Deploying an AI product · AI with Sandy

A demo that works once on your machine is not a product. Shipping AI to real users adds concerns a script never had: secrets must be managed, costs must be bounded, failures must be handled, and you must be able to see what is happening in production. This lesson is the bridge from working code to a running service.

Overview

Deploying an AI feature is mostly normal software deployment plus a few AI-specific concerns: variable latency, per-token cost, and non-deterministic output. You handle them with the same disciplines you would use for any external dependency — timeouts, retries, caching, observability, and a controlled rollout — applied with the knowledge that this dependency is slow, paid, and occasionally surprising.

Key ideas

Secrets and config belong to the environment

Never ship keys in code or in the client. The model API is called from your server, where the key lives in the platform's secret manager or environment variables, not in the repo.

import os
 
API_KEY = os.environ["ANTHROPIC_API_KEY"]  # set in the host platform, never committed
if not API_KEY:
    raise RuntimeError("ANTHROPIC_API_KEY is not set")

Make every call observable and bounded

Log each request's latency, token usage, and outcome so you can answer "why is it slow" and "why is the bill high" with data. Set a timeout, retry transient failures with backoff, and degrade gracefully when the model is unavailable.

import time, logging
 
logger = logging.getLogger("ai")
 
def tracked_call(messages):
    start = time.time()
    try:
        resp = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=400,
            messages=messages, timeout=20,
        )
        logger.info("ai_ok", extra={
            "latency_ms": int((time.time() - start) * 1000),
            "input_tokens": resp.usage.input_tokens,
            "output_tokens": resp.usage.output_tokens,
        })
        return resp.content[0].text
    except Exception as e:
        logger.error("ai_fail", extra={"error": str(e)})
        return "Sorry, that is not available right now. Please try again shortly."

Control cost before it controls you

Cost scales with tokens and traffic. Cap max_tokens, cache responses to repeated identical prompts, use a smaller model where it suffices, and set a hard spend alert on the provider account. A single runaway loop in production can cost more in an hour than a month of normal use.

Roll out gradually and keep evals running

Release behind a feature flag to a small slice of users first. Watch your eval score, error rate, latency, and cost. If a signal degrades, turn the flag off — instant rollback, no redeploy. Keep running the eval set against production samples so quality regressions show up before users complain.

Quick recap

Keep keys in server-side environment secrets; never in code or the client.
Log latency, tokens, and outcomes; set timeouts and graceful fallbacks.
Bound cost with token caps, caching, model choice, and spend alerts.
Roll out behind a flag to a small slice; watch evals, errors, latency, and cost.
Never call the model directly from the client — always go through your backend.

import os
 
API_KEY = os.environ["ANTHROPIC_API_KEY"]  # set in the host platform, never committed
if not API_KEY:
    raise RuntimeError("ANTHROPIC_API_KEY is not set")

Make every call observable and bounded

import time, logging
 
logger = logging.getLogger("ai")
 
def tracked_call(messages):
    start = time.time()
    try:
        resp = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=400,
            messages=messages, timeout=20,
        )
        logger.info("ai_ok", extra={
            "latency_ms": int((time.time() - start) * 1000),
            "input_tokens": resp.usage.input_tokens,
            "output_tokens": resp.usage.output_tokens,
        })
        return resp.content[0].text
    except Exception as e:
        logger.error("ai_fail", extra={"error": str(e)})
        return "Sorry, that is not available right now. Please try again shortly."

Keep keys in server-side environment secrets; never in code or the client.
Log latency, tokens, and outcomes; set timeouts and graceful fallbacks.
Bound cost with token caps, caching, model choice, and spend alerts.
Roll out behind a flag to a small slice; watch evals, errors, latency, and cost.
Never call the model directly from the client — always go through your backend.