Deploying an AI product
Take an AI feature from your laptop to real users — safely, observably, and within budget.
Prerequisites
- Evals: knowing if it works
- Calling an LLM API
You will learn
- Move API keys and config to a production-safe setup
- Add logging, cost controls, and graceful failure
- Roll out behind a flag and watch the right signals
A demo that works once on your machine is not a product. Shipping AI to real users adds concerns a script never had: secrets must be managed, costs must be bounded, failures must be handled, and you must be able to see what is happening in production. This lesson is the bridge from working code to a running service.
Overview
Deploying an AI feature is mostly normal software deployment plus a few AI-specific concerns: variable latency, per-token cost, and non-deterministic output. You handle them with the same disciplines you would use for any external dependency — timeouts, retries, caching, observability, and a controlled rollout — applied with the knowledge that this dependency is slow, paid, and occasionally surprising.
Key ideas
Secrets and config belong to the environment
Never ship keys in code or in the client. The model API is called from your server, where the key lives in the platform's secret manager or environment variables, not in the repo.
import os
API_KEY = os.environ["ANTHROPIC_API_KEY"] # set in the host platform, never committed
if not API_KEY:
raise RuntimeError("ANTHROPIC_API_KEY is not set")Make every call observable and bounded
Log each request's latency, token usage, and outcome so you can answer "why is it slow" and "why is the bill high" with data. Set a timeout, retry transient failures with backoff, and degrade gracefully when the model is unavailable.
import time, logging
logger = logging.getLogger("ai")
def tracked_call(messages):
start = time.time()
try:
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=400,
messages=messages, timeout=20,
)
logger.info("ai_ok", extra={
"latency_ms": int((time.time() - start) * 1000),
"input_tokens": resp.usage.input_tokens,
"output_tokens": resp.usage.output_tokens,
})
return resp.content[0].text
except Exception as e:
logger.error("ai_fail", extra={"error": str(e)})
return "Sorry, that is not available right now. Please try again shortly."Control cost before it controls you
Cost scales with tokens and traffic. Cap max_tokens, cache responses to repeated identical prompts, use a smaller model where it suffices, and set a hard spend alert on the provider account. A single runaway loop in production can cost more in an hour than a month of normal use.
Roll out gradually and keep evals running
Release behind a feature flag to a small slice of users first. Watch your eval score, error rate, latency, and cost. If a signal degrades, turn the flag off — instant rollback, no redeploy. Keep running the eval set against production samples so quality regressions show up before users complain.
Quick recap
- Keep keys in server-side environment secrets; never in code or the client.
- Log latency, tokens, and outcomes; set timeouts and graceful fallbacks.
- Bound cost with token caps, caching, model choice, and spend alerts.
- Roll out behind a flag to a small slice; watch evals, errors, latency, and cost.
- Never call the model directly from the client — always go through your backend.