Calling an LLM API · AI with Sandy

You have a workspace and you understand prompts. Now you call the model from code in a way you can rely on in production: correct message roles, the right parameters, streaming for responsiveness, and error handling so a hiccup does not crash your app.

Overview

Modern LLM APIs use a messages format. You send a list of messages, each with a role — typically system (instructions that frame the whole conversation), user (the request), and assistant (the model's previous replies, when continuing a chat). The API returns the next assistant message.

A handful of parameters shape the response, and a small set of errors come up often enough that handling them is part of the basic skill.

Key ideas

The messages format

The system prompt sets behaviour once; user messages carry each request.

from anthropic import Anthropic
 
client = Anthropic()  # reads ANTHROPIC_API_KEY from the environment
 
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=300,
    system="You are a concise assistant for a Telugu cooking blog. Reply in English.",
    messages=[
        {"role": "user", "content": "Give me 3 tips for crisp dosa."},
    ],
)
 
print(response.content[0].text)

Parameters that matter

max_tokens caps the length of the reply. Set it deliberately — it also bounds cost.
temperature controls randomness. Use low values (0–0.3) for factual or structured tasks where you want consistency, higher (0.7–1.0) for creative variety.
model picks the trade-off between speed, cost, and capability. Use a smaller, cheaper model for simple, high-volume calls and a larger one for hard reasoning.

Stream for responsiveness

For anything a user waits on, stream the tokens as they arrive instead of blocking until the whole reply is ready.

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=300,
    messages=[{"role": "user", "content": "Explain embeddings in 3 lines."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Handle the errors that actually happen

The two you will hit most are rate limits (too many requests too fast) and transient server errors. Retry these with exponential backoff; do not retry authentication or invalid-request errors, because they will fail every time until you fix the input.

import time
from anthropic import RateLimitError, APIStatusError
 
def call_with_retry(messages, attempts=4):
    for i in range(attempts):
        try:
            return client.messages.create(
                model="claude-sonnet-4-6", max_tokens=300, messages=messages
            )
        except RateLimitError:
            wait = 2 ** i  # 1s, 2s, 4s, 8s
            time.sleep(wait)
        except APIStatusError as e:
            if e.status_code >= 500 and i < attempts - 1:
                time.sleep(2 ** i)
            else:
                raise
    raise RuntimeError("Exhausted retries")

Quick recap

Use the messages format: system frames behaviour, user carries requests.
Set max_tokens and temperature deliberately — they control length, cost, and consistency.
Stream output for anything a person waits on.
Retry rate limits and 5xx errors with exponential backoff; never retry auth or bad-request errors.
Pick the smallest model that does the job to control cost.

from anthropic import Anthropic
 
client = Anthropic()  # reads ANTHROPIC_API_KEY from the environment
 
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=300,
    system="You are a concise assistant for a Telugu cooking blog. Reply in English.",
    messages=[
        {"role": "user", "content": "Give me 3 tips for crisp dosa."},
    ],
)
 
print(response.content[0].text)

Parameters that matter

max_tokens caps the length of the reply. Set it deliberately — it also bounds cost.
temperature controls randomness. Use low values (0–0.3) for factual or structured tasks where you want consistency, higher (0.7–1.0) for creative variety.
model picks the trade-off between speed, cost, and capability. Use a smaller, cheaper model for simple, high-volume calls and a larger one for hard reasoning.

Stream for responsiveness

For anything a user waits on, stream the tokens as they arrive instead of blocking until the whole reply is ready.

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=300,
    messages=[{"role": "user", "content": "Explain embeddings in 3 lines."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Handle the errors that actually happen

import time
from anthropic import RateLimitError, APIStatusError
 
def call_with_retry(messages, attempts=4):
    for i in range(attempts):
        try:
            return client.messages.create(
                model="claude-sonnet-4-6", max_tokens=300, messages=messages
            )
        except RateLimitError:
            wait = 2 ** i  # 1s, 2s, 4s, 8s
            time.sleep(wait)
        except APIStatusError as e:
            if e.status_code >= 500 and i < attempts - 1:
                time.sleep(2 ** i)
            else:
                raise
    raise RuntimeError("Exhausted retries")

Quick recap

Use the messages format: system frames behaviour, user carries requests.
Set max_tokens and temperature deliberately — they control length, cost, and consistency.
Stream output for anything a person waits on.
Retry rate limits and 5xx errors with exponential backoff; never retry auth or bad-request errors.
Pick the smallest model that does the job to control cost.