Calling an LLM API
Send your first real requests — messages, parameters, streaming, and handling errors gracefully.
Prerequisites
- Setting up your AI workspace
- Prompting that actually works
You will learn
- Send a chat-style request with system and user roles
- Control output with temperature and max_tokens
- Stream responses and handle rate limits and errors
Telugu lo nerchuko · Watch in Telugu
You have a workspace and you understand prompts. Now you call the model from code in a way you can rely on in production: correct message roles, the right parameters, streaming for responsiveness, and error handling so a hiccup does not crash your app.
Overview
Modern LLM APIs use a messages format. You send a list of messages, each with a role — typically system (instructions that frame the whole conversation), user (the request), and assistant (the model's previous replies, when continuing a chat). The API returns the next assistant message.
A handful of parameters shape the response, and a small set of errors come up often enough that handling them is part of the basic skill.
Key ideas
The messages format
The system prompt sets behaviour once; user messages carry each request.
from anthropic import Anthropic
client = Anthropic() # reads ANTHROPIC_API_KEY from the environment
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=300,
system="You are a concise assistant for a Telugu cooking blog. Reply in English.",
messages=[
{"role": "user", "content": "Give me 3 tips for crisp dosa."},
],
)
print(response.content[0].text)Parameters that matter
max_tokenscaps the length of the reply. Set it deliberately — it also bounds cost.temperaturecontrols randomness. Use low values (0–0.3) for factual or structured tasks where you want consistency, higher (0.7–1.0) for creative variety.modelpicks the trade-off between speed, cost, and capability. Use a smaller, cheaper model for simple, high-volume calls and a larger one for hard reasoning.
Stream for responsiveness
For anything a user waits on, stream the tokens as they arrive instead of blocking until the whole reply is ready.
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=300,
messages=[{"role": "user", "content": "Explain embeddings in 3 lines."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)Handle the errors that actually happen
The two you will hit most are rate limits (too many requests too fast) and transient server errors. Retry these with exponential backoff; do not retry authentication or invalid-request errors, because they will fail every time until you fix the input.
import time
from anthropic import RateLimitError, APIStatusError
def call_with_retry(messages, attempts=4):
for i in range(attempts):
try:
return client.messages.create(
model="claude-sonnet-4-6", max_tokens=300, messages=messages
)
except RateLimitError:
wait = 2 ** i # 1s, 2s, 4s, 8s
time.sleep(wait)
except APIStatusError as e:
if e.status_code >= 500 and i < attempts - 1:
time.sleep(2 ** i)
else:
raise
raise RuntimeError("Exhausted retries")Quick recap
- Use the messages format: system frames behaviour, user carries requests.
- Set
max_tokensandtemperaturedeliberately — they control length, cost, and consistency. - Stream output for anything a person waits on.
- Retry rate limits and 5xx errors with exponential backoff; never retry auth or bad-request errors.
- Pick the smallest model that does the job to control cost.