The brief
Build an API that takes a user profile (age, goal, equipment, schedule) and returns a personalised weekly training plan. It had to stream the response (users hate staring at a spinner), cost less than $0.01 per request, and survive a modest traffic spike without falling over.
Prompt design
The first version used a single large prompt. It worked, but it was fragile — small changes to the system prompt cascaded unpredictably into the output format.
LangGraph’s graph-based approach helped here. I split the pipeline into three nodes:
- Profile validation — checks the input makes sense, enriches it with defaults.
- Plan generation — the creative step, where GPT-4o actually writes the plan.
- Format enforcement — a lightweight pass that ensures the output matches the expected JSON schema.
Each node has a clear input/output contract. When something breaks, you know exactly which node to look at.
Chunked streaming
FastAPI’s StreamingResponse paired with LangGraph’s async streaming made this straightforward:
async def stream_plan(profile: UserProfile):
async for chunk in graph.astream(profile.dict()):
yield f"data: {chunk}\n\n"
@app.post("/plan")
async def generate_plan(profile: UserProfile):
return StreamingResponse(stream_plan(profile), media_type="text/event-stream")
The client receives tokens as they’re generated. Perceived latency drops from ~8s to ~1.5s to first token.
Staying cheap
GPT-4o is expensive if you’re careless. Two things kept costs down:
- Caching — identical or near-identical profiles return a cached plan (Redis, 1h TTL). About 30% of requests hit cache after the first week.
- Token budgeting — the prompt enforces a maximum plan length. A 7-day plan doesn’t need 4,000 tokens.
Rate limiting
Deployed on Railway with a simple sliding-window rate limiter (10 requests / user / hour). Railway’s environment variables made it trivial to wire up — no config files, no secrets in code.