Overview
A production API that accepts a user profile (age, goal, available equipment, weekly schedule) and returns a structured, personalised training plan. Responses stream token-by-token so users see output within seconds rather than waiting for the full plan to generate.
Architecture
The generation pipeline is built with LangGraph, breaking the process into three nodes: input validation and enrichment, plan generation with GPT-4o, and output format enforcement. Each node has a clear contract, making the pipeline easy to debug and extend.
FastAPI handles the HTTP layer, with StreamingResponse and server-sent events delivering the streamed output to clients.
Cost control
Running LLM inference at scale requires careful cost management. Two mechanisms keep costs predictable:
- Response caching via Redis — similar profiles return cached plans (1-hour TTL), reducing API calls by ~30% in steady state.
- Token budgeting — the system prompt enforces an output length ceiling. A 7-day plan doesn’t need 4,000 tokens.
Deployment
Deployed on Railway with environment-based configuration. Rate limiting (10 requests/user/hour) is enforced at the API layer using a sliding-window algorithm backed by Redis.