← Back
Feb 2026 · 11 min read

Shipping an AI fitness-plan API with FastAPI and LangGraph

From prompt design to chunked generation and rate limiting — the engineering behind a small LLM product that had to stay cheap and fast.

AI / ML · Backend

The brief

Build an API that takes a user profile (age, goal, equipment, schedule) and returns a personalised weekly training plan. It had to stream the response (users hate staring at a spinner), cost less than $0.01 per request, and survive a modest traffic spike without falling over.

Prompt design

The first version used a single large prompt. It worked, but it was fragile — small changes to the system prompt cascaded unpredictably into the output format.

LangGraph’s graph-based approach helped here. I split the pipeline into three nodes:

  1. Profile validation — checks the input makes sense, enriches it with defaults.
  2. Plan generation — the creative step, where GPT-4o actually writes the plan.
  3. Format enforcement — a lightweight pass that ensures the output matches the expected JSON schema.

Each node has a clear input/output contract. When something breaks, you know exactly which node to look at.

Chunked streaming

FastAPI’s StreamingResponse paired with LangGraph’s async streaming made this straightforward:

async def stream_plan(profile: UserProfile):
    async for chunk in graph.astream(profile.dict()):
        yield f"data: {chunk}\n\n"

@app.post("/plan")
async def generate_plan(profile: UserProfile):
    return StreamingResponse(stream_plan(profile), media_type="text/event-stream")

The client receives tokens as they’re generated. Perceived latency drops from ~8s to ~1.5s to first token.

Staying cheap

GPT-4o is expensive if you’re careless. Two things kept costs down:

  • Caching — identical or near-identical profiles return a cached plan (Redis, 1h TTL). About 30% of requests hit cache after the first week.
  • Token budgeting — the prompt enforces a maximum plan length. A 7-day plan doesn’t need 4,000 tokens.

Rate limiting

Deployed on Railway with a simple sliding-window rate limiter (10 requests / user / hour). Railway’s environment variables made it trivial to wire up — no config files, no secrets in code.