Architecture

LLMs Are Infrastructure, Not Just APIs

By Jules Cesar Junior Ndayisenga

The default mental model for LLMs in production is simple: call the API, get a response, display it. OpenAI's Chat Completions endpoint. Anthropic's Messages API. A simple HTTP request. The LLM is treated as an external service, a black box you query and forget.

This mental model is dangerous for production systems.

APIs Break in Production

When you treat an LLM as just another API, you inherit all the fragility of external dependencies, plus new failure modes unique to language models:

  • Latency variance: the same prompt can take 500ms or 15 seconds depending on load. Your timeout strategy matters more than your prompt engineering.
  • Output instability: the same input produces different outputs across calls. Deterministic workflows need guardrails, retries, and validation layers.
  • Cost explosion: an unconstrained agent loop can burn through your API budget in minutes. Rate limiting and token budgets aren't optional.
  • Vendor lock-in: building directly against OpenAI's SDK couples your architecture to a single provider's pricing and availability.

The Infrastructure Mindset

Treating LLMs as infrastructure means applying the same engineering discipline we use for databases, message queues, and caching layers:

  • Abstraction layers: wrap LLM calls behind an interface. Swap providers without touching business logic.
  • Circuit breakers: when the LLM service degrades, fallback to cached responses, simpler models, or rule-based logic.
  • Observability: log every prompt, response, latency, and token count. You can't optimize what you can't measure.
  • Semantic caching: similar prompts should return cached results. Not every call needs to hit the model.
  • Output validation: structured output parsing (JSON schemas, Pydantic models) ensures the model's response is usable by downstream systems.

What Changes When You Think This Way

At Asyst, I build AI systems that serve real users: chatbots, interview simulators, automation tools. The difference between a demo and a production system is entirely in how you treat the LLM layer. The prompt is 10% of the work. The other 90% is error handling, retry logic, cost control, output validation, and graceful degradation.

The industry will mature past the "just call the API" phase. The engineers who treat LLMs as infrastructure today will be the ones trusted to build the systems that run in production tomorrow.

Production AI isn't about the model. It's about everything around it.