The self-hosted open-source LLM pitch sounds compelling. Own your model. No per-token costs. Train it on your data. Run it on your hardware. Total control.

Then you do the math.

The Critical Requirement: Tool Use

Our AI assistant doesn't write poetry. It parses natural language into structured operations. "Order a Bergara for Richard at 15% margin" needs to resolve into the right handler, the right parameters, the right sequence of API calls. That's tool use — and it's where frontier models are significantly ahead of open-source alternatives.

A model that gets the answer right 90% of the time is unusable for operational software. When a dealer says "invoice that order," you need 100% reliability on intent parsing. One misrouted command is a compliance issue, a financial error, or a lost customer. We tested open-source models. The tool-use reliability gap wasn't close.

The GPU Math

Monthly cost comparison at startup scale
Approach Monthly Cost Notes
Self-hosted (A100 instance) $1,500 – $3,000 24/7 for availability. Plus DevOps time.
API (Claude via Bedrock) $100 – $250 At startup scale (50 dealers). Pay per token.
Break-even point Hundreds of active tenants. Years away.

The break-even math is clear: you need hundreds of active tenants running thousands of daily conversations before self-hosting makes economic sense. At startup scale, API costs are a rounding error compared to GPU hosting.

What "Training" Actually Means

Most people who say "train it on your data" mean one of three things, and two of them are free:

What most people mean by "training" is just a good system prompt. You don't need a GPU for that.

The Bedrock Advantage

AWS Bedrock collapses the vendor relationship. Single AWS bill. IAM authentication — the same identity system that manages your entire infrastructure. Same region as your database and application servers. No separate API key management, no second vendor relationship, no additional compliance surface.

For a solo founder, every hour spent wrangling GPU instances, debugging model outputs, and managing inference infrastructure is an hour not spent shipping features for customers. The API lets you treat AI as a utility, like a database or a CDN. You call it when you need it and don't think about it otherwise.

The Hybrid Option Stays Open

Use a cheap model for simple lookups, a frontier model for complex workflows. But you need real usage data to make that call, not speculation.

The architecture supports model switching. The AI assistant talks to a port interface, not to Claude directly. Tomorrow we could route simple queries to a smaller model and complex tool-use workflows to the frontier. But that optimization requires usage patterns we don't have yet. Premature optimization of AI infrastructure is the same trap as premature optimization of code — you're solving a problem you've imagined, not one you've measured.

Ship with the best available model. Collect real usage data. Optimize when the data tells you to, not when the Hacker News comments tell you to.