What Product Managers Actually Need to Know About How Models Work
Layer 1 of the AI Fluency framework: enough technical knowledge to hold your own in conversations with engineers, without becoming an ML researcher.
You’re in a technical review meeting. An engineer mentions that “the model’s context window is the real bottleneck here,” and you need to know whether that’s a fundamental constraint or an excuse. Two minutes later, someone else says “we should fine-tune instead of prompting,” and you need to ask the right question: is that the cheapest solution, or are they reaching for a tool they know instead of thinking through the problem?
This is Layer 1 of the AI Fluency framework I introduced in the first article of this series: enough to never say something wrong in a technical conversation, and enough to ask the questions that separate clear thinking from engineering mythology.
You don’t need to actually build any of this. You just need to understand it well enough to make product decisions and hold ground in conversations where technical trade-offs are real and money is at stake.
Tokens: Why They Matter More Than You’d Think
A token is the smallest unit of text a language model processes. It’s not a word. The distinction matters.
Roughly: one token is about four characters of English text, or 0.75 words. A 1,000-word article is about 1,300 tokens. The reason we talk about tokens instead of words is that models handle subword units using algorithms like Byte Pair Encoding, which lets them handle unknown words, code, punctuation, and multiple languages with a single vocabulary. “Unhappiness” might become two tokens: “un” and “happiness.” “iPhone” might be one.
Why you care: Token counting affects your pricing model, your cost per query, and whether a feature is economically viable at scale.
If you’re building a document Q&A feature and charging per query, you need to estimate how many tokens each query will consume. If your average document is 50,000 words, that’s roughly 67,000 tokens. If you include that entire document in every query, your inference cost balloons. This is why retrieval systems exist: to send only the relevant chunks, not the whole document. Understanding tokens forces you to think about chunking strategy early, not as an afterthought when costs explode.
Test your thinking: if a feature needs to include 100 customer support tickets as context (to handle multi-turn support conversations), and each ticket averages 200 tokens, what’s your per-request cost at current API prices? If that number makes your pricing model break, you now know retrieval is not optional.
Context Windows: The Hard Limit
The context window is everything the model can see at once. It includes the system prompt, conversation history, any retrieved documents, and the current user query. The model has no memory outside of it. This is a hard architectural constraint, not a software limitation you can engineer around.
Claude Sonnet 4.6 offers a 1 million-token context window (with a 200K standard tier). Gemini 3.1 Pro also offers 1 million tokens. GPT-5.4 exposes 272,000 tokens by default and up to 1 million via the API. These numbers change rapidly, and they matter for your product architecture in three ways.
First, cost. Larger contexts cost more to process, both in tokens consumed and in time to generate the first token. A 1 million-token context doesn’t help your user experience if their first response takes ten seconds.
Second, quality. Models don’t attend equally to all parts of a long context. Research shows a “lost in the middle” phenomenon: information buried in the middle of a very long context is recalled less reliably than information at the beginning or end. If you’re stuffing 100 retrieved documents into a context window, the model may ignore the most relevant ones if they appear in the middle. This makes retrieval quality more important than raw window size. A smaller, well-chosen set of documents often outperforms a larger, unsorted dump.
Third, architectural decisions. Large context windows invite a kind of laziness in product design: just fit everything in and let the model sort it out. That rarely works. Instead, context limits drive you towards smarter retrieval, better chunking strategies, and summarisation pipelines that compress long documents before inclusion. These constraints often improve your product.
The mistake to avoid: “Bigger context window = better product.” It usually doesn’t. Retrieval quality matters more than raw window size. And if someone suggests using a 1 million-token context to avoid building a retrieval system, push back.
Training vs Inference: Where the Fix Lives
A language model has two distinct phases that are easy to conflate but very different in cost, time, and product implication.
Training is the one-time, massively expensive process where a model learns from data. Thousands of GPUs run for weeks or months. Parameters (the model’s internal weights) get updated billions of times. The output is a large file encoding everything the model learned. Training a current frontier model (the Claude 4 series, GPT-5 series, Gemini 3 series) runs into the hundreds of millions of dollars, and the trend is upward. Training creates a hard knowledge cutoff: the model only knows what was in its training data, up to a specific date.
Inference is using the model. That is the whole idea. Training produces a finished object: a large file of learned patterns, sitting on a disk somewhere. Inference is running that file against a new input to produce an output. The model itself does not change. It is a finished thing being used.
An analogy that might help it stick. Training is the years a chef spent developing a cookbook: experimenting, failing, adjusting, and eventually writing everything down. Inference is you cooking dinner from that cookbook tonight. You are not rewriting the recipes. You are following the instructions already there, with your ingredients, to produce a meal. Every ChatGPT message, every API call, every autocomplete suggestion is someone cooking dinner. The cookbook itself never changes.
For a language model specifically, that means your prompt flows in, and the model generates output one token at a time, each new token conditioned on everything that came before. This is called autoregressive generation, and three things follow from it that you cannot ignore.
First, latency is dominated by output length, not input length. A 50-token prompt that produces a 1,000-token answer takes roughly 20x as long to generate as the same prompt with a 50-token answer. This is also why streaming interfaces feel responsive: tokens appear as they are produced, so the user sees progress instead of staring at a spinner. Without streaming, a long response feels broken even when it is not.
Second, there are two different latency metrics and they have different causes. Time to first token (TTFT) is how long the user waits before anything appears. It is dominated by prompt length, because the whole input has to be processed before generation starts. Tokens per second (TPS) is how fast the response streams once it starts. It is dominated by output length and the model’s decoding speed. A product can have fast TTFT and slow TPS, or the reverse, and the user experiences these as two different problems. Know which one is biting you before you try to fix it.
Third, inference is stateless. The model has no memory between calls. If you are building a chat product, every turn sends the entire conversation history back through the network, and you pay for all of it, every time. This is why a long chat becomes expensive linearly, and then, if you are not careful, quadratically once the context gets large enough that the model slows down. Summarisation pipelines, context truncation, and prompt caching exist to blunt this curve.
On pricing: input tokens and output tokens are priced separately, and output is typically 3 to 5 times more expensive. This quietly shapes architecture. RAG systems pump up your input count, which is cheap. Long generations pump up your output count, which is where the bill lives. A chatbot that produces concise answers is not just better UX, it is better unit economics.
On caching: in 2026, every major provider offers some form of prompt caching, where a prefix you reuse across requests (your system prompt, a retrieved document, a large code file) can be charged at a fraction of the normal input rate, sometimes as low as 10 percent. For conversational products this changes the economics materially. If you have a 5,000-token system prompt, you almost certainly want it cached. If you are sending the same retrieved documents across a multi-turn conversation, you want those cached too. This is a standard PM question to ask your engineers: are we actually hitting the cache?
Most of the product decisions you make as a PM are about inference, not training. Training is where capability comes from. Inference is where your unit economics, your latency, and most of your visible user experience live.
The consequence: if a user says “the model gave wrong information,” the fix could be any of these, each with different timelines and costs:
- A prompt change (inference-time, immediate, cheap)
- A retrieval fix via RAG (inference-time with moderate complexity)
- A fine-tuning run (training-time, days to weeks, expensive)
- Full retraining (training-time, months, very expensive)
Understanding which bucket the problem falls into determines the timeline and cost you communicate to stakeholders. A prompt change is an hour. A fine-tuning run is a project. These are not equivalent.
Why hallucinations happen about recent events is also now obvious: the model’s training data has a cutoff. Without RAG to inject fresh information at inference time, the model has literally no way to know about last week’s news. This is why every serious AI product that touches current information needs retrieval.
Also: when an API provider updates their model (GPT-5.3 to GPT-5.4, or Claude Sonnet 4.5 to 4.6), this is a new training run. Behaviour can shift in unexpected ways. Previously-passing evaluations may fail. This is why you need evaluation suites to detect when a model update breaks your product before it reaches users.
The Three Levers: Prompting, RAG, and Fine-tuning
When you want a model to behave differently or know more, you have three primary levers. Picking the wrong one is one of the most expensive mistakes a PM can make.
Prompting means shaping behaviour through the input: system prompt, user message, few-shot examples. It’s immediate, cheap, reversible, and requires no new data. But it consumes context window space, can be fragile (the model can be distracted from instructions by conflicting content), and is bad for injecting large bodies of knowledge.
Use prompting for style, format, tone, and output structure. It’s your first iteration tool. Change a prompt on a Tuesday and see results on Wednesday. Iterate until you’re hitting quality targets.
Retrieval-Augmented Generation (RAG) injects relevant information from an external knowledge base at query time. You embed documents into vectors, store them in a vector database, and at query time you retrieve the most similar chunks and insert them into the prompt. The model then generates a response grounded in that context.
Use RAG when your knowledge base changes frequently (news, documentation, internal data), when factual accuracy matters, when the knowledge is too large to fit in a context window, or when you’re handling proprietary data that shouldn’t appear in training. RAG is updatable without retraining: just update the index.
The catch: retrieval quality is a bottleneck. The model can only be as good as what it retrieves. Poor chunking strategies break context. The pipeline has multiple failure points.
Fine-tuning continues training a pre-trained model on task-specific data. It’s expensive and slow to iterate, requires high-quality labelled data (often hundreds to thousands of examples), and carries the risk of “catastrophic forgetting” where the model’s general capabilities degrade. But fine-tuning deeply encodes behaviour into the model’s weights. It requires no large system prompt at inference time. For narrow, well-defined tasks with stable patterns, it can be very effective.
Use fine-tuning when you have a stable task with clear input/output patterns, sufficient high-quality training data, and when prompting alone isn’t hitting your quality targets.
Here’s the decision framework:
What are you trying to change?
│
├── Knowledge (facts, data, documents)
│ └── Use RAG
│
├── Style, format, or tone
│ ├── Try prompting first
│ └── Fine-tune only if prompting fails
│ AND you have labelled data
│
└── Deep behavioural patterns
├── Do you have 500+ quality examples?
│ ├── Yes → Fine-tune
│ └── No → Better prompting + examples
└── Is the task stable and well-defined?
├── Yes → Fine-tune may be worth it
└── No → RAG + prompting (iterate faster)
Most PM instincts that reach for fine-tuning first are actually knowledge problems (RAG) or style problems (better prompting) in disguise.
The mistake to avoid: using fine-tuning to inject facts into a model. Fine-tuning teaches behaviour and style, not new knowledge. You’ll waste time and money. Use RAG instead.
Embeddings and Vector Search: The Engine Behind Retrieval
An embedding is a numerical representation of text as a vector: a list of floating-point numbers in a high-dimensional space. The key property: semantically similar content produces numerically similar vectors. This is geometric. It’s why you can search for “vehicle collision” and find documents about “car accidents” even if those exact words never appear.
This geometric property is the entire reason RAG works.
A semantic search (vector search) finds the closest vectors to a query vector. Unlike keyword search, which matches exact words, vector search matches meaning. “Car accident” finds “vehicle collision” naturally. Typos are handled better. Long-form meaning works well. The trade-off: you can see keyword matches (“there’s the word!”), but vector similarity is more opaque (“it’s close in embedding space… somehow”).
The quality of your embedding model is the ceiling on how good your retrieval can ever be. If the embedding model doesn’t understand your domain well, retrieval fails, and the generation model can’t fix bad retrieval. This is the garbage-in, garbage-out problem of RAG systems.
For specialised domains (legal, medical, financial), general-purpose embedding models often underperform. The terminology has different meanings. The training data under-represents the domain. This is why domain-specific embedding models exist (e.g., voyage-law-2 for legal text). A smaller, domain-specific model often outperforms a larger, general-purpose one.
In production, the best retrieval systems combine vector search with keyword search (hybrid search). Vector search catches conceptual matches and paraphrase. Keyword search catches exact matches and rare terms that embeddings might miss. Fusion algorithms combine both result sets.
The decision to make: For your specific domain, is a general-purpose embedding model sufficient, or do you need a domain-specific one? If you’re handling legal documents, medical records, or specialised technical content, test both and measure. The cost difference is usually small; the quality difference can be large.
Attention and Transformers: Why Position Matters
Transformers (the architecture behind all modern LLMs) work by letting every token attend to every other token and deciding which ones matter. That’s “attention.” Everything you care about flows from this mechanism.
In the sentence “The lawyer filed the brief because she thought it was incomplete,” the token “she” needs to know it refers to “lawyer,” not “brief.” Attention does this: it lets “she” attend strongly to “lawyer” and weakly to everything else.
More importantly for your product: attention is position-sensitive. Models reliably attend to the beginning of the context (system prompts and early instructions) and the end of the context (the current query), but poorly to the middle. This is the “lost in the middle” phenomenon.
If you have ten retrieved documents in a RAG system, the model may not attend equally to all of them. The most relevant document should be placed at the beginning or end of the context, not buried in the middle.
This matters for prompt design. If you’re crafting a system prompt with instructions, put the most critical instructions first and last. If you’re building a RAG system, your retrieval ranking algorithm should decide not just which documents are relevant, but where in the context they’ll be placed.
Also: attention requires every token to attend to every other token. That’s O(N²) complexity. This is why longer contexts are disproportionately more expensive. This is why extending context windows is hard. This is why a 1 million-token context required serious architectural innovation.
The intuition: the cost of processing a sequence doesn’t scale linearly. It scales quadratically. Double your context window, and you quadruple the computation.
Temperature and Sampling: The Creativity Dial
After processing your prompt, a language model doesn’t simply decide the next word. It produces a probability distribution over its entire vocabulary: a score for every possible next token. Temperature controls how the model samples from that distribution.
| Temperature | Behaviour | Best for | Avoid for |
|---|---|---|---|
| 0.0 | Deterministic. Always picks the top token. | Structured data extraction, evals and auditing, safety-critical features | Creative tasks where variety matters |
| 0.3 to 0.5 | Mostly predictable, with light variation | Factual Q&A, summarisation, classification | High-variance creative output |
| ~1.0 (default) | Balanced between likely and surprising tokens | General chat, drafting, conversational UX | Tasks requiring reproducibility |
| 1.2 to 1.5 | Creative, surprising, exploratory | Brainstorming, marketing copy, ideation | Safety-critical features, evals, auditing |
| 1.8 to 2.0 | Chaotic, often incoherent | Rare experimental use | Almost everything in production |
Temperature = 0 means greedy decoding: always pick the highest-probability token. Outputs are deterministic. The same prompt always produces the same output.
Temperature = 1.0 (default) samples from the distribution. Some variability, but the most likely tokens are still favoured.
Temperature > 1.0 flattens the distribution. Lower-probability tokens become more likely. Outputs are more creative, surprising, and random.
Temperature doesn’t make the model smarter or dumber. It only changes how adventurously the model samples from probabilities it’s already computed. A less probable but correct answer is still in the distribution. Temperature affects whether it gets sampled.
For your product:
Use temperature = 0 for anything that needs to be consistent, reproducible, or auditable. Structured data extraction, factual Q&A, legal review, evaluations. You want determinism.
Use higher temperature (0.7-1.2) for creative tasks: brainstorming, marketing copy, ideation. You want variety.
One mistake: running evaluations with high temperature. If your eval results vary every run due to sampling randomness, you can’t tell whether a change you made actually improved things. Eval suites must run at temperature = 0.
Another mistake: using temperature = 0.8 in a safety-critical feature (medical triage, contract review) because you want “natural-sounding” responses. Higher temperature increases the probability of unexpected outputs, including outputs that violate safety constraints. In safety-critical contexts, use low temperature.
Test Your Understanding
Here are four scenarios. Work through each one and think about what you’d ask or decide.
Scenario 1: Knowledge Cutoff Problem
A user reports your AI product gave factually incorrect information about an event that happened last month. What are the two most likely causes, and how would you investigate which one it is?
Scenario 2: Feature Architecture Decision
You’re designing a customer support chatbot. Each conversation can involve multiple back-and-forths over several minutes. The model needs to remember earlier context. What would you choose: increasing the context window, or implementing a summarisation pipeline that compresses conversation history before each request? What’s the trade-off?
Scenario 3: The Fine-tuning Proposal
Your engineering team proposes fine-tuning a model to “improve quality on our specific domain.” You have 200 labelled examples. Is fine-tuning the right move? What questions would you ask before committing to a fine-tuning run?
Scenario 4: Retrieval Ordering
You’re building a RAG system for legal document research. You retrieve five relevant documents. Based on what you know about attention and context position, how would you order them in the prompt to maximise the model’s ability to use them?
Take time with these. The answers reveal whether you’ve internalised the concepts or just skimmed them. If any of these feel unclear, go back to the relevant section and re-read.
What’s Next
This is Layer 1: how models work at a mechanical level. The next layer, which I’ll cover in Article 3, is evaluation and benchmarking: how to measure whether your product decisions actually worked.
If you understand tokens, context windows, training vs inference, the three levers, embeddings, attention, and temperature, you’ve crossed a threshold. You can sit in a room with engineers and ask the right questions. You can push back on bad ideas. You can make decisions about architecture and trade-offs without deferring to whoever sounds most confident.
That’s the point of this framework. Not to make you an ML engineer. To make you a product person who thinks clearly about AI systems.