Introduction
When you write a part I, you’ll need to write part II as well. Let’s pick it up where we left it last week.
Model parameters
When you call a model API, you can pass parameters that control how the model generates text. These are not settings that change the model’s knowledge or capability. They control the sampling process: how the model picks the next token from a probability distribution.
Temperature
Temperature is the most important and widely used parameter. It controls randomness, typically on a scale from 0 to 2:
| Temperature | Effect | Best for |
|---|---|---|
| 0.0 | Fully deterministic, always picks the most probable token | Data extraction, classification, math, factual QA |
| 0.3 – 0.5 | Low randomness, mostly predictable | Coding, technical writing, structured output |
| 0.7 – 1.0 | Balanced, varied but coherent | General chat, copy writing, assistants |
| 1.2 – 2.0 | High randomness, creative but sometimes incoherent | Brainstorming, poetry, creative fiction |
Temperature 0 is not perfectly deterministic in practice. Floating-point rounding means you may still see minor variation. For truly reproducible outputs, also set a seed parameter if the API supports it.
Top-P (nucleus sampling)
While temperature scales all probabilities, Top-P cuts off the “tail” of improbable tokens. With Top-P = 0.9, the model only samples from the smallest set of tokens whose cumulative probability adds up to 90%, ignoring all the unlikely candidates. This prevents the model from producing very unlikely tokens even at high temperatures. Most practitioners tune temperature or Top-P, not both simultaneously.
Presence penalty and frequency penalty
These two parameters (used mainly in OpenAI’s API) combat repetition:
- Presence penalty (range -2 to 2). Penalizes any token that has already appeared in the output, regardless of how many times. Set this higher if the model keeps returning to the same topic or reusing the same phrases
- Frequency penalty (range -2 to 2). Penalizes tokens proportionally to how often they have already appeared. Better for controlling word-level repetition in long outputs
Anthropic’s API does not expose these directly, as far as I know. You are expected to handle repetition through prompt engineering instead.
Max tokens and stop sequences
- Max tokens. Hard cap on how long the response can be. The model will truncate at this limit. Useful for controlling costs and enforcing concise responses
- Stop sequences. A list of strings that cause the model to stop generating immediately when encountered. Useful for structured output parsing
System prompt vs user prompt
Not a sampling parameter, but worth explaining here. Most APIs accept two types of input:
- System prompt. Background instructions that set the model’s persona, constraints, and context. Treated with higher authority than the user message
- User prompt. The actual message the end user sends
In practice, the system prompt is how application developers control model behavior, defining who the assistant is, what it can and cannot discuss, and what format to respond in.

Retrieval Augmented Generation (RAG)
One of the most practical and widely deployed patterns in AI engineering is Retrieval Augmented Generation (RAG). The concept is simple: before asking the model a question, first search a knowledge base and inject the relevant results into the prompt.
The problem RAG solves
LLMs have two significant knowledge limitations. First, they have a training cutoff and know nothing about events after they were trained. Second, they have no access to your private data: your company’s internal documents, your customer database, your codebase. RAG solves both problems without expensive fine-tuning.
RAG vs fine-tuning
| Approach | Best for | Drawbacks |
|---|---|---|
| RAG | Frequently updated knowledge, private data, large document sets, factual accuracy with citations | Adds latency; retrieval quality is critical; complex to build well |
| Fine-tuning | Teaching a model a specific style, format, or skill; baking in domain vocabulary | Expensive; knowledge becomes stale; does not help with recall of specific facts |
| Prompt engineering | Quick changes, small amounts of context, no infrastructure | Limited by context window; no persistent knowledge store |
LLM orchestration
Once you go beyond a single prompt-and-response, you need infrastructure to manage the complexity: chaining multiple LLM calls, managing memory, routing between different models, handling tool use, and building agent loops. This is the domain of LLM orchestration frameworks.
LangChain
LangChain is the most widely known orchestration framework. It provides building blocks for composing LLM-powered applications.
- Chains. Sequences of steps, e.g. “retrieve context, summarise, then generate answer”
- Agents. Loops where the LLM decides which tools to call, interprets results, and decides what to do next
- Tools. Integrations with external systems such as web search, calculators, databases, and APIs that an agent can invoke
- Memory. Modules for maintaining conversation history or persistent state across calls
- LangSmith. LangChain’s observability platform for debugging and evaluating chains in production
LangChain’s strength is its breadth of integrations. Its weakness, frequently cited, is that its abstractions can make simple things complicated and debugging difficult. Many experienced teams end up writing their own simpler orchestration after outgrowing it.
LlamaIndex
LlamaIndex (formerly GPT Index) focuses more specifically on the data layer, making it easy to connect LLMs to external data sources. If LangChain is a general-purpose orchestration framework, LlamaIndex is specialized for RAG and knowledge management.
- Data connectors. Pre-built loaders for PDFs, Notion, Slack, databases, GitHub, and more
- Index types. Multiple ways to structure and query your data, including vector stores, keyword search, knowledge graphs, and summary indices
- Query engines. High-level interfaces for asking questions over your indexed data
- Agentic patterns. More recently expanded into agent workflows as well
Newer approaches
The orchestration landscape moves fast. Several patterns have emerged as alternatives or complements to the above, most notably (DSPy, CrewAI and MCP).
The right orchestration approach depends on your scale. For prototyping, LangChain or LlamaIndex get you moving fast. For production, many teams prefer less magic and more explicit control.
Agent harnesses
An agent harness is everything that surrounds an LLM except the model itself. Where a raw model just takes input and produces output, a harness gives it legs:
- It manages the loop of observe, think, and act
- It connects the model to external tools like web search, file systems, and APIs
- It handles memory beyond the context window
- It deals with the practical plumbing of error handling, retries, and state persistence
A useful formula is “Agent = Model + Harness”. As AI has moved from one-shot chat interactions toward persistent, multi-step autonomous work, the harness has become as important as the model itself. OpenClaw is a good real-world example: swap out the LLM and the harness still handles scheduling, integrations, and execution.
The Chinese AI ecosystem
Chinese AI development has accelerated dramatically since 2023, producing models that are genuinely competitive with Western frontier labs, sometimes at a fraction of the cost.

Key players
| Company / lab | Model family | Notable for |
|---|---|---|
| DeepSeek | DeepSeek / R1 | Shocked the industry in early 2025, frontier-level performance at a fraction of expected cost. R1 is an open source reasoning model matching Western flagships |
| Alibaba | Qwen | Strong and rapidly evolving open source family, widely used for fine-tuning |
| Baidu | ERNIE | China’s largest search company, integrated with Baidu Search |
| ByteDance | Doubao / Seed | TikTok’s parent. Strong in multimodal and voice |
| Zhipu AI | GLM | Academic spin-out from Tsinghua, well-regarded for bilingual capability |
| 01.AI | Yi | Founded by Kai-Fu Lee (ex-Google China), known for fast, cheap inference |
Moonshot AI and the Kimi model
Moonshot AI is a Beijing-based startup notable for building Kimi, a model that made headlines for its extremely long context window before Western models caught up. Kimi is widely used in China for document analysis and research tasks. Moonshot has focused heavily on long-context capability and is backed by Alibaba, among others.
DeepSeek
DeepSeek’s R1 release in January 2025 was a watershed event. Not because it was the most capable model, but because it demonstrated that frontier-level performance could be achieved at a tiny fraction of the compute cost that US labs had assumed was necessary. Key efficiency innovations included:
- Mixture of Experts (MoE). Instead of activating all model parameters for every token, MoE routes each token to only a small subset of “expert” sub-networks. This massively reduces the compute per token while keeping the total model capacity large
- Aggressive quantitation. Reducing the numerical precision of model weights to save memory and speed up inference
- Novel training techniques. Architectural improvements that reduce memory bandwidth requirements
The efficiency story also matters geopolitically. US export controls restrict China’s access to the most advanced chips. DeepSeek showed that world-class models can be built on older, more available hardware if the engineering is good enough.
Latest DeepSeek model (v4) also has the large labs on their toes. The pricing is significantly lower and the (open weights) model is pretty good.
Key platforms
HuggingFace started as a chatbot company and pivoted to become the central hub of the open source AI community. Think of it as GitHub for AI models. It hosts:
- Model Hub. Hundreds of thousands of publicly available models. You can download Llama, Mistral, Qwen, DeepSeek, and countless fine-tuned variants with a single command
- Datasets. A similarly large library of training and evaluation datasets
- Spaces. A platform for deploying demos of models and tools
- Buckets. Store models, datasets, and artifacts with simple per-TB pricing
When Meta releases new Llama weights, they appear on HuggingFace within hours. The community fine-tunes, quantizes, and remixes them, all tracked through the Hub.
Replicate is another useful platform: it lets you run open source models via a simple API without managing any infrastructure. You pay per second of compute used, making it great for experimentation with image generation and audio models.
Bonus
A few concepts worth knowing, but not already covered.
Parameters, weights, and model size
You will often hear models described by their parameter count: “a 7B model”, “a 70B model”. Parameters (often called weights, but not exactly the same) are the numerical values the model learned during training, the actual content of the model file. More parameters generally means more capacity, but also more memory required and slower inference. The relationship between parameter count and capability is non-linear and depends heavily on training quality, data, and architecture.
Fine-tuning and LoRA
Fine-tuning means continuing the training of a pre-trained model on a smaller, task-specific dataset. Full fine-tuning is expensive because you update all parameters. LoRA (Low-Rank Adaptation) is a smarter approach: instead of modifying all weights, LoRA injects small trainable matrices into the model architecture and only trains those. The base model remains frozen. This makes fine-tuning accessible on consumer hardware and has driven an explosion of specialized model variants.
Model distillation
Model distillation is a training technique where a smaller model (the “student”) is trained to mimic the behavior of a larger, more capable model (the “teacher”). Rather than learning purely from raw training data, the student learns from the teacher’s output probabilities, which carry richer information than a simple right/wrong label.
The result is a smaller model that punches above its weight, having absorbed some of the larger model’s reasoning patterns. This is one of the main reasons the smaller tiers (Haiku, Flash, mini) are as capable as they are: they are often distilled from their bigger siblings rather than trained entirely from scratch.
Agents and tool use
An agent is an LLM that has been given tools it can invoke, such as web search, code execution, file access, or API calls, and runs in a loop: observe, think, act, observe again. The model decides which tool to call, interprets the result, and either calls another tool or produces a final answer. Agents can take many steps and perform complex multi-stage tasks autonomously. The reliability and safety of agents in production is one of the central research challenges in the field right now.
Model benchmarks
Model capabilities are measured using standardized benchmarks. Common ones you will encounter:
- Massive Multitask Language Understanding (MMLU). 57 academic subjects, multiple choice. Tests breadth of knowledge
- HumanEval / SWE-bench. Coding benchmarks. SWE-bench is especially rigorous, using real GitHub issues the model must solve
- MATH / AIME. Mathematical reasoning
- GPQA. Graduate-level science questions designed to be hard even for domain experts
- Chatbot Arena (LMSYS). Human preference ranking where real users compare model outputs and vote. Considered one of the most valid real-world measures
Benchmark scores can be gamed, as labs sometimes train on benchmark-adjacent data. Chatbot Arena’s human preference data is harder to game and tends to be a more reliable signal of real-world quality.
Next
I was not planning it, but decided to add a post on running LLMs locally. So that will be covered next week.
