AI Deep dive – Part II

Introduction

When you write a part I, you’ll need to write part II as well. Let’s pick it up where we left it last week.

Model parameters

When you call a model API, you can pass parameters that control how the model generates text. These are not settings that change the model’s knowledge or capability. They control the sampling process: how the model picks the next token from a probability distribution.

Temperature

Temperature is the most important and widely used parameter. It controls randomness, typically on a scale from 0 to 2:

Temperature	Effect	Best for
0.0	Fully deterministic, always picks the most probable token	Data extraction, classification, math, factual QA
0.3 – 0.5	Low randomness, mostly predictable	Coding, technical writing, structured output
0.7 – 1.0	Balanced, varied but coherent	General chat, copy writing, assistants
1.2 – 2.0	High randomness, creative but sometimes incoherent	Brainstorming, poetry, creative fiction

Temperature 0 is not perfectly deterministic in practice. Floating-point rounding means you may still see minor variation. For truly reproducible outputs, also set a seed parameter if the API supports it.

Top-P (nucleus sampling)

While temperature scales all probabilities, Top-P cuts off the “tail” of improbable tokens. With Top-P = 0.9, the model only samples from the smallest set of tokens whose cumulative probability adds up to 90%, ignoring all the unlikely candidates. This prevents the model from producing very unlikely tokens even at high temperatures. Most practitioners tune temperature or Top-P, not both simultaneously.

Presence penalty and frequency penalty

These two parameters (used mainly in OpenAI’s API) combat repetition:

Presence penalty (range -2 to 2). Penalizes any token that has already appeared in the output, regardless of how many times. Set this higher if the model keeps returning to the same topic or reusing the same phrases
Frequency penalty (range -2 to 2). Penalizes tokens proportionally to how often they have already appeared. Better for controlling word-level repetition in long outputs

Anthropic’s API does not expose these directly, as far as I know. You are expected to handle repetition through prompt engineering instead.

Max tokens and stop sequences

Max tokens. Hard cap on how long the response can be. The model will truncate at this limit. Useful for controlling costs and enforcing concise responses
Stop sequences. A list of strings that cause the model to stop generating immediately when encountered. Useful for structured output parsing

System prompt vs user prompt

Not a sampling parameter, but worth explaining here. Most APIs accept two types of input:

System prompt. Background instructions that set the model’s persona, constraints, and context. Treated with higher authority than the user message
User prompt. The actual message the end user sends

In practice, the system prompt is how application developers control model behavior, defining who the assistant is, what it can and cannot discuss, and what format to respond in.

Retrieval Augmented Generation (RAG)

One of the most practical and widely deployed patterns in AI engineering is Retrieval Augmented Generation (RAG). The concept is simple: before asking the model a question, first search a knowledge base and inject the relevant results into the prompt.

The problem RAG solves

LLMs have two significant knowledge limitations. First, they have a training cutoff and know nothing about events after they were trained. Second, they have no access to your private data: your company’s internal documents, your customer database, your codebase. RAG solves both problems without expensive fine-tuning.

RAG vs fine-tuning

Approach	Best for	Drawbacks
RAG	Frequently updated knowledge, private data, large document sets, factual accuracy with citations	Adds latency; retrieval quality is critical; complex to build well
Fine-tuning	Teaching a model a specific style, format, or skill; baking in domain vocabulary	Expensive; knowledge becomes stale; does not help with recall of specific facts
Prompt engineering	Quick changes, small amounts of context, no infrastructure	Limited by context window; no persistent knowledge store

LLM orchestration

Once you go beyond a single prompt-and-response, you need infrastructure to manage the complexity: chaining multiple LLM calls, managing memory, routing between different models, handling tool use, and building agent loops. This is the domain of LLM orchestration frameworks.

LangChain

LangChain is the most widely known orchestration framework. It provides building blocks for composing LLM-powered applications.

Chains. Sequences of steps, e.g. “retrieve context, summarise, then generate answer”
Agents. Loops where the LLM decides which tools to call, interprets results, and decides what to do next
Tools. Integrations with external systems such as web search, calculators, databases, and APIs that an agent can invoke
Memory. Modules for maintaining conversation history or persistent state across calls
LangSmith. LangChain’s observability platform for debugging and evaluating chains in production

LangChain’s strength is its breadth of integrations. Its weakness, frequently cited, is that its abstractions can make simple things complicated and debugging difficult. Many experienced teams end up writing their own simpler orchestration after outgrowing it.

LlamaIndex

LlamaIndex (formerly GPT Index) focuses more specifically on the data layer, making it easy to connect LLMs to external data sources. If LangChain is a general-purpose orchestration framework, LlamaIndex is specialized for RAG and knowledge management.

Data connectors. Pre-built loaders for PDFs, Notion, Slack, databases, GitHub, and more
Index types. Multiple ways to structure and query your data, including vector stores, keyword search, knowledge graphs, and summary indices
Query engines. High-level interfaces for asking questions over your indexed data
Agentic patterns. More recently expanded into agent workflows as well

Newer approaches

The orchestration landscape moves fast. Several patterns have emerged as alternatives or complements to the above, most notably (DSPy, CrewAI and MCP).

The right orchestration approach depends on your scale. For prototyping, LangChain or LlamaIndex get you moving fast. For production, many teams prefer less magic and more explicit control.

Agent harnesses

An agent harness is everything that surrounds an LLM except the model itself. Where a raw model just takes input and produces output, a harness gives it legs:

It manages the loop of observe, think, and act
It connects the model to external tools like web search, file systems, and APIs
It handles memory beyond the context window
It deals with the practical plumbing of error handling, retries, and state persistence

A useful formula is “Agent = Model + Harness”. As AI has moved from one-shot chat interactions toward persistent, multi-step autonomous work, the harness has become as important as the model itself. OpenClaw is a good real-world example: swap out the LLM and the harness still handles scheduling, integrations, and execution.

The Chinese AI ecosystem

Chinese AI development has accelerated dramatically since 2023, producing models that are genuinely competitive with Western frontier labs, sometimes at a fraction of the cost.

Key players

Company / lab	Model family	Notable for
DeepSeek	DeepSeek / R1	Shocked the industry in early 2025, frontier-level performance at a fraction of expected cost. R1 is an open source reasoning model matching Western flagships
Alibaba	Qwen	Strong and rapidly evolving open source family, widely used for fine-tuning
Baidu	ERNIE	China’s largest search company, integrated with Baidu Search
ByteDance	Doubao / Seed	TikTok’s parent. Strong in multimodal and voice
Zhipu AI	GLM	Academic spin-out from Tsinghua, well-regarded for bilingual capability
01.AI	Yi	Founded by Kai-Fu Lee (ex-Google China), known for fast, cheap inference

Moonshot AI and the Kimi model

Moonshot AI is a Beijing-based startup notable for building Kimi, a model that made headlines for its extremely long context window before Western models caught up. Kimi is widely used in China for document analysis and research tasks. Moonshot has focused heavily on long-context capability and is backed by Alibaba, among others.

DeepSeek

DeepSeek’s R1 release in January 2025 was a watershed event. Not because it was the most capable model, but because it demonstrated that frontier-level performance could be achieved at a tiny fraction of the compute cost that US labs had assumed was necessary. Key efficiency innovations included:

Mixture of Experts (MoE). Instead of activating all model parameters for every token, MoE routes each token to only a small subset of “expert” sub-networks. This massively reduces the compute per token while keeping the total model capacity large
Aggressive quantitation. Reducing the numerical precision of model weights to save memory and speed up inference
Novel training techniques. Architectural improvements that reduce memory bandwidth requirements

The efficiency story also matters geopolitically. US export controls restrict China’s access to the most advanced chips. DeepSeek showed that world-class models can be built on older, more available hardware if the engineering is good enough.

Latest DeepSeek model (v4) also has the large labs on their toes. The pricing is significantly lower and the (open weights) model is pretty good.

Key platforms

HuggingFace started as a chatbot company and pivoted to become the central hub of the open source AI community. Think of it as GitHub for AI models. It hosts:

Model Hub. Hundreds of thousands of publicly available models. You can download Llama, Mistral, Qwen, DeepSeek, and countless fine-tuned variants with a single command
Datasets. A similarly large library of training and evaluation datasets
Spaces. A platform for deploying demos of models and tools
Buckets. Store models, datasets, and artifacts with simple per-TB pricing

When Meta releases new Llama weights, they appear on HuggingFace within hours. The community fine-tunes, quantizes, and remixes them, all tracked through the Hub.

Replicate is another useful platform: it lets you run open source models via a simple API without managing any infrastructure. You pay per second of compute used, making it great for experimentation with image generation and audio models.

Bonus

A few concepts worth knowing, but not already covered.

Parameters, weights, and model size

You will often hear models described by their parameter count: “a 7B model”, “a 70B model”. Parameters (often called weights, but not exactly the same) are the numerical values the model learned during training, the actual content of the model file. More parameters generally means more capacity, but also more memory required and slower inference. The relationship between parameter count and capability is non-linear and depends heavily on training quality, data, and architecture.

Fine-tuning and LoRA

Fine-tuning means continuing the training of a pre-trained model on a smaller, task-specific dataset. Full fine-tuning is expensive because you update all parameters. LoRA (Low-Rank Adaptation) is a smarter approach: instead of modifying all weights, LoRA injects small trainable matrices into the model architecture and only trains those. The base model remains frozen. This makes fine-tuning accessible on consumer hardware and has driven an explosion of specialized model variants.

Model distillation

Model distillation is a training technique where a smaller model (the “student”) is trained to mimic the behavior of a larger, more capable model (the “teacher”). Rather than learning purely from raw training data, the student learns from the teacher’s output probabilities, which carry richer information than a simple right/wrong label.

The result is a smaller model that punches above its weight, having absorbed some of the larger model’s reasoning patterns. This is one of the main reasons the smaller tiers (Haiku, Flash, mini) are as capable as they are: they are often distilled from their bigger siblings rather than trained entirely from scratch.

Agents and tool use

An agent is an LLM that has been given tools it can invoke, such as web search, code execution, file access, or API calls, and runs in a loop: observe, think, act, observe again. The model decides which tool to call, interprets the result, and either calls another tool or produces a final answer. Agents can take many steps and perform complex multi-stage tasks autonomously. The reliability and safety of agents in production is one of the central research challenges in the field right now.

Model benchmarks

Model capabilities are measured using standardized benchmarks. Common ones you will encounter:

Massive Multitask Language Understanding (MMLU). 57 academic subjects, multiple choice. Tests breadth of knowledge
HumanEval / SWE-bench. Coding benchmarks. SWE-bench is especially rigorous, using real GitHub issues the model must solve
MATH / AIME. Mathematical reasoning
GPQA. Graduate-level science questions designed to be hard even for domain experts
Chatbot Arena (LMSYS). Human preference ranking where real users compare model outputs and vote. Considered one of the most valid real-world measures

Benchmark scores can be gamed, as labs sometimes train on benchmark-adjacent data. Chatbot Arena’s human preference data is harder to game and tends to be a more reliable signal of real-world quality.

I was not planning it, but decided to add a post on running LLMs locally. So that will be covered next week.