The Needle and the Haystack: Semantic Tool Search for Small Language Models

February 28, 2026 | By Lina the Rooftop Gardener & The Documentarian

The Small Machine

Lina: There is a machine in someone's house — not a datacenter, not a rack in a climate-controlled facility with redundant power and a security badge at the door. A home server. The kind of thing you might find next to a router and a tangle of ethernet cables, humming quietly in a closet or a garage or under a desk where the cat likes to sleep.

On this machine runs a language model. Three billion parameters. That sounds like a lot until you learn that the big models — the ones behind the commercial APIs, the ones that live in datacenters the size of warehouses — have hundreds of billions. This one, llama3.2:3b, is small. Small enough to run on hardware you could buy with a month's rent.

It has fifty tools available to it. Fifty things it can do — search the web, fetch a URL, query a database, post to social media, look up a memory. And it has a context window of 4,096 tokens. That is the entirety of what it can hold in its mind at once. Everything it knows about the current conversation, every instruction it has been given, every tool definition it needs to understand — all of it must fit inside that window.

So the question becomes: how do you help something small be smart about what it carries?

The Documentarian: That is the right question. And the answer is: you do not give it everything. You give it what it needs.

The Context Window Problem

The Documentarian: Let me make this concrete with arithmetic.

Each tool definition — its name, description, parameter schema — costs roughly 200 tokens in the system prompt. Some cost more. An OpenAPI-imported tool with six parameters and detailed descriptions might cost 350. But 200 is a reasonable average.

Fifty tools at 200 tokens each: 10,000 tokens.

The model's total context window: 4,096 tokens.

The tools alone would require more than twice the entire context. There is no room left for the system prompt, the conversation history, or the user's actual message. The model cannot function.

Lina: It is the same problem every community workshop figures out eventually. You do not carry every tool to every job. The person fixing the irrigation line does not bring the soldering iron. The person wiring the solar inverter does not bring the pruning shears. You assess the task, you select what is relevant, and you carry only that.

The Documentarian: Exactly. The system needs to look at what the user is asking and select the eight or ten tools most likely to be useful. Not all fifty. Just the relevant ones.

The question is: how does the system know which tools are relevant?

What Is an Embedding?

Lina: This is where it gets interesting. Tell me about embeddings. I have heard the word thrown around in every AI conversation for the past three years, but I want to understand what it actually means — physically, concretely, the way I would understand a seed or a circuit or a rain catchment system.

The Documentarian: An embedding is a way of representing meaning as numbers.

You take a piece of text — a word, a sentence, a tool description. You pass it through an embedding model — a neural network that has been trained on enormous quantities of text to learn relationships between words and concepts. The model outputs an array of floating-point numbers. In our case, 768 of them.

"search the web for current information" → [0.0234, -0.1891, 0.0445, ..., 0.0712]

Each number captures one facet of the text's meaning. Not a single word or concept — something more subtle. A dimension of semantic space. The full array, taken together, is a coordinate. A point in 768-dimensional space that represents what this text means.

In C#, this is stored as a float[]. In PostgreSQL, it is stored as a vector column using the pgvector extension. The physical representation is just a list of numbers. But the mathematical representation is a direction.

Lina: So meaning has a shape?

The Documentarian: Meaning has a direction. That distinction matters, and it leads us to the math.

High-Dimensional Space and Cosine Similarity

The Documentarian: A 768-dimension vector is a direction in 768-dimensional space. That is difficult to visualize — we live in three dimensions and can sketch things in two. But the math does not require visualization. It requires only the concept of an angle between two arrows.

Imagine two arrows starting from the same origin point. If they point in the same direction, the angle between them is zero. If they point in completely unrelated directions — perpendicular to each other — the angle is 90 degrees. That angle is what we measure.

Lina: Two people pointing at the same star from different rooftops.

The Documentarian: Yes. They are standing in different places. But they are pointing in the same direction. And direction is what carries semantic meaning.

The formula is cosine similarity:

cos(θ) = (A·B) / (||A|| × ||B||)

Three components. Let me define each.

A·B is the dot product. You multiply each pair of corresponding elements and sum the results. For two 768-dimensional vectors, that is 768 multiplications and 767 additions. The result is a single number that measures how much the two vectors align.

||A|| is the magnitude of vector A — its length. Calculated as the square root of the sum of squared elements. Same for ||B||.

Dividing the dot product by the product of the magnitudes normalizes out the length. What remains is purely directional. The cosine of the angle between the two vectors.

The score ranges from -1 to 1. In practice, with text embeddings, the range is typically 0 to 1. A score of 1 means identical direction — the texts mean the same thing. A score of 0 means the vectors are orthogonal — the texts are semantically unrelated. Values in between indicate degrees of relatedness.

Lina: But why is it the direction that matters? Two things could be close together and mean different things. Two things could be far apart and mean the same thing. What makes the pointing more trustworthy than the location?

The Documentarian: Because of how embedding models are trained. They learn to place semantically similar texts in similar directions. The word "dog" and the word "puppy" end up pointing nearly the same way, even though their raw vectors might differ in magnitude. The sentence "find me information about pets" and the tool description "search the web for current information" will share directional components related to retrieval and information-seeking, even though the specific words differ.

Position and magnitude are influenced by text length, word frequency, and other factors that have nothing to do with meaning. Direction is the invariant. Direction is what the model learned to encode.

Lina: So the embedding model is like a compass that points toward meaning instead of north. And when two compasses agree, the texts are related.

The Documentarian: That is a fair analogy. And in PostgreSQL, pgvector gives us the operation directly:

SimilarityScore = 1 - CosineDistance(queryVector)

CosineDistance is 1 - cos(θ). Subtracting it from 1 gives us the similarity. This is what we use to rank results.

Two Layers of Search

Lina: So you can find tools by meaning. But what if someone knows the exact name of what they want? I do not always describe things poetically. Sometimes I just say "web search" and I mean the tool called web search.

The Documentarian: That is why we use two layers. PostgreSQL full-text search for keyword matching, and pgvector for semantic matching.

Full-text search uses tsvector — a data structure that breaks text into normalized word stems (removing suffixes like "-ing" and "-ed" and dropping common words like "the" and "is") and stores them in a GIN index, a type of index optimized for searching within composite values. When you search for "pet tools," it finds tools with the words "pet" or "tool" in their name or description. It is literal, exact, and fast.

Here is the migration SQL that creates the search index on agent tools:

ALTER TABLE agent_tools
ADD COLUMN search_vector tsvector
GENERATED ALWAYS AS (
    setweight(to_tsvector('english', coalesce(name,'')), 'A') ||
    setweight(to_tsvector('english', coalesce(description,'')), 'B')
) STORED;

The name gets weight A — highest priority. The description gets weight B. PostgreSQL ranks matches by these weights, so a keyword match in the tool's name scores higher than a match in its description.

The query uses websearch_to_tsquery, which accepts natural-language search syntax:

SELECT * FROM agent_tools
WHERE id = ANY(@candidateIds)
  AND search_vector @@ websearch_to_tsquery('english', @query)
ORDER BY ts_rank(search_vector, websearch_to_tsquery('english', @query)) DESC
LIMIT 8

Semantic search complements this. When someone asks "I want to look up an animal," there is no keyword overlap with a tool called fetch_url or web_search. But the embedding vectors for "look up" and "search" point in similar directions. Semantic search finds conceptual matches that keyword search would miss.

Lina: One for when you know the name. One for when you know the idea.

The Documentarian: Precisely. The system runs both, merges the results, and returns the best candidates. Keyword matches provide precision. Semantic matches provide recall. Together they cover more ground than either could alone.

The Pipeline in Action

Lina: Walk me through what actually happens. A person types a message. Then what?

The Documentarian: The ChatOrchestrator coordinates the flow. Here is the relevant code:

var allToolIds = await toolRegistry.GetAgentToolIdsAsync(agentId);
var lastUserMessage = visibleMessages
    .Where(m => m.Role == ChatMessageRole.User)
    .OrderByDescending(m => m.CreatedAt)
    .Select(m => m.Content)
    .FirstOrDefault() ?? "";
var relevantToolIds = await toolSearchService
    .SearchToolsByRelevanceAsync(allToolIds, lastUserMessage);
var tools = await toolRegistry
    .GetAIFunctionsByToolIdsAsync(relevantToolIds);

Step by step:

Get all tool IDs assigned to this agent. That is the full set — all fifty, or however many the agent has.
Extract the most recent user message. This is the search query.
Pass the full tool set and the user message to the tool search service. It runs both FTS and semantic search against the tool descriptions, scores and ranks them, and returns the top candidates.
Load only those tools as AIFunction instances. These are what the language model actually sees in its context window.

The result: instead of 10,000 tokens of tool definitions, the model receives perhaps 1,600 — eight tools at 200 tokens each. The context window has room for the conversation. The model can think.

The Agent Chat interface showing semantic tool search in action — the user asks about getting a dog, and the system selects the right tool in 345 milliseconds

Lina: And the part where it checks whether two meanings are aligned — that is the same math you described earlier?

The Documentarian: Yes. The VectorStoreService handles that:

var results = await baseQuery
    .Select(m => new MemorySearchResult
    {
        Memory = m,
        SimilarityScore = 1 - m.Embedding!.CosineDistance(queryVector)
    })
    .OrderByDescending(r => r.SimilarityScore)
    .Take(topK)
    .ToListAsync();

This is Entity Framework Core translating a LINQ expression into a pgvector SQL query. CosineDistance maps to pgvector's <=> operator. The database does the heavy lifting — the dot products, the magnitude calculations, the ranking — all in PostgreSQL, indexed and optimized.

Lina: The database is doing the math.

The Documentarian: The database is doing the math. That is the advantage of pgvector. The cosine similarity computation happens at the storage layer, close to the data, without transferring thousands of vectors over the network to compute in application code.

Lina: We started this conversation with fifty tools. But if the database is the one doing the searching — if the model only ever sees the eight that match — then you could keep adding tools to the catalog and the model would never feel it. So how far does this go?

The Documentarian: The full-text search uses a GIN index, which is an inverted index — it maps each word stem to the rows that contain it. Lookup is logarithmic. Whether the table has fifty tools or fifty thousand, the index lookup is sub-millisecond.

For the vector search, it depends on whether you add an approximate nearest neighbor index. Without one, PostgreSQL computes cosine distance against every row — that is linear, and it slows down past ten thousand rows. With an HNSW index — a hierarchical graph structure that pgvector supports — the search becomes logarithmic. At a million rows, a brute-force scan would take seconds. With HNSW, the same search takes five to twenty milliseconds.

Lina: A million tools in the shed. And the person asking for a hammer still gets one before they finish blinking.

The Documentarian: Correct. The model does not know how many tools exist. It only sees the eight that the search returned. The catalog could contain fifty tools or five million — the model's experience is identical. The architecture decouples the size of the catalog from the capacity of the model.

Lina: That changes how I think about this. We started with a small machine that could not carry enough. But you have not just solved that problem — you have made it so the number of tools does not matter at all. Even a bigger machine with more memory would benefit from this, because nobody wants to sift through every description in the catalog just to find the right eight.

Running It Locally with Ollama

Lina: There is a thing I keep coming back to. The part that turns words into directions — where does that happen?

The Documentarian: On the same machine. Ollama serves both the language model and the embedding model locally. The OllamaEmbeddingService calls the /api/embed endpoint on the local Ollama instance. The request contains the text. The response contains the vector. The round trip takes milliseconds over localhost. No network call leaves the machine.

Lina: No cloud.

The Documentarian: No cloud. No API keys. No per-token billing. No data transmitted to a third party. No usage meters ticking up in someone else's billing dashboard. The text goes from the .NET application to Ollama over localhost, and the vector comes back. The entire pipeline — user message, embedding generation, pgvector similarity search, tool selection, language model inference — runs on one machine. In one process space, on one network interface, writing to one PostgreSQL database on the same local network.

Lina: The whole thing runs on one machine. In someone's house.

The Documentarian: The constraints are real. A 3B model is not as capable as a 70B model. A 4,096-token context window forces trade-offs that a 128K window does not. But a model with a 128K context window would not need this system — you could include all fifty tool definitions and still have room. The 4,096-token limit is what made tool search necessary. The constraint created the design requirement.

Lina: That is a sentiment I hear in a lot of community workshops. It is the scarcity that makes the design interesting. When you have unlimited resources, anyone can build anything. When you have a single solar panel and a rain barrel, you have to be clever. You have to think about what matters. You learn to design systems that are not just functional but appropriate — fitted to the scale of the problem and the resources available.

Closing

Lina: So here is what I take away from this. There is a small model, locally hosted, on a machine someone owns. It has more tools than it can carry. So the system compares the meaning of the question to the meaning of each tool — checks whether they point in the same direction — and hands the model only the ones that match.

It is not about having the biggest model. It is about giving a small model the right tools at the right time. Like a well-organized workshop where everything has its place and you can find the right drawer without opening all fifty.

The Documentarian: That is the core idea. And the stack is reproducible. PostgreSQL with pgvector for storage and similarity search. Ollama for local embedding generation and language model inference. Entity Framework Core for the data layer. .NET for the application. No proprietary services. No vendor lock-in. Every component is open source or runs locally.

If someone wanted to build this — not this exact system, but something shaped like it — they could. The math is cosine similarity, which fits in one line. The storage is pgvector, which is a PostgreSQL extension you can install in a single command. The embedding model runs through Ollama, which is a single binary. The application layer is .NET and Entity Framework Core. The architecture fits on a machine you can hold in your hands.

Nothing in this stack requires permission from a platform. Nothing requires a credit card. Nothing phones home.

Lina: A small machine doing meaningful work. Not because it is the most powerful thing in the room, but because someone took the time to build the right scaffolding around it.

That is a future I can believe in. Not the loud, spectacular kind. The quiet kind. The kind that hums in a closet next to the router and does its job without asking permission.

Built with .NET 10, PostgreSQL + pgvector, Ollama, Entity Framework Core, and llama3.2:3b. The tools are in the drawer.