How Embeddings Work

From the ground up. Connected to what you're building.

What embedding actually does

You give it text — a word, sentence, paragraph, whatever. It returns a list of numbers. Your model (Qwen3-Embedding-0.6B) returns exactly 1,024 numbers. That's it. That list of 1,024 floats IS the embedding.

"Thompson Sampling" → [0.0234, -0.1891, 0.0012, ..., 0.0847]  (1,024 values)
"Banana split"      → [0.1102, 0.0034, -0.2201, ..., -0.0391]  (1,024 values)

Each of those 1,024 positions is a dimension. Not a "chunk" of the text — a dimension of meaning. The model learned during training that, say, dimension 47 correlates with "technical vs casual," dimension 312 correlates with "concrete vs abstract," dimension 891 correlates with "positive vs negative." But you don't get to pick what each dimension means — the model discovered them during training. Most dimensions don't map to anything a human would name.

How does the model decide what numbers to put there?

The text goes through the transformer layers — attention heads, feedforward networks, the whole stack. The final hidden state (the last layer's output) gets pooled (usually averaged across all token positions) into a single 1,024-dim vector. That vector is the embedding.

The model was trained with a contrastive objective: "these two texts are similar, push their vectors close together. These two texts are different, push them apart." After millions of such comparisons, the 1,024 dimensions organize themselves into a space where cosine similarity between vectors ≈ semantic similarity between texts.

What you're actually measuring when you do cosine similarity

similarity = dot(embedding_A, embedding_B) / (norm(A) * norm(B))

This is the angle between two 1,024-dimensional arrows. Close to 1.0 = pointing the same direction = similar meaning. Close to 0 = perpendicular = unrelated. Close to -1.0 = opposite.

The key thing you're asking about — what gets embedded?

This is the "defactorization" question. You choose what text to feed in:

Input	What you get	Trade-off
Single word: "retrieval"	Word-level meaning	No context, ambiguous
Sentence: "Thompson Sampling converges in 50 tasks"	Sentence-level meaning	Specific claim, loses surrounding context
Paragraph: full method description	Paragraph-level meaning	Rich context, but diluted — one 1,024-dim vector for 200 words means each word gets ~5 dimensions of influence
Whole document	Document-level meaning	Very diluted — a 5-page paper compressed to 1,024 numbers loses enormous detail

The dilution problem is real. A 200-word paragraph and a 5-word sentence both become 1,024 numbers. The paragraph has more information crammed into the same space. Some of it gets lost. This is why chunking strategy matters — and why your hierarchical idea (document → paragraph → sentence) is interesting.

What your system currently does

Each of your 205 skills gets embedded as one vector (the full skill description, maybe 50-100 words → 1,024 dims). Each task gets embedded similarly. You compare them with cosine similarity. That's your relevance dimension.

What "different factorization protocols" could mean

Chunk size: Embed skills at sentence level instead of full description. More vectors per skill, finer-grained matching, but more computation.
Multi-vector: Instead of one 1,024-dim vector per skill, keep multiple vectors (one per sentence). ColBERT does this — it compares every query token against every document token and takes the best matches. Much more expensive, much more precise.
Dimensionality reduction: Take the 1,024 dims and project down to, say, 128 via PCA or learned projection. Faster, loses some signal. Matryoshka embeddings (which Qwen3 supports) let you truncate to any prefix length.
Factored embeddings: Separate the embedding into interpretable sub-spaces — dimensions 0-255 for "topic," 256-511 for "style," 512-767 for "complexity," 768-1023 for "specificity." Nobody's done this well because the dimensions aren't naturally organized that way. But if you COULD... that's where your R/I/V weighting would operate at the embedding level, not just the retrieval scoring level.

That last one is closest to what you were describing with the WWAD system — paragraph-level embeddings with different dimensions mattering for different retrieval purposes. The problem is the 1,024 dimensions aren't labeled. You'd need to either discover which dimensions correspond to which qualities, or train a projection that separates them.

Visual Resources

3Blue1Brown: Transformers/GPT (Ch 5) — embeddings as points in high-dimensional space, how words become vectors
3Blue1Brown: Attention (Ch 6) — how attention heads work (connects to your "multi-plane" insight)
3Blue1Brown: How LLMs Store Facts (Ch 7) — how knowledge lives in feedforward layers
Weaviate: ColBERT / Late Interaction Overview — single-vector vs multi-vector vs cross-encoder with diagrams
Visualizing RAG Pipeline — interactive tool: chunking → embedding → retrieval
Firecrawl: Best Chunking Strategies 2025 — 7 strategies benchmarked, recursive 512-token wins (69% vs 54%)