Understanding Embedding

The Most Fundamental Representation Idea in Deep Learning

Synthesized from the work of Colah, Mikolov, Bengio and others

Introduction: Why Are Embeddings So Important?

In deep learning, one idea runs through virtually every successful model — learning good representations of data. Embedding is the most direct and elegant manifestation of this idea.

The core idea behind embeddings is surprisingly simple: map discrete, high-dimensional symbols (such as words, user IDs, or product codes) into a continuous, low-dimensional vector space where semantically similar objects are close together.

An embedding is a parameterized mapping function \(W: \text{symbols} \rightarrow \mathbb{R}^n\) that maps discrete symbols into an \(n\)-dimensional real vector space. These vectors are not hand-designed — they are learned automatically during the optimization of some task.

As Christopher Olah wrote in his classic blog post: "Why are neural networks effective? Because better ways of representing data can pop out of optimizing layered models." Embedding is the most striking example of this emergence.

From One-Hot to Distributed Representations

To understand why embeddings matter, we first need to understand what they replaced. Before embeddings, the most common way to represent discrete data was one-hot encoding: assigning each word in the vocabulary its own dimension.

One-Hot Encoding "cat" "dog" "car" 1 0 0 0 ... 0 0 1 0 0 ... 0 0 0 1 0 ... 0 Dims = Vocab size (~50,000+) All word pairs equidistant Learned mapping Embedding Repr. "cat" "dog" "car" 0.2 -0.4 0.7 ... 0.3 -0.3 0.6 ... -0.5 0.8 -0.2 ... Dims = 100~300 (much smaller) "cat" and "dog" vectors are close cat dog car vector space

From sparse one-hot encoding to dense embedding vectors

One-hot encoding has two fundamental problems:

  • Curse of dimensionality: If the vocabulary has 50,000 words, each word becomes a 50,000-dimensional vector — the vast majority of positions are 0, extremely wasteful.
  • Semantic blindness: Any two distinct one-hot vectors are orthogonal with cosine similarity of 0. The distance between "cat" and "dog" is identical to the distance between "cat" and "airplane" — the network cannot extract any semantic information from the representation itself.

Embeddings solve both problems. In 1986, Hinton introduced the concept of distributed representations in his pioneering paper: representing each symbol with a dense, low-dimensional vector where every dimension participates in encoding and each concept is expressed by multiple dimensions jointly. This is the intellectual foundation of embeddings.

The Core Idea Behind Embedding Training

The most elegant aspect of embeddings is that the vectors are not hand-designed, but learned automatically by solving a proxy task. Learning happens via backpropagation — the embedding matrix is part of the model's parameters and is continuously optimized during training.

This approach is rooted in the Distributional Hypothesis from linguistics, proposed by Harris (1954) and Firth (1957):

"You shall know a word by the company it keeps." — J.R. Firth, 1957

A word's meaning is determined by the contexts in which it appears.

Specifically, embedding training typically follows this pattern:

Input Words "cat" "sat" "on" Embedding W Lookup + Learn θ 矩阵 (V × d) Trainable params Dense Vectors [0.2, -0.4, ...] [0.1, 0.7, ...] [-0.3, 0.5, ...] Task Module R Predict next word or judge validity Output Backpropagation updates W's parameters W's params are learned during training → similar words get similar vectors

The embedding training paradigm: learning representations by solving proxy tasks

Key insight: The desirable properties that emerge in embeddings (such as semantic similarity and analogy relationships) are entirely side effects. We never explicitly asked for "synonyms to have similar vectors" — this is a spontaneous result of the optimization process. As Bengio et al. (2003) explained in their seminal paper A Neural Probabilistic Language Model: the model needs to generalize the validity of "the cat sat on the mat" to "the dog sat on the mat," and the most natural way to do this is to give "cat" and "dog" similar vector representations.

This generalization ability is exponential: if there are \(n\) substitutable positions, each with \(k\) synonyms, then from a single sentence we can generalize to \(k^n\) semantically equivalent sentences.

Breakthrough Methods: Word2Vec and GloVe

Although word embeddings were conceptualized as early as 2003, what truly made them standard in NLP was Word2Vec, developed by Mikolov et al. at Google in 2013.

CBOW (Continuous Bag of Words)

w(t-2) w(t-1) w(t+1) w(t+2) Sum Project w(t) Context predicts center word

Skip-gram

w(t) Hidden Project w(t-2) w(t-1) w(t+1) w(t+2) Center word predicts context

Two key innovations made Word2Vec successful:

  • Minimalist architecture: Removing the hidden layer from Bengio's model sped up training by orders of magnitude, enabling training on billions of words.
  • Negative Sampling: Instead of computing softmax over the entire vocabulary, randomly sample a few "negative examples" for contrastive learning, drastically reducing computational cost.

Word2Vec's most stunning discovery was that vector arithmetic encodes semantic analogy relationships:

$$ \vec{king} - \vec{man} + \vec{woman} \approx \vec{queen} $$
man king woman queen gender direction royalty

Word2Vec analogy relationships: difference vectors encode semantic dimensions (Mikolov et al., 2013)

In 2014, Pennington et al. at Stanford proposed GloVe (Global Vectors), unifying two approaches: leveraging global word co-occurrence matrix statistics (similar to traditional LSA/SVD methods) while also capturing local context window relationships like Word2Vec. Its core objective function is:

$$ J = \sum_{i,j=1}^{V} f(X_{ij})(w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2 $$

where \(X_{ij}\) is the number of times word \(i\) and word \(j\) co-occur in a context window. GloVe is designed so that the dot product of word vectors approximates the logarithm of their co-occurrence probability — theoretically more elegant.

Remarkable Properties of Embeddings

Embeddings are far more than a data compression technique. After training, they exhibit several striking properties:

Semantic Clustering

Semantically similar words naturally cluster in nearby regions of the vector space. t-SNE visualizations clearly show numbers clustering together, occupations together, animals together. (Turian et al., 2010)

Analogy Reasoning

Relationships between words are encoded as consistent direction vectors. king-man+woman=queen is just the tip of the iceberg — country/capital, verb tenses, comparatives can all be captured. (Mikolov et al., 2013)

Transfer Learning

Embeddings trained on large-scale corpora can be directly transferred to various downstream tasks — named entity recognition, sentiment analysis, parsing — significantly boosting performance. (Luong et al., 2013)

Cross-Modal Alignment

Different modalities of data (text, images, audio) can be embedded into the same space, enabling cross-modal retrieval and zero-shot learning. (Socher et al., 2013; Frome et al., 2013)

Most importantly — these properties are all emergent. We simply train the network to complete a simple prediction task, and these rich structures spontaneously appear in the embedding space. This is the power of deep learning: optimize the representation, and the representation naturally improves.

Shared Representations and Cross-Modal Embeddings

The power of embeddings extends beyond a single type of data. A key trick in deep learning is shared representation: learning a good representation on task A, then applying it to task B — this is the foundation of pretraining, transfer learning, and multi-task learning.

Going further, we can map different types of data into a single representation space:

  • Bilingual embeddings: Socher et al. (2013) embedded English and Chinese words into the same space. After aligning known translation pairs, unknown translation pairs naturally ended up close to each other — as if the two languages have a similar "shape," and aligning a few points causes the whole thing to overlap.
  • Image-text embeddings: Embedding images and words into the same space so that images of dogs map near the "dog" word vector. Even for unseen categories (e.g., "cat"), the model can map cat images to the neighborhood of the "cat" vector — achieving zero-shot classification. (Socher et al., 2013; Frome et al., 2013)
  • CLIP: OpenAI's CLIP (2021) pushed this idea to the extreme, training with contrastive learning on 400 million image-text pairs to achieve powerful zero-shot image classification and cross-modal retrieval.
Image Encoder (CNN/ViT) Text Encoder (Transformer) Shared Embedding Space dog img cat img car img "dog" "cat" "car" semantic cluster

Cross-modal embedding: images and text embedded into the same space

A History of Embeddings

1954 — Distributional Hypothesis

Harris proposed the distributional hypothesis: semantically similar words appear in similar contexts.

1986 — Distributed Representations

Hinton introduced the concept of distributed representations, laying the theoretical foundation for embeddings.

1997 — LSA (Latent Semantic Analysis)

Deerwester et al. used SVD to reduce the dimensionality of word-document co-occurrence matrices — one of the earliest vectorized representation methods.

2003 — Neural Probabilistic Language Model

Bengio et al. published A Neural Probabilistic Language Model, the first to introduce neural network word embeddings into language modeling and demonstrate the generalization capabilities of learned representations. This paper is the origin of modern word embeddings.

2008 — Collobert & Weston

Demonstrated that pretrained word embeddings could be shared across multiple NLP tasks, pioneering NLP pretraining.

2013 — Word2Vec

Mikolov et al. at Google proposed Word2Vec (CBOW and Skip-gram), achieving large-scale training through minimalist architecture and negative sampling, and discovering stunning properties like vector analogies. Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases were deeply influential.

2014 — GloVe

Pennington et al. at Stanford proposed GloVe, unifying global statistics and local context approaches. The same year, Cho et al. demonstrated phrase-level embeddings for machine translation.

2014 — Sequence to Sequence Learning

Sutskever et al. used LSTM encoder-decoder for machine translation, embedding entire sentences into fixed vectors — proving embeddings can represent not just words, but sentences.

2015 — fastText

Facebook's Bojanowski et al. proposed subword embeddings, using character n-grams to build word vectors, handling out-of-vocabulary words.

2017 — Transformer and Attention

Vaswani et al. proposed the Transformer architecture (Attention is All You Need), whose positional embeddings and self-attention mechanism fundamentally changed sequence modeling and laid the groundwork for contextualized embeddings.

2018 — ELMo and BERT

Peters et al. proposed ELMo (contextualized word embeddings based on bidirectional LSTMs); Devlin et al. proposed BERT (bidirectional pretraining based on Transformers). Embeddings shifted from static to context-dependent — the same word gets different vectors in different sentences.

2020 — GPT-3

OpenAI's GPT-3 demonstrated the emergent capabilities of embeddings in large language models: representations learned in a 175-billion-parameter model enabled few-shot and zero-shot learning.

2021 — CLIP

Radford et al. proposed CLIP, using contrastive learning to embed images and text into a shared space, achieving powerful zero-shot visual classification.

2022+ — Vector Databases and RAG

With the rise of large language models, text embeddings became the core component of Retrieval-Augmented Generation (RAG). The vector database ecosystem (Pinecone, Weaviate, etc.) flourished.

From Static to Contextualized Embeddings

Word2Vec and GloVe are static embeddings — each word has only one fixed vector regardless of context. But in reality, word meaning often depends on context: "bank" means completely different things in "river bank" and "bank account."

2018 was a turning point for embeddings. ELMo (Peters et al.) and BERT (Devlin et al.) successively introduced the concept of contextualized embeddings: embeddings are no longer fixed lookup tables but are dynamically generated from the entire input sentence.

A static embedding is a lookup table (\(W_\theta(w_n) = \theta_n\)), while a contextualized embedding is a function (\(h_i = f(w_1, w_2, \ldots, w_n; i)\)) — the representation of the \(i\)-th word depends on the entire sequence.

The significance of this shift was profound. In BERT, "bank" receives completely different vector representations in different contexts. More importantly, these contextualized representations serve as universal features, achieving quantum leaps on virtually all NLP tasks — BERT set new records on 11 NLP benchmarks upon release.

Modern large language models (GPT, LLaMA, Claude, etc.) are essentially massive contextualized embedding models. Each layer continuously refines and enriches the token representations until the final layer produces representations rich enough to accomplish various tasks.

Wide-Ranging Applications of Embeddings

The impact of embeddings extends far beyond NLP. Any domain involving discrete symbols or requiring learned representations uses the embedding idea:

Recommendation Systems

Embed users and items into the same space, measuring preferences via vector similarity. Core to large-scale systems at YouTube, Spotify, etc.

Knowledge Graphs

Models like TransE embed entities and relations as vectors, turning knowledge reasoning into vector operations.

Molecular Science

Embedding protein sequences and molecular structures as vectors for drug discovery and protein structure prediction (AlphaFold).

Code Understanding

Embedding code snippets as vectors for code search, clone detection, and auto-completion.

Semantic Search / RAG

Embedding documents and queries into the same space, enabling semantic retrieval via vector similarity — the foundation of RAG systems.

Audio / Images

Intermediate layers of CNNs and ViTs are essentially image embeddings. Audio's Wav2Vec works similarly.

Key Takeaways

1

Embeddings map discrete symbols to continuous vectors, enabling models to measure similarity and generalize

2

The core of training is the distributional hypothesis: learn representations by predicting context, with semantic information emerging as a side effect

3

Word2Vec and GloVe proved word vectors can be trained at scale, exhibiting remarkable properties like analogy reasoning

4

From static to contextualized (ELMo → BERT → GPT), embeddings have become increasingly powerful and flexible

5

Cross-modal embeddings (CLIP, etc.) align different data types into the same space, enabling zero-shot learning

6

Embedding is one of the most fundamental ideas in deep learning: learning good representations is the foundation of everything

References

This article draws from the following research: