Word Embedding (Word2Vec)

Simulate training 2D word vectors using the Skip-gram architecture with Negative Sampling.

Word Embedding (Word2Vec)

Concept Overview

Word embeddings are dense vector representations of words in a continuous vector space, where semantically similar words are mapped to proximate points. The Word2Vec algorithm, introduced by Mikolov et al., revolutionized Natural Language Processing (NLP) by providing an efficient method to learn these high-quality representations from large corpora of text without requiring labeled data. The fundamental intuition behind Word2Vec is the distributional hypothesis: "a word is characterized by the company it keeps." By training a neural network to predict words based on their context (or vice-versa), the network's internal weights learn to capture intricate semantic and syntactic relationships.

Mathematical Definition

Word2Vec primarily uses one of two architectures: Continuous Bag-of-Words (CBOW) or Skip-gram. In Skip-gram with Negative Sampling (SGNS), the objective is to maximize the probability of context words given a target word, while minimizing the probability of randomly sampled "negative" context words. Given a sequence of training words w₁, w₂, ..., w_T, the objective function is:

J = (1/T) Σ_t=1^T Σ_{-c ≤ j ≤ c, j ≠ 0} log P(w_t+j | w_t)

With Negative Sampling, log P(w_O | w_I) is approximated by:

log σ(v'_{w_O} · v_{w_I}) + Σ_i=1^k E_{w_i ~ P_n(w)} [log σ(-v'_{w_i} · v_{w_I})]

where:

c = size of the training context (window size)

v_w = input vector representation of word w

v'_w = output vector representation of word w

σ(x) = 1 / (1 + exp(-x)) (sigmoid function)

k = number of negative samples

P_n(w) = noise distribution for drawing negative samples

Key Concepts

Skip-gram vs CBOW

Continuous Bag-of-Words (CBOW): Predicts a target word given its surrounding context words. It tends to be faster to train and has slightly better accuracy for frequent words.
Skip-gram: Predicts the surrounding context words given a single target word. It handles rare words or phrases much better than CBOW and is generally preferred for large datasets.

Negative Sampling

Calculating the exact probability using softmax over the entire vocabulary (which can be millions of words) is computationally prohibitive. Negative sampling transforms this into a binary classification problem. Instead of updating all weights, the model distinguishes the true context word from k randomly sampled "noise" words, drastically reducing computational complexity while maintaining high representation quality.

Linear Substructures

One of the most remarkable properties of Word2Vec embeddings is their linear translation properties. Relationships are captured as vector offsets. The famous example is that vector operations capture analogies: v("King") - v("Man") + v("Woman") ≈ v("Queen"). This demonstrates that the geometric spatial arrangement in the latent vector space meaningfully corresponds to semantic concepts.

Historical Context

The concept of word embeddings builds upon older techniques like Latent Semantic Analysis (LSA) and earlier neural language models introduced by Bengio et al. in 2003. However, these earlier methods were often computationally expensive and did not scale well to vast amounts of text.

In 2013, a team of researchers led by Tomas Mikolov at Google created Word2Vec. By removing the computational bottleneck of dense matrix multiplications in hidden layers (using the simple CBOW and Skip-gram architectures), they were able to train models on billions of words in hours rather than weeks. This breakthrough democratized word embeddings and sparked a massive wave of advancements in NLP.

Real-world Applications

Information Retrieval & Search: Understanding search query intent by expanding queries with semantically similar terms.
Recommendation Systems: Applying "Item2Vec" to embed products or movies based on user interaction sequences, enabling similarity-based recommendations.
Machine Translation: Providing rich initial representations of words to improve translation accuracy across language models.
Sentiment Analysis: Capturing nuanced meanings and contextual usage of adjectives to better classify sentiment in text.

Related Concepts

Neural Network Learning — Word2Vec uses fundamental backpropagation to optimize its vector representations.
PCA Dimensionality Reduction — PCA or t-SNE is commonly used to project high-dimensional word vectors (e.g., 300 dimensions) down to 2D or 3D for visualization.
Attention Mechanism — Transformers and Attention mechanisms have largely superseded Word2Vec for contextualized embeddings (e.g., BERT, GPT).

Word Embedding (Word2Vec)