Investigation into In-Context Learning Capabilities of Transformers

Motivation

Why Study In-Context Learning?

In-context learning (ICL) is the ability of transformer models to solve new tasks using only a handful of example input–output pairs provided in the prompt, with no parameter updates. This is how modern LLMs like GPT-4 and Claude can answer novel questions after seeing just a few demonstrations.

Understanding ICL matters because training large models is extremely expensive—GPT-3 training took roughly 3–4 continuous months. If models can adapt to new tasks purely through context, we can bypass costly retraining cycles. Yet the mechanisms driving ICL remain poorly understood: when does it work, when does it fail, and what makes the difference?

We conducted a comprehensive empirical investigation to answer these questions across three research axes: scaling laws, benign overfitting, and behavior of full transformer architectures.

Problem Setup

Task Definition & Scope

We study binary classification tasks generated from a Gaussian mixture model. Each task is an independent classification problem with its own direction vector, so the model must infer the task from context rather than memorize a fixed rule.

μ τ ~ Uniform(R \cdot S d-1), x i = y i μ + z i, z i ~ N(0, I d), y i \in {-1, +1}

R controls the separation between classes (signal strength). Each task consists of N labeled context examples plus one query point whose label must be predicted.

Parameter	Symbol	Description	Range Tested
Feature dimension	d	Number of features per data point	50 – 1000
Context examples	N	Labeled examples shown in-context	5 – 80
Training tasks	B	Number of tasks per training batch	50 – 2000
Signal strength	R	Class separation (signal-to-noise)	1.35 – 9.5
Label noise	ε	Probability of flipped context labels	0 – 0.4

Scope Boundaries

What we did: Systematic empirical sweeps over (d, N, B, R, ε) on synthetic Gaussian mixture tasks using a linear in-context classifier, plus evaluation of commercial LLMs (GPT-4o-mini, Claude, Gemini) on the same task format.

What we did not do: We did not derive new theoretical bounds or train full-scale transformer architectures from scratch. Our linear classifier isolates the geometric mechanism of ICL; full transformers were evaluated as pre-trained black boxes.

Approach

Three Research Questions

We organized our investigation around three complementary questions, each probing a different aspect of in-context learning. Expand each section below for methodological details.

RQ 1

Scaling Laws

How does in-context test accuracy scale with feature dimension (d), context size (N), and number of training tasks (B)?

RQ 2

Benign Overfitting

Under what conditions does the model memorize noisy context labels yet still generalize correctly on clean test data?

RQ 3

Full Transformers

Do commercial LLMs (GPT, Claude, Gemini) exhibit the same ICL behaviors predicted by linear attention theory?

RQ 1 — Model & Training Details

We use a linear in-context classifier with a single learnable matrix W ∈ R^d×d. The model computes the label-weighted empirical mean of context examples, then predicts via the inner product μ̂^TW x_query. This formulation mirrors the theoretical parameterization studied by Frei & Vardi (2024) while remaining tractable for large-scale sweeps.

Training: SGD with learning rate η = 0.01, zero initialization, up to 1000 steps, evaluated every 10 steps across 3 independent seeds. Loss is logistic and computed only on the query—context predictions are used exclusively for evaluation.

Experiments: We swept d ∈ {50, 100, 200, 500, 1000}, N ∈ {5, 10, 20, 40, 80}, B ∈ {50, 100, 250, 500, 1000, 2000}, under both constant R = 6.45 and SNR-scaled R = 0.3√d. Interaction grids over (d, N) and (B, N) captured cross-variable effects.

RQ 2 — Noise Injection & Regime Classification

We inject label noise by independently flipping each context label with probability ε ∈ {0, 0.05, 0.1, 0.2, 0.3, 0.4}. In one set of experiments, noise is applied only to context labels (query labels remain clean); in another, noise is applied uniformly to both context and query labels.

Regime classification:

Underfitting

Low in-context accuracy,
low test accuracy

Classical Overfitting

High in-context accuracy,
low test accuracy

Benign Overfitting

High noisy in-context accuracy,
high test accuracy

We visualized regimes using phase diagrams over (d, ε), (R, ε), and (N, ε), sweeping across dimensionality, signal strength, and context size interactions.

RQ 3 — Commercial LLM Evaluation

We evaluated Google Gemini 2.0 Flash, Anthropic Claude, and OpenAI GPT-4o-mini on the same Gaussian mixture classification tasks. Since these models expect text, we serialized each feature vector as comma-separated floating-point numbers (4 decimal places) and constructed few-shot prompts with N labeled examples followed by an unlabeled query.

We also probed the open-weights TinyLlama-1.1B model on synthetic linear regression tasks, comparing its MSE against one-step gradient descent and preconditioned GD++ baselines.

Key Results

What We Found

Below are the headline findings from each research question. Full training curves, phase diagrams, and additional figures are available in the full report and poster.

RQ 1

Scaling Laws

Dimension (d)

Under constant R, both validation and in-context accuracy reach 1.0 regardless of dimension, but higher d slows convergence because a fixed signal becomes weaker in higher-dimensional noise.
Under SNR-scaled R (R = 0.3√d), performance reliably reaches 1.0 regardless of d, since signal strength grows with dimension.

Learning curves for d=50 with SNR-scaled R showing train, validation, and in-context accuracy over training steps — **Figure 1.** d=50, SNR-scaled R. In-context accuracy reaches 1.0 quickly, but validation accuracy plateaus at ~0.95—the model classifies its own context well but struggles to generalize to unseen query points in low dimensions.

Learning curves for d=1000 with constant R showing gradual convergence to 1.0 — **Figure 2.** d=1000, constant R=6.45. Despite 1000 dimensions, all three accuracy metrics converge to 1.0, though convergence is more gradual. Higher dimensions slow learning but do not prevent it when signal is sufficient.

Context Size (N)

More context examples raise the starting accuracy of the model (N=80 starts at ~0.98 vs N=5 at ~0.75 for validation), representing better initial task inference.
However, all configurations converge to comparable final accuracy. N primarily affects convergence speed, not ceiling performance.

Learning curves with only 5 context examples showing validation starting at 0.75 — **Figure 3.** N=5 context examples. Validation accuracy starts at ~0.75 and takes ~30 steps to reach 1.0. With few examples, the model needs more training to compensate for limited per-task information.

Learning curves with 80 context examples showing near-instant convergence — **Figure 4.** N=80 context examples. Validation starts at ~0.98 and reaches 1.0 almost immediately. More context examples give the model a much better initial estimate of the task, dramatically accelerating convergence.

Interpretation The scaling variables d, N, and B primarily control convergence speed. The theoretical ceiling is determined by signal strength R relative to dimension d. When R is scaled appropriately, the model reaches near-perfect accuracy across all tested configurations.

RQ 2

Benign Overfitting

Context-Only Noise

When noise is applied only to context labels (query labels remain clean), benign overfitting occurs across nearly all configurations—even at 40% label flipping.
The model memorizes noisy context labels (high in-context accuracy) while maintaining near-perfect validation accuracy on clean queries.
Higher dimensions slow convergence but do not prevent benign overfitting, as long as signal strength is sufficient.

Benign overfitting at 20% noise: in-context accuracy reaches 1.0 while validation also reaches 1.0 — **Figure 5.** ε=0.2 (20% label noise), context-only. Despite 20% of context labels being flipped, the model memorizes even the noisy labels (in-context acc → 1.0) while still achieving perfect validation accuracy. This is classic benign overfitting.

Strong benign overfitting at 40% noise: in-context accuracy reaches 1.0, train accuracy caps around 0.8 — **Figure 6.** ε=0.4 (40% label noise), context-only. Even with nearly half the labels flipped, in-context accuracy reaches 1.0 and validation converges to ~0.8. Train accuracy (blue) is most affected—it never reaches 1.0 because the noisy training signal caps it, yet generalization is preserved.

Uniform Noise (Context + Query)

When noise is applied to both context and query labels, signal strength R becomes critical. Low R (e.g., 1.35) causes collapse to near-random accuracy (underfitting).
Sufficient R (e.g., 8.97) restores benign overfitting, with validation accuracy reaching its theoretical maximum (1 − ε).
Higher dimensionality can cause collapse rather than facilitate overfitting when signal is not scaled accordingly—opposite to the context-only noise case.

Underfitting with low signal strength R=1.35: validation accuracy stuck near 0.5 (random chance) — **Figure 7.** d=1500, R=1.35, ε=0.2, noise on both context and query. With weak signal, the model collapses: validation and train accuracy hover near 0.5 (random chance), while in-context accuracy slowly climbs to ~0.75. This is **underfitting**—the signal is too weak for the model to learn the task.

Benign overfitting restored with high signal strength R=8.97: in-context accuracy near 1.0, validation near 0.8 — **Figure 8.** Same setup but R=8.97. Stronger signal restores **benign overfitting**: in-context accuracy reaches ~0.99, while validation accuracy stabilizes at ~0.8 (its theoretical maximum of 1−ε). Signal strength is the decisive factor.

Key Insight Benign overfitting is not an exotic edge case. Under context-only noise, it is the default behavior. The critical variable is the relationship between signal strength (R) and dimensionality (d)—not noise level alone. Compare Figures 7 and 8: the only difference is R, yet the outcome flips from underfitting to benign overfitting.

RQ 3

Full Transformer Architectures

Commercial LLMs (GPT-4o-mini, Claude, Gemini)

Performance improves with stronger signal (higher R) and degrades in high dimensions, partially aligning with linear ICL theory.
Results show that full transformers do implement some form of in-context classification on serialized numerical data, though performance is noisier than the trained linear model.

Commercial LLM accuracy vs signal strength R for Gemini, Claude, and GPT — **Figure 9.** Accuracy vs. signal strength (R) for three commercial LLMs (d=50, N=5). All providers show an upward trend as signal increases, consistent with linear ICL theory. Gemini performs best overall; Claude shows the steepest improvement. Shaded regions represent variance across seeds—note the high uncertainty, reflecting prompt sensitivity.

Heatmap comparing accuracy of Gemini, Claude, and GPT across dimension and context length — **Figure 10.** Accuracy heatmaps across dimension (d) and context length (N) for each provider. **Low-dimensional tasks (d=50–100) yield the best results**, consistent with theoretical predictions. Performance drops sharply in higher dimensions (d≥500), especially for Claude and GPT. Gemini maintains moderate accuracy across more configurations. Red cells (0.00) indicate the model failed to produce valid predictions.

Open-Weights Probing (TinyLlama)

TinyLlama's MSE was compared against one-step gradient descent (GD-1) and preconditioned GD++ baselines on linear regression tasks.
Results provide evidence that full transformers naturally implement gradient-descent-like algorithms, as suggested by linear attention theory.

Interpretation The ICL behaviors predicted by simplified linear theory are partially preserved in full transformer architectures. Commercial LLMs show the same directional trends (signal helps, high dimension hurts) but with greater variance, suggesting additional mechanisms beyond pure geometric inference.

See all figures & training curves in the full report

Honest Assessment

Limitations & Failure Modes

Synthetic Data Only

All experiments use Gaussian mixture tasks. While this gives precise control, real-world ICL operates on natural language and heterogeneous data distributions. Our findings characterize the geometric mechanism but may not transfer directly to production settings.

Linear Classifier

Our primary model is a linear in-context classifier (single W matrix). This deliberately isolates the geometric mechanism but cannot capture nonlinear attention effects present in full transformers.

Commercial LLM Noise

RQ3 results are noisier due to uncontrollable factors: API-level randomness, prompt sensitivity, and the models' training on vastly different data. We cannot rule out that performance on serialized numerical data reflects prompt engineering rather than true ICL.

Computational Limits

We trained for at most 1000 steps per configuration. Longer training could alter conclusions about convergence behavior, particularly in high-dimensional, low-signal regimes where accuracy was still improving at termination.

How to Interpret Our Results

Our accuracy metrics are averaged over 3 independent seeds, so differences under ±2% should be considered noise. When we report that a configuration "reaches 1.0," this means the mean across seeds is ≥0.99. For benign overfitting results, the meaningful signal is the gap between in-context accuracy (on noisy labels) and validation accuracy (on clean data)—a large gap with both values high confirms benign overfitting.

Next Steps

Future work could extend this analysis to multi-class settings, non-Gaussian distributions, and mid-scale transformer architectures (e.g., fine-tuned GPT-2) where training dynamics can be observed directly. Investigating the transition between benign and classical overfitting with finer ε resolution would also help map the phase boundary more precisely.

Project Artifacts

Report, Poster & Code

All project materials are publicly available.

Full Report

34-page PDF with all methods, figures, and discussion

Conference Poster

Visual summary of all three research questions

GitHub Repository

Source code, training scripts, and reproducibility artifacts

Interactive Demo

Test ICL with GPT, Claude, and Gemini in your browser

Reproducing Our Results Clone the repository, install dependencies with pip install -r requirements.txt, then run scripts in src/icl_reproduction/ (e.g., python example_comprehensive_pipeline.py). The commercial LLM tests require API keys set in a .env file—see the repository README for details.

References

Citations

Frei, S. & Vardi, G. (2024). Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context.
Garg, S., Tsipras, D., Liang, P., & Valiant, G. (2023). What Can Transformers Learn In-Context? A Case Study of Simple Function Classes.

Our linear in-context classifier formulation follows the theoretical framework of [1]. The function-class perspective on ICL capabilities draws from [2]. Commercial LLM APIs (OpenAI, Anthropic, Google) were accessed via their respective Python SDKs.

Team

Contributors

RC

Rushil Chandrupatla

RQ1 & RQ2: model design, training pipeline, results analysis, report writing

SL

Sebastian Leng

RQ2: benign overfitting experiments, noise injection methods, code

LB

Leo Bangayan

RQ3: commercial LLM evaluation framework, interactive demo, code

AM

Arya Mazumdar

Faculty Mentor