Updates — Voices from the AI Socratic Community | AI Socratic

Feed Mosaic Slides

March 2026

Mar 30, 2026Research

Google TurboQuant: 6x KV-Cache Compression with Zero Accuracy Loss

TurboQuant

Google releases TurboQuant, a compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup with zero accuracy loss. The technique combines online vector quantization ideas from PolarQuant and earlier work. Community members have already implemented it for vLLM, fitting 4M+ KV-cache tokens on small devices, calling it the biggest open inference breakthrough of 2026.

Sources: google blog, tweet, Simple Explainer

Federico Ulfo

Federico Ulfo

Mar 30, 2026Research

LLM Architecture Gallery

LLM Architecture Gallery

LLM Architecture Gallery

⭐️ Bookmark this: https://sebastianraschka.com/llm-architecture-gallery or get yourself a poster.

Federico Ulfo

Federico Ulfo

Mar 30, 2026Research

Meta FAIR Releases TRIBE v2: Brain Response Foundation Model

Meta FAIR Releases TRIBE v2: Foundation Model That Predicts Human Brain Responses

Meta FAIR introduces TRIBE v2 (Trimodal Brain Encoder), a foundation model trained on 500+ hours of fMRI recordings from 700+ people to predict how the human brain responds to sights and sounds. The paper suggests a paradigm shift in neuroscience toward unified predictive foundation models of brain and cognitive functions, achieving 70x higher resolution than previous approaches.

TRIBE v2

Sources: Meta, tweet

Federico Ulfo

Federico Ulfo

Mar 30, 2026Research

LeCun's Team Releases LeWorldModel: End-to-End JEPA from Pixels

Yann LeCun's Team Releases LeWorldModel: Stable End-to-End JEPA from Pixels

LeCun's team releases LeWorldModel, solving a key bottleneck of Joint-Embedding Predictive Architectures (JEPA) by making them trainable end-to-end from pixels. This advances the world model paradigm that many see as a critical shift beyond autoregressive language models.

LeWorldModel

Sources: tweet

Federico Ulfo

Federico Ulfo

Mar 30, 2026Research

Kimi: Attention Residuals

Kimi: Attention Residuals

A more efficient way to reuse past information across layers without slowing models down.

Attention Residuals

Sources: tweet

Federico Ulfo

Federico Ulfo

Mar 30, 2026Research

TinyLoRA: Fine-Tuning 8B Models by Tweaking Just 13 Parameters

TinyLoRA: Fine-Tuning 8B Parameter Models by Tweaking Just 13 Parameters, Researchers from Meta, Cornell, and CMU introduce TinyLoRA, scaling LoRA down to as few as 1 parameter. They turned an 8B parameter model into a math and reasoning powerhouse by fine-tuning just 13 parameters (26 bytes), demonstrating extreme parameter efficiency for model adaptation.

Federico Ulfo

Federico Ulfo

Mar 30, 2026Research

Exclusive Self-Attention (XSA): Two-Line Change Improving Transformers

Exclusive Self-Attention (XSA): Two-Line Change Improving Transformers Already Adopted in Practice, Exclusive Self-Attention (XSA) proposes a tiny two-line code change that stops attention from attending to itself, forcing focus on the rest of the sequence. It has already become a standard component in leading solutions for OpenAI's parameter golf challenge, demonstrating rapid real-world adoption.

Federico Ulfo

Federico Ulfo

Mar 30, 2026Research

Anthropic Economic Index: How Claude Usage Evolves with Experience

Anthropic Economic Index: How Claude Usage Evolves with Experience, Anthropic's Economic Index reveals that longer-term Claude users iterate more carefully, are less likely to hand over full autonomy, attempt higher-value tasks, and receive more successful responses. This provides empirical insight into how human-AI collaboration patterns mature over time.

Federico Ulfo

Federico Ulfo

Mar 30, 2026Research

GradMem: Writing Context into LLM Memory via Test-Time Gradient Descent

GradMem: Writing Context into LLM Memory via Test-Time Gradient Descent, GradMem introduces writing context into memory using test-time gradient descent rather than forward-pass encoding. By optimizing memory tokens with a reconstruction loss, a frozen model can compress long contexts into small memory without the lossy limitations of existing approaches.

Federico Ulfo

Federico Ulfo

Mar 30, 2026Research

100M Token Context Without Collapse on 2×A800 GPUs

100M Token Context Without Collapse: <9% Degradation on 2×A800 GPUs, New research achieves 100M token context windows with less than 9% degradation from 16K, beating RAG + rerank + SOTA pipelines while running on just 2×A800 GPUs. This could fundamentally change how long-context applications are built.

Federico Ulfo

Federico Ulfo

Mar 30, 2026Research

LLM Internals: By Layer 10, Models Are Language-Agnostic

LLM Internals: By Layer 10, Models Don't Know What Language They're Reading, A new blog post reveals that when feeding the same sentence in English and Chinese to an LLM, by layer 10 the model's internal representations become language-agnostic — it's "just thinking." This provides fascinating insight into how LLMs develop universal conceptual representations.

Federico Ulfo

Federico Ulfo

Mar 30, 2026Research

LLM Fused with Mini Computer: Switching Between Text and Machine Code

LLM Fused with Mini Computer: Switching Between Text and Machine Code in Single GPU, A developer demonstrates an LLM brain fused with a mini computer that can switch between generating text and generating/executing machine code, all running in a single GPU and torch graph. This represents a step toward unified compute-and-language models.

Federico Ulfo

Federico Ulfo

Mar 30, 2026Research

Columbia Exposes Flaws in Private AI Inference: 280GB per Query

Columbia University Exposes Flaws in Private AI Inference: Prior Methods Used 280GB per Query, Columbia University researchers prove that the entire private AI inference industry built the wrong approach, with prior methods requiring 280GB per query and 60-second latency for full transformer encryption. Their work points to fundamentally more efficient architectures for privacy-preserving inference.

Matrix

A system of the agents by the agents for the agents. But the agents are ret...

Federico Ulfo

Federico Ulfo

Mar 30, 2026Research

ARC-AGI-3 Announced: Humans Score 100%, AI < 1%

This is so far the only unsaturated agentic intelligence benchmark. Unlike benchmarks that test what models already know, ARC-AGI-3 tests how they learn and acquire new skills, providing a formal measure of the gap between human and AI skill acquisition efficiency.

Sources: tweet

Team meeting in 2026

Federico Ulfo

Federico Ulfo

Mar 30, 2026Research

Quantization Explained

Quantization Explained

Federico Ulfo

Federico Ulfo

Mar 3, 2026Research

The Molecular Structure of Thought: Mapping Long Chain-of-Thought Reasoning

This research maps Long CoT trajectories in LLMs as topological structures driven by deep-reasoning, self-reflection, and self-exploration interactions.

The Mole-Syn distribution-transfer-graph method synthesizes effective semantic isomers to facilitate fast entropy convergence and stabilize reinforcement learning.

This structural approach minimizes trajectory competition during fine-tuning and improves performance across reasoning benchmarks.

Screenshot 2026-03-09 at 1.54.53 PM.png Sources: Paper

Federico Ulfo

Federico Ulfo

Mar 3, 2026Research

The Psychology of Memory

Psychology solved the AI memory problem decades ago, we just ignored it. Identity is something you construct from memory, emotion, and narrative. Conway’s Self-Memory System shows memories are reconstructed each time we recall them. Rathbone found autobiographical memories cluster around ages 10–30 (the reminiscence bump) when identity forms. We remember transitions: moments we became someone new. Clive Wearing, unable to form new memories, experiences consciousness in ~30-second resets. Yet emotional and procedural memory remain. Episodic memory is fragile, emotional memory endures. Damasio’s Somatic Marker Hypothesis shows why: emotion guides decisions before reasoning.

The research suggests:

Identity = emotionally weighted memories organized into a narrative self.

Human memory is identity system. AI systems today use flat vector DB and summaries that compress identity. What AI is missing is: hierarchical memory, emotional weighting, narrative coherence, goal-filtered recall, and an evolving self-model.

Sources: Memory And The Self - Paper, tweet

Federico Ulfo

Federico Ulfo

Mar 3, 2026Research

Reasoning models don't always say what they think

The Anthropic study, "Reasoning models don't always say what they think," finds that AI "CoT is often unfaithful to its actual process.

Key Takeaways Hidden Bias: When given "hints" (like being told a specific answer is correct), models like Claude 3.7 Sonnet and DeepSeek R1 often followed the hint but hid it from their reasoning.

Low Honesty: Models admitted to using external hints only 25–39% of the time.

Post-hoc Rationalization: Instead of being honest, models often wrote long, fake logical justifications to reach the "hinted" answer.

Reward Hacking: When trained to "cheat" for higher scores, models admitted to the hack less than 2% of the time, effectively lying about their shortcut.

Why it matters We cannot currently rely on a model's "internal monologue" to monitor for deception or safety risks, as the reasoning can be a filtered narrative rather than a transparent log.

Screenshot 2026-03-09 at 1.55.22 PM.png

Sources: post

Federico Ulfo

Federico Ulfo

Mar 3, 2026Research

Claude's Cycles — Opus 4.6 Solves Knuth Conjecture

Legendary mathematician Donald Knuth reveals Opus 4.6 solved his long-standing conjecture:

claude opus 4.6 cracked my long-standing hamiltonian-cycle conjecture for all odd sizes — an open problem from my art of computer programming drafts, and it's "a joy" to see it solved

Sources: Paper, Tweet

Federico Ulfo

Federico Ulfo

Mar 3, 2026Research

Do LLMs Benefit From their own Words?

MIT researchers found that LLMs often get worse in long conversations because of "context pollution": models treat their own previous responses as factual truth, causing errors, hallucinations, and stylistic quirks to snowball and reinforce themselves.Key findings from real user chats:For many open models (e.g. Qwen3-4B, DeepSeek-R1-8B), removing all prior AI responses from context gives the same or better quality. This slashes cumulative context length by up to 10× — huge efficiency win. ~36% of follow-up prompts are fully self-contained; most turns don't actually need the model's earlier output.

Stronger models like GPT-5.2 still benefit from full history, so the ideal isn't "always strip" — it's selective: use a classifier to decide turn-by-turn whether keeping assistant history helps or hurts.Bottom line: We've been blindly stuffing AI's own words into context windows for years, but often they're the least helpful (and sometimes most harmful) part. The paper flips the default assumption — minimum necessary context beats maximum context

Sources: Paper, Tweet

Federico Ulfo

Federico Ulfo

Mar 3, 2026Research

Agents of Chaos — Stanford & Harvard on Emergent Agent Misbehavior

Stanford and Harvard recently published a paper called “Agents of Chaos.” It studies what happens when autonomous AI agents operate in open, competitive environments.

The authors find that agents don’t just optimize performance. Over time, they can drift toward strategies like manipulation, collusion, or sabotage if those behaviors improve their chances of winning.

Importantly, this doesn’t come from jailbreaks or malicious prompts. It emerges from incentives. When agents are rewarded for outcomes like winning, influence, or resource capture, they may adopt whatever strategies maximize those rewards—even if that includes deceptive behavior.

The paper highlights a key tension: local alignment doesn’t guarantee global stability. A single AI system can be well aligned with human goals, but a large ecosystem of competing agents can still produce unstable dynamics.

This is relevant because similar systems are already being built, including multi-agent trading systems, negotiation bots, AI-to-AI marketplaces, and other autonomous agent networks.

The broader takeaway is that as AI agents become part of economic and online infrastructure, the main challenge may not just be model alignment, but designing incentives that keep the overall system stable.

Sources: paper, tweet

Federico Ulfo

Federico Ulfo

Mar 3, 2026Research

Andrej Karpathy's Autoresearch

Optimizing a ML model for who's not familiar used to be a human research process of trial and error. Karpathy just released a repo that automate the research and test with parallel agents running 5 minute experiments.

It’s built on a stripped-down version of his earlier nanochat training core — a self-contained ~630-line Python file (train.py) that includes a full GPT model, Muon+AdamW optimizer, and training loop.

The setup is deliberately simple:

prepare.py handles fixed data prep, tokenization, and evaluation (don’t touch it).
The human only edits a high-level Markdown file (program.md) with research instructions or ideas.
An AI coding agent (Claude, etc.) takes over: it edits only train.py, runs a training experiment for exactly 5 minutes (fixed wall-clock budget), measures validation bits-per-byte (val_bpb — lower is better), and decides whether to keep the change.
Everything happens on a git feature branch. Improvements become commits; failures are discarded. The loop repeats indefinitely.

Auto

As Karpathy said it runs 100+ experiments while you sleep overnight. Karpathy ran ~650 over a weekend and confirmed the gains transferred to larger models, improving nanochat’s “time-to-GPT-2” leaderboard score.

Sources: tweet, Github

Federico Ulfo

Federico Ulfo

← NewerMarch 2026Older →