Skip to main content
AI Socratic
November 2025
Research

DeepSeek-OCR: Context Compression Through Optical 2D Mapping

DeepSeek AI has unveiled DeepSeek-OCR, a groundbreaking approach to compressing long contexts via optical 2D mapping. This innovative system demonstrates that vision-based compression can achieve remarkable efficiency in handling text-heavy documents, potentially revolutionizing how large language models (LLMs) process extensive textual information.

The DeepSeek-OCR system consists of two primary components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Together, they achieve an impressive 97% OCR precision when compressing text at a ratio of less than 10× (meaning 10 text tokens compressed into 1 vision token). Even at an aggressive 20× compression ratio, the system maintains approximately 60% accuracy.

Karpathy questions if all LLMs input should actually be images, the advantages are:

  • more information compression (see paper) => shorter context windows, more efficiency
  • significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images
  • input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful.
  • the tokenizer must go. It import all the ugliness of Unicode, byte encoding, and a lot of historical babbage and security jailbreak risks.

Links

Federico UlfoFederico Ulfo
Research

Language Models Are Injective And Hence Invertible

· Claim: Decoder‑only transformer LMs are almost‑surely injective: different prompts map to unique last‑token hidden states; this holds at initialization and is preserved under gradient descent.

· Method: Prove components are real‑analytic, show collisions occur only on a measure‑zero parameter set, and that GD updates don’t move parameters into that set in finite steps.

· Evidence: Billions of collision tests on six SOTA LMs found no collisions.

· Algorithm (SipIt): Reconstructs exact input text from hidden activations by exploiting causality; sequentially matches each token’s hidden state given the known prefix; offers linear‑time guarantees.

· Failure cases: Applies to decoder‑only transformers with analytic activations and continuous initialization; quantization, weight tying, duplicated embeddings, or non‑analytic parts can break injectivity. OK there are ways to preserve "privacy" to the question.

Paper: arxiv.org/abs/2510.15511

Federico UlfoFederico Ulfo
Research

LLM as a Judge — ranking full slates by preference

Instead of simulating clicks and scrolls, researchers let LLMs reason which playlist, feed, or product lineup you’d actually prefer.

And it worked. Across Amazon, Spotify, MovieLens, and MIND datasets, they found:

  • LLMs can rank full slates (not just single items) with strong coherence
  • Logical consistency directly predicts preference accuracy
  • Pretrained models generalize no fine-tuning required

link: x.com/alxnderhughes/status/1988202281314251008

Federico UlfoFederico Ulfo
Research

Does RL improve LLM reasoning? (NeurIPS 2025 top paper)

This paper got top score at NeurIPS 2025. It aims at answering: does RL make LLM better reasoners?

The authors study Reinforcement Learning with Verifiable Rewards (RLVR) and find that while it improves accuracy for small k, it doesn’t create new reasoning patterns — meaning the base model still determines the upper limit of reasoning ability.

Interestingly, it’s distillation, not RL, that shows genuine signs of emergent reasoning 😮.

link: x.com/jiqizhixin/status/1987710546674856051
web: limit-of-rlvr.github.io

Federico UlfoFederico Ulfo
Research

Continuous Autoregressive Language Models (CALM)

Tencent + Tsinghua just dropped a paper called Continuous Autoregressive Language Models (CALM) and it basically kills the “next-token” paradigm every LLM is built on.

Instead of predicting one token at a time, CALM predicts continuous vectors that represent multiple tokens at once.

Meaning: the model doesn’t think “word by word”… it thinks in ideas per step.

→ 4× fewer prediction steps (each vector = ~4 tokens)
→ 44% less training compute
→ No discrete vocabulary pure continuous reasoning
→ New metric (BrierLM) replaces perplexity entirely

link: x.com/rryssf_/status/1985646517689208919

Federico UlfoFederico Ulfo
← NewerNovember 2025Older →

Search

Search across events, members, and blog posts