Skip to main content
AI Socratic

DeepSeek AI has unveiled DeepSeek-OCR, a groundbreaking approach to compressing long contexts via optical 2D mapping. This innovative system demonstrates that vision-based compression can achieve remarkable efficiency in handling text-heavy documents, potentially revolutionizing how large language models (LLMs) process extensive textual information.

The DeepSeek-OCR system consists of two primary components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Together, they achieve an impressive 97% OCR precision when compressing text at a ratio of less than 10× (meaning 10 text tokens compressed into 1 vision token). Even at an aggressive 20× compression ratio, the system maintains approximately 60% accuracy.

Karpathy questions if all LLMs input should actually be images, the advantages are:

  • more information compression (see paper) => shorter context windows, more efficiency
  • significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images
  • input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful.
  • the tokenizer must go. It import all the ugliness of Unicode, byte encoding, and a lot of historical babbage and security jailbreak risks.

Links

React:

Comments

Sign in as a member to join the conversation.

Loading comments…

Stay Updated

Get the latest AI insights delivered to your inbox. No spam, unsubscribe anytime.

Search

Search across events, members, and blog posts