Let me tell you something funny.
DeepSeek — the AI company that's been flexing its text muscles for over a year — finally added vision to its toolkit. You can now upload a photo and have it actually see what's in it. Sounds cool, right?
Here's the punchline: Show it a picture of its own CEO, Liang Wenfeng, and it'll confidently tell you it's Jack Ma. Or Pony Ma. Or some random guy. It has no freaking clue who its own boss is [citation:1].
This went live about 15 hours ago [citation:2]. And honestly? It's exactly the kind of chaotic energy I've come to expect from DeepSeek. They nail the engineering, but the polish? Not quite there yet.
First things first: this isn't just a glorified OCR tool. It's a full visual understanding system that sits right next to "Fast Mode" and "Expert Mode" in the DeepSeek interface [citation:3].
The technical magic behind it is called "Thinking with Visual Primitives" [citation:4]. Here's the dumbed-down version:
This matters because most vision models are terrible at "pointing." They'll say "the thing on the left" and you have no idea what they mean. DeepSeek's approach is different — it anchors its reasoning to actual coordinates in the image [citation:5].
The efficiency is also wild. Processing an 800×800 image costs roughly 90 tokens — while Claude Sonnet needs about 870 and Gemini needs around 1,100 [citation:6]. That's not just clever. It's competitive advantage territory.
When it works, it really works.
Test it with a Shanghai skyline photo — it IDs four major buildings and correctly calls out "the white bridge is probably Zhapu Road Bridge," which is a classic photography spot [citation:7].
Show it a museum artifact, and it'll describe textures and infer historical styles. Show it a meme, and it gets the joke [citation:8]. Show it a webpage layout, and it can spit back working code [citation:9].
For screenshots, tables, documents, and flowcharts, it's fast and reliable [citation:10]. That's genuinely useful for anyone who works with messy digital content.
But here's the problem: the failures are spectacular.
It can't recognize its own CEO. Multiple tests show DeepSeek misidentifying Liang Wenfeng as Jack Ma, Pony Ma, or just saying "I really don't know" after thinking for minutes [citation:11].
It doesn't know anyone recent. Show it Cape Verde goalkeeper Vozinha — who went viral in the last week — and it'll "think" for over a minute, mention Cape Verde multiple times, then give a completely wrong answer [citation:12].
Why? Because the training data stops at 2025 [citation:13]. And there's no internet search in Vision Mode — unlike competitors like Doubao that auto-connect to search results [citation:14].
Here's where it gets really awkward. In late April, DeepSeek published their technical report on this vision model — then took it down hours later. The GitHub page returned a 404 [citation:15].
Was it leaked? Was it incomplete? Did they reveal too much? Nobody knows. But it's not a good look for a company that just raised a reported $7 billion in Series A funding [citation:16].
Look, it's easy to clown on DeepSeek for the CEO fail. But this isn't really about celebrity recognition.
This is about agents.
DeepSeek's V4 model, released in April, was all about Agent capabilities — getting AI to actually do things. But a text-only agent hits a ceiling fast. If you want AI that can:
…it needs to see. Full stop [citation:17].
And DeepSeek might be the late but smart player here. While others rushed to slap on mediocre multi-modal features, DeepSeek waited until they had something different — something that prioritizes pointing accuracy over pixel resolution [citation:18].
The early benchmarks back this up. In counting tasks, DeepSeek scored 89.2% vs Gemini's 88.2% and GPT-5.4's 76.6%. In maze navigation, it scored 66.9% vs GPT-5.4's 50.6% [citation:19]. That's not just competitive. That's dominant in specific niches.
DeepSeek's Vision Mode is technically brilliant but practically unpolished.
The "visual primitives" approach is genuinely innovative. The token efficiency is a legitimate advantage. The benchmark numbers are impressive.
But.
A vision model that can't recognize its own CEO is a bad look. A technical paper that disappears overnight is a red flag. A feature that can't identify anyone or anything from the last 18 months is a hard limit.
Should you try it? Yes. It's free, it's fast, and it's genuinely useful for a lot of tasks.
Should you trust it with anything important? Not yet. Not without double-checking.
DeepSeek has a history of shipping things that start rough and become dominant. This might be one of those times. But right now? It's not there yet.
DeepSeek finally gave its AI eyes — and it's already embarrassing itself in public.
But honestly? That's kind of what makes this fun to watch. AI is still figuring out how to point at things without looking like a toddler. And DeepSeek, for all its flaws, is trying something genuinely different than the pixel-stacking approach everyone else is taking.
The CEO thing? It's funny. It's embarrassing. It'll get fixed.
The deeper question is: does this approach actually work better in the long run? My gut says yes. But I'm not holding my breath until I see it handle a real-world dashboard without hallucinating half the data.
What's your experience been with DeepSeek's Vision Mode? Drop a comment below — I'm genuinely curious.
Keywords: DeepSeek Vision Mode, DeepSeek multi-modal, DeepSeek CEO fail, AI vision review, DeepSeek opinion, Chinese AI models, visual primitives, Osuji Miracle, FLUXVIA
Sources: DeepSeek识图模式正式上线 • 凤凰网: 认不出梁文锋 • DoNews: 技术报告 • IT之家: 正式上线
Published June 18, 2026 • Opinion • 6 min read • FLUXVIA
Most read from Fluxvia Journal — for developers, designers, and curious minds.
AI isn't replacing developers — it's upgrading them. Explore the real state of agentic software development in 2026.
Complete beginner's guide to building consistent, scalable products with atomic design methodology and real-world examples.
It's not for the computer — it's for the humans who will read, maintain, and debug it. An honest look at programming philosophy.
Step-by-step Node.js + Express tutorial with CRUD operations, testing, and React integration. No prior experience needed.
Master the 4-phase protocol and mindset shifts that separate experts from beginners. Includes GDB, Valgrind, and prevention strategies.
DeepSeek's new Vision Mode launched 15 hours ago — and it's already embarrassing itself. Here's why it can't recognize its own CEO.

"Modern web tools, guides, and resources built for developers by developers. Always free, always client-side."
Resources
Support




All Rights Reserved.
Comments
0