DeepSeek Finally Got Eyes — And It Can't Even Recognize Its Own Boss

🔥 Opinion

DeepSeek Finally Got Eyes — And It Can't Even Recognize Its Own Boss

By Osuji MiracleFLUXVIA • June 18, 2026 • 6 min read

Let me tell you something funny.

DeepSeek — the AI company that's been flexing its text muscles for over a year — finally added vision to its toolkit. You can now upload a photo and have it actually see what's in it. Sounds cool, right?

Here's the punchline: Show it a picture of its own CEO, Liang Wenfeng, and it'll confidently tell you it's Jack Ma. Or Pony Ma. Or some random guy. It has no freaking clue who its own boss is [citation:1].

This went live about 15 hours ago [citation:2]. And honestly? It's exactly the kind of chaotic energy I've come to expect from DeepSeek. They nail the engineering, but the polish? Not quite there yet.

AI eye visualization
DeepSeek's new Vision Mode gives AI the ability to "see" — but the execution is uneven. Image via Unsplash

Okay, But What Does This Thing Actually Do?

First things first: this isn't just a glorified OCR tool. It's a full visual understanding system that sits right next to "Fast Mode" and "Expert Mode" in the DeepSeek interface [citation:3].

The technical magic behind it is called "Thinking with Visual Primitives" [citation:4]. Here's the dumbed-down version:

"Instead of just describing what it sees in words, DeepSeek can now point at objects in an image using bounding boxes and coordinates — like giving the AI a cyber finger."

This matters because most vision models are terrible at "pointing." They'll say "the thing on the left" and you have no idea what they mean. DeepSeek's approach is different — it anchors its reasoning to actual coordinates in the image [citation:5].

The efficiency is also wild. Processing an 800×800 image costs roughly 90 tokens — while Claude Sonnet needs about 870 and Gemini needs around 1,100 [citation:6]. That's not just clever. It's competitive advantage territory.

AI vision technology
DeepSeek's "visual primitives" approach points at objects with coordinates, not just words. Image via Unsplash

The Good, The Bad, and The Straight-Up Embarrassing

✅ What It Nails

When it works, it really works.

Test it with a Shanghai skyline photo — it IDs four major buildings and correctly calls out "the white bridge is probably Zhapu Road Bridge," which is a classic photography spot [citation:7].

Show it a museum artifact, and it'll describe textures and infer historical styles. Show it a meme, and it gets the joke [citation:8]. Show it a webpage layout, and it can spit back working code [citation:9].

For screenshots, tables, documents, and flowcharts, it's fast and reliable [citation:10]. That's genuinely useful for anyone who works with messy digital content.

❌ Where It Faceplants

But here's the problem: the failures are spectacular.

It can't recognize its own CEO. Multiple tests show DeepSeek misidentifying Liang Wenfeng as Jack Ma, Pony Ma, or just saying "I really don't know" after thinking for minutes [citation:11].

It doesn't know anyone recent. Show it Cape Verde goalkeeper Vozinha — who went viral in the last week — and it'll "think" for over a minute, mention Cape Verde multiple times, then give a completely wrong answer [citation:12].

Why? Because the training data stops at 2025 [citation:13]. And there's no internet search in Vision Mode — unlike competitors like Doubao that auto-connect to search results [citation:14].

🤡 The Embarrassing Part

Here's where it gets really awkward. In late April, DeepSeek published their technical report on this vision model — then took it down hours later. The GitHub page returned a 404 [citation:15].

Was it leaked? Was it incomplete? Did they reveal too much? Nobody knows. But it's not a good look for a company that just raised a reported $7 billion in Series A funding [citation:16].

"A vision model that can't recognize its own CEO and a technical paper that disappears overnight. That's not a bug. That's a vibe."

Why I Still Care About This (And You Should Too)

Look, it's easy to clown on DeepSeek for the CEO fail. But this isn't really about celebrity recognition.

This is about agents.

DeepSeek's V4 model, released in April, was all about Agent capabilities — getting AI to actually do things. But a text-only agent hits a ceiling fast. If you want AI that can:

  • Navigate browser interfaces
  • Read dashboards
  • Interpret UI elements
  • Analyze complex charts and PDFs

…it needs to see. Full stop [citation:17].

And DeepSeek might be the late but smart player here. While others rushed to slap on mediocre multi-modal features, DeepSeek waited until they had something different — something that prioritizes pointing accuracy over pixel resolution [citation:18].

The early benchmarks back this up. In counting tasks, DeepSeek scored 89.2% vs Gemini's 88.2% and GPT-5.4's 76.6%. In maze navigation, it scored 66.9% vs GPT-5.4's 50.6% [citation:19]. That's not just competitive. That's dominant in specific niches.

Business collaboration
DeepSeek's vision mode is a prerequisite for true AI agents — and they might be late, but they're not wrong. Image via Unsplash

🧠 My Honest Take

DeepSeek's Vision Mode is technically brilliant but practically unpolished.

The "visual primitives" approach is genuinely innovative. The token efficiency is a legitimate advantage. The benchmark numbers are impressive.

But.

A vision model that can't recognize its own CEO is a bad look. A technical paper that disappears overnight is a red flag. A feature that can't identify anyone or anything from the last 18 months is a hard limit.

Should you try it? Yes. It's free, it's fast, and it's genuinely useful for a lot of tasks.

Should you trust it with anything important? Not yet. Not without double-checking.

DeepSeek has a history of shipping things that start rough and become dominant. This might be one of those times. But right now? It's not there yet.


The Bottom Line

DeepSeek finally gave its AI eyes — and it's already embarrassing itself in public.

But honestly? That's kind of what makes this fun to watch. AI is still figuring out how to point at things without looking like a toddler. And DeepSeek, for all its flaws, is trying something genuinely different than the pixel-stacking approach everyone else is taking.

The CEO thing? It's funny. It's embarrassing. It'll get fixed.

The deeper question is: does this approach actually work better in the long run? My gut says yes. But I'm not holding my breath until I see it handle a real-world dashboard without hallucinating half the data.

What's your experience been with DeepSeek's Vision Mode? Drop a comment below — I'm genuinely curious.


Disclaimer: This is an opinion piece based on early testing and publicly available information as of June 18, 2026. DeepSeek's Vision Mode launched approximately 15 hours prior to publication. All claims are sourced where possible. Fast-moving AI means capabilities may change — but the CEO fail is already legendary.

Keywords: DeepSeek Vision Mode, DeepSeek multi-modal, DeepSeek CEO fail, AI vision review, DeepSeek opinion, Chinese AI models, visual primitives, Osuji Miracle, FLUXVIA

Sources: DeepSeek识图模式正式上线凤凰网: 认不出梁文锋DoNews: 技术报告IT之家: 正式上线

Published June 18, 2026 • Opinion • 6 min read • FLUXVIA

Comments

0
Loading comments...

📬 Fluxvia

"The best stories are written in code and ink."
Reader Reader Reader Reader +1k

Join 2,000+ readers. Weekly insights on tech, creativity & code.

"Modern web tools, guides, and resources built for developers by developers. Always free, always client-side."

Resources

Support

Have a tool request or feedback?

Reach out anytime.

📧 fluxviatech@gmail.com

All Rights Reserved.