Shenzhen TV Frontier of Innovation Exclusive Interview with Visincept (Full Text)
On May 16, 2026, the program Frontier of Innovation from Shenzhen TV visited Visincept. It focused on EgoTwin, the egocentric hand 3D alignment data engine for world models jointly launched by Visincept and Baidu Intelligent Cloud, and conducted an in-depth interview with Zhang Lei, Founder and CEO of Visincept, and Liu Wei, Co-founder. Below is the full interview.

Over the past year, global tech communities have forged profound strategic synergy amid the convergence of virtual and real worlds. Turing Award laureate Yann LeCun left Meta after 12 years of service to found AMI Labs. World Labs, established by "Godmother of AI" Fei-Fei Li, rolled out its World API in early 2026. The tool can turn ordinary photos and videos into interactive 3D spaces that follow physical collision rules with a single click. Meanwhile, Chinese innovation teams have repeatedly secured top rankings in WorldArena, the benchmark for world models, and achieved fruitful results in applying relevant technologies across various vertical industries.
Against this backdrop, Visincept, a startup based in Shenzhen, has emerged as a standout player in the fiercely competitive world model track. Equipped with superior egocentric 3D reconstruction capabilities for hands and high-efficiency data augmentation performance, the company is tackling the core pain points of embodied intelligence.
1. Bottlenecks for AI to Enter the Physical World
"Over the past few years, AI has made phenomenal achievements in the digital world, which marks a major milestone. People have long anticipated that AI could step into the physical realm. This means endowing AI with a physical form, allowing it to take over work once done solely by humans, and even surpass human capabilities."
Zhang Lei went on to point out the challenges: "From a research perspective, AI has run into substantial bottlenecks. We can easily access massive data across the internet for digital scenarios, yet it is extremely difficult to obtain physical-world data of equivalent scale. Data scarcity has therefore capped the performance of large models."
2. Solving the Problem of "Data Hunger"
The internet is flooded with texts and images, enabling large models to master conversational skills with ease. However, real-world operational experience cannot be generated out of nothing. Tasks that humans complete effortlessly, such as folding clothes and wiping tables, entail exorbitant costs for basic robotic interaction. Lacking sufficient hands-on experience, robots suffer from severe data hunger, driving researchers to seek fundamental solutions.
"We have spent years carrying out extensive foundational work, including object understanding and human pose and motion recognition in open environments. Those're all core technologies closely tied to world models," Zhang Lei noted. "Drawing on the latent space model proposed by scholar Yann LeCun in deep learning, and combining it with prevailing industry pain points, we recognized that the conditions for developing world models are now mature. As an innovative R&D team, we have accelerated the development of relevant algorithms, aiming to deliver fundamental contributions to the whole industry."
3. Enabling Models to Master Real-World Physical Rules
Globally, competition in embodied intelligence has turned white-hot. As world models gain growing traction and generative AI advances rapidly, enabling models to accurately simulate real physical laws has become the key to breakthroughs.
"What we aim to build is a world model designed to realize efficient interaction between robots and the physical environment. More fundamentally, we intend to reshape the paradigm where reinforcement learning learns from physical interaction experience," Zhang Lei explained. "Traditional language models are built for next token prediction, while world model focuses on next state prediction. Though the phrasing differs only slightly, the underlying technologies are worlds apart."
4. Core Elements of World Models
Traditional language models predict the next word based on context, while world models forecast how the real world will evolve in the next moment. Beyond simple back-and-forth dialogue in the virtual space, they reconstruct and deduce the physical logic of reality.
Zhang Lei elaborated in detail: "We hold that a qualified world model needs three core attributes: Object-Centric, Action Aligned and Causality-Driven. The object-centric design grants the model the ability to perceive objects and distill underlying causal rules from massive datasets. Action alignment is also critical. It allows us to efficiently align egocentric hand data with robotic arm operation data, so as to fully leverage hand motion data and help robots interact with the environment more effectively, which is also a major focus across the industry lately."
He emphasized: "The third attribute, causality-driven design, is the essential nature of world models. A world model must learn causal relationships, namely how the state of the world changes after an action is performed. Only causality-aware world models can integrate well with reinforcement learning and boost the intelligence level of robots in environmental interaction."
5. Visincept's Technological Advantages
Robots need to first identify objects, then understand actions, and finally comprehend the causal links between the two. For example, a closed door will only open when a hand presses down on its handle. This track has drawn widespread attention, yet market players adopt divergent technical routes, which places high demands on teams' industry insight.
"In 2022, our team developed the DINO model, which pushed Transformer-based object detection to state-of-the-art performance in computer vision. It has influenced nearly all subsequent research on visual object detection," Zhang Lei said with pride. "Our vision technologies are adopted by the team led by Professor Fei-Fei Li at Stanford University, as well as many domestic enterprises including Alibaba and Digua Robotics. For instance, we can reconstruct the 3D model of a person performing dances, martial arts and other movements in videos purely via visual analysis."
Regarding the team's long-term vision, he stated: "Our ultimate goal is to build a general-purpose model, and provide algorithm verification, data collection and iterative optimization support for robotics companies of all kinds."
6. EgoTwin: A Pivotal Step Toward Realizing the Vision
Any ambitious technological vision must be grounded in practical implementation. EgoTwin, the newly launched digital engine from Visincept, acts as the bridge between vision and reality. It converts everyday egocentric footage captured by humans into usable digital data for robots.
"In developing world models, we prioritize aligning operational data between human hands and robots, two distinct types of operating agents," Zhang Lei explained. "EgoTwin solves a core industry challenge: human hand operation is the most natural way to collect behavioral data. Can we simply record people's daily activities with cameras, and convert the footage directly into data applicable for robots?"
"This is a shared concern throughout the industry. Two years ago, most people thought it was hardly achievable, but now numerous teams are working toward this goal," he added. "We plan to open up EgoTwin's capabilities to the industry as soon as possible, to help accelerate data collection and lower relevant costs across the sector."
7. Future Commercialization Layout
The industrial implementation of cutting-edge technology cannot be accomplished overnight. A phased and gradual commercialization roadmap has become an inevitable choice.
Liu Wei shared the company's plans: "When it comes to world models, embodied intelligence is our primary application direction. On one hand, we will provide cognitive capabilities powered by data for large model developers and embodied intelligence enterprises at home and abroad. On the other hand, we will develop our own hardware based on world models. Adopting the 'model + hardware' business model, we will serve the manufacturing industry and other sectors. This is our long-term industrial layout."
"Moving forward with EgoTwin, we will first build it into a professional data engine. Working with partners including Baidu Intelligent Cloud, we will deliver data services to existing embodied intelligence companies and help them develop high-performance robots. Afterwards, we will refine and iterate our models with accumulated data feedback, develop proprietary robotic hardware, and ultimately provide hardware services to other robotics firms."
Full interview video from Shenzhen TV: [Click Here]