top of page

See It, Understand It, Fix It: How Vision Language Models and AR Are Reinventing Repair & Maintenance

  • 20 hours ago
  • 6 min read

Updated: 10 hours ago


Vision-language models (VLMs) can now see a machine and describe a fault — but describing is not repairing.Turning perception into a guided, verified physical repair is a computer-graphics problem: geometry, real-time rendering and AR anchoring.VIGA Entertainment Technology has spent close to a decade building exactly those engines — for the Las Vegas Sphere, TVS Motor, satellite intelligence and large-scale digital twins.We built a working demonstration that disassembles, diagnoses and guides the repair of a real device, part by part.The same loop that guides a human technician today will guide a robot tomorrow.

What is Vision Language Models-powered maintenance, in one sentence?


It is a system that looks at a real machine, understands what it sees, and guides a human — or a robot — through the correct repair, step by step. A vision-language model handles perception. A geometry-and-CAD layer knows where every component sits and how it comes apart. Augmented reality draws the next instruction onto the object in front of you. And a feedback loop checks whether each step actually worked.

That last part — verification — is what separates a demo from a tool. Captioning a photo is easy. Confirming that the right bolt came loose, that the replacement part is seated, and that the machine is safe to power on is the hard, valuable part.


Vertical diagram of the VIGA VLM maintenance pipeline — capture, VLM perception, geometry and CAD intelligence, AR overlay, and repair, with a real-time feedback loop."

Why is maintenance the single most valuable place to deploy a VLM right now?

Because maintenance is where a skills shortage, rising machine complexity, and zero tolerance for downtime all collide — and AI guidance pays back fastest there. Experienced technicians are retiring and taking decades of know-how with them. The market has already moved.


"Bar chart of the global augmented reality market growing from $99.8B in 2025 to $387B by 2031, source Mordor Intelligence."

The numbers that matter:

  • The global augmented-reality market is projected to grow from roughly $99.8B in 2025 to about $387B by 2031, a ~25% CAGR (Mordor Intelligence, 2026).

  • Remote assistance and maintenance is the single largest AR application — around 29% of the market in 2025 (Mordor Intelligence).

  • AR-headset adopters report 25–40% better first-time-fix rates and 20–34% faster repairs on complex, multi-component jobs (industry analyses, 2026).

  • Gartner expects roughly half of all service-management deployments to use AR field-service tools by 2026.

  • Predictive-maintenance programs already cut unplanned downtime by up to 50% and maintenance costs by 20–30%.

  • About half of enterprise data processing is moving to the edge by 2026 — close to the machine, low-latency, often offline.

The perception layer has become a commodity. The opportunity is in everything that turns perception into a correct action.


Why does a computer-graphics studio have the edge here — and not a pure AI vendor?

Because the valuable, difficult part is not the perception. It is the geometry, the real-time anchoring, and the verification loop — and those are graphics and rendering problems we have solved for years. Plenty of teams can wire a camera to a model and get a caption back. Far fewer can close the loop into a guided, verified repair.

At VIGA, we build systems, not isolated tools. The repair loop only works if every layer below it works together — and seven years of real-time CG forced us to be fluent across all of them.


"Capability stack showing VIGA spanning five layers: edge and GPU compute, real-time rendering, geometry and CAD intelligence, VLM perception, and AR delivery."

Most vendors sell you one slice of this stack. We own the whole thing — because to ship real-time graphics at production scale, we had to.


How does the VIGA repair loop actually work, step by step?

Perception flows into geometry, geometry into AR guidance, and the result flows back to perception for verification. Concretely:

  1. Detect & identify. The vision language models looks at the assembled machine, names the components it sees, and flags the symptom.

  2. Anchor to geometry. We match the live view to the known CAD or mesh and resolve where each part is in 3D — even when it is partly hidden.

  3. Plan the disassembly. The system orders the steps in physically valid sequence. No "remove the part behind a part you haven't removed yet."

  4. Guide in AR. Each instruction is drawn onto the real object — highlight this screw, turn it this way — and stays locked in place as the technician moves.

  5. Verify & adapt. After every action, the VLM re-checks the scene. Did the part come free? Is the replacement seated? If not, it corrects course before any damage is done.

This is the orange loop in the diagram above — and it is the difference between a clever caption and a finished repair.


What does this look like in practice? (Our fan repair demonstration)


We built a working demonstration around a deliberately ordinary object: a household fan. Ordinary is the point. If the system can take a fan apart, find the fault, and guide the rebuild with missing context and awkward parts, the approach generalizes to far more serious equipment.


The following video shows a detailed step by step repair process for a fan with the help of a VLM and an agent to assist.


"Exploded view of a fan generated from geometry, with a vision-language model flagging the cracked blade as the fault."

The system reasons over an exploded view built from the geometry, identifies the cracked blade as the fault, sequences the teardown, and verifies the rebuild. Swap "fan" for a pump, a gearbox, an aircraft access panel, or a factory instrument, and the architecture is identical — only the geometry and the stakes change. That portability is the whole point.


Where is this heading — humanoid robots and autonomous maintenance?

The same loop that guides a person's hands will guide a robot's. Humanoid and mobile robots arriving on factory floors face the identical question a technician does: what is in front of me, where are its parts, and what is the correct next move?

A perception → geometry → action loop that guides a human in AR glasses is, with a different output device, the loop that guides a robot's end-effector. Push the compute to the edge — near the machine, low-latency, working even offline — and you get maintenance that no longer waits for a person or a network connection. We are deliberately building for both readers of that loop: the technician today, the autonomous system tomorrow.


Who is VIGA, and why trust us with this?

VIGA Entertainment Technology is a Bengaluru-based team of computer-graphics engineers who build the digital engines behind complex, high-stakes industries — closer to the metal and the GPUs. Our work is proven in production, not in demos:

  • The Wizard of Oz at the Las Vegas Sphere — AI-driven production at one of the most demanding display venues on earth.

  • TVS Motor Company — real-time visualization and configuration for the automotive sector.

  • Satellite intelligence — selected for the NCIIPC AI Grand Challenge.

  • Large-scale digital twins across aerospace, defense, healthcare, sports and automotive, built in Unity, Unreal, WebGL and generative AI.

  • Movie Colab — an enterprise AI system that automates complex filmmaking workflows end to end.

  • Developed and validated within leading accelerator programs, including NVIDIA, Google and ElevenLabs.

We are systems-first, built for complexity, cross-industry, and AI-integrated by design. When we say an overlay must be locked to the part and a verification step must be right, we mean it — because in safety-critical work, it always had to be.


Frequently asked questions


What exactly is a vision-language model in this context? A VLM is an AI model that takes images or video plus text and reasons across both. It can look at a machine and describe, locate and explain what it sees. We pair it with geometry and AR so that understanding becomes a guided, verified repair rather than just a caption.


Do you need a CAD model of every machine to make this work? A good CAD or 3D model dramatically improves accuracy, but it is not always mandatory. Where models are missing, we can reconstruct geometry, and the VLM still provides perception. The richer your asset data, the tighter the guidance — so we help clients build that data asset as part of the engagement.


What hardware does the technician actually use? It runs on the phone or tablet you already have, and it shines on AR glasses such as XREAL, Meta and the new Android XR devices, or enterprise headsets. The same pipeline can also feed a robot's camera. We are deliberately hardware-flexible.


Why a computer-graphics studio rather than a pure AI vendor? Because the hard, valuable part is the geometry, the real-time anchoring and the verification loop — graphics and rendering problems we have solved across film, VR and safety-critical training. Perception alone does not fix anything.


Which industries is this for? Manufacturing, automotive, aerospace MRO, energy and utilities, heavy equipment, and any field-service operation facing a skills gap and costly downtime. If a repair has steps and stakes, the approach applies.


Let's build the loop for your machines

If you have a repair, maintenance or inspection problem worth solving — especially the kind everyone else calls impossible — we would like to hear about it. We collaborate with teams at every stage, from first concept to production-scale systems.

Start a conversation: info@vigaet.com · vigaet.com

VIGA Entertainment Technology — computer graphics, in service of the real world. Bengaluru, India.

 
 
 

Comments


Commenting on this post isn't available anymore. Contact the site owner for more info.
bottom of page