May 13, 2026

Multimodal AI — The shift that could change how we interact with technology

For the last few years, most people have experienced AI through text.

Ask a question.
Generate some code.
Summarise a document.

That’s been the dominant interaction model.

What’s starting to happen now is much bigger.

AI is becoming multimodal, meaning models are increasingly able to understand and work across multiple forms of information at the same time. Text, images, audio, video and real-world context are beginning to merge into a single system of understanding.

And that changes what AI can actually become.

From language models to world models

Early generative AI systems were largely built around language.

They became incredibly good at predicting and generating text, which opened the door to everything from co-pilots to AI search.

Multimodal AI expands that capability.

Instead of understanding one format in isolation, models can now combine different inputs simultaneously. An AI system can analyse an image while understanding spoken instructions, interpret video in real time or connect written information to visual context.

That sounds subtle on paper, but it’s a major shift.

It moves AI closer to how humans actually process the world.

We don’t separate information into neat categories. We combine visual signals, sound, language and context constantly. Multimodal systems are starting to operate in a similar way.

Why this matters now

The technology has developed quickly over the last 18 months.

Models are now capable of:

  • understanding images and diagrams

  • analysing video and audio

  • processing speech conversationally

  • interacting with interfaces visually

  • generating realistic media across formats

We’re already seeing early examples appear in day-to-day products.

Voice AI systems that can understand tone and context.
AI co-pilots that can interpret screenshots and workflows.
Real-time video generation.
Autonomous systems that can “see” and respond to environments.

The gap between software and perception is starting to close.

The impact on product design

One of the biggest long-term implications of multimodal AI is what it does to interfaces.

Traditional software has largely been designed around menus, dashboards and structured workflows.

Multimodal AI introduces a different model.

Interaction becomes more natural.

Instead of navigating software manually, users increasingly communicate intent through conversation, visuals or voice. The system interprets context and works out how to execute the task.

That changes how products are designed from the ground up.

The interface becomes less about navigation and more about communication.

Why engineering teams are changing

This shift is also influencing hiring and team structure.

Building multimodal systems requires a combination of disciplines that historically operated separately.

Machine learning.
Data infrastructure.
Frontend engineering.
Real-time systems.
Audio and video processing.
Infrastructure and compute optimisation.

The boundaries between these functions are starting to blur.

We’re seeing growing demand for engineers who can understand how these systems connect together rather than operating within a single narrow layer.

That’s one reason why more companies are moving toward smaller, cross-functional engineering teams with broader technical capability.

The infrastructure challenge

One thing often overlooked in conversations around multimodal AI is the infrastructure required to support it.

These systems are computationally demanding.

Processing video, audio and real-time interactions simultaneously requires enormous amounts of compute, storage and optimisation. It also introduces challenges around latency, reliability and orchestration.

That’s creating significant demand in areas like:

  • AI infrastructure

  • GPU optimisation

  • distributed systems

  • data engineering

  • observability and monitoring

In many ways, the supporting ecosystem around multimodal AI may become just as important as the models themselves.

Where this goes next

Over the next five years, multimodal capability will likely become the default rather than the exception.

AI systems will increasingly:

  • understand environments visually

  • communicate naturally through voice

  • operate across devices and platforms

  • combine structured and unstructured information in real time

This is particularly important for areas like:

  • robotics

  • healthcare

  • autonomous systems

  • customer interaction

  • enterprise productivity

  • software development

The long-term opportunity is not just better chatbots.

It’s systems that understand context more like humans do.

What this means for companies

For technology leaders, the challenge isn’t deciding whether multimodal AI matters.

It’s understanding where it changes the experience their product delivers and what capabilities they need internally to support that shift.

The companies moving early are already thinking beyond isolated AI features.

They’re thinking about how AI becomes part of the core interaction model of the product itself.

That’s a much bigger strategic change.

____________________________________________________

A lot of AI discussion still focuses on models.

The more important shift may actually be interaction.

Multimodal AI changes how systems receive information, interpret context and respond to the world around them.

And over time, that probably changes how we interact with technology altogether.