The New Generation Shaping Computer Vision & Multimodal AI
Meet the rising researchers shaping the future of computer vision and multimodal AI. A deep dive into their work, impact, and why they matter in the next wave of AI innovation.
In recent years, computer vision and multimodal AI (vision, language, and other modalities) have shifted from niche research topics to central pillars of modern artificial intelligence. Tasks like image grounding, video-language understanding, embodied agents, vision-language generation and reasoning have exploded — and with them, a new generation of researchers is pushing the frontier.
Below are three rising stars worth watching, each making distinct and significant contributions:
1. Kai‑Wei Chang (UCLA)
Chang sits at the intersection of vision, language, and trust, addressing not just what a system knows, but how it behaves. As multimodal models scale, his focus on robustness and fairness becomes increasingly relevant.
- Associate Professor, Department of Computer Science, University of California, Los Angeles.
- Research focus: combining knowledge acquisition, natural language understanding, vision-language models, robustness and trustworthiness.
- Key contributions: Models like VisualBERT, DesCo (“Learning Object Recognition with Rich Language Descriptions”) which push vision-language object recognition beyond traditional labels.
2. Soujanya Poria (NTU / SUTD)
Poria’s work is critical in making multimodal AI inclusive (across languages and modalities), safe (emotion, bias), and applicable in real-world interactive systems. As models expand beyond images and text into audio, embodied interaction, her insights shape how they behave, adapt and are evaluated.
- Associate/Assistant Professor in AI/multimodal machine learning; affiliated with Nanyang Technological University / Singapore University of Technology and Design.
- Research focus: multimodal machine learning (text, audio, vision), affective computing (emotion recognition, sentiment), and responsible/safe multimodal systems.
- Key work: Established datasets and frameworks for multimodal analysis (e.g., multilingual/multimodal sentiment datasets), advanced embedding of vision + language and audio in interaction systems.
3. Ling Shao (MBZUAI / Terminus Group)
Shao bridges foundational research (vision-language, generative models) and deployment at scale (AIoT, smart cities, industrial systems). As vision-language systems embed into the physical world and edge devices, his system-thinking is increasingly important.
- British-Chinese computer scientist and entrepreneur: founder of the Inception Institute of Artificial Intelligence (IIAI) and deeply involved in multimodal AI for vision-and-language, medical imaging, AIoT, and smart cities.
- Research interests: computer vision, machine learning, generative AI, vision + language, medical image analysis and large-scale real-world AI systems (smart city, IoT).
Why These Areas Are Hot Right Now
- Scale and convergence: Vision alone is no longer enough. Recent breakthroughs integrate vision and language (and other modalities) to understand, reason, generate and interact.
- Application explosion: From visual Q&A, image & video generation, to embodied agents (robots, AR/VR) and multimodal assistants.
- From algorithm to system: Researchers now build full pipelines like models, datasets, deployment. The rising stars reflect this shift.
- Ethics, trust & interaction: As models interpret and generate across modalities, questions of bias, safety, interpretability, fairness become central.
- Edge/embedded vision: With smart devices, autonomous vehicles, drones, AR/VR, vision & multimodal AI is leaving labs and entering the real world.
What to Watch: Emerging Trends
- Vision-language-action models: Models that interpret images + text and act in the physical world (robots, embodied agents, AR/VR).
- Multilingual, multimodal datasets & models: Need for models across languages, modalities, geographies.
- Robust, safe, trustworthy multimodal AI: As systems deploy, they must handle noisy inputs, adversarial scenarios, fairness, bias and interaction.
- Generative vision-language modeling at scale: Models that generate coherent visual-text narratives, video from text, interactive reasoning across modalities.
- Real-world system thinking: Embedding vision & multimodal AI into smart cities, IoT, healthcare, robotics.
- Continual, interactive multimodal learning: Learning over time, across modalities, adapting to new tasks and environments rather than static training.
Why These Researchers Matter to You
Whether you’re a content writer, a tech audience, or exploring creative opportunities:
- They represent where the field is going, not just where it has been. These researchers are shaping the next decade of vision & multimodal AI.
- Their work intersects foundations (models, datasets) and applications (systems, deployment, trust), relevant for both research and product innovation.
- For your arts network, creative projects or content writing: multimodal AI opens rich possibilities: vision + text + audio generative experiences, interactive installations, visual narrative understanding. These researchers are enabling that future.
Fast Facts
What exactly is “multimodal AI”?
Multimodal AI refers to models and systems that process, integrate or generate across more than one input or output modality, for example images, text, video, etc. It moves beyond single-modality (vision only or language only) to richer, more general intelligence.
Why focus on “rising stars” like Chang, Poria and Shao rather than established giants?
While foundational figures laid the groundwork for vision (ImageNet, CNNs, etc.), the rising generation focuses on the next wave: multimodal integration, system and application-scale deployment, fairness, multilingual and multimodal generalisation. Their work is reactive to the newest challenges of AI.
How can someone like me use this insight (content writer / creative network) for opportunities?
Look for:
- Generative systems combining vision + language + audio (interactive media, art installations)
- Research that emphasises trust, bias, robustness, interaction (important story angles)
- Datasets and models supporting cross-language, cross-culture, cross-modality (spotlight opportunities)
- Researchers building stacks (models + systems + datasets) and thinking about end-users, not just bench results (great for interviews, features).