Blog

Mapping AI Value Across Modalities

As we've said several times before, the most common mistake I see is starting with the technology. Business leaders want to know which AI tool to buy, which model to use, which vendor to choose. But that's backwards.

Business ValueAI StrategyMultimodal AI

As we've said several times before, the most common mistake I see is starting with the technology. Business leaders want to know which AI tool to buy, which model to use, which vendor to choose. But that's backwards.

Instead, what you should do is start by mapping your actual workflows and customer journeys, then match them to AI capabilities that solve real problems.

Generative AI isn't one thing - it's a collection of distinct capabilities across different modalities:

Text assets: Generating, summarizing, reviewing content; sentiment analysis; translation; code writing and review.

Visual assets: Generating and editing illustrations, photos, 3D models, videos; manipulating visual content.

Audio & voice: Speech-to-text and text-to-speech; voice cloning; real-time translation; music generation.

Data augmentation: Classifying and analyzing data and visuals; generating patterns from trends; semantic search.

Physical assets: Modeling and designing physical objects; simulating processes; guiding robotic systems.

The framework becomes powerful when you stop thinking about AI in the abstract and ask: "Which specific workflow step could benefit from which capability?"

A retail client struggled with returns. Customers would email photos of damaged products with descriptions, then wait 24-48 hours for manual review. This was a typical customer journey.

Customer discovers damage

Takes photo and writes description

Submits request

[Wait for human review] ← Friction point

Receives approval/denial

You match the friction point to visual assets (analyzing product photos) + text assets (understanding descriptions) + data augmentation (comparing against policies and patterns).

The solution: Multimodal AI processing both photo and description together, instantly determining if damage qualifies for return. Resolution time drops from 36 hours to 3 minutes.

A B2B software company has reps spending 40% of time customizing proposals. Generic templates don't win deals; custom creation is too slow. The bottleneck was:

Discovery call with prospect

[Manual proposal creation - 8 hours] ← Friction point

Send to prospect

Match this to voice capabilities (transcribing calls) + text assets (generating customized content) + visual assets (creating industry diagrams) + data augmentation (analyzing past winning proposals).

The result: AI-assisted proposals from discovery notes, adapted from winning examples, with relevant visuals. Creation time drops to 90 minutes. Win rates increase 23%.

Map one complete workflow or customer journey. Identify the step that creates the most friction, waiting, or manual effort. Then look at the AI capabilities and ask: "Which capability - or combination - could address this specific step?"

That's where AI creates business value. Not in the impressive demo, but in the matched capability solving the real problem.

Frequently Asked Questions

What are the different AI modalities for business value?
The five AI modalities are: text assets (generating, summarizing, translating content), visual assets (generating and editing images, videos, 3D models), audio and voice (speech-to-text, voice cloning, translation), data augmentation (classifying, analyzing, semantic search), and physical assets (modeling objects, simulating processes, guiding robotics).
How do you map AI capabilities to business workflows?
Start by mapping your actual workflows and customer journeys, identify the step that creates the most friction or manual effort, then match specific AI capabilities — or combinations of capabilities across modalities — to that friction point. The value comes from solving real problems, not from impressive demos.
What is multimodal AI and how does it create business value?
Multimodal AI processes multiple types of input simultaneously — such as photos and text descriptions together. For example, a retail returns process that combines visual analysis of product photos with text understanding of damage descriptions can reduce resolution time from 36 hours to 3 minutes.

Originally published in Think Big Newsletter #9 on Amir Elion's Think Big Newsletter.

Subscribe to Think Big Newsletter