himala 🇩🇪

Multi-agent AI applications: from user jobs to reliable collaboration

AI application

Duration

Aug 2024 — March 2025

While being a Principal Product Designer in the early-stage AI video production startup, we made a pivot and went into building a digital AI assistant platform that would serve as a professional and personal life manager. We ran into challenges of leveraging AI functionality with the definition of the user problem and how exactly the automation should bring proven recurrent value.

As the industry shifts from classic SaaS interfaces to AI-driven experiences, building effective agentic systems requires rethinking how users interact with semi-autonomous assistants. The case study below presents the UX solutions to merge the current interface patterns with new capabilities of LLMs workflows to help users complete high priority tasks.

Problem

From the start of working on the AI assistant, we found the complexity of actually building an autonomous assistant that could answer accurately questions about the user based on his integrations, create tasks and calendar events that corresponded to user requests. The problem of accuracy and usability lies in the fact that self-directed actions by the agents require a lot more background work and testing by us before claiming the system can offer a solution.

Challenges

We are in the era of mixing UI and automatic assistant. Chat-based interface is the most common and basic medium of such unity with its awkward limitations (read email summaries in the chat?). Assuming that the LLM is fine-tuned correctly, I believe that the design process of creating interfaces changes from manually putting GUI components on the screen to orchestrating the information points into agentic workflows with user being a knowledge aka prompt guider. Hence, we need to empower our customer to be able to lead with the right instruments. Find the strongest of AI and classic API worlds and make them work and help each other progressively, as the priority is to help user deal with the problem and arrive to the goal, not the AI itself.

User jobs to orchestrate AI workflows

In our first months after pivoting, we learned that starting with user job stories shapes both product and technical decisions before any interface design begins. Since users talk to the LLM through prompts, user jobs become the foundation for the future system agent specialization—each job story maps to specific workflows and coordination patterns.

Create user job stories

During first workshops, our team decided to focus on AI enthusiasts who need a "second brain" to handle routine tasks like setting up meetings, creating tasks, summarizing emails and files etc.

Define a task artifact based on jobs

Let's look at the example of tasks. Based on chain of actions user would hypothetically take, we created a structure of the task object. Each artifact would include: context, time based indicators, and important actions. This becomes the blueprint for the workflow.

Summarize the task creation process into a workflow

Built a path of how the problem would be fixed by real human and transformed it into prompted steps. By doing so before jumping into implementation, we made sure that the team understands the workflow from different sides and there is unity of how we are trying to solve the problem.

How to earn customer's trust?

What is the likelihood that a customer would entrust an AI system to answer an important email? I believe that the goal is not about doing and sounding right, but about trying to be collaborative and proactive. Trust is gained through iteration and progress, which the system should seek from the customer.

Show user how AI thinks

Because there is no control of the user input and the LLM output has a hallucinated nature, an AI interface should be transparent and allow edits along the workflows steps:

Tell users which tools the AI is using for complex tasks.
When errors happen, let users see what went wrong and give an option to try again
Always ask permission before the AI changes real-world things (like sending emails or booking meetings).
Show the AI's "thought process" - what information it's looking at and why.

Build smart prompts with good data

Having defined main user job stories, we created a Notion documentation that transforms user actions into clear instructions for an AI conversation. To reduce wrong answers: How to reduce hallucinations? Collecting repeated problems or using alternative solutions like OpenSearch for email or file content. LLM always guesses. Are really good at emulating any patterns they find in the items they guess about.

To reduce hallucination, track common problems and teach LLM to fix them.
Use external tools together with LLMs capabilities for storing and retrieving data.

Transform chat interface into a workspace

We replaced the typical multi-topic chat interface with focused workspaces where chat is just a communication channel. This mental shift changes how people work with AI systems. Instead of starting fresh conversations, they enter environments naturally built for their specific needs. Why we believed this would work:

Users think in jobs, not conversations.
All relevant context stays in one place.
Performance improves over time with use.
Chat becomes a tool, not the main feature.

Contextualized interface: what is important?

Traditional UI elements aren't disappearing—they're getting smarter. By making familiar interface components speak the AI's language, users can interact naturally without learning new patterns. Every UI element by becoming contextualized in the AI workflows can now translate its actions into clear instructions the AI understands. The challenge is to find in those contexts the right balance between the amount of available unstructured and structured knowledge to meet user expectations.

Simplify context through summaries

Let users drag emails, tasks, or any interface element directly into chat as context. This bridges the gap between what users expect and what technology can deliver—adding any artifact means adding context, making it intuitive for both users and AI workflows.

Let user help to manage memory

The tech team made us aware that we need to balance short-term context with long-term memory. With the user having the ability to pin any data in the workspace, as well as with system saving references to any data objects that were mentioned in the conversation as important, we made memory management a personalization process that is visible and customizable.

Communicate directly with the interface

We created shortcuts in the interface that let users point exactly at what they want to discuss or change. Skip explanations and get straight to solving problems with precise context.

Integrate for scale

Data integration = selecting what need vs limit and latency of introducting data. adding limits.

Data weightings = accuracy

Product and tech teams have together developed an information weighting system that allowed us to practice our way of any data from any source. This approach simplifies the process of adding any new integration in the future.

Make integrations feel native

Each integration should represent its true structure, as well should be easy to manipulate in the interface. We came to the conclusion that table templates help us achieve the first goal, whereas the unified representation of data objects as cards in the interface covers the second.

Smart data limits

We need to ask the right question: "What do we actually need from this integration?" Email integration could mean thousands of messages. Thus, we should give users smart filters to choose what the system works with before making any further action.

Creating strong empirical evaluations: benchmarking the workflows

Fine-tuning AI models takes time, but we needed clear ways to track progress and measure accuracy with every cycle.

Run beta testing after the last one. Repeat

We ran unmoderated tests with mixed methods to get insights of how users really interact with the AI and spot where it fails to meet expectations.

Introduce accuracy KPIs early

For example, to test emails summarization, we sent emails to test email account and had the AI evaluate its own performance by reviewing chat transcripts. This process helped us a lot to become better faster.

Create smart UI anchors to gather information

We skipped simple thumbs up/down ratings — they're biased toward angry users. Instead, we measured what we thought actually matters:

Editing frequency. How much do users change AI-generated content?
Acceptance rate. How often do users keep what the AI creates?
Task completion. Successfully created calendar events, sent emails, etc.

Build in asynchronous feedback tools

We created ways for beta users to flag wrong AI responses directly in the interface. This created a continuous improvement loop in the early phases.