himala 🇩🇪

Multi-agent AI applications: from user jobs to reliable collaboration

Category

AI application

Duration

August 2024 — March 2025

While being a Principal Product Designer in the early-stage AI video production startup, we made a pivot and went into building a digital AI assistant platform that would serve as a professional and personal life manager. We ran into challenges of leveraging AI functionality with the definition of the user problem and how exactly the automation should bring proven recurrent value.

As the industry shifts from classic SaaS interfaces to AI-driven experiences, building effective agentic systems requires rethinking how users interact with semi-autonomous assistants. The case study below presents the UX solutions to merge the current interface patterns with new capabilities of LLMs workflows to help users complete high priority tasks.

Problem

From the start of working on the AI assistant, we found the complexity of actually building an autonomous assistant that could answer accurately questions about the user based on his integrations, create tasks and calendar events that corresponded to user requests. The problem of accuracy and usability lies in the fact that self-directed actions by the agents require a lot more background work and testing by us before claiming the system can offer a solution.

Challenges

We are in the era of mixing UI and automatic assistant. Chat-based interface is the most common and basic medium of such unity with its awkward limitations (read email summaries in the chat?). Assuming that the LLM is fine-tuned correctly, I believe that the design process of creating interfaces changes from manually putting GUI components on the screen to orchestrating the information points into agentic workflows with user being a knowledge aka prompt guider. Hence, we need to empower our customer to be able to lead with the right instruments. Find the strongest of AI and classic API worlds and make them work and help each other progressively, as the priority is to help user deal with the problem and arrive to the goal, not the AI itself.

User jobs to orchestrate AI workflows

In our first months after pivoting, we learned that starting with user job stories shapes both product and technical decisions before any interface design begins. Since users talk to the LLM through prompts, user jobs become the foundation for the future system agent specialization—each job story maps to specific workflows and coordination patterns.

Define a task artifact based on jobs

Let's look at the example of tasks. Based on chain of actions user would hypothetically take, we created a structure of the task object. Each artifact would include: context, time based indicators, and important actions. This becomes the blueprint for the workflow.

Summarize the task creation process into a workflow

Built a path of how the problem would be fixed by real human and transformed it into prompted steps. By doing so before jumping into implementation, we made sure that the team understands the workflow from different sides and there is unity of how we are trying to solve the problem.

How to earn customer's trust?

What is the likelihood that a customer would entrust an AI system to answer an important email? I believe that the goal is not about doing and sounding right, but about trying to be collaborative and proactive. Trust is gained through iteration and progress, which the system should seek from the customer.

Show user how AI thinks

Because there is no control of the user input and the LLM output has a hallucinated nature, an AI interface should be transparent and allow edits along the workflows steps:

  1. Tell users which tools the AI is using for complex tasks.
  2. When errors happen, let users see what went wrong and give an option to try again
  3. Always ask permission before the AI changes real-world things (like sending emails or booking meetings).
  4. Show the AI's "thought process" - what information it's looking at and why.
Build smart prompts with good data

Having defined main user job stories, we created a Notion documentation that transforms user actions into clear instructions for an AI conversation. To reduce wrong answers:

  1. To tackle hallucination, track common problems and teach LLM to fix them.
  2. For better context, use external tools together with LLMs capabilities for storing and retrieving data.
Transform chat interface into a workspace

We replaced the typical multi-topic chat interface with focused workspaces where chat is just a communication channel. This mental shift changes how people work with AI systems. Instead of starting fresh conversations, they enter environments naturally built for their specific needs. Why we believed this would work:

  1. Users think in jobs, not conversations.
  2. All relevant context stays in one place.
  3. Performance improves over time with use.
  4. Chat becomes a tool, not the main feature.

Integrate for scale

As our platform would grow to support increasingly complex workflows, designing a scalable and flexible approach to third-party data integrations has become critical. Our users rely on a diverse ecosystem of tools and need a cohesive way to bring relevant data into their workflows without overwhelming the system or sacrificing control.

Classify data: weighting for accuracy and relevance

Not all data is equal. To help users and systems interpret incoming information, we worked with the tech team to develop a classification framework for all the incoming data. This allowed us to evaluate data based on relevance and accuracy and assign each piece of information to meaningful categories.

Unify interface patterns for data integrations

Our goal was to build a scalable interface capable of supporting multiple types of third-party data integrations — ranging from emails and calendars to project management tools like Linear. To achieve this, we developed a consistent table templates model, where applicable, that allows users to interact with integrated elements in a similar way they would do in native applications.

Make data integration purposeful

Integrating data isn’t just about connectivity—it’s about intent. We start by asking a critical question: What do we need from this integration? For example, integrating an email client can surface thousands of emails and attachments. Rather than ingesting all data, our interface supports smart selection and filtering, allowing users to define what should be imported.

Creating strong empirical evaluations: benchmarking the workflows

Fine-tuning AI models takes time, but we needed clear ways to track progress and measure accuracy with every cycle.

Run user testing after the last one. Repeat

We ran unmoderated tests with mixed methods to get insights of how users really interact with the AI assistant and spot where it fails to meet expectations. Navigation was one of most problematic areas we had to iterate on a couple of times to get it right.

Introduce accuracy KPIs early with Grafana

For example, to test emails summarization, we sent emails to test email accounts and had the AI evaluate its own performance. This process helped us a lot to become better faster in relevancy and contextualization.

  • Fail/low (0-4).
    - Hallucinations or a possible-seeming hallucination if the selected source is wrong.
    - Major information discrepancies or missing information.
  • Okay (5-7).
    - Lack of sufficient detail OR sometimes specific detail, but the information overall is not incorrect.
    - Maybe a slight discrepancy between the interpretation of a key point from the source material.
  • Low Highs (7-8). Slight discrepancies about the identified key details (but not incorrect information).
  • High Highs (9-10). Very minor discrepancies (possibly by omission) to perfect.
Create smart UI anchors to gather information

We skipped simple thumbs up/down ratings — they're biased toward angry users and work better for unmoderated testing to flag wrong LLM responses. Instead, we measured what we thought actually mattered:

  1. Editing frequency. How much do users change AI-generated content?
  2. Acceptance rate. How often do users keep what the AI creates?
  3. Task completion. Successfully created calendar events, sent emails, etc.
  4. Asynchronous feedback tools for beta users. Successfully created calendar events, sent emails, etc.