Testing AI Like a Scientist, Detective, and Therapist: A UX Researcher’s Field Guide

Written by Ryan McGarry | May 12, 2026

Testing AI Like a Scientist, Detective, and Therapist: A UX Researcher’s Field Guide

By Ryan McGarry, UX Research Manager, Akraya | Published March 2026

“Testing AI isn’t like testing a button. A button either works or it doesn’t. AI is more like a toddler with a PhD, brilliant, highly capable, and then sometimes it just makes things up and is confident about it.” Ryan McGarry

Introduction: A New Kind of Problem

Somewhere between 2023 and today, the products that UX researchers test stopped behaving like predictable machines. If you’re embedded in enterprise software, developer tooling, or any B2B product ecosystem, you’ve probably felt it: the moment a stakeholder walks into your kick-off meeting and says, “We’re adding AI.”

For our team at Akraya, supporting Google’s suite of developer and cloud tools, that moment arrived in force. What started as occasional AI feature requests has become 85% of our research focus over the last year. And with that shift came an urgent realization: the mental models, session protocols, and methodological playbooks we had spent years refining were no longer sufficient on their own.

This post shares what we’ve learned the hard way through broken sessions, hallucinating models, and a few embarrassing moments of not knowing what questions to ask. Whether you’re a seasoned UXR stepping into your first AI study or a research lead rethinking your team’s approach, this is a practical guide built from field experience. In essence, this specialized framework helps our clients navigate the 'black box' of AI and ensures their R&D spend results in products that users trust.

The Fundamental Shift: From Deterministic to Probabilistic Thinking

Traditional UX research operates on a deterministic assumption: if a button doesn’t work, it’s a bug. The user clicked it, it failed, you document it, and the team fixes it. The world is binary and clean.

AI breaks this entirely.

In 2026, we’re no longer testing buttons; we’re testing behaviors. When a generative agent provides two different answers to the same prompt for two different users, that isn't necessarily a "bug." It’s the nature of a probabilistic system. This shift changes the very definition of "valid data" in a session.

This means UX researchers need to develop what we call a probabilistic mindset. Instead of asking “did it work or not?” we ask: “Does this variability help or hurt the user’s experience?” We’re no longer just evaluating interfaces. We’re evaluating trust, mental models, and the messy human experience of interacting with something unpredictable.

Think of yourself as part scientist, part detective, and yes, part therapist. You need the scientist’s rigor to control variables, the detective’s instinct to ask questions nobody else thought to ask, and the therapist’s patience when a participant’s session derails because the model started hallucinating.

Phase 1: Research Planning Ask Everything Before You Write a Single Task

The planning phase is where most AI studies fail before they even begin. Researchers either assume they understand the tool well enough or they rely too heavily on the product team to surface the nuances that matter for research design. Here’s what we’ve learned to do differently.

Understand the Black Box

AI systems are frequently opaque, even to the people who build them. One of the most important questions you can ask a stakeholder is: “Where does this AI source its information from?”

This question sounds basic, but the implications are enormous. Here’s a real example: while analyzing study data, I opened a new Google Doc to compile my notes. A new AI-powered feature offered to auto-generate the document for me based on a prompt. Curious, I tried it. The tool produced content that seemed eerily close to the actual session data I’d collected data it shouldn’t have had access to, stored on a shared team drive I’d never linked.

I spent ten minutes questioning my own memory before realizing the tool had likely just generated plausible-sounding content. But the experience highlighted something critical: if I, as a researcher actively investigating these tools, couldn’t immediately tell where the AI was getting its answers, our participants certainly couldn’t. And that confusion has direct implications for trust, mental model formation, and the kinds of follow-up questions you need to build into your script.

Ask your stakeholders: Does this AI pull from public data? Personal user files? A team-shared repository? Enterprise databases? The answer shapes everything from task design to the pre-task briefing you give participants.

Clarify What the Team Actually Wants to Learn

Not all research questions are created equal, and in AI studies, the gap between what stakeholders ask for and what’s researchable can be significant.

We’ve sat in many kick-offs where a PM says, “We want to know if users trust the output.” That’s a valid research question. But we’ve also had studies where every participant spontaneously commented on latency, how long it took the model to respond, and the team had to remind us mid-project that speed wasn’t something they could actually fix in the near-term. We’d spent 20% of every session on a topic that had no actionable path forward for the team.

Before writing a single task, align with your stakeholders on: What can the team actually change based on what we learn? Is this about the interface, the model behavior, or something else? If users comment on model accuracy, is that actionable feedback for this team?

Tighten your research scope around what’s useful, not just what’s interesting.

Identify Failure Modes Before You Meet Your First Participant

One of the best exercises you can do in the planning phase is a pre-mortem: imagine your study has gone completely sideways. What happened? Common AI failure modes we’ve encountered include: the model taking 30+ seconds to respond mid-task, hallucinated outputs that confuse participants, completely different responses to nearly identical inputs, and session environments that crash when connected to certain types of input files.

Anticipating these failure modes lets you build contingencies into your protocol rather than improvising in front of a participant. We’ll talk more about specific contingencies in the execution section.

Lock Down the Variables That Matter

AI models are highly sensitive to configuration. If a participant can freely switch between model versions during a study, say, from Gemini 1.5 to Gemini 2.0, you’ve introduced a confounding variable that can make your results uninterpretable. Ask the engineering team to lock in a specific model version for the duration of the study, and document which build was used so anyone reading the research later has that context.

Similarly, clarify terminology before the study begins. “AI agent,” “co-pilot,” and “assistant” mean very different things to very different people. Using imprecise language with participants leads to imprecise feedback. Build a shared glossary with your stakeholders and use it consistently in your script.

Phase 2: Study Execution Controlling the Uncontrollable

Even with a meticulous plan, AI sessions will surprise you. The goal isn’t to eliminate variance; it’s to contain it so your data remains meaningful. At Akraya, we move from the Detective’s preparation to the Scientist’s control and the Therapist’s observation.

The Dual-Input Strategy

This is one of the most valuable tactical lessons we’ve developed at Akraya. When your study involves participants providing input to an AI feature, say, uploading a code file to a code generation tool, you face a classic tension between external validity (using their real file = realistic results) and internal validity (everyone uses the same file = comparable results).

Our solution: use both. Ask participants to begin with their own file. This gives you the realistic, naturalistic data your stakeholders love. But also have a pre-tested, standardized example file ready as a fallback. If the participant’s own file causes the AI to produce garbage output or an irrelevant edge case, we pivot to the Control File. This ensures that every study produces a comparable baseline of data, regardless of how "clean" the participant's own file was.

We pre-test example files thoroughly: we run them through the AI tool multiple times, confirm the output is reliable and correct, and document what to expect so we can spot anomalies during sessions.

Engineer on Standby

This tip sounds obvious, but gets skipped more often than you’d think, often because engineers are expensive and busy. Our recommendation: Prioritize having a technical stakeholder silently observe your first three to five sessions. Camera off, muted, watching the live stream. Their job is not to moderate or intervene with questions; it’s to troubleshoot in real time if the environment breaks.

We’ve had cloud environments hang mid-session, leaving participants staring at a loading screen. With an engineer on standby, that becomes a two-minute fix. Without one, it’s a session you must throw out.

Preparing for Latency and Hallucinations

Two failure modes deserve specific session protocols: latency and hallucinations.

When the model is slow, silence can feel like a malfunction. We’ve developed simple techniques for filling that space: using think-aloud prompts (“What are you expecting to see?”), moving to other tasks and returning, or simply normalizing the wait time in the pre-task briefing (“This feature sometimes takes 20-30 seconds to respond, that’s expected for today’s session”). If latency isn’t a research priority, consider using static mocks for those portions of the session to avoid introducing negative bias early.

When the model hallucinates and produces confident but incorrect output, your job isn’t to correct the AI. It’s to observe the participant’s reaction. Do they notice? Do they trust it anyway? Do they know how to verify? These are often the most valuable data points in the session.

A Note on Privacy and Policy

AI studies carry unique privacy risks that traditional UX research often doesn’t. At Google, we work within strict internal guidelines about which AI tools can access or process what types of data. Researchers should never expose sensitive code, proprietary internal data, or PII to AI tools with ambiguous data retention policies or external data dependencies.

This applies both to how you conduct research and how you use AI tools in your own workflow. Before any study, review your organization’s current guidelines. These rules change as the tools evolve. Be transparent with participants about how their inputs and any generated outputs will be stored, used, and whether they’ll be used to train or refine the model they’re testing.

De-identify all data before sharing findings, and err on the side of caution when you’re unsure.

The Bigger Picture: Why This Matters for the Industry

The challenges we’ve described aren’t unique to Akraya or to Google. As AI becomes the default feature in enterprise software coding tools, productivity suites, customer service platforms, and healthcare applications, UX research is the discipline best positioned to bridge the gap between what these systems can do and what users can actually trust and use.

But only if we adapt our methods.

The researchers who will be most valuable in the next five years are the ones who can hold probabilistic thinking alongside rigorous experimental design, who know when to let a session breathe into the uncertainty and when to apply controls that make data meaningful, and who can ask a software engineer the right questions about model architecture and turn the answers into tasks that a non-technical participant can complete.

That’s a different skill set than what most of us were trained on. It’s worth investing in now.

Key Takeaways

Shift from deterministic to probabilistic thinking when evaluating AI output
Ask stakeholders where the AI sources its data before writing a single task
Align on what’s actionable before scoping research questions
Run a pre-mortem: anticipate failure modes and build contingencies
Use a dual-input strategy (personal file + standardized fallback) to balance validity
Have an engineer silently on standby for the first several sessions
Build latency and hallucination handling into your session protocol
Stay current on your organization’s AI privacy and policy guidelines

Ryan McGarry is UX Research Manager at Akraya, where he leads a team supporting Google’s Cloud and developer tooling products. He holds a PhD in Human Factors and Applied Cognition from George Mason University.

Want to bring this workshop to your organization? Contact Akraya to learn about our UX Research training programs.

View full post