Chatbot Guidance Design

This award-winning paper offers a set of design recommendations for conversational user interface to enhance task performance, improvement, and user experience.

This research has 3 impacts for conversational design:
01. Example guidance = better performance; rule guidance = better improvement
02. Providing examples upon request is better than giving them when users fail
03. Design conversation according to purpose: performance or learning-driven




How can we provide better guidance
for task-oriented chatbot users?


I designed and led pioneering research on conversational user interfaces, co-authored a paper published in a top human-computer interaction conference, and received a best paper award (5%).




Between-subject experiment, Literature Review, Interviews,
Survey, Affinity Diagramming,
Statistical Analysis

My Role

HCI Researcher


Sep 2021 - Aug 2022
11 months


2 Researchers (including me)
2 Coding assistants
1 Designer
1 Project Lead

Read Paper



Discovering Research Gap


Time matters.
Content also matters.

Initially, I was intrigued by the challenge of assisting users in recovering from conversational failure. After reading over 100+ research papers, my team and I were able to narrow our focus to designing better guidance. We discovered previous research lacks consensus and presents conflicting ideas on the ideal timing and type of guidance to offer.

Designing Guidance Combinations


Four timings.
Two types.

In order to fill the existing research gap, we wanted to explore
eight combinations of two guidance types (Example-based and Rule-based) and four timings (Service-onboarding, Task-intro, Upon-request, and After-failure) on user performance and experience.

To guide our research, we formulated three research questions and identified the necessary data to answer them. To ensure comprehensive results, we adopted a mixed-methods approach that included a lab experiment and reflection sessions.

Justifying Context & Task


Choosing what's relevant

Once we had defined the research scope, we turned our attention to selecting the tasks. Our choice was to create chatbots for two popular contexts: travel arrangement and movie booking, to ensure that the outcomes are broadly applicable.

We developed IBM Watson chatbots and crafted nine guidance conditions, including a control group that did not receive any guidance. I led the conversation-design process, which entailed collecting sample dialogues, mapping 12 conversation flows, and iterating the design with 10+ pilot testings.

Talking, And More Talking!


126 interviews,
1512 task interations,

The study consisted of two phases. In the first phase, as the researcher, I observed participants as they interacted with the chatbot while performing six tasks of varying complexities, with the chatbot providing one of nine possible guidance combinations. Following this phase, participants were asked to complete a survey that measured their satisfaction with the guidance provided.

In the second phase, I conducted  interviews with the participants to gain insight into their perceptions, attitudes, and concerns regarding each guidance combination. During this phase, participants were also asked to rank the guidance combinations in order of preference and provide explanations for their rankings.

Getting Our Hands Dirty With Data


Turning 1000+
affinity notes into
three important topics

We decided to use physical affinity diagramming due to the abstract nature of the problem. We want to gain clarity by physically moving around and rearranging notes in a tangible space.

I led the team through this process, which allowed us to synthesize over 1000 notes into three main topics: task efficiency, performance improvement, and diverse opinions on guidance and timing. This was a challenging process as we had to carefully examine numerous notes, and towards the end, generating unique and innovative insights became difficult due to repetition.

Quantitative data
as story
for user performance

In addition to quotes, we believe that users' actual interactions are indicative of their performance and overall experience. To delve deeper, we analyzed task-completion time, non-progress events, and improvements using statistical methods with R and Python. This part of the process was fairly straightforward, and I took the lead in discussing which statistical method to use and defining the quantitative metrics.



First Paper On
Chatbot Guidance

Working on this project for almost a year has really taught me how to conduct research with both rigor and attention to detail. I've also come to realize that exploring research gaps that are truly worth exploring requires a significant investment of time.


Being The First

We not only identified patterns in the effectiveness of these pairings, but also explored the underlying reasons for these patterns. We were able to generate a set of design recommendations for chatbot practitioners. Our study is just a starting point, and we encourage future researchers to validate the effectiveness of our proposed designs in real-life settings.



Treat Affinity Notes Like Your Baby

Taking thorough and relevant notes is crucial to helping the team synthesize information more effectively and arrive at better insights that inform design decisions.

Mixed Methods = Sticky Story

Just like UX case studies, academic papers are also about telling a good story. Incorporating various methods can greatly improve the quality of my narrative.

Do More Quant

I was involved in high-level tasks such as selecting the appropriate statistical methods and determining the key terms to be considered. I'd like to hone my quant execution skills more.



In the six years that I have worked with many students, Sonia stands out as one of the most well-rounded and skilled student researcher I have had the pleasure of working with. Her sharpness in finding valuable insights is truly remarkable. From our very first meeting, Sonia asked thought-provoking and important questions that drove the research ... She was able to connect users’ quotes with her keen observations during the experiment. When everyone in the room thought the quote was out of context, Sonia was the one who could connect the dots and provide meaningful interpretation to it.

- Stanley Chang (Project Lead; Associated professor @ NYCU Computer Science Department)