arXiv:1203.2299v1 [cs.CL] 11 Mar 2012
A Cross-cultural Corpus of Annotated Verbal and Nonverbal Behaviors in Receptionist Encounters Maxim Makatchev
Robotics Institute Carnegie Mellon University Pittsburgh, PA, USA
Robotics Institute Carnegie Mellon University Pittsburgh, PA, USA
Carnegie Mellon University in Qatar Doha, Qatar
conclude with the discussion of possible uses of the corpus.
We present the first annotated corpus of nonverbal behaviors in receptionist interactions, and the first nonverbal corpus (excluding the original video and audio data) of service encounters freely available online. Native speakers of American English and Arabic participated in a naturalistic role play at reception desks of university buildings in Doha, Qatar and Pittsburgh, USA. Their manually annotated nonverbal behaviors include gaze direction, hand and head gestures, torso positions, and facial expressions. We discuss possible uses of the corpus and envision it to become a useful tool for the human-robot interaction community.
CORPORA OF SERVICE ENCOUNTERS
Audio corpora of human service encounters have been used for analysis of linguistic and paralinguistic features, such as timing and prosody. For example, Vienna-Oxford International Corpus of English (VOICE)  includes service encounters between speakers of English as a lingua franca. Audio recordings of Syrian shopping interactions were collected and analyzed by Traverso . Service encounters gathered in public offices and shops of Catalonia were examined with respect to how bilinguals negotiate code (language) of their interaction. Audio recordings have been used to analyze politeness strategies in shopping interactions (see, for example, ). The importance of gaze (see  for an overview) and smile (see, for example, ) in defining the outcome of the service interactions suggest the need for capturing and studying nonverbal behaviors in videos. For instance, customers reported higher satisfaction when they interacted face-to-face with a bank teller who responded with contingent smile, rather than constant neutral or constant smiling expression . The same data showed that amused and polite smiles differ with respect to their temporal properties . Analysis of verbal and nonverbal expressions in the videos of interethnic encounters of Korean retailers with Korean and African-American customers showed that these language communities had different perception of function of socially minimal and socially expanded encounters . Receptionist interactions, a subtype of service encounters, were analyzed with respect to their verbal content via role plays in . Hewitt et al.  conducted discourse analysis of dialogues involving hospital receptionists. The openly accessible CUBE-G corpus of nonverbal behaviors from role plays of German and Japanese participants covers scenarios that may be relevant for service encounters, including first meeting, negotiation and status difference . The original Map Task  and followup projects collect direction-giving dialogues that may be relevant to some receptionist encounters. We were not able to find any nonverbal corpora of human receptionist interactions. With respect to availability, among all the corpora mentioned above only VOICE, CUBE-G and Map Task related corpora are freely accessible. Hence, our corpus may be the first annotated corpus of nonverbal behaviors in receptionist interactions, and the first nonverbal corpus (excluding the original video and audio data) of service encounters freely available online .
Behavioral realism has been one of the promising directions in the development of on-screen conversational agents and robots capable of natural language dialogue (see  for an overview). For example, interactions with a robot receptionist that evoke user’s social response are associated with better engagement and lower rate of breakdowns during information-seeking dialogues . A necessary step in designing such interactions is to identify behaviors with a potential to evoke a desired user response. Data sources that can be used to harvest behavior candidates include ethnographic and controlled studies. Ethnographic studies provide an opportunity for collection of naturalistic conversational data, but often face the issues of unclear sample population and coarse granularity of captured data . On the other hand, collecting high resolution data in a controlled setting may hamper spontaneity and naturalness of the interaction. In general, data collection methodology can influence both the sociopragmatic choices, namely, what speech act to say, and their pragmalinguistic realization, namely, how to say it (see  for a discussion). These methodological difficulties, combined with the challenges of annotating multimodal data, result in the lack of annotated corpora of naturalistic interactions for many scenarios that are currently relevant for human-robot interaction research. The corpus of role plays between a visitor and a receptionist in a realistic environment that we present in this paper attempts to help fill this gap. In the next section, we describe related work on corpora of service encounters. After that, we introduce our data collection methodology and the annotation scheme we use. We
Copyright is held by the author/owner(s). ACM X-XXXXX-XX-X/XX/XX.
on-duty security guards were present in the vicinity of the reception desk. Each pair of participants would have 2-3 interactions with one of the subjects as a receptionist, and then they would switch roles and have 2 or 3 more interactions, depending on allotted time. After that, the participants were debriefed on their experiences. Overall, more than 60 interactions were recorded. The interactions were recorded with 2 or 3 consumer-level high definition cameras. Visitor and receptionist were each dedicated a camera capturing their torso, arms and face that was positioned about 45 degrees off their default line of sight (namely, the line of sight that is perpendicular to the front edge of the rectangular reception desk). Most of the interactions would have a third camera capturing the side view of the scene. All cameras were in plain view. In addition to the audio captured by the cameras, an audio recorder (iPod) was placed on the receptionist desk.
We recruited via emails and posters in Education City, in Doha, Qatar and via announcements posted on bulletin boards across CMU campus in Pittsburgh, USA. The recruitment materials specified that we were looking for native speakers of American English or Arabic. Majority of the participants (17 of 22) were university students, staff, or faculty. The participants filled demographic surveys and evaluated themselves on ten-item personality inventory (TIPI)  and 20-item positive and negative affect scale (PANAS) . The distribution of participants is shown in Table 1. Arabic Doha American English Arabic Pittsburgh American English
Females Males Females Males Females Males Females Males
2 6 2 3 1 1 5 1
Table 1: Distribution of participants between Doha and Pittsburgh experiment sites People apply different criteria when they report their native language and mother tongue . To control for this, we asked the participants to list the countries they lived in for more than a year, and their age at the time of moving in and out of the country. All but 3 participants (who were all in the American English condition in Doha) spent the majority of their lives in the country where their native language is a primary spoken language. A female participant in Doha changed her reported native language from American English to Tulu, after asking the experimenter a clarification question. Her data remains in the corpus although she is not included in the Table 1. Mean age of participants in Doha was 25 years (SD = 7.8). In Pittsburgh, average age was 28.7 years (SD = 12.7). Native speakers of Arabic were on average 23.2 years old (SD = 4.2), while average age of native speakers of American English was 30.9 years (SD = 12.5).
The main goal of our corpus is to analyze occurrences and timing of verbal and nonverbal behaviors. Consequently, we have chosen to annotate the data at the level of granularity that minimizes the coding effort while at the same time allowing to capture timing and major features of communicative events. For example, instead of annotating each of preparation, hold, stroke, and retraction phases of a hand gesture  we annotate an interval between beginnings of the stroke and retraction phases. Similarly, facial expression are annotated as intervals approximately from the beginning of rise to the beginning of decay  phases, with some error inherent to manual annotation. The annotation scheme, developed in the process of annotating the corpus, is summarized in Table 2. Modality Speech Eye gaze
After filling out the questionnaires, one of the participants was asked to play the role of a receptionist while another was asked to imagine themselves as a first-time visitor looking for a particular location inside the building. The location was picked by the experimenter from the following list: library, restroom, cafeteria, student recreation room, a professor A’s office, etc. Visitors were asked to seek help of the receptionist for directions using English and then to proceed towards their destination. Most of the participant pairs were not familiar with each other. The fact of familiarity, when clear, is noted in the annotations. Similarly, the annotations include information on whether the participant has a thorough (works or studies inside the building) or passing (works or studies in a nearby building) familiarity with the experiment site. In both sites, the receptionist would occupy the actual receptionist area in the lobby of the building. In Doha,
Values Transcribed utterances, including non-words Pointing (self-initiated), pointing (following interlocutor), focus (interlocutor, guard, desktop, down, up, left, right, front, back, scattered, destination) smile (open or closed mouth) nod, half nod, double nod, multiple nod, upward nod, multiple upward nod, micro nod, shake Pointing (left or right hand), finger only Sitting, standing, focus (left, right, front, back, destination, interlocutor, desk)
Table 2: Annotation scheme Coding nonverbal expressions, as well as transcribing ambiguous speech involves a degree of subjectivity. For example, the exact point of gaze fixation within the recipient’s face is hard to identify even by the recipient himself . In fact, a typical direct eye contact consists of a sequence of fixations on different points on the face . Since it is unclear whether the exact fixation pattern has any influence 2
on social communication, in this study we do not distinguish between different fixation points within the general face area (neither does the video fidelity allow that). We plan to validate the annotations by employing a second annotator. The annotation is done using the multi-track video annotation tool Advene .
While the small number of individual participants makes this corpus unsuitable for cross-subject analysis, the multiple trials may be accounted for by mixed-effects models . More appropriately, the corpus should be used for qualitative analysis and formation of hypothesis for further studies. For example, compare the gaze behaviors of a native Arabic-speaking female S4 (Subject 4) playing a receptionist responding to native Arabic-speaking male S1 playing a visitor (Fig. 1) versus the dialogue with the subjects’ roles reversed (Fig. 2). Notice that both subjects gazed at their interlocutor more in the visitor role. This appears to be a trend that can be explained in part by the receptionist looking towards the destination during the direction-giving speech, while the visitor may continue looking at the receptionist. Now, compare a receptionist gaze of S4 (Fig. 1) with one of S12 (Fig. 3), who is a female native speaker of American English. Notice the short glances that punctuate fragments of the directions sequence spoken by S12. These glances appear to precede visitor’s backchannels and therefore may play a role in connection events . Receptionist S4, on the contrary, did not glance at the visitor until the very end of the directions sequence. These different gaze behaviors may reflect individual styles, genders and cultures of receptionistvisitor pairs, or levels of comfort and expertise, among other possibilities. Further, more controlled, studies may address these hypothesis.
This publication was made possible by the support of an NPRP grant from the Qatar National Research Fund. The authors would like to express their gratitude to Michael Agar, Mark Barker, Justine Cassell, Anwar El-Shamy, Ismet Hajdarovic, Alicia Holland, Carol Miller, Dudley Reynolds, Michele de la Reza, Candace Sidner, Mark Stehlik, Mark C. Thompson, security and receptionist staff of CMU Qatar, and the study participants.
 A. Anderson, M. Bader, E. Bard, E. Boyle, G. M. Doherty, S. Garrod, S. Isard, J. Kowtko, J. McAllister, J. Miller, C. Sotillo, H. S. Thompson, and R. Weinert. The HCRC Map Task corpus. Language and Speech, 34:351–366, 1991.  O. Aubert and Y. Pri´e. Advene: active reading through hypervideo. In Proc. of ACM Hypertext, September 2005.  B. Bailey. Communication of respect in interethnic service encounters. Language in Society, 26:327–356, 1997.  L. M. Beebe and M. C. Cummings. Natural speech act data versus written questionnaire data: How data collection method affects speech act performance. In S. M. Gass and J. Neu, editors, Speech Acts Across
Cultures: Challenges to Communication in a Second Language, pages 65–86. Berlin / New York: Mouton de Gruyter, 1996. B. Chee, A. Wong, D. Limbu, A. Tay, Y. Tan, and T. Park. Understanding communication patterns for designing robot receptionist. In S. Ge, H. Li, J.-J. Cabibihan, and Y. Tan, editors, Social Robotics, volume 6414 of Lecture Notes in Computer Science, pages 345–354. Springer Berlin / Heidelberg, 2010. M. Cook. Gaze and mutual gaze in social encounters: How long—and when—we look others “in the eye” is one of the main signals of nonverbal communication. American Scientist, 65(2):328–333, 1977. S. D. Gosling, P. J. Rentfrow, and J. William B. Swann. A very brief measure of the Big-Five personality domains. Journal of Research in Personality, 37:504–528, 2003. H. Hewitt, L. McCloughan, and B. McKinstry. Front desk talk: discourse analysis of receptionist–patient interaction. British Journal of General Practice, 59(565):e260–e266, 2009. M. E. Hoque, L.-P. Morency, and R. W. Picard. Are you friendly or just polite? —Analysis of smiles in spontaneous face-to-face interactions. In Proc. of the Affective Computing and Intelligent Interaction, October 2011. K. Kim. Affect Reflection Technology in Face-to-Face Service Encounters. MIT MS Thesis, September 2009. S. Kita, I. van Gijn, and H. van der Hulst. Movement phases in signs and co-speech gestures, and their transcription by human coders. In I. Wachs-muth and M. Fr¨ ohlich, editors, Gesture and Sign Language in Human-Computer Interaction, pages 23–35. Springer, 1998. K. C. Kong. Politeness of service encounters in Hong Kong. Pragmatics, 8(4):555–575, 2010. M. Makatchev, R. Simmons, and M. Sakr. Carnegie Mellon Receptionist Corpus. http://www.qatar.cmu.edu/hala/corpora/. M. McPherson, L. Smith-Lovin, and J. M. Cook. What is a language community? American Journal of Political Science, 44(1):142–155, 2000. E. Montague, J. Xu, P. yu Chen, O. Asan, B. P. Barret, and B. Chewning. Modeling eye gaze patterns in clinician-patient interaction with lag sequential analysis. J. of Human Factors and Ergonomics Society, 53:502–516, October 2011. J. C. Pinheiro and D. M. Bates. Mixed-Effects Models in S and S-PLUS. Springer, 2000. M. Rehm, E. Andr´e, N. Bee, B. Endrass, M. Wissner, Y. Nakano, A. A. Lipi, T. Nishida, and H.-H. Huang. Creating standardized video recordings of multimodal interactions across cultures. In M. Kipp, J.-C. Martin, P. Paggio, and D. Heylen, editors, Multimodal corpora, pages 138–159. Springer-Verlag, Berlin, Heidelberg, 2009. C. Rich, A. Holroyd, B. Ponsler, and C. Sidner. Recognizing engagement in human-robot interaction. In Proceedings of ACM/IEEE International Conference on Human Robot Interaction, pages 375–382, 2010. C. Rich and C. L. Sidner. Robots and avatars as hosts,
advisors, companions and jesters. AI Magazine, 30(1):29–41, 2009. B. Seidlhofer, A. Breiteneder, T. Klimpfinger, S. Majewski, R. Osimk, and M.-L. Pitzl. Vienna-Oxford international corpus of English (version 1.1 online). http://voice.univie.ac.at. Accessed January 16, 2012. R. Simmons, M. Makatchev, R. Kirby, M. Lee, I. Fanaswala, B. Browning, J. Forlizzi, and M. Sakr. Believable robot characters. AI Magazine, 32(4):39–52, 2011. V. Traverso. Syrian service encounters: a case of shifting strategies within verbal exchange. Pragmatics, 11(4):421–444, 2001. M. von Cranach and J. H. Ellgring. Problems in the recognition of gaze direction. In M. von Cranach and I. Vine, editors, Social Communication and Movement: Studies of Interaction and Expression in Man and Chimpanzee, pages 419–443. Academic Press, London, 1973. D. Watson, L. A. Clark, and A. Tellegen. Development and validation of brief measures of positive and negative affect: The PANAS scales. J. of Personality and Social Psychology, 47:1063–1070, 1988.
Figure 1: Interaction between S1 as a visitor and S4 as a receptionist. Wide vertical stripes represent intervals of speech. Narrow vertical stripes represent (from left to right): intervals of visitor’s and receptionist’s gaze towards the direction pointed by the receptionist, and visitor’s and receptionist’s gaze towards each other. Color coding of these modalities is specified by the icons in the upper part of the plots.
Figure 2: Interaction between S4 as a visitor and S1 as a receptionist. Wide vertical stripes represent intervals of speech. Narrow vertical stripes represent (from left to right): intervals of visitor’s and receptionist’s gaze towards the direction pointed by the receptionist, and visitor’s and receptionist’s gaze towards each other. Color coding of these modalities is specified by the icons in the upper part of the plots.
Figure 3: Interaction between S11 as a visitor and S12 as a receptionist. The visitor’s eye gaze for this particular dialogue is partially inferred from his head gaze. Wide vertical stripes represent intervals of speech. Narrow vertical stripes represent (from left to right): intervals of visitor’s and receptionist’s gaze towards the direction pointed by the receptionist, and visitor’s and receptionist’s gaze towards each other. Color coding of these modalities is specified by the icons in the upper part of the plots.