This week the paper topic was software and I stumbled across SimSensei – a virtual human created for assisting in therapeutic diagnosis for those suffering from depression or PTSD. I found this paper interesting as it opens so many potential doors for other use cases and the best thing is that very little is new in here – most of it is just clever connecting existing technologies together. There are some more videos on YouTube showing it in action but check out below for my comments on the paper.
This paper is available from http://aamas2014.lip6.fr/proceedings/aamas/p1061.pdf
Devault, D., Artstein, R., Benn, G., Dey, T., Fast, E., Gainer, A., … Morency, L.-P. (2014).
SimSensei Kiosk: A Virtual Human Interviewer for Healthcare Decision Support. In AAMAS ’14 Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems (pp. 1061–1068).
We present SimSensei Kiosk, an implemented virtual human interviewer designed to create an engaging face-to-face interaction where the user feels comfortable talking and sharing information. SimSensei Kiosk is also designed to create interactional situations favorable to the automatic assessment of distress indicators, defined as verbal and nonverbal behaviors correlated with depression, anxiety or post-traumatic stress disorder (PTSD). In this paper, we summarize the de-sign methodology, performed over the past two years, which is based on three main development cycles: (1) analysis of face-to-face human interactions to identify potential distress indicators, dialogue policies and virtual human gestures, (2) development and analysis of a Wizard-of-Oz prototype system where two human operators were deciding the spoken and gestural responses, and (3) development of a fully automatic virtual interviewer able to engage users in 15-25 minute interactions. We show the potential of our fully automatic virtual human interviewer in a user study, and situate its performance in relation to the Wizard-of-Oz prototype
This paper covers the work of a research group funded by DARPA to create a “virtual human” that can be used for conducting interviews for psychological support for those suffering from post-traumatic stress disorder (PTSD) or depression. The intention is that the virtual human can act as either an initial or interim support between visits with human clinicians. The virtual human (named Ellie) is intended to help clinicians make diagnoses whilst also helping the users feel they have more support to talk. In this context, a virtual human is a piece of software that results in a 3D model of a human being displayed on-screen that the user can interact with. This human takes in video and audio feeds so that it can respond in a human-like way to specific scenarios.
Ellie was created through three main phases of development. Initially face-to-face interviews were conducted between those who volunteered for a trial that who would fall in to the depression or PTSD category and a trained professional. This was for two purposes – firstly to provide a baseline for later test results and to provide the SimSensei team with a series of videos to study. In addition to a literature review the team studied all the interactions that were recorded to determine what questions the interviewer used and how they used body language, gesturing and nonverbal behaviours whilst the interviewee was analysed to see how they responses so that a system could be designed to cope with this. One of the biggest challenges was the wide-ranging information that the interviewee could reveal making context detection extremely difficult.
The second phase of development consisted of a puppeteer approach where two humans controlled Ellie’s software with one person controlling the nonverbal behaviour and another controlling the speech. This was called the Wizard of Oz (WoZ) approach. Rather than having unlimited options of body non-verbal or verbal communication a large but pre-defined list of choices were provided based on the analysis of the interviews conducted in the first phase. The idea was to use the human operator’s responses to program automated behaviour to act in the same way and limiting the responses was designed to test that a more closed rather than unlimited conversational flow could work.
The third and final phase of development looked at the approach of the human wizards in the WoZ scenarios and automated this so that based on the interviewee’s responses Ellie would automatically express appropriate nonverbal and verbal behaviour.
The bulk of the work on Ellie consisted of the analysis and selection of data. Most the technology within the system already existed. Broadly speaking it used a general modular virtual human architecture with a bespoke messaging bus to pass information between the three virtual human software modules. The first module covered non-verbal perception which used the multi-sense framework to detect head-tracking, face tracking, gaze detection and basic audio analysis. The second module was for dialog processing and consisted of four natural language understanding classifiers covering generic dialog, dialog valence, domain specific small talk and domain specific dialog. The final module was for the generation of non-verbal behaviour (such as expressing body language). In these systems, existing components were used including GAVAM head tracking, CLM-2 face tracking, FACET SDK for further face tracking, OKAO gaze detection, Cogito Software’s audio analysis, PocketSphinx automated speech recognition, SentiWord 3 for language valence, FLoReS dialog manager, Cerebella for handling a lot of the non-verbal behaviour generation and finally SmartBody for the animation. Other than selecting technologies well most the work was around tweaking each of the algorithms to work together and give an overall impression of deeper understanding than Ellie had – work was done around language valance to detect whether words and sentences came with positive, negative or neutral valance. There was significant work here to change the default behaviour of SentiWord to cope with the potential unlimited context that Ellie may face.
The results were not surprising in that automated Ellie did not perform as well as the WoZ version of Ellie – however it came extremely close. The most impressive result was that both WoZ and the automated version rated better amongst those in the trial than the human therapist involved in the first phase of development. Further research has been suggested in this area however other papers do suggest that this could be related to people being willing to talk to a computer more freely without feeling they should withhold information. There are several studies showing that when people interact with computers the entire sentence structure, conversational silences and content is different from when other humans are present. This conclusion was that even in the present state Ellie could offer real benefits to users of this system.
Although there were very few significant leaps forward in Ellie’s software I feel that the selection of architecture and software components combined has created software far greater than the sum of its components. A big part of this is down to the changes to natural language understanding which weight the positive or negative connotations of a sentence. Understanding any possible topic is extremely difficult so the team came up with a way to gauge overall tone of the conversation so that Ellie could probe for more information and then measure non-verbal responses whilst building further rapport with the interviewee. This faked understanding is an important step in allowing a user to open so that their other non-verbal signs can be analysed to assist human clinicians later in treatment.