CA2375589A1

CA2375589A1 - Method and apparatus for determining user satisfaction with automated speech recognition (asr) system and quality control of the asr system

Info

Publication number: CA2375589A1
Application number: CA002375589A
Authority: CA
Inventors: James Craig; Andrew Osburn; Carter Cockerill; Jeremy Bernard; Mark Boyle; David Burns
Original assignee: Diaphonics Inc
Current assignee: Diaphonics Inc
Priority date: 2002-03-08
Filing date: 2002-03-08
Publication date: 2003-09-08

Abstract

An apparatus for determining user satisfaction using automated speech recognition (ASR) systems is disclosed. The apparatus comprises: means for assessing the voice user emotional state based upon the voice characteristics;
means for assessing the voice user behavioural pattern based upon the current, and previous, interactions with the ASR application; means for decision-modelling the overall voice user experience based upon the emotional state and behaviour pattern; and a real-time adaptation means of the voice user interface to match the individual based upon the QC in ASR assessment of the voice user experience. A method of determining user satisfaction using automated speech recognition (ASR) systems is also disclosed.

Description

Method and Apparatus for Determining User Satisfaction with Automated Speech Recognition Systems Field of the Invention The present invention relates to a method of determining user satisfaction with automated speech recognition systems and an apparatus using the method. The invention is concerned with gauging user satisfaction with, and quality control (QC) of, Automated Speech Recognition Systems (ASR). The method and apparatus of the invention draw upon analysis of Speaker Emotion, historical user behaviour, and statistical methods in order to estimate the degree of user satisfaction with an ASR system (QC in ASR). The QC in ASR system operates within the Public Switch Telephone Network (PSTN) and integrates components that are based in telephony services, automated speech recognition, automated assessment of speaker emotion, and automated speaker behaviour profiling.
Background and Summary of the Invention Despite ever-increasing usage of the Web, companies still receive more than 70% of their orders through the appliance their customers prefer - the telephone. Thanks to widespread adoption of cellular telephones, voice remains the preferred means of doing business. There will be more than 2 Billion telephones by 2003, a figure that dwarfs the projected 200 Million Web-enabled computers. For the foreseeable future, voice will continue to be the dominant mode for exchanging business information.
Automated speech recognition (ASR) is a technology that allows a computer to recognize and interpret human speech, much as the computer would recognize a typed command. ASR has been around several decades, but only relatively recent improvements in software and computing power have made it a compelling business tool.
The key ASR benefits for callers are as follows:

- The ability to conduct transactions and receive information at any time over any phone: Customers do not require any special equipment or Internet access.
- A simple, user-friendly speech interface: Natural-language speech recognition flattens out the frustrating hierarchy or tree structure associated with touch-tone menus, making ASR a more pleasant, efficient and effective system.
- 24/7 availability: The customer is not limited to store or call-center hours.
For companies using ASR, key benefits include:
- Real-time integration with business systems: No re-keying of data associated with manual transaction and customer care processing.
- Reduced requirements for Customer Service Agents: Customers are never put on hold, and agents can focus on other tasks and more complex transactions.
- Lower administrative costs: ASR allows companies to bring in customers over an automated channel.
- Improved customer service levels: Implementing ASR will help eliminate hold times, and the system is available 2417.
The challenge with ASR is designing and developing applications that will emulate the natural human dialogue process. A good ASR application provides a natural and intuitive dialogue flow that allows the User to interact with the system without question or concern. Achieving this level of ASR application sophistication is very difficult in practice.
One of the most important elements missing from current ASR systems is the ability to not only understand what the user is saying but also the dialogue context, meaning, and manner in which the speech is conveyed. In order to fully assess the user satisfaction and interaction with the ASR application a great deal more information is required. The QC in ASR system meets this challenge by providing a full and robust system for gauging the User-ASR experience.
This is accomplished by not only recognising the speech but also by assessing the User emotional state and individual behavioural characteristics. This assessment is then used to adapt the Voice User Interface to better meet the needs of the individual user. Therefore, the ASR application can be tailored real-time to match the individual and thereby begin to emulate a much more human dialogue process.
There are several commercially available Quality Control and ASR
Monitor software based tools existing in the marketplace today. These tools draw upon data in ASR application logs in order to identify problem areas in the applications such as bottlenecks, poor dialogue flows etc. These tools assist in tuning the applications to smooth out or redefine dialogues that may be misunderstood or misleading. These tools are very rudimentary in scope and do not conduct any assessment of the Voice User experience.
There are currently no other similar methods or processes in place to assess the Voice User experience with an ASR in an automated fashion.
According to the present invention, the QC in ASR integrates the following areas of technology:
- Automated speech recognition - Automated analysis and assessment of speaker emotion based upon utterances made by the speaker during the use of an ASR system - Automated user behaviour profiling based upon ASR usage logs and historical user profile data - Decision matrix that takes as input all data regarding the User-ASR
experience, and determines a User satisfaction level, i.e., estimates the ease with which the User interacted and was satisfied with ASR system.
- Algorithms that will automatically and dynamically adapt the ASR voice intertace (i.e. the dialogue flow) to match the individual User needs based upon the outputs from the ASR level of satisfaction decision matrix.
- Statistical methods for aggregating User satisfaction and behavioural data Features of the invention and advantages associated therewith are as follows:
- The QC in ASR does not rely on a single data source but rather combines a number of unique methods to assess the User-ASR experience.
- Three distinct voice and speech components are analyzed (Context and Discourse Structure, Prosody, and Paralinguistic Features) in order to assess the User Emotional State - User Emotional State is assessed real-time, ie while the User-ASR
interaction is ongoing - The User behavioral pattern and history is stored, accessed, and updated, based upon each interaction of the User with the ASR. This allows the ASR to know in advance the User preferences and abilities and to tailor the Voice User Interface appropriately.
- The data from the User emotional assessment and behavioural pattern are used as inputs to a decision matrix in order to determine an assessment of the overall User experience and satisfaction level - The User emotional assessment and behavioural pattern data are used to dynamically adapt the ASR Voice User Interface to conform to the needs of the individual User - Statistical methods are employed in order to assess the effectiveness of the ASR Voice User Interface Further understanding of other features, aspects, advantages of the present invention will be realized by reference to the following description, appended claims, and accompanying drawings.
Brief Description of the Drawings Embodiments) of the present invention will be described with reference to the accompanying drawings, wherein:
Fig. 1 schematically illustrates the components of the QC in ASR system in accordance with the present invention;
Fig. 2 is a schematic presentation of the voice user emotion assessment in Fig. 1; and Fig. 3 is a schematic flow chart showing the QC in ASR process flow in accordance with the present invention.
Detailed Description of the Preferred Embodiments) 5 Fig. 1 schematically illustrates the components of the QC in ASR system in accordance with one embodiment of the invention.
Each component in Fig. 1 will be explained below.
1.0 Automated Speech Recognition Application This component can include any ASR based application implemented using contemporary speech recognition engines.
ASR applications have the ability to record the utterances made by the speaker. These utterances, which are saved in standard audio file formats such as .wav, .vox, etc, can then be used as inputs to the Voice User Emotion Assessment 3.0 component as shown in Fig. 1.
The ASR application also builds a log file for every session conducted with a user. The log file contains a great deal of information regarding the user session including such data as invalid responses, re-prompts, time-outs, gender, etc. The log file data are used as inputs to the Voice User Behavioural Assessment 5.0 component.
2.0 ASR Log Files and Utterances This component represents the data source of ASR Log files and User utterances. As discussed above, the utterance and log files contain the source data used by both the Voice User Emotion and Behavioural Assessment components 3.0 and 5Ø

3.0 Voice User Emotion Assessment This component takes as an input the User utterance file and processes the voice data in order to assess the User emotional state. Several distinct voice and speech components are analysed: For example, Context and Discourse Structure, Prosody, and Paralinguistic Features. This component is the most complex within the QC in ASR system and is described in further detail below. The Voice User Emotion Assessment data is updated with every User utterance and passed to the Voice User Level of Satisfaction Matrix.

4.0 Voice User Level of Satisfaction Matrix This component takes as an input the results of the Voice User Emotion Assessment component 3.0 and the Voice User Behavioural Assessment component 5Ø The decision matrix consists of a set of algorithms that determines an estimate of the overall User satisfaction level based upon the emotional and behavioural assessments. As the emotional and behavioural assessments are continually being updated throughout the course of the User-ASR interaction, the decision matrix is also continually updating the determination of the user satisfaction level.
The estimated User level of satisfaction is frequently updated and passed to the ASR Voice User Interface Adaptation Component 6Ø

5.0 Voice User Behavioural Assessment This component takes as an input the ASR log files and processes the contained data in order to assess the Voice User Behavioural pattern. The behavioural pattern describes the manner in which the User is able to interact with and navigate the ASR. For example, a novice User that is unfamiliar with the dialogue flow or a User that has demonstrated difficulty in using the ASR
require a more directed and robust dialogue. Experienced users who have demonstrated that they can move quickly through the ASR require a more terse and brief dialogue flow. Therefore, the User behavioural pattern is built over a period of time based upon each interaction of the User with the ASR.
Each time the User accesses the ASR, the behavioural pattern is created and/or updated as appropriate to reflect the User capabilities and, thereby, reflect the Users individual needs.

The Voice User Behavioural Assessment data is updated, as Log File data is available, and then passed to the Voice User Level of Satisfaction Matrix.

6.0 ASR Voice User Interface Adaptation Component This component takes as an input the User Level of Satisfaction data from component 4Ø Based upon the determined level of satisfaction the Voice User Interface within the ASR is updated dynamically to meet the immediate User needs. In this manner the real-time determination of the User-ASR
interaction experience is determined and acted upon in order to conform and tailor the ASR Voice User Interface to meet the individual and immediate User requirements.

7.0 Voice User Historical Behavioural Data Component ~ This component represents the data source for User behavioural data. A
database record is created for an individual the first time they access the ASR.
The record contains information regarding the individual's interaction with the ASR and reflects their level of satisfaction and ease of use during each interaction. Each successive time the User accesses the ASR the historical profile is queried in order to tailor the Voice User interface to meet the individual needs. Upon termination of the User-ASR interaction, the behavioural profile is amended as required.
The Voice User Emotion Assessment Component 3.0 will be described below in greater detail, in conjunction with Fig. 2.
As noted above, this component 3.0 is very sophisticated within the QC in ASR system. The purpose of the component is to process the User spoken utterance files with the objective of attempting to determine the speaker emotional state. The results can be sufi'icient to indicate if the User has had a "negative" experience as opposed to a "positive" one.

To achieve this objective there are many features and characteristics of the human voice which can be derived and analysed in order to determine an assessment of the User emotional state. The distinct voice and speech components that are analysed are, for example, Context and Discourse Structure 3.1, Prosody 3.2, and Paralinguistic Features 3.3 as illustrated in Fig.
2.
Each components of the Voice User Emotion Assessment is further detailed as follows:
3.1 Context and Discourse Structure Context and Discourse Structure give consideration to the overall meaning of a sequence of words rather than looking at specific words in isolation. Different words can mean different things depending upon the context in which they are spoken. Therefore, one has to consider the overall discourse and structure of the dialogue flow in order to fully assess the meaning and emotion contained therein.
Techniques used to derive context and structure will consider the rise and fall of voice intonation and computing the probability of a certain word based upon the previous words that have been spoken.
3.2 Prosody Prosodic features of voice are reflected in vocal effects such as variations in pitch, volume, duration, and tempo among others. Of the three voice components, prosody in voice holds the greatest potential for determination of conveyed emotion. Prosodic features are extracted from a voice sample through digital signal processing techniques. The prosodic features are determined and then analysed in order to attempt to classify the user emotion.
Often several voice samples are required in order to derive an emotional state.
3.3 Paralinguistic Features Paralinguistic features or voice are separated into two types of classifications. The first is voice quality that reflects different voice modes such as whisper, falsetto, and huskiness, among others. The second is voice qualifications that include non-verbal cues such as laugh, cry, tremor, and fitter.
As with prosody, these voice features can be extracted through digital signal processing techniques. Paralinguistic features are then analysed in order to attempt to classify the user emotion.
Fig. 3 is a schematic flow chart showing the QC in ASR process flow in accordance with the present invention.
According to the present invention, the QC in ASR process flow is as follows:
1. User calls the ASR application.
2. The ASR, through standard means such as account number, password/PIN, voice biometric characteristics etc, identifies the caller.
3. The User behavioural profile is retrieved from the User Behavioural database and the ASR Voice User Interface is initially configured based upon the User profile. If the User is accessing the ASR for the first time then a new User Behavioural database record is created and a default Voice User Interface is configured.
4. The ASR interacts with the User and, each time a User response is provided, an utterance file is recorded and a Log File entry is made.
5. The Voice User Emotional Assessment component processes the utterance files and the Voice User Behavioural Assessment component processes the log files.
6. Step 5 is iterative and will be repeated each time a Voice User response is provided.
7. The User Emotional and Behavioural Assessment data are passed to the Voice User Level of Satisfaction Decision Matrix. The data are processed in order to determine the immediate user level of satisfaction.

8. The User level of satisfaction data are passed to the ASR Voice User Adaptation Component. Based on the user satisfaction level, the Voice User Interface can be immediately tailored to match the requirements of the User at that specific time.

9. Upon completion of the User-ASR interaction, the User Behavioural data record is updated.
While the present invention has been described with reference to several specific embodiments, the description is illustrative of the invention and is not to be construed as limiting the invention. Various modifications and variations may occur to those skilled in the art without departing from the true spirit and scope of the invention as defined by the appended claims.

Claims

1. An apparatus for determining user satisfaction using automated speech recognition (ASR) systems, the apparatus comprising:
(a) means for assessing the voice user emotional state based upon the voice characteristics;
(b) means for assessing the voice user behavioural pattern based upon the current, and previous, interactions with the ASR application;
(c) means for decision-modelling the overall voice user experience based upon the emotional state and behaviour pattern; and (d) a real-time adaptation means of the voice user interface to match the individual based upon the QC in ASR assessment of the voice user experience.

2. An apparatus according to claim 1, further comprising a database storage of historical voice user behavioural data, and decision modelling algorithms employed to assess and weigh the data elements from the emotional state and behavioural pattern assessments in order to achieve an overall determination regarding the voice user experience.