WO2023148772A1

WO2023148772A1 - A system and method to reduce ambiguity in natural language understanding by user expectation handling

Info

Publication number: WO2023148772A1
Application number: PCT/IN2023/050105
Authority: WO
Inventors: Michael Schmitz; Christoph Voigt; Kai Samuel DAVID ERIK KARREN
Original assignee: Hishab India Private Limited
Priority date: 2022-02-06
Filing date: 2023-02-03
Publication date: 2023-08-10

Abstract

The present invention describes a system and a method for dialogue management in a human- computer interaction by determining and satisfying a user's expectation and improving transcription. The dialogue management method and system (100) comprises receiving conversation data from user utterances and processing the conversation data received to improve transcription and predict a user's expectation in the interaction session and thereby to reduce ambiguity in natural language understanding by user expectation handling. The dialogue management method and system (100) includes analyzing a plurality of subwords and keywords corresponding to the processed conversation data.

Description

A SYSTEM AND METHOD TO REDUCE AMBIGUITY IN NATURAE LANGUAGE

UNDERSTANDING BY USER EXPECTATION HANDLING

FIELD OF INVENTION

[001] The present invention relates to a dialogue management in an interactive voice response system and more particularly to systems and methods to satisfy a user's expectation and improve accuracy of voice recognizing entity in transcription during a user interaction session in an interactive voice response system to reduce ambiguity in natural language understanding by user expectation handling.

BACKGROUND OF THE INVENTION

[002] In the current scenario, it is difficult to appropriately establish a conversation with a relatively high degree of freedom without consuming a vast amount of resources to correct the speech recognition by an Automatic Speech Recognition (ASR) in an interactive voice response system. The extant ASR process is typically a statistical system using a fixed vocabulary. This means that a word which does not exist in the system’s vocabulary may not be recognized correctly by the ASR. However, simply adding a word to the vocabulary is often not enough as this requires collecting and inputting a large volume of data to represent different factors contributing to the creation of speech signals, such as, for example, speaker identity, accent, emotional state, topic under discussion, and the language used in communication. The ASR also needs to be updated and trained to reflect how the particular word is typically used in context. Furthermore, the ASR process is not capable enough to correctly predict and identify a user’s intent from casual utterances of the user, and therefore, cannot satisfy the user’s implied expectation from an interaction session. Additionally, if a word is not frequently used, an ASR engine might misrecognize the word, favoring one that is statistically more likely to be spoken. These factors can reduce the accuracy with which the ASR engine recognizes any word.

[003] Furthermore, user utterances may include ambiguities, which may make processing the user utterances difficult or impossible and, as a result, transcriptions are likely to include results that are not remotely related to the user’s intention. In such cases, the system is inferred to be incapable of reducing uncertainty and confusion associated with the user's utterance input that is possibly not in alignment with the intended use case. [004] Such limitations for the ASR process to recognize a misalignment costs the user both time and energy and also costs the system both time and operating expenses, respectively. Consequently, the user may not find the interface of the system user-friendly and may be reluctant to use it in the future, as the user needs to stay alert and be discrete with his speech throughout the interaction session because continuous speech may typically result in higher incidences of recognition errors and therefore, the user will need to be interfered to provide clarity when being misunderstood. It is also possible that the user might be led into a wrong direction in the user journey such that he loses time and interest to be led back into the right direction.

[005] Furthermore, given the current scenario, in a dialogue management system or an IVR communication system, a Natural Language Understanding (NLU) pipeline of a user interface driving entity needs to be manually adapted for each and every classified case. Although this suffices on a small scale, it is hard to maintain on a larger scale. As a result, a plurality of cases is missed, and wrong adaptations are implemented which harms the progressivity of a conversation.

[006] Therefore, a need exists, for an improvement, a solution that is flexible and can possibly improve a transcription of the user utterance and avoid misleading directions in the user journey, to reduce ambiguity in natural language understanding by user expectation handling and correct incorrect text(s) associated with recognition errors without disrupting the session and further determine and satisfy the user’s expectation saving both cost for the operation and the user’s usage.

SUMMARY OF THE INVENTION

[007] This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

[008] Embodiments of the present invention are directed to a method and a system for dialogue management in a human-computer interaction by determining and satisfying a user’s expectation and improving transcription. The dialogue management method and system comprise receiving conversation data from user utterances corresponding to a user interaction session and processing the conversation data received to improve transcription and predict a user’s expectation in the interaction session. The dialogue management method and system include generating and analyzing a plurality of subwords and keywords corresponding to the processed conversation data and storing the plurality of subwords and keywords in a corresponding subword model and a keyword model respectively. The dialogue management method and system further include applying a confidence scoring model to the plurality of subwords and keywords for assigning and modifying a plurality of confidence scores in a plurality of expectation dictionaries and selecting from the plurality of subwords and/or keywords corresponding to the subword model and/or the keyword model in the dedicated expectation dictionary to provide an output with the subword and/or keyword with the highest confidence score.

[009] As a result, this invention can be used to improve transcription, measure and reduce the uncertainty or confusion surrounding a user's input that is possibly not in alignment with the interaction session in a system. Furthermore, this invention helps in aligning the system expectations to extract intended information from and for the user. As a result, it further reduces ambiguity in the use-case corresponding to the user interaction session. This makes the interface more efficient, and approachable. Moreover, it saves time and cost for both the system operations and the users usage.

[0010] Implementations may include one or more of the following features.

BRIEF DESCRIPTION OF DRAWINGS

[0011] Fig. 1 is a block diagram illustrating data flow between a user and an IVR communication system for identifying and satisfying user expectations.

[0012] Fig. 2A is a flowchart illustrating a process on how expectation dictionaries learn

[0013] Fig. 2B is a flowchart illustrating a process on how expectation dictionaries are updated.

[0014] Fig. 2C is a flowchart illustrating a process on how a transcription alternative is chosen.

DETAILED DESCRIPTION OF THE INVENTION

[0015] Described herein are methods and systems of dialogue management in an interactive voice response system and to satisfy a user's expectation and improve accuracy in transcription during a user interaction session in an interactive voice response system. The systems and methods are described with respect to figures and such figures are intended to be illustrative rather than limiting to facilitate explanation of the exemplary systems and methods according to embodiments of the invention.

[0016] The foregoing description of the specific embodiments reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments.

[0017] Also, it is noted that the embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed, but could have additional steps not included in the figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination corresponds to a return of the function to the calling function or the main function. [0018] It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the appended claims.

[0019] As used herein, the term “network” refers to any form of a communication network that carries data and is used to connect communication devices (e.g. phones, smartphones, computers, servers) with each other. According to an embodiment of the present invention, the data includes at least one of the processed and unprocessed data. Such data includes data which is obtained through automated data processing, manual data processing or unprocessed data. [0020] As used herein, the term “artificial intelligence” refers to a set of executable instructions stored on a server and generated using machine learning techniques.

[0021] Although the following description uses terms “first,” “second,” etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another. For example, a first intent could be termed a second intent, and, similarly, a second intent could be termed a first intent, without departing from the scope of the various described examples.

[0022] Fig. 1 illustrates a conceptual diagram illustrating an example framework for overall data flow between a human and an exemplary Interaction voice response (IVR) communication system 100 for dialog management and improving accuracy in identifying and satisfying user expectations in a user interaction session. In this example, the disclosed system includes a voice recognizing entity 102, an expectation handler 103, a dialogue engine 104, a Natural Language Understanding component 105 (referred to as NLU component 105 hereafter), a conversation history database 106, a conversation analysis component 107, an evaluation component 108, and an expectation dictionary component 109. The expectation dictionary component 109 further includes an expectation dictionary 110. The expectation dictionary 110 includes a subword model 111 and a keyword model 112 respectively.

[0023] According to an example embodiment of the present invention, the Interaction voice response (IVR) communication system 100 includes a plurality of expectation dictionaries 110. [0024] Referring back to FIG. 1, a user 101 initiates a call to the IVR communication system 100 using a client device not illustrated herein for simplicity. The client device may correspond to a wide variety of electronic devices. According to an example embodiment of the present invention, the client device is a smartphone or a feature phone or any telecommunication device such as an ordinary landline phone. The client device acts as a service request means for inputting a user request.

[0025] The user 101 is routed to the voice recognizing entity 102 of the IVR communication system 100. The voice recognizing entity 102 corresponds to an Automatic Speech Recognition (referred to as ASR hereafter) module or a speech-to-text (referred to as STT hereafter) module. The ASR receives and translates the Voice input signal from the user’s 101 utterance into a text output, which represents its best analysis of the words and extra dialog sounds spoken in the user’s 101 utterance. The voice recognizing entity 102 connects to the expectation handler 103. The expectation handler 103 receives translated data from the voice recognizing entity 102. The expectation handler 103 further connects to the expectation dictionary component

109. The expectation dictionary component 109 provides an (Application Programming Interface) API comprising a plurality of expectation dictionaries, such as expectation dictionary

110, classified into categories dedicated to a plurality of fields such as, but not limited to, use cases, domains, user models, and interaction steps. The expectation dictionaries store subwords and keywords in corresponding subword models, such as subword model 111, and keyword models, such as keyword model 112, and also stores semantics and pragmatics received from user utterances and corresponding conversation data. The expectation dictionaries help to pick an expected transcription of an entity value before it is transmitted to an entity extraction module such as the NLU component 105. [0026] The expectation handler 103 is used to determine and identify an implied expectation of the user corresponding to the user utterance. The expectation handler 103 is further capable of determining and assigning a confidence score to each of the plurality of subwords and keywords stored in the plurality of expectation dictionaries in the expectation dictionary component.

[0027] The expectation handler 103 further connects to the dialogue engine 104. The dialogue engine 104 drives the voice recognizing entity 102 and provides a user interface between the user and the services mainly by engaging in a natural language dialogue with the user. The dialogue may include questions requesting one or more aspects of a specific service, such as asking for information. In this manner the IVR communication system 100 may also receive general conversational queries and engage in a continuous conversation with the user through the dialogue engine 104. The dialogue engine 104 is further capable of switching domains and use-cases by recognising new intents, use-cases, contexts, and/or domains by the user during a conversation. The dialogue engine 104 keeps and maintains the dynamic structure of the user interaction session as the interaction unfolds. The context, as referred to herein, is the collection of words and their meanings and relations, as they have been understood in the current dialogue in the user interaction session.

[0028] The dialogue engine 104 further connects to the NLU component 105. The NLU component 105 receives input from dialogue engine 104 and translates the natural language input into machine -readable information. NLU component 105 determines and generates transcribed context, intent, use-cases, entities, and metadata of the conversation with the user 101. The NLU component 105 uses natural language processing to also determine use-case from the user’s utterance as conversational input.

[0029] The dialogue engine 104 further connects to the conversation history database 106. The dialogue engine 104 drives the interaction with the user 101 and concurrently inputs latest updates, and events associated with the user interaction session into the conversation history database 106. The conversation history database 106 stores raw conversation data received from the dialogue engine 104 that can be further processed by the conversation analysis component 107. The conversation history database 106 may be stored in any suitable location, for example locally on the IVR communication system 100, or remotely in a cloud computing server.

[0030] The conversation analysis component 107 analyses the raw conversation data associated with the user interaction session received from the conversation history database 106 for each type of dictionary i.e. at least one of the plurality of use case, domain, user, interaction step expectation dictionaries for example. The conversation analysis component 107 is further capable of evaluating the raw conversation data for each subword and keyword using the evaluation model component 108. The evaluation model component 108 stores one or a plurality of rules and evaluation and statistical models to provide a positive or negative indication for each of the plurality of subwords and keywords received in the corresponding raw conversation data. The evaluation model component 108 may also employ deep learning on the information received from the data to train, develop and update the evaluation and statistical models.

[0031] The conversation analysis component 107 is further capable of assigning weights to each of the plurality of subwords and keywords received in the corresponding raw conversation data. For example, in a rule-based scenario of how the weights are assigned to each of the plurality of subwords and keywords transcription alternatives by the conversation analysis component 107 is, when a chosen alternative of a subword and/or a keyword results in a positive flow or follows a “happy” path in a conversation with the user, the weight of the alternative of the subwords and/or keywords in the expectation dictionaries in the expectation dictionary component 109, corresponding to the user interaction session, is increased by a reward value. An exemplary scenario for a positive flow or a “happy” path in a conversation includes, but is not limited to, deriving and executing the implied expectations and dialogues and, as a result, performing the expected service and satisfying the user intent successfully within an interaction session associated with the user, such as the user 101.

[0032] In the same manner, if the conversation with the user results in a negative outcome or follows an “unhappy” path, the weight of the alternative of the subword and/or the keyword transcription corresponding to the user interaction session is decreased by a penalty value. An exemplary scenario for a negative flow or an “unhappy” path in a conversation includes, but is not limited to, producing significant errors and being unable to derive and execute the implied expectations and dialogues and, as a result, failing to perform the expected service and satisfying the user intent successfully within an interaction session associated with the user, such as the user 101.

[0033] According to an example embodiment of the present invention, the reward value and the penalty value are statically defined values.

[0034] According to an example embodiment of the present invention, the reward value and the penalty value have the same value.

[0035] According to an example embodiment of the present invention, because negative outcomes or the “unhappy” path in a conversation have a higher influence on the interaction session, it is more suitable to have a higher penalty value compared to the reward value in order to decrease the weight of incorrect alternatives faster and by a significant count.

[0036] According to an example embodiment of the present invention, a new weight assigned to a subword and/or a keyword is calculated as follows:

New weight = min(weight + reward, maximal weight) New weight = max( weight - penalty, minimal weight)

[0037] Below is an exemplary scenario including an empty expectation dictionary and the first turn of a conversation that is analyzed using arbitrary values.

[0038] For example, the transcription alternatives for the first turn in an expectation dictionary include:

"text": "I would like to perform",

"confidence": 0.75

"text": "I would like to perform a transaction",

"confidence": 0.72

[0039] According to the above example embodiment of the present invention, the dedicated expectation dictionary is empty. As a result, the corresponding weights do not change. So the first alternative “I would like to perform” is selected by the voice recognizing entity 102 because “I would like to perform” has the highest confidence score. However, according to an example embodiment of the present invention, this leads to a fallback because the transcription does not include the intention of the user. Therefore, the IVR communication system 100 fails to satisfy the user expectation and has to recover the conversation. Then, “I would like to perform” lead to a negative outcome, the transcription is inserted into the dedicated expectation dictionary by calculating its weight using the conversation analysis component 107 with the aforementioned equation: weight = max( weight - penalty, minimal weight)

[0040] According to the above example embodiment of the present invention, the penalty value is predetermined to be 0.05 (same as the reward value) and the minimal weight is -0.5 and the maximal weight is 0.9. weight = max(0 - 0.05, -0.5) weight = max(-0.05, -0.5) weight = -0.05 [0041] The dedicated expectation dictionary then reads: "I would like to perform"

New weight: -0.05

[0042] According to the above example embodiment of the present invention, “I would like to perform a transaction “is also added to the expectation dictionary as follows: weight = min(weight + reward, maximal weight) weight = min(0 + 0.05, 0.9) weight = min(0.05, 0.9)

New weight = 0.05

[0043] The dedicated expectation dictionary then reads:

"I would like to perform": -0.05

"I would like to perform a transaction": 0.05

[0044] The expectation dictionaries improve over time by creating a set of positive and negative weights for transcription alternatives.

[0045] The new weight would then be inserted and replace the old weight corresponding to the subword or the keyword alternative into the dedicated expectation dictionary in the expectation dictionary component 109. The “min” and the “max” and the “maximal weight” and “the minimal weight” is used to allow and specify a fixed limit on how large the weight could get in both positive and negative directions.

[0046] Referring back to the expectation handler 103, it can be used to re -rank the subwords, keywords or a group of words and/or words associated with semantics and pragmatics in a list, in the dedicated expectation dictionaries corresponding to the expectation dictionary component 109, using a plurality of confidence scores.

[0047] The confidence score can be associated with a word, a subword, a keyword or a group of words and/or words associated with semantics and pragmatics. The confidence score signifies how confident the expectation handler 103 is of the identified subwords and/or keywords or words stored, in their respective dedicated expectation dictionaries in the expectation dictionary component 109, including, for example, subwords and/or keywords or words were previously identified in speech utterances in previous user interaction sessions. [0048] According to an example embodiment of the present invention, the expectation handler 103 generates a confidence score on a scale from X to Y corresponding to the assigned weight of a subword and/or keyword or a word, where a confidence score of Y means that the expectation handler 103 is very confident that the subword and/or keyword or the word was recognized and used in the context correctly and a confidence score of X means that the expectation handler 103 could not confidently and correctly recognize and use the subword and/or keyword or the word.

[0049] Previously learned knowledge of which transcription alternatives of a subword and/or keyword or a word had a positive or negative effect on the progress of the conversation is used for the re-ranking by the expectation handler 103.

[0050] According to an example embodiment of the present invention, the expectation handler 103 increases the confidence score of an alternative of a subword and/or keyword or a word with a positive weight, and the expectation handler 103 decreases the confidence score of an alternative of a subword and/or keyword or a word with a negative weight corresponding to an entity which had previously a negative effect on the conversation in the user interaction session. It is to be appreciated that the expectation handler 103 does not remove any transcription alternative from the list comprising subword and/or keyword models, word models, a group of words models, only the corresponding confidence score is adjusted as per the weight.

[0051] Fig. 2A is a flowchart 200 illustrating steps describing how the expectation dictionaries learn from a human-computer conversation, in the context of the IVR communication system 100 described in Fig. 1, in accordance with one or more aspects of the present invention.

[0052] At step 201, the process starts with a user, such as the user 101, initiating a call, for example, to the IVR communication system 100.

[0053] In the next step, at step 202, the IVR communication system 100 establishes a user interaction session with the user 101. The interaction session includes, for example, but not limited to, the user 101 inputting service requests such as, to receive information, and initiate various processes that may be a part of the service requests etc.

[0054] In the next step, at step 203, the IVR communication system 100 receives user input from the user 101 in the form of a plurality of speech utterances comprising a plurality of subwords and keywords.

[0055] In the next step, at step 204, the IVR communication system 100 recognizes the entity of the plurality of speech utterances comprising the plurality of subwords and keywords, spoken by the user 101 in the interaction session, for processing and generating a response to the user based on the user speech utterances data. [0056] In the next step, at step 205, the IVR communication system 100 processes the plurality of speech utterances comprising the plurality of subwords and keywords spoken by the user 101. Furthermore, In the next step, at step 206, while driving the user interaction session according to the use case, the IVR communication system 100 also records and concurrently stores inputs received in the user interaction session such as, for example, latest updates, events, raw conversation data and content associated with the user 101 in the conversation history data 106.

[0057] In the next step, at step 207, the IVR communication system 100 identifies the raw conversation data and extracts the plurality of subwords and keywords spoken by the user 101 and updates and maintains a plurality of subword models and keyword models including, but not limited to, transcription alternatives comprising the plurality of subwords and keywords that are derived from the raw conversation data. The plurality of subword models and keyword models are stored correspondingly in their respective dedicated expectation dictionaries. The IVR communication system 100 further determines and assigns a weight associated with each of the plurality of subwords and keywords, the details of which is explained later in the specification, and stores them correspondingly in their respective dedicated expectation dictionaries.

[0058] In the next step, at step 208, the process ends with the user terminating the call to the IVR communication system 100, for example.

[0059] Fig. 2B is a flowchart 250 illustrating steps describing how the expectation dictionaries are updated from a human-computer conversation, in the context of the IVR communication system 100 described in Fig. 1, in accordance with one or more aspects of the present invention. [0060] At step 251, the process starts with a user, such as the user 101, initiating a call, for example, to the IVR communication system 100.

[0061] In the next step, at step 252, the IVR communication system 100 establishes and maintains a user interaction session with the user 101.

[0062] In the next step, at step 253, the IVR communication system 100 receives user input from the user 101 in the form of a plurality of speech utterances comprising a plurality of subwords and keywords.

[0063] In the next step, at step 254, the IVR communication system 100 recognizes the entity of the plurality of speech utterances comprising the plurality of subwords and keywords spoken by the user 101 in the interaction session. [0064] In the next step, at step 255, the IVR communication system 100 processes the plurality of speech utterances comprising the plurality of subwords and keywords spoken by the user 101 in the interaction session.

[0065] In the next step, at step 256, the IVR communication system 100 records and concurrently stores inputs received in the user interaction session including, but not limited to, latest updates, events, raw conversation data and content associated with the user 101 in the conversation history data 106. The IVR communication system then 100 identifies the raw conversation data and extracts and stores the plurality of subwords and keywords spoken by the user 101.

[0066] In the next step, at step 257, the IVR communication system 100 analyzes the plurality of subwords and keywords in the speech utterances, spoken by the user, using a set of rules or an evaluation model corresponding to one or more dedicated expectation dictionaries for each of the plurality of subwords and keywords.

[0067] In the next step, at step 258, the IVR communication system 100 determines and evaluates a positive or negative indication for each of the plurality of subwords and keywords using the set of rules or the evaluation model.

[0068] In the next step, at step 259, the IVR communication system 100 assigns a weight associated with each of the plurality of subwords and keywords or modifies the weight associated with one or more previously transcribed subwords and keywords with a reward or penalty value based on the evaluation. The previously learnt subwords and keywords are identified and acquired from various past user interaction sessions and are stored in their respective dedicated expectation dictionaries. In an exemplary scenario, it can be assumed that the word “balance” is recognized for 30% of the users in 50% of the cases with the highest confidence as “lance” or “ball”. However, “lance” and “ball” have no independent meaning in the corresponding use case and, therefore, would not lead to a positive flow or progress in the conversation with the IVR communication system 100. As a result, “lance” and “ball” would get assigned with negative weights.

[0069] In the next step, at step 260, the IVR communication system 100 calculates and adjusts a corresponding confidence score for each of the plurality of subwords and keywords, as part of a re-ranking process, using the set of rules or an evaluation model. The confidence score of a transcription alternative of a subword and/or keyword with a positive weight is increased and the confidence of a transcription alternative of a subword and/or keyword with a negative weight is decreased. [0070] According to an example embodiment, a simple rule-based equation of how the weights are used to adjust a confidence score of a transcription alternative as part of the adjustment and re-ranking process is illustrated below. For each transcription alternative of a subword and/or keyword or a word, or a group of words, the IVR communication system 100 analyzes if it contains a weight in their respective dedicated expectation dictionary, and then calculates the confidence score according to the equation illustrated below:

New confidence score = Old confidence score + weight where the old confidence score is generated by the voice recognizing entity 102.

[0071] In the next step, at step 261, the IVR communication system 100 determines and identifies if the new confidence score for the corresponding transcription alternative comprising the subword, or keyword, or a word, or a group of words surpasses a first threshold confidence score. The first threshold confidence score is predetermined automatically or by an administrator of the IVR communication system 100. The first threshold confidence score is further capable of being adjusted as per -use-case, and/or per positive or negative outcome of a conversation.

[0072] According to an example embodiment of the present invention, the IVR communication system 100 is configured with a function or application for adjusting the threshold score to further improve the accuracy of detecting the transcription alternatives. The function for adjusting the threshold score is configured for each subword model and/or keyword model.

[0073] If at step 261, the IVR communication system 100 determines the new confidence score corresponding to the transcription alternative surpasses the first threshold confidence score, then, in the next step, at step 262, the IVR communication system 100 then updates the dedicated expectation dictionary and re-ranks the transcription alternative accordingly. The process then ends at step 263.

[0074] According to an exemplary scenario, it can be assumed that a user, such as user 101, by the name “Sofie Muller” makes a call to the IVR communication system 100, and the transcription alternative with the highest confidence is “Sophie Miller” and “Sofie Muller” is, for example, the second or third highest-ranking alternative based on the confidence scores. The “i” instead of the “ii” is detected and identified in an earlier verification step. However, “Sophie” instead of “Sofie” is less likely to be recognized by the user in a voice-only interaction mode, and a correction will need to be made with another latter attempt. A common scenario is that the system would then choose another alternative if Sophie Miller was rejected, but would not remember it for each of the both users in the next user interaction session conversations. When, for example, a direct database query is made without an earlier verification step, the query will likely fail or a different person may be found, which would jeopardize later user authentication steps. However, when using the dedicated Expectation Dictionary, such as, in a user-specific dictionary “Sophie Miller” would be assigned a negative weight following a lower confidence score and, as a result, is going to be reranked by the Expectation Handler 103 when “Sofie” calls the next time. For example, in the IVR communication system 100 “Sophie Miller” is assigned with the highest confidence score, but following the re-ranking of the transcription alternatives, “Sophie Miller” will not be the transcription alternative with the highest confidence score anymore. “Sofie Muller” will be the transcription alternative with the highest confidence score. Therefore, the verification will be successful and no correction will have to be made by the user. The transcription alternative “Sofie Muller” will then be assigned a positive weight which will increase the likelihood that it will be selected as the best transcription alternative the next time. The re-ranking performed by the IVR communication system 100 uses previously learned knowledge in form of the plurality of Expectation Dictionaries to increase the likelihood that a transcription alternative is selected that leads to a positive outcome (like progress in the dialogue flow) over a transcription alternative that has lead to negative outcomes in the past.

[0075] Similarly, in yet another exemplary scenario, the same rule is applied to all the inputs that a user, such as the user 101, makes, for example, if the user 101 has spoken “420” but the IVR communication system 100 misrecognized it as “42” and the user 101 then corrects the IVR communication system 100. As a result, the transcription alternative “42” would be assigned a negative weight, and inputs that are verified by the user and surpass the corresponding first threshold confidence score would be assigned positive weights. The IVR communication system 100 then learns the kind of inputs that are likely for the user 101 and the kind that are not and would correspondingly re-rank the transcription alternatives.

[0076] Similarly, in yet another exemplary scenario, the same rule is applied for lowering the likelihood of partial transcription to be chosen. For example, in a scenario, where the user 101 has said, “I would like to perform a transaction” but the transcription alternative with the highest confidence score identified by the IVR communication system 100 comprises only a partial transcription with the most important part missing, in such scenarios, the usage of expectation dictionaries help to prevent that such alternatives are chosen in the future user interaction sessions. It can be assumed that, in the IVR communication system 100, the transcription alternatives list illustrates the following:

"text": "I would like to perform", confidence": 0.75 text": "I would like to perform a transaction", confidence": 0.72

[0077] As per the rule, the first transcription is chosen for further processing because it corresponds to the highest confidence score. However, the word that is actually relevant for the intent classification of the user, i.e. “transaction” is missing in the transcription whereas it is present in the second transcription with a lower confidence score, therefore the voice recognizing entity 102 was unable to understand it. As a result, “I would like to perform" results in a negative outcome/fallback in the conversation with the user 101 because the dialogue engine 104 is unable to extract the intent of the user. Therefore, “I would like to perform" is assigned a negative weight and the confidence score is updated. The confidence score surpasses a corresponding first threshold confidence score and therefore, the transcription alternatives are re -ranked accordingly in the one or more dedicated expectation dictionaries by the expectation handler 103 in the IVR communication system 100. After re-ranking the transcription alternatives, for further future interaction sessions, for example, with a possible different user, the confidence score of “I would like to perform” would be lower than the confidence score of "I would like to perform a transaction" and, as a result, the transcription that leads to a successful extraction of the user’s intent is chosen.

[0078] For example, the transcriptions with their weights and confidence scores and the penalty value predetermined to be 0.05 (same as the reward value) and the minimal weight -0.5 and the maximal weight 0.9 follows such as: weight = max(0 - 0.05, -0.5) weight = max(-0.05, -0.5) weight = -0.05

[0079] The dedicated expectation dictionary then reads for the following transcription:

"I would like to perform"

New weight: -0.05

Before the expectation dictionary has been updated, the transcription alternatives and their confidence scores read as follows:

"text": "I would like to perform",

Confidence score: 0.75

"text": "I would like to perform a transaction" Confidence score: 0.72

[0080] After the dedicated expectation dictionary has been updated, the transcription alternative and their confidence scores change as follows:

“I would like to perform “ confidence score : 0.75

=> confidence score = 0.75 - 0.05

=> confidence score = 0.70

[0081] The dedicated expectation dictionary for the following transcriptions then reads:

"text": "I would like to perform"

Confidence score: 0.7

"text": "I would like to perform a transaction",

Confidence score: 0.72

[0082] Therefore, the voice recognizing entity 102 then selects the second alternative “I would like to perform a transaction “because this transcription alternative now has a higher confidence score after the modification applied by the expectation handler 103 using the dedicated expectation dictionary 110 from the expectation dictionary component 109. This leads to a positive outcome i.e. progressivity in the conversation because the IVR communication system 100 does not have to ask again for clarity.

[0083] “I would like to perform a transaction “is added to the dedicated expectation dictionary as follows: weight = min(weight + reward, maximal weight) weight = min(0 + 0.05, 0.9) weight = min(0.05, 0.9) weight = 0.05

[0084] The corresponding weights in the dedicated expectation dictionary reads as follows:

"I would like to perform": -0.05

"I would like to perform a transaction": 0.05 [0085] The expectation handler 103 changes initial confidence scores, according to yet another an example embodiment of the present invention, using the weights from the expectation dictionary from:

"text": "I would like to perform"

Confidence score: 0.80

"text": "I would like to perform a transaction"

Confidence score: 0.77

[0086] To:

"text": "I would like to perform"

Confidence score: 0.75

"text": "I would like to perform a transaction"

Confidence score: 0.82

[0087] Similarly, when the IVR communication system 100 requests for a plurality of information, and not all of them are included in the list of transcription alternatives with the highest confidence scores, the IVR communication system 100 learns to choose transcription alternatives, over time, containing all the needed information. For example, choosing the correct alternative of “6 6 3 9 9” over the alternative “6 6 3 9” when a corresponding postcode is asked.

[0088] Referring back to Fig. 2B, if at step 261, the IVR communication system 100 determines the new confidence score corresponding to the transcription alternative does not surpass the first threshold confidence score, then, in the next step, at step 264, the dedicated expectation is not updated and the transcription alternative is not re -ranked in the list of corresponding subword models, or keyword models, or word models, or a group of words models. The process then ends at step 265.

[0089] Fig. 2C is a flowchart 270 illustrating steps describing how the best transcription alternative is chosen in a human-computer conversation, in the context of the IVR communication system 100 described in Fig. 1, in accordance with one or more aspects of the present invention.

[0090] At step 271, the process starts with a user, such as the user 101, in a call or initiating a call, for example, to the IVR communication system 100. [0091] In the next step, at step 272, the IVR communication system 100 establishes a user interaction session with the user 101. The interaction session includes, for example, but not limited to, the user 101 inputting service requests such as, to receive information, and initiate various processes that may be a part of the service requests etc.

[0092] In the next step, at step 273, the IVR communication system 100 receives user input from the user 101 in the form of a plurality of speech utterances comprising a plurality of subwords and keywords.

[0093] In the next step, at step 274, the IVR communication system 100 recognizes the entity of the plurality of speech utterances comprising the plurality of subwords and keywords, spoken by the user 101 in the interaction session.

[0094] In the next step, at step 275, the IVR communication system 100 uses the expectation handler 103 to identify and choose the transcription alternative with the highest confidence score from the dedicated expectation dictionary to configure a response to the user's request.

[0095] In the next step, at step 275, the IVR communication system 100 generates a response to the user’s request. The response, to the user, comprises the transcription alternative corresponding to the highest confidence score from the dedicated expectation dictionary. The process then ends at step 276. The plurality of expectation dictionaries are equipped with at least a spelling error correction portion and a vocabulary error correction portion. The step of processing the identified errors by substituting with predetermined keywords and/or subwords extracted from the dedicated expectation dictionary is executed in the background without causing any interruption in execution of the interaction session for minimizing delay.

[0096] Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more examples. In the preceding description, numerous specific details were provided, such as examples of various configurations to provide a thorough understanding of examples of the described technology. One skilled in the relevant art will recognize, however, that the technology can be practiced without one or more of the specific details, or with other methods, components, devices, etc. In other instances, well-known structures or operations are not shown or described in detail to avoid obscuring aspects of the technology.

[0097] Although the subject matter has been described in language specific to structural features and/or operations, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features and operations described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. Numerous modifications and alternative arrangements can be devised without departing from the spirit and scope of the described technology.

Claims

We claim:

1. A speech recognition system (100) for processing of natural language conversation into text during a human-computer interaction, the speech recognition system (100) for processing of natural language conversation into text during a human-computer interaction comprising: a. a voice recognizing entity (102), the voice recognizing entity (102) receives and recognizes user speech input signals; b. a dialogue engine (104), the dialogue engine (104) handles voice-based interactions with the user in a user interaction session and receives conversation data; c. a natural language understanding (NLU) component (105), the natural language understanding component (105) processes the received conversation data and generates a plurality of subwords and keywords corresponding to the processed conversation data; d. an expectation handler (103), the expectation handler (103) determines and identifies an implied expectation of the user corresponding to the processed conversation data from the user utterance; e. a conversation analysis component (107), the conversation analysis component (107) analyzes the plurality of subwords and keywords corresponding to the processed conversation data, assigns a confidence score for each of the plurality of subwords and keywords in a plurality of expectation dictionaries (110); f. an expectation dictionary component (109), the expectation dictionary component (109) stores the plurality of subwords and keywords in a corresponding sub word model (111) and/or a keyword model (112) respectively in the corresponding expectation dictionaries (110); the voice recognizing entity (102) selects from the plurality of subwords and/or keywords corresponding to the subword model (111) and/or the keyword model (112) in the dedicated expectation dictionary (110) included in the expectation dictionary component (109), and provides an output with the subword and/or keyword with the highest confidence score to satisfy the implied expectation of the user in the user interaction session. The speech recognition system (100) for processing of natural language conversation into text during a human-computer interaction, as claimed in claim 1, wherein the dialogue engine (104) navigates and maintains user interaction sessions in accordance with the implied expectation of the user corresponding to the data received from the voice recognizing entity (102). The speech recognition system (100) for processing of natural language conversation into text during a human-computer interaction, as claimed in claim 1, wherein the expectation handler (103) updates the subword model (111) and/or the keyword model (112) by modifying the confidence scores in the dedicated expectation dictionary (110) in case of failing to correctly determine and identify the implied expectation of the user; and wherein the expectation handler (103) modifies the confidence scores associated with the plurality of subwords and keywords with respect to the first threshold score in the expectation dictionary component (109). The speech recognition system (100) for processing of natural language conversation into text during a human-computer interaction, as claimed in claim 1, wherein the conversation analysis component (107) is further capable of accessing a conversation history database (106); and wherein the conversation history database (106) stores latest updates and events received from the conversation data. The speech recognition system (100) for processing of natural language conversation into text during a human-computer interaction, as claimed in claim 1, wherein the conversation analysis component (107) uses a set of rules and/or an evaluation model from an evaluation model component (108) to determine a positive or negative indication of a subword and/or a keyword based on the dedicated expectation dictionary (HO). The speech recognition system (100) for processing of natural language conversation into text during a human-computer interaction, as claimed in claim 1, wherein the conversation analysis component (107) assigns a weight to each of the plurality of subwords and keywords for at least one of assigning and modifying the corresponding confidence scores using the evaluation model component (108). The speech recognition system (100) for processing of natural language conversation into text during a human-computer interaction, as claimed in claim 1, wherein the conversation analysis component (107) with the help of the evaluation model component (108): a. increases the weight of a subword or a keyword by a reward value in case of a positive outcome of the user interaction session; and b. decreases the weight of a subword or a keyword by a penalty value in case of a negative outcome of the user interaction session; The speech recognition system (100) for processing of natural language conversation into text during a human-computer interaction, as claimed in claim 1, wherein the expectation handler (103) stores and re-ranks the plurality of subwords and keywords in the respective subword model (111) and keyword model (112) based on modified confidence scores in the dedicated expectation dictionary (110) included in the expectation dictionary component (109). The speech recognition system (100) for processing of natural language conversation into text during a human-computer interaction, as claimed in claim 8, wherein the expectation handler (103) increases the confidence score for a subword or keyword in the corresponding expectation dictionary (110) in case of a positive indication of the subword or keyword; and decreases the confidence score for a subword or keyword in the corresponding expectation dictionary (110) in case of a negative indication of the sub word or keyword. The speech recognition system (100) for processing of natural language conversation into text during a human-computer interaction, as claimed in claim 3, wherein the expectation handler (103) detects and identifies errors and/or ambiguity associated with keywords and/or subwords in the user utterance; and adjusts the identified errors and/or ambiguity by substituting with predetermined keywords and/or subwords extracted from the dedicated expectation dictionary (110). The speech recognition system (100) for processing of natural language conversation into text during a human-computer interaction, as claimed in claim 3, wherein the conversation analysis component (107) assigns a negative indication to the identified errors. The speech recognition system (100) for processing of natural language conversation into text during a human-computer interaction, as claimed in claim 1, wherein each of the plurality of expectation dictionaries (110) in the expectation dictionary component (109) is equipped with at least a spelling error correction portion and a vocabulary error correction portion. The speech recognition system (100) for processing of natural language conversation into text during a human-computer interaction, as claimed in claim 1, wherein the expectation handler (103) selects and obtains input from the dedicated expectation dictionary (110) to re -rank and adjust the output of the voice recognizing entity (102) in order to increase the likelihood of satisfying the user expectation in the user interaction session. A method for processing natural language conversation in a human-computer interaction into text, the method for processing natural language conversation in the human-computer interaction into text comprising the steps of: a. receiving conversation data from user utterances corresponding to each user interaction sessions; b. processing the conversation data received; c. generating a plurality of subwords and keywords corresponding to the processed conversation data; d. analyzing the plurality of subwords and keywords corresponding to the processed conversation data; e. storing the plurality of subwords and keywords in a corresponding subword model (111) and a keyword model (112) respectively; f. applying a confidence scoring model to the plurality of subwords and keywords for at least one of assigning and modifying a plurality of confidence scores in a plurality of expectation dictionaries (110); and g. selecting from the plurality of subwords and/or keywords corresponding to the subword model (111) and/or the keyword model (112) in the dedicated expectation dictionary (110), and providing an output with the subword and/or keyword with the highest confidence score and a voice recognizing entity (102) picking the subword and/or keyword from a plurality of expectation dictionaries (110) with the highest confidence score. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 14, wherein the step of assigning and modifying a plurality of confidence scores corresponding to the plurality of subwords and keywords in a plurality of expectation dictionaries further comprises the steps of: a. determining and identifying an implied expectation of the user from the analyzed conversation data corresponding to the user utterance; and b. selecting and providing with a dedicated expectation dictionary associated with the implied expectation of the user. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 14, wherein the method for processing realtime natural language conversation further comprises the step of: a. at least one of navigating and maintaining user interaction sessions corresponding to the implied expectation of the user. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 14, wherein the implied expectation constitutes an intent of the user. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 14, wherein the method for processing of the conversation data further comprises the step of: a. updating the subword model and the keyword model by modifying the confidence scores in the dedicated expectation dictionary in case of failing to correctly determine and identify the implied expectation of the user; and b. modifying the confidence scores associated with the plurality of subwords and keywords based on the degree of conformity with respect to the first threshold score. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 14, wherein the method for processing conversation data further comprises the step of: a. at least one of updating and maintaining the plurality of expectation dictionaries with semantics, pragmatics and the subwords and keywords corresponding to the analyzed conversation data when the confidence score surpasses a first threshold score; The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 14, wherein the plurality of expectation dictionaries (110) are classified into categories dedicated to at least use cases, domains, user models, and interaction steps. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 14, wherein the step of assigning and modifying a plurality of confidence scores to the plurality of subwords and keywords in a plurality of expectation dictionaries (110) further comprising the step of: a. accessing a conversation history database. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 21, wherein the method for processing conversation data further comprising the steps of: a. analyzing conversation history data from the conversation history database for each category of expectation dictionary dedicated to use case, domain, user model, and/or interaction step; and b. using a set of rules and/or an evaluation model to determine a positive or negative indication of a subword and/or a keyword based on the dedicated expectation dictionary. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 14, wherein the step of applying the confidence scoring model to the plurality of subwords and keywords for at least one of assigning and modifying a plurality of confidence scores in a plurality of expectation dictionaries further comprising the step of: a. assigning a weight to each of the plurality of subwords and keywords for at least one of assigning and modifying the corresponding confidence scores. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 23, wherein the step for assigning a weight to each of the plurality of subwords and keywords for modifying the corresponding confidence scores further comprising the step of: a. increasing the weight of a subword or a keyword by a reward value in case of a positive outcome of the user interaction session; and b. decreasing the weight of a subword or a keyword by a penalty value in case of a negative outcome of the user interaction session; The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 14, wherein the method for processing conversation data further comprises the step of: a. storing each subword and keyword corresponding to user utterance from the user interaction sessions; and b. re -ranking the plurality of subwords and keywords in the respective subword models and keyword model based on modified confidence scores. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 24, wherein the reward value and penalty value are statically defined values predetermined automatically or by an administrator. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 14, wherein the method for processing conversation data further comprises the step of: a. generating a new expectation dictionary in case of a new user and/or a new use-case/domain. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 24, wherein the positive indication of a subword or keyword includes adopting and/or incrementing the confidence score for keyword in the corresponding expectation dictionary. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 24, wherein the negative indication of a subword or keyword includes decrementing the confidence score for the keyword in the corresponding expectation dictionary. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 14, wherein the method for processing conversation data further comprises the steps of: a. detecting and identifying errors and/or ambiguity associated with keywords and/or subwords in the user utterance received from the analyzed conversation data; and b. adjusting the identified errors and/or ambiguity by substituting with predetermined keywords and/or subwords extracted from the dedicated expectation dictionary. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 30, wherein the step of processing the identified errors by substituting with predetermined keywords and/or subwords extracted from the dedicated expectation dictionary is executed in the background without causing any interruption in execution of the interaction session for minimizing delay. The method for processing natural language conversation in a human-computer interaction into text, as claimed in claim 15, wherein each of the plurality of expectation dictionaries are equipped with at least a spelling error correction portion and a vocabulary error correction portion.