CN110021295B

CN110021295B - Method and system for identifying erroneous transcription generated by a speech recognition system

Info

Publication number: CN110021295B
Application number: CN201910000917.4A
Authority: CN
Inventors: A·阿龙; 郭尚青; J·伦克纳; M·慕克尔吉
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2018-01-07
Filing date: 2019-01-02
Publication date: 2023-12-08
Anticipated expiration: 2039-01-02
Also published as: CN110021295A

Abstract

Methods, apparatus, and computer program products are provided for identifying erroneous transcriptions generated by a speech recognition system. The erroneous transcription generated by the speech recognition system is identified. A set of known speech members is provided for use by the speech recognition system. Each speech member is composed of a corresponding plurality of words. The received utterance is matched with a first utterance member of the set of known utterance members. The first speech member is the closest matching speech member and has a first plurality of words. The matching operation matches fewer words than the first plurality in the received utterance and the received utterance changes in a first particular manner as compared to a first word in a first time slot in a first member of the utterance. The received utterance is sent to an error transcription analyzer component that increments evidence that the received utterance is evidence of an error transcription. Once the incremental evidence for the erroneous transcription exceeds a threshold, future received utterances containing the erroneous transcription are processed as if the first word were recognized.

Description

Method and system for identifying erroneous transcription generated by a speech recognition system

Technical Field

The present disclosure relates generally to machine learning. More particularly, the present disclosure relates to teaching machine learning systems to detect transcription errors of speech recognition tasks.

Background

Speech recognition is a computer technology that allows users to perform a variety of interactive computer tasks as an alternative to communicating via conventional input devices such as mice and keyboards. Some tasks include transmitting commands for a computer to perform selected functions or transcribing speech into written transcription for a computer application, such as a spreadsheet or word processing application. Unfortunately, the speech recognition process is not error-free, and an important issue is correcting transcription errors or "false transcription". Erroneous transcription occurs when the speech recognition component of the computer incorrectly transcribes acoustic signals in the spoken utterance. In an automatic speech recognition task, when a selected word is transcribed erroneously, a command may not be properly executed or speech may not be properly transcribed. The erroneous transcription may be due to one or more factors. For example, it may be because the user is a non-local speaker, because of the user's cursive speech, or because of background noise on the channel of the speech recognition system.

One type of incorrect transcription is a replacement error, in which the speech recognition system replaces the uttered word with an incorrect word. Another type of error is an insertion error, where the system recognizes "spam" utterances, such as breathing, background noise, "thiophene," or interprets a word as two words, and so on. Another type of transcription error is a deletion error, in which one of the pronunciation words does not occur in transcription. In some cases, the deletion may occur because the speech recognition system rejects the recognized phonemes as non-existent words according to its dictionary. Alternatively, the deletion is due to incorrect merging of two words. For example, the user may say "nine trees" and the system recognizes the utterance as "ninety".

Traditional methods for resolving erroneous transcriptions include manually checking the transcriptions for errors and correcting them by means of an input device such as a keyboard, or correcting them by letting the system recognize candidate erroneous transcriptions and enter a dialogue with the user aimed at correcting them. For example, the system may ask the user via a speaker "do you say' chicken? "if the user says" no, "the system will record the candidate error transcript as an error. The number of transcription errors can also be reduced by improving the speech model of a particular user. The default acoustic model of the speech recognition system may be better adapted to a user when the system receives a greater number of speech samples from a particular user, either by letting the user read from a known transcription, or by the user continuing to use the system.

Further improvements in computer-aided speech recognition are needed.

Disclosure of Invention

In accordance with the present disclosure, a method, apparatus, and computer program product for identifying erroneous transcriptions generated by a speech recognition system. A set of known speech members is provided for use by the speech recognition system. Each speech member is composed of a corresponding plurality of words. The received utterance is matched with a first utterance member of the set of known utterance members. The first speech member is the closest matching speech member and has a first plurality of words. The matching operation matches fewer words than the first plurality of the received utterances and the received utterances vary in a first specific manner as compared to the first words in the first time slot in the first utterances member. The received utterance is sent to an error transcription analyzer component that increments evidence that the received utterance is evidence of an error transcription. Once the incremental evidence for the erroneous transcription exceeds a threshold, future received utterances containing the erroneous transcription are deemed to be recognized by the first word.

Some of the more relevant features of the disclosed subject matter have been summarized above. These features should be construed as merely illustrative. Many other beneficial results can be attained by applying the disclosed subject matter in a different manner or by modifying the invention as will be described.

Drawings

For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

FIG. 1 depicts an exemplary block diagram of a distributed data processing environment, in which exemplary aspects of the illustrative embodiments may be implemented;

FIG. 2 is an exemplary block diagram of a data processing system in which exemplary aspects of the illustrative embodiments may be implemented;

FIG. 3 illustrates an architecture diagram of components in a speech recognition system according to an embodiment of the present invention;

FIG. 4 illustrates a flow chart of operation of a speech recognition system according to an embodiment of the present invention;

FIG. 5 is a flow chart for adding class members based on user responses according to an embodiment of the invention;

FIG. 6 is a flow diagram of adding evidence that class members should be added to multiple classes according to an embodiment of the invention;

FIG. 7 is a diagram for detecting whether a new transcription error is a replacement, deletion, or insertion error according to an embodiment of the present invention;

FIG. 8 is a flow chart illustrating a method for incrementing evidence that a wrongly transcribed word from one class member is a legal substitution of the wrongly transcribed word for the same word in another class member of the same class;

FIG. 9 is a flow chart illustrating a method for incrementing evidence that a wrongly transcribed word from one class is a legal substitution of the wrongly transcribed word for the same word in a class member of a different class; and

FIG. 10 is a flow chart of one embodiment of the present invention for obtaining additional evidence using a speech recognition system.

Detailed Description

At a high level, the preferred embodiments of the present invention provide a system, method and computer program product for machine learning for properly identifying and handling erroneous transcriptions from a speech recognition system. The present invention uses a set of one or more known utterances that when recognized by a speech recognition system produce a system response. In a preferred embodiment, utterances are arranged into a group or "class"; when one of the class utterances is recognized, a class system response action is performed. When the recognized utterances are members of a class, they are referred to as "class members. Each utterance typically consists of a number of words, the number of words may vary depending on the particular utterance. When the transcription matches some but not all of the words in the member utterance, e.g., word Y is identified in place of word X in a given time slot in the class member, the transcription is considered some evidence of a false transcription of word Y for word X in the member utterance. Once evidence exceeds a threshold, future recognized utterances containing erroneous transcriptions are considered to be recognized as original words. A machine learning algorithm is employed in some embodiments to determine a confidence level that the recognized word Y is equivalent to word X in the recognized utterance.

Various rules are used in embodiments of the present invention by the error transcription analyzer to determine how much confidence level should be increased with additional instances of the same error transcription. In some embodiments, the error transcription analyzer uses a machine learning algorithm. Since the amount of evidence provided by a particular transcription depends on many factors, as described below. For example, the more utterances of the intended X transcribed into Y in this or other member utterances, the more evidence of erroneous transcription of X to Y. Furthermore, the greater the number of words that match in a particular utterance, e.g., long words with a single suspected mistranscribed word, the more evidence that the mistranscription analyzer assumes mistranscription. As evidence of a particular erroneous transcription becomes nearly definitive, the classification system is able to process the recognition utterance with the erroneous transcription as if the original word was recognized. One way in which embodiments of the present invention achieve this is to add new utterances in which one or more erroneous transcriptions replace the original words of existing class members as class members of one or more classes of utterances. Another way that other embodiments of the present invention use is to identify the wrong transcription as a valid replacement for the original word, so that whenever recognized, the system performs as if the original word was recognized instead.

In the following description, the process of determining whether a new erroneous transcription or a new utterance should be added for use by the system is generally described as incremental evidence. Those skilled in the art will appreciate that incremental evidence may be used in confidence calculations in some embodiments, for example, as part of a machine learning system. Thus, when the incremental evidence exceeds a threshold, the threshold may be an accumulated evidence threshold or a confidence threshold calculated based on the accumulated evidence. In the evidence threshold or confidence threshold calculation, individual evidence collected from different instances of erroneous transcription may have different weights or effects in the threshold calculation. In a preferred embodiment, evidence of erroneous transcription is incremented for each speech member based on the proximity of the attributes of the received speech to the attributes of the speech members. In other embodiments, the evidence for the erroneous transcription (e.g., identifying word Y instead of word X) is incremented in a single location such that once the threshold is exceeded for the erroneous transcription, the system processes any future received utterances that contain the erroneous transcription as if the original word was identified.

With reference now to the figures, and in particular with reference to FIGS. 1-2, exemplary diagrams of data processing environments are provided in which illustrative embodiments of the present disclosure may be implemented. 1-2 are only exemplary and are not intended to assert or imply any limitation with regard to the environments in which aspects or embodiments of the disclosed subject matter may be implemented. Many modifications to the depicted environments may be made without departing from the spirit and scope of the present invention.

With reference now to the figures, FIG. 1 depicts a pictorial representation of an exemplary distributed data processing system in which aspects of the illustrative embodiments may be implemented. Distributed data processing system 100 may include a network of computers in which aspects of the illustrative embodiments may be implemented. Distributed data processing system 100 contains at least one network 102, which network 102 is a medium used to provide communications links between various devices and computers connected together within distributed data processing system 100. Network 102 may include connections, such as wire, wireless communication links, or fiber optic cables.

In the depicted example, server 104 and server 106 connect to network 102 along with network storage unit 108. In addition, clients 110, 112, and 114 are also connected to network 102. These clients 110, 112, and 114 may be, for example, smartphones, tablet computers, personal computers, network computers, and the like. In the depicted example, server 104 provides data, such as boot files, operating system images, and applications to clients 110, 112, and 114. In the depicted example, clients 110, 112, and 114 are clients to server 104. Distributed data processing system 100 may include additional servers, clients, and other devices not shown. One or more server computers may be host computers connected to network 102. The host computer may be, for example, an IBM system z host running an IBM z/OS operating system. Connected to the host may be a host storage unit and a workstation (not shown). The workstation may be a personal computer directly connected to a host computer communicating via a bus, or may be a console terminal directly connected to the host computer via a display port.

In the depicted example, distributed data processing system 100 is the Internet with network 102 representing a worldwide collection of networks and gateways that use the Transmission control protocol/Internet protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, government, educational and other computer systems that route data and messages. Of course, distributed data processing system 100 also may be implemented to include a number of different types of networks, such as for example, an intranet, a Local Area Network (LAN), a Wide Area Network (WAN), and the like. As noted above, the illustration in FIG. 1 is intended as an example, not as an architectural limitation for different embodiments of the disclosed subject matter, and thus, the particular elements illustrated in FIG. 1 should not be considered limiting with respect to the environments in which the illustrative embodiments of the present invention may be implemented.

With reference now to FIG. 2, a block diagram of an exemplary data processing system is shown in which aspects of the illustrative embodiments may be implemented. Data processing system 200 is an example of a computer, such as client 114 in FIG. 1, in which computer usable code or instructions implementing the processes for illustrative embodiments of the present disclosure may be located.

With reference now to FIG. 2, a block diagram of a data processing system is shown in which illustrative embodiments may be implemented. Data processing system 200 is an example of a computer, such as server 104 or client 110 in FIG. 1, in which computer usable program code or instructions implementing the processes may be located for the illustrative embodiments. In this illustrative example, data processing system 200 includes a communication fabric 202 that provides communications between a processor unit 204, a memory 206, persistent storage 208, a communication unit 210, an input/output (I/O) unit 212, and a display 214.

The processor unit 204 is used for executing instructions of software that may be loaded into the memory 206. The processor unit 204 may be a set of one or more processors, or may be a multi-processor core, depending on the particular implementation. Further, the processor unit 204 may be implemented using one or more heterogeneous processor systems in which a primary processor is present on a single chip along with a secondary processor. As another illustrative example, processor unit 204 may be a Symmetric Multiprocessor (SMP) system containing multiple processors of the same type.

Memory 206 and persistent storage 208 are examples of storage devices. A storage device is any hardware capable of temporarily and/or permanently storing information. In these examples, memory 206 may be, for example, random access memory or any other suitable volatile or non-volatile storage. Persistent storage 208 may take various forms depending on the particular implementation. For example, persistent storage 208 may contain one or more components or devices. For example, persistent storage 208 may be a hard drive, a flash memory, a rewritable optical disk, a rewritable magnetic tape, or some combination of the above. The media used by persistent storage 208 also may be removable. For example, a removable hard drive may be used for persistent storage 208.

In these examples, communication unit 210 provides for communication with other data processing systems or devices. In these examples, communication unit 210 is a network interface card. The communication unit 210 may provide communication using one or both of physical and wireless communication links.

Input/output unit 212 allows data to be input and output with other devices that may be connected to data processing system 200. For example, the input/output unit 212 may provide a connection for user input through a keyboard and a mouse. Further, the input/output unit 212 may transmit an output to a printer. Furthermore, an input/output unit may provide a connection to the microphone for audio input from the user and speakers to provide audio output from the computer. Display 214 provides a mechanism for displaying information to a user.

Instructions for the operating system and applications or programs are located on persistent storage 208. These instructions may be loaded into memory 206 for execution by processor unit 204. The processes of the different embodiments may be performed by processor unit 204 using computer implemented instructions, which may be located in a memory such as memory 206. These instructions may be referred to as program code, computer usable program code, or computer readable program code that may be read and executed by a processor in processor unit 204. In different embodiments, it may be implemented on different physical or tangible computer readable media, such as memory 206 or persistent storage 208.

Program code 216 is located in a functional form on computer readable media 218. Computer readable media 218 is selectively removable and may be loaded onto or transferred to data processing system 200 for execution by processor unit 204. In these examples, program code 216 and computer readable medium 218 form a computer program product 220. In one example, the computer-readable medium 218 may be in a tangible form, such as, for example, an optical or magnetic disk inserted or placed in a drive or other device that is part of the persistent storage 208 for transfer onto a storage device, such as a hard drive that is part of the persistent storage 208. In a tangible form, computer readable medium 218 may also take the form of a persistent storage device, such as a hard drive, thumb drive, or flash memory connected to data processing system 200. The tangible form of computer readable medium 218 is also referred to as a computer recordable storage medium. In some cases, the computer recordable medium 218 may not be removable.

Alternatively, program code 216 may be transferred to data processing system 200 from computer readable media 218 through a communications link to communications unit 210 and/or through a connection to input/output unit 212. The communication links and/or connections may be physical or wireless in the illustrative examples. The computer readable medium may also take the form of non-tangible media, such as communications links or wireless transmissions containing the program code. The different components illustrated for data processing system 200 are not meant to provide architectural limitations to the manner in which different embodiments may be implemented. The different illustrative embodiments may be implemented in a data processing system that includes components in addition to or in place of those shown for data processing system 200. Other components shown in fig. 2 may differ from the illustrative example shown. As one example, a storage device in data processing system 200 is any hardware apparatus that may store data. Memory 206, persistent storage 208, and computer-readable media 218 are examples of storage devices in tangible form.

In another example, a bus system may be used to implement communication structure 202 and may be comprised of one or more buses, such as a system bus or an input/output bus. Of course, the bus system may be implemented using any suitable type of architecture that provides for a transfer of data between different components or devices attached to the bus system. In addition, the communication unit may include one or more devices for transmitting and receiving data, such as a modem or a network adapter. Further, a memory may be, for example, memory 206 or a cache such as found in an interface and memory controller hub in communication fabric 202.

The computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including, for example, java ^TM Object oriented programming languages such as Smalltalk, C++, C#, objective-C, and the like, as well as conventional procedural programming languages, such as Python or C. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter case In this case, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, using an internet service provider through the internet).

Those of ordinary skill in the art will appreciate that the hardware in FIGS. 1-2 may vary depending on the implementation. Other internal hardware or peripheral devices, such as flash memory, equivalent nonvolatile memory, or optical disk drives and the like, may be used in addition to or in place of the hardware depicted in figures 1-2. Furthermore, the processes of the illustrative embodiments may be applied to a multiprocessor data processing system, other than the SMP system mentioned previously, without departing from the spirit and scope of the disclosed subject matter.

The techniques described herein may operate in conjunction with a standard client-server paradigm such as that shown in fig. 1, where a client machine communicates with a Web-based portal accessible over the internet executing on a set of one or more machines. An end user operates an internet-connectable device (e.g., a desktop computer, a notebook computer, an internet-enabled mobile device, etc.) capable of accessing and interacting with a portal. Typically, each client or server machine is a data processing system including hardware and software, such as that shown in FIG. 2, and the entities communicate with each other over a network, such as the Internet, an intranet, an extranet, a private network, or any other communication medium or link. A data processing system typically includes one or more processors, an operating system, one or more application programs, and one or more utility programs.

While people often cannot understand each word correctly in a conversation, humans use the context of the conversation to help spell what the misinterpreted word should be. The speech recognition mechanism does not have tools that humans use to make such context decisions. However, by machine learning, confidence about what the incorrectly transcribed word must be can be learned by observing the same repeated transcription errors (sometimes in combination with user behavior). Embodiments of the present invention allow the system to learn based on individual users and environments, as well as user categories and environment types.

One environment in which embodiments of the present invention may be implemented is shown in FIG. 3. The speech recognition system 303 receives the speech samples 301 for conversion into computer usable text or markup. Part of the speech recognition system 303 is a classifier 304, such as an IBM Watson natural language classifier or Stanford classifier. The component identifies a class of questions or assertions (alternatively, "utterances") 309 that ask or speak the same thing. For example, there may be a class asking for directions to the nearest restroom. Such a request may take many forms, namely, "which way to the restroom? "where", "where to go to the toilet? "and the like. In this case, "which way to go to the restroom? "," where in the toilet "and" which way to go to the toilet? "is referred to as class instance 309 a..309 c. All examples have the same "intent". The classifier takes the utterance and attempts to match it to any known class and class instance. Classifier 304 returns the highest confidence class and confidence 311. If the returned confidence level exceeds the threshold set by the system, the system responds 313 with a specified response associated with the given class. At the same time, the error transcription analyzer 312 traverses each of the class members of the highest matching class and finds the closest matching class member. If there is not a perfect match, then after enough evidence is accumulated, new class members are added to the class in step 310. If a close class instance is found, the system attempts to infer which word (or words) may be incorrectly transcribed for the known word (or words) in the process set forth in more detail below. The suspected mistranscribed word pairs are stored in a mistranscribed word pair data store 325. The data store 325 stores the misheard word (or words) and the correct word. In alternative embodiments of the present invention, the utterances are not organized by class, but rather each utterance has its own associated system response. Most of the following description describes embodiments that organize utterances in a class as class instances or members. In some embodiments of the invention, the error transcript analyzer 312 uses a machine learning algorithm to increment the amount of evidence for a particular error transcript. As will be discussed below, a particular error transcription instance may increment different amounts of evidence for different class members depending on the rules used by error transcription analyzer 312.

In embodiments of the present invention, all components may reside in a single system. In other embodiments, some of the components may be distributed among different systems. For example, the voice sample may be obtained by a client system, such as a smart phone, the voice recognition system 303, classifier 304, and class store 309 may reside at a server, and the system response 313 may be a voice response played back at the client or a response performed at another system in a distributed network.

In the first stage of operation, classifier 304 will identify each class member and if a class member is identified, generate an appropriate system response 313. In many cases, the system response will be speech generated by the system, such as an answer to a user question. The system response 313 may be a non-voice response such as a search and visual display in a graphical user interface of a window or web page requested by the user.

In an embodiment of the invention, feedback on the system response is collected from the user. The feedback may take the form of additional voice samples, e.g., additional similar questions, negative response "this is not my meaning", or implicitly by lack of additional response indicating that the accepted response is correct. Other user inputs may indicate acceptance or rejection of the response. For example, if a user asks questions about a speech recognition system that is a knowledgeable topic or about a web page displayed on the system, and the user continues to interact with the system or continues to view the web page in a non-surprising manner, such actions may be interpreted by the system as accepting a response. When classifier 304 is unable to recognize the initial speech sample as a class member, speech recognizer 303 may generate a clarification question to prompt the user to provide additional information and/or speech samples in embodiments of the present invention.

In the first stage of operation, classifier 304 will also send a message to error transcription analyzer 312 that does not match the recognized voice 305 of the recognized class member. In an embodiment of the present invention, the errant transcript analyzer 312 attempts to swap different errant transcriptions that may be obtained from the errant transcription word to the data store 325 and resubmit the text to the classifier 304, e.g., as a candidate class member.

In the second stage of operation, error transcription analyzer 312 adds class members to an existing set of classes for use by classifier 304. The error transcription analyzer 312 stores occurrences of candidate class members, including candidate error transcription(s) in a candidate class that computes the recognition speech to which the most likely belongs. As more candidate class members and the same candidate false transcription(s) are stored, the confidence that the candidate class members belong in the class and that the candidate false transcription is a substitute form of a word in the existing class members is higher. When the threshold is reached, candidate class members are added to the class as identified class members 311 for use by classifier 304 to generate a system response 313 to the user. In an alternative embodiment of the invention, class store 309 is shared between error transcription analyzer 312 and classifier 304. When a candidate class member is added to the class as a new class member by the error transcription analyzer 312, the classifier 304 will simply begin using it.

As part of the second stage of operation, embodiments of the present invention also include "expected class members" or "candidate class members" that classifier 304 uses to recognize utterances. The error transcript analyzer 312 has an increased confidence level in candidate class members and candidate error transcripts and places expected class members in the class to accelerate evidence accumulation. The false transcription analyzer 312 calculates an intermediate confidence level that exceeds a first intermediate threshold but is below a second threshold required for false transcription and candidate class members to be added as identified members of the class. Classifier 304 uses the expected or candidate class members that exceed the intermediate threshold to generate a system response 313 as if it were an identified class member, or to enter into an interactive dialogue with the user, e.g. "I want you to do X". Is this correct? Where X is the correct system response for the class. If affirmative, the user response will add evidence that candidate class members and candidate false transcripts should be added to the class.

As a configuration step to operate the above-described system, a set or "category" of recognized utterances that the system is able to recognize and possibly respond to is created and stored in class storage 309. These sets may be considered "classes" in the sense of a text classifier, such as those used in a Watson natural language classifier or similar classifier. The class is made up of a group of members that make up the various ways in which substantially identical utterances are made up. For example, class with template questions "how does i enter the bathroom? "may have alternative examples: "where is the bathroom? "," which way to bathroom? "," which way to go to the toilet? "and the like. Manual creation of classes is used in some embodiments, but as described below, in some embodiments there is a set of automated techniques that can extend the manually created classes. Moreover, as described herein, the error transcript analyzer 312 provides new class members based on repeated error transcriptions.

When the speech recognition system translates a spoken utterance into words, it may incorrectly transcribe one or more words in the utterance. As described above, this is referred to as a false transcription or transcription error. In a preferred embodiment of the invention, if a "match" occurs to one of the class members of N-1 words in the N words in the utterance, i.e., only one word does not match the class member, the false transcription analyzer treats it as evidence of the false transcription of the unmatched word. One rule that the false transcription analyzer uses to increment evidence is that for a given N-1 (number of matching words), the greater N (number of words in a member), the more evidence that false transcription exists. In an embodiment of the invention, another rule is that the closer the phonetic similarity between words in class members and candidate erroneous transcriptions, the more evidence of erroneous transcriptions is present. A typical false transcription would be a word that has a similar sound to the intended word at the same location in the class member. In many cases, the words or phrases in the candidate class members sound similar to the words or phrases in the class members, otherwise the speech recognition system will not produce a false transcription.

Thus, in the above case, it is assumed that the speech recognition system transcribes "which wake is a toilet? "rather than" which way to the restroom? "then the error transcription analyzer considers this as some evidence that" wake-up "might be an error transcription of" way ". The more frequent instances of the same erroneous transcription occur, the more evidence is collected and the greater the confidence that the erroneous transcription analyzer is in the erroneous transcription. At some point, the false transcription analyzer is so confident that exceeding the threshold is a false transcription. In embodiments of the present invention that use a class of utterances, candidate class members are added to the class and the system performs an action, such as a verbal response, as if it actually identified "which way to the restroom? ".

In embodiments of the present invention, at a lower confidence level, the system may perform a second action, e.g., require clarification, e.g., "i don't hear, do you ask about the direction to go to the restroom? "in other embodiments of the invention, there may be a first lower threshold of moderate confidence level, where candidate class members are added to the class as" trial "members. The system will collect user responses as appropriate actions are performed for the class members and feed those responses back to the error transcription analyzer. Thus, a user response indicating acceptance of a system response will increase the confidence of the erroneous transcription, while a user response indicating rejection of a system response will decrease the confidence of the erroneous transcription for class members. As the confidence level increases, the confidence level exceeds a second, higher level as the user continues to accept the system response, and the candidate class member transitions from the trial state to the permanent state as a class member of the class.

Embodiments of the present invention use machine learning to recognize erroneous transcriptions of speech text generated from a speech recognition system. In embodiments of the present invention, utterances with similar meanings are used to interact with the user. The class includes a set of member utterances, each member utterance u_i being composed of a corresponding number of n_i words. When the transcription matches some but not all of these words (e.g., N i-1) and word Y is used in the member utterance in place of word X in a given time slot (e.g., jth time slot), the transcription is employed as some evidence of erroneous transcription of word Y for word X.

The more the system sees the expected utterance X transcribed into Y in this or other known utterances, the more evidence of erroneous transcription. As described above, one rule is that the greater the number of words n_i in a particular utterance with a single suspected mistranscribed word, the more evidence of mistranscription is assumed.

Embodiments of the present invention allow for false transcription confidence to be aided by knowledge of the same or similar speakers. The same speaker or similar speakers are more likely to mispronounce or use words in the same or similar manner. One similarity measure that may be used is to detect that the speakers have native speakers in the same L1 language, i.e., the same first language. Another similarity measure is that users share the same environment, e.g., workplace or organization, and will tend to use the same vocabulary. In embodiments of the present invention, different classes of member utterances are stored for different users or different classes of users. Embodiments of the present invention utilize user-based rules to add evidence of erroneous transcription.

Embodiments of the present invention allow for false transcription confidence to be aided by knowledge of the same or similar circumstances. While there is some overlap with users of the same workplace or organization as described above, in this category the same user will use different words in different environments. Words used in a home environment as opposed to a work environment tend to be different. Furthermore, some types of error transcription are more common in different types of environments, such as insertion errors in noisy environments. In embodiments of the present invention, different classes of member utterances are stored for a particular environment or type of environment. Embodiments of the present invention use context-based rules to add evidence of erroneous transcription.

Other embodiments of the present invention allow for false transcription confidence to be aided by knowledge of whether the word and suspected mistranscribed word have some degree of speech similarity.

In embodiments of the present invention, a false transcription of a given word in one class member is considered evidence of the false transcription of that word in other class members that use the same word. In these embodiments, evidence may be expected to accumulate before false transcriptions are actually encountered in class members. For example, in a first class member, the word "thoroughfare" may be misinterpreted as a "row" or vice versa. In embodiments of the present invention, the system will accumulate some, preferably less, evidence to other class members that share the incorrectly transcribed word(s). Embodiments of the invention may also accumulate evidence of these words in other classes. Rules in some of the embodiments indicate that less evidence is accumulated in utterances in other classes than partner class members.

Other embodiments of the present invention use regular expressions to allow different word orders in candidate class members from potentially matching existing class members. Different word orders may be allowed in the candidate class members, but would mean that there is a lower signal strength, i.e. a lower confidence that the candidate erroneous transcription is truly erroneous transcription.

In embodiments of the present invention, the error transcription analyzer also considers the context, geographic proximity, and situational awareness for interpreting confusing words. For example, such as "which way to a restroom (r)? The expression of "can be easily combined with the expression of" which way to restaurant? "confusion". If a person is driving while giving the expression, the first context or context type, the second expression is more likely correct, i.e. the person is looking for a restaurant. The first expression is more likely to be correct if the expression is sent out in the office space.

Also, another confusing statement is: let us start the bar and let us watch the movie. The two sentences may be distinguished based on, for example, the context to whom the respective sentence was uttered. An office manager is more likely to say a first sentence to his/her employee, while a second sentence is more likely to say between two friends.

A flow chart of an embodiment of the present invention is shown in fig. 4. In step 401, a minimum number of transcription errors is set. Thus, in one embodiment, the value of MISTRANSCRIP _min_seen=minimum# is set to the number of error transcription instances of the same error transcription that must be SEEN before the candidate error transcription or candidate class member is "identified", i.e., the number of actions that the system will take on. In step 403, a threshold value of confidence level that transcription errors exist is set. Thus, the value MISTRANSCRIP _thresh = the probability/confidence above which the system proposal for the identified word is assumed to be a false transcription. Two thresholds are set because the amount of evidence collected for each candidate erroneous transcription instance is different. Each instance of the incorrect transcription may have a different context and a different amount of matched and unmatched words between the candidate class member and the existing class member.

Other values may be set before the instance may be considered a candidate erroneous transcription, such as the number of transcription errors allowed per candidate class member, e.g., max_frame_mistranscribed = maximum score of words of erroneous transcription allowed per utterance in a preferred embodiment. If there are too many candidate false transcriptions in a single candidate false transcription, there is less likely to be sufficient evidence that the recognized utterance is a class member. In alternative embodiments, a different threshold is set.

In step 405, a natural language classifier is initialized with a set of classes to which the system can respond. In an embodiment of the present invention, a set of synonymous phrases is also initialized in the classifier. A synonymous phrase is a collection of equivalent words or phrases that can replace class members in a class. In this way, class members can be extended without having to list each possible variant of a class member as a single class member. Each class in the set of classes is associated with a so-called intent, and each intent is mapped to a response that the system takes when recognizing the intent.

When the utterance is submitted to the system, step 407, in one embodiment, speech is recognized, and if the classifier determines that the utterance matches a class member, step 409, then an appropriate response for the class is returned to the user, step 411. In other embodiments of the invention, rather than exact matches, confidence levels are used to determine whether a response should be returned. For example, the response is evaluated by the classifier, giving the first T classes and associated confidence conf_i for each class. Conf_0 is the class with the highest confidence. When conf_0 exceeds the threshold THRESH, the classifier system responds that it knows the system response associated with the intent of the associated class, step 411. For example, if the intent is "restroom direction," the system response will provide a direction to the restroom. In an embodiment of the invention, if there is no exact match or if the confidence level does not exceed a threshold level, the system will enter an interrogation mode in which more information is received from the user, for example by interrogating the user to clarify a question and analyzing the user utterance made in response to the question, step 410.

Next, a determination is made as to whether there is a candidate transcription error, step 412. In some embodiments, this step is performed by the classifier and passed to the error transcription analyzer. In other embodiments, all of the recognized utterances are passed to an error transcription analyzer, which will determine whether a transcription error has occurred. The process of determining the erroneous transcription is discussed in more detail below. If there is a transcription error, the transcription error and its location in the class member is stored, step 413. If not, the system returns to listen for other user utterances.

Since the false transcription analyzer receives the false transcription in the new instance and the false transcription reappears in the user utterance, the false transcription analyzer will accumulate more and more evidence that the recognized word is a false transcription. Each instance may provide a different amount of evidence. If all but one word matches a class member, in embodiments of the invention this will be more evidence than an instance where several of the words in the utterance do not match a class member. As evidence accumulates, the confidence level will meet the threshold for false transcription, step 415. Once the confidence level exceeds the threshold, class members with erroneous transcriptions are stored as alternatives to class members. In embodiments of the present invention that use synonymous phrases, the erroneous transcription may be stored as part of the synonymous phrases of the class. Other embodiments use other components to store the erroneous transcription as a valid replacement for the original word(s).

FIG. 5 illustrates a flow chart of one embodiment of the present invention for adding new class members. In the illustrated embodiment, the intermediate confidence level of the candidate class members is used to send the user evidence of the system response and increment the erroneous transcription. The classifier sends a message to an error transcript analyzer (not shown) containing the candidate error transcripts, the candidate class members, and the user responses to the system responses of the candidate error transcripts.

In step 501, an error transcript is received from a classifier by an error transcript analyzer. A user response is received in step 503. In step 505, the location of the erroneous transcription with class members is identified. For example, whenever the classifier detects class members and the verbatim transcription matches N-k words of the N words in the closest matching class instance, and N words are also used in the transcription of the utterance, the non-matching word pairs are represented by (w_ { i_j }, a_ { i_j }), where there are k indices { i_j }. In an embodiment of the present invention, the pairs (w_ { i_j }, a_ { i_j }) are stored as hashes of potential erroneous transcriptions. For example, a_ { i_j } is a potentially incorrect transcription of the word w_ { i_j } where the word w_ { i_j } appears in the class instance. In some embodiments, the location of the error transcript is part of the packet received from the classifier, however, in other embodiments, the determination is performed by an error transcript analyzer.

In addition to word pairs, in an embodiment of the invention, the error transcription analyzer stores three additional values. In step 507, the system stores the number of times the classifier responded under the assumption of erroneous transcription, and the answer given appears to have been accepted by the user. In step 509, the system stores the number of times the classifier has responded when assuming the wrong transcription, but the response given appears to have been rejected by the user. In step 511 the system stores the number of times that an erroneous transcription was detected, i.e. there is a direct correspondence between w_ { i_j } - > a_ { i_j } of the word in the top class instance and the substitute word in the transcription, but the system confidence in the top class does not exceed THRESH (intermediate threshold), so the system does not give a response.

In one example, the system stores (w_j, a_j, 5,2,4), meaning that 5 false transcribed hypotheses lead to a user accepted response, two false transcribed hypotheses lead to a user refused response, and 4 times a_j appears to hear a_j instead of the word w_j, but the classifier confidence in the top class does not exceed thresh_1 and therefore does not give a system response. IN this illustrative embodiment, the general entry IN the error transcription hash is given by (w, a, CO, IN, NO) -where w = correct word, a = potential mi transcribed word, CO = correct count, IN = incorrect count, and NO = NO response count.

The process continues until the class threshold at the top of the utterance exceeds a higher threshold level thresh_2, step 513, meaning that the machine learning system has sufficient confidence for the stored erroneous transcription as a member of the replacement class, or there are no remaining word pairs (w_i, a_i) representing candidate erroneous transcriptions to be replaced.

Note that there may be several words IN the utterance with 5 tuples (w_i, a_i, co_i, in_i, no_i) that are candidate erroneous transcriptions a_i. IN this case, the process is iterated, wherein the substitutions a_i- > w_i are performed IN descending order of confidence IN the correction, i.e. IN descending order of co_i/(co_i+in_i). This process continues as long as the number of erroneously transcribed words M and the total number of words N are such that M/N > MAX_FRACTION_MISTRATORIBED.

In various embodiments of the invention, class members specific to each user, class of users, specific environment (e.g., location), or type of environment may be stored. The trade-off according to the specific user training class members is that for fewer samples of recognized speech from a single user, training will be more accurate for the specific type of erroneous transcription that the user may be doing, which may mean that machine learning will take longer to train rather than training with multiple users. Training from the user class has the advantage that more speech samples can be identified and thus machine learning is performed faster, but there is a risk that the respective user may be incorrectly transcribed as a member of the user class, or that an incorrect transcription specific to a particular user is incorrectly handled.

Training class members according to a particular environment or type of environment may also be useful to obtain more voice samples than are obtained from a single user. The environment types may include noisy environments as compared to quiet environments. Alternatively, the environment type may be an environment in which certain activities occur, such as an automobile, home, work, or school. The system must sort the environments by type, which may require user input, e.g., confirm the environment type. Alternatively, the system may use geolocation input and mapping data to classify the environment, client data such as a voice utterance from a corporate desktop or a personal-owned smartphone, whether the client device is moving, environmental background noise that accompanies the voice sample. Class members may also be trained based on a particular environment/location (e.g., XYZ company headquarters or Joe's home).

FIG. 6 shows a flow chart of an embodiment of the invention in which class members are trained based on user and environmental characteristics. In an embodiment of the invention, different classes, i.e. class member sets, are trained and stored for the respective users, and different classes are stored for the different environments. In other embodiments, classes are stored for a particular user in a particular environment. The same existing class members exist in different user and environment classes, and as evidence accumulates, new candidate class members (based on existing class members) will have different amounts of evidence in different classes. Thus, when the corresponding threshold is exceeded, the new candidate class member will become a class member in some classes but not others. The graph is also used to show where the context is used to calculate the amount of evidence accumulated for candidate erroneous transcriptions of class members.

In step 601, new candidate class member data is received, i.e. similar to the example above, candidate class members belonging to the respective classes have been identified as candidate erroneous transcriptions. Steps 603-613 receive data for determining which classes the new candidate member and candidate erroneous transcript belong to and determining the context of the new candidate member and new candidate erroneous transcript. In step 603, user information is received. The user information may take a variety of forms. In an embodiment of the invention, the login information identifies the user. As part of the registration process, the user enters personal information such as name, gender, race, etc. In other embodiments of the invention, the user information is biometric data used to identify and classify the user. During speech recognition, the system may make assumptions based on speech characteristics (e.g., tone quality), and accents fit the ethnic group. Finally, the system may enter an interactive dialog during the training phase to ask questions about identity, race, work role, etc. In embodiments of training and storing classes for individual users, user information is used to determine user identity in step 605. In an embodiment of the invention, the user information is used to determine the user class in step 607 in which the class is trained and stored for the user class. A user class is a group of people, organization members, or other groups of users that may similarly use words (i.e., result in similar false transcriptions).

In step 609, the environmental information is received. In an embodiment of the invention, the environmental information is geolocation information, optionally enhanced by map information. In other embodiments, the environmental information includes background noise captured with speech, indicating a quiet or noisy environment, or movement information captured by a GPS or accelerometer, indicating a moving environment such as a vehicle. In some embodiments of the present invention, the context information is used to uniquely determine the context identification in step 611. In other embodiments, the context information is used to determine the context type in step 613. In some unique environments, such as workplaces or schools, specific terms are used, and thus the same erroneous transcription may occur for different users. In an environment of the same environment type, e.g. a noisy environment, the same erroneous transcription will tend to occur, e.g. background noise is erroneously recognized as speech. In embodiments of the invention, the context information may also be used to determine a class of users, for example, where the location is associated with the class of users.

Although not shown, as mentioned above in other embodiments, problem information may also be received, which is useful for determining the context in which a new candidate member is to be issued. By comparing the most recent utterance with the current utterance, the system can determine the probability that the candidate erroneous transcription is a true erroneous transcription. In other embodiments of the invention other data is received.

Once the system determines which classes the new candidate class members and candidate false transcripts belong to, the system calculates how strong the evidence for each particular class is, STEP 614. For example, if candidate class members and candidate error transcripts are issued by a particular user in a particular environment, in embodiments of the present invention, evidence will be greater for classes trained and stored for that particular user or that particular environment than for class members of the user class and class types to which the user and environment respectively belong. By being able to train class members according to user, class of user, environment, and environment type simultaneously, the system can be allowed to have more samples and train faster. It also allows the system to provide trained class members for a particular user in a particular environment, which would be most accurate for detecting false transcripts. That is, in embodiments of the present invention, the class is trained for a particular combination of user and environmental features. The context of candidate class members and candidate false transcriptions, e.g., location, problem information, is also used in embodiments of the present invention to determine the amount of evidence that should be accumulated for class members in each class.

Next, in step 615, a determination is made as to whether sufficient evidence has been collected for the erroneous transcription of the particular class of users. If so, then in step 617, a new class member is added to the class with the incorrect transcription as a replacement for the original word in the class member. If not, in step 619, the accumulated evidence of the erroneous transcription in the user class is incremented. Dashed lines are shown from step 617, indicating that the accumulated evidence of the user class may be incremented even when the evidence exceeds a threshold of the user class.

Next, in step 621, a determination is made as to whether sufficient evidence has been collected for the erroneous transcription of a particular environmental class. If so, then in step 623, a new class member is added to the class with the incorrect transcription as a replacement for the original word in the class member. If not, in step 625, the accumulated evidence of erroneous transcription in the context type is incremented. From step 623, a dashed line is shown indicating that the accumulated evidence may be incremented even if the evidence exceeds a threshold of the environment type.

In this figure, for convenience of explanation, only decisions for a specific user and a specific environment are shown. However, in alternative embodiments, similar decisions are made for each of the user classes to which the user belongs and the environment type to which the environment belongs and the class of the particular user/environment combination.

In an embodiment of the invention, all classes are loaded for training. However, when the classifier is used to identify whether class members of a class are identified, in embodiments of the invention that identify users and environments, for example, the classifier will only use a selected group of classes for a particular user and/or a particular environment. In a distributed environment, where a client is used to collect and interact with speech samples from multiple individual users, the class can be trained by machine learning from all users in the multiple users, but using only the most focused class for the user and the environment allows faster training and better differentiation.

In other embodiments of the present invention, once training classes for a particular environment/user combination begins, the error transcription analyzer stops loading other classes for training. For example, once a user/environment combination reaches a desired confidence level, not necessarily a sufficiently high confidence level that class members are added to the class, other classes stop being trained in response to candidate false transcriptions from the particular user/environment combination.

In alternative embodiments, one or more of the listed steps may not be performed. For example, in the case where only class members are stored based on only user information, the steps relating to the environment will not be performed. In case only a single user is storing class members, then no user class step is performed.

In fig. 7, a process for storing new candidate erroneous transcriptions is shown. As described above, the erroneous transcription may be a replacement error, wherein the speech recognition system replaces the uttered word with an incorrect word; insertion errors, for example, where the system recognizes "garbage" utterances, such as breathing, background noise "thiophene"; or a deletion mistake, in which one of the pronounced words does not appear in the transcription. Each of these types of errors may represent a different transcription of the error. In addition, in embodiments of the present invention, each type of error transcription (substitution, insertion, deletion) will be stored differently.

In step 700, new candidate erroneous transcriptions are detected. The system first determines in step 701 whether the erroneous transcription is a replacement error. If the incorrect transcription is a replacement error, then there will be the same number of words in the candidate class member as the potentially alternative class member. If not, in step 703, the system determines if the erroneous transcription is a deletion error. If the transcription is a deletion error, one or more words from the existing class member are absent from the candidate class member. If the transcription error is not a deletion error, the system determines if it is an insertion error in step 705. For simplicity of illustration, only the testing of pure replacement errors, pure deletion errors, and pure insertion errors are shown. However, in alternative embodiments of the present invention, other tests of other types of transcription errors are performed. For example, within a candidate class member, there may be multiple error transcriptions of the same kind or different kinds, e.g., two replacement errors, or one replacement error and one insertion error.

Once the system determines which type of candidate erroneous transcription is among the candidate class members, the appropriate type of transcription error symbol is used to track evidence. In step 707, the replacement symbol is used to replace the error. This symbol is discussed above in connection with fig. 5. The location of the wrong transition within the class member is identified as i j and the non-matching word pair is represented as a (w_ { i_j }, a_ { i_j }) word pair, the number of times the system responded and was accepted, the number of times the system responded and was rejected, and the number of times the wrong transcription was detected but the system did not give a response. IN this illustrative embodiment, the general entry IN the error transcription hash is given by (w, a, CO, IN, NO) -where w = correct word, a = potential error transcription word, CO = correct count (accept), IN = error count (reject), and NO = NO response count.

In step 709, a delete symbol is used. Because of this, there are no words in the candidate class members, the word pair will be designated as (w_ { i j },0_ { i j }) to indicate that there are no words in the candidate class members that correspond to word w. The error transcription hash in this case is given by (w, 0, CO, IN, NO).

Similarly, in step 711, if an insertion error is detected, an insertion symbol is used. An example symbol representing an insertion is (0_ { i j }, w_ { i j }), the associated erroneous transcription being given by (0, w, co, in, no).

Once the transcription error evidence is incremented to accumulated evidence of transcription errors that are members of the class, step 713, the process ends, step 715.

The erroneous transcription is about the replacement, deletion or substitution of a word in the received utterance compared to the first speech member, the received utterance being changed in a first way compared to a first time slot in the first speech member. As the evidence is incremented by the error transcription analyzer, i.e., the received utterance is evidence of an error transcription in a first manner in a first time slot, it will exceed a threshold and add a second speech member to the group of speech members for use by the speech recognition system. The second sounding member, compared to the first sounding member, uses a change that has been identified as "false transcription" at the first time slot in the first sounding member so far. Note that when the transcription error is an insertion or deletion error, the total number of slots in the generated second speech member will be slightly different, there will be more slots for the insertion error and fewer slots for the deletion error, but the change will still be considered as the first slot of the first speech member.

FIG. 8 is a flow chart illustrating a method for incrementing evidence that a wrongly transcribed word in one class member is a wrongly transcribed word that is a legal replacement for the same word in another class member of the same class. Finding a single erroneous transcription in a long class member in a single instance, i.e. all other word matches, is considered to be a strong evidence of erroneous transcription in a preferred embodiment of the invention. However, in embodiments of the present invention, evidence of erroneous transcription in one class member is some, preferably less, that is, erroneous transcription is also a valid replacement in other class members.

The process begins at step 800, where candidate erroneous transcriptions are identified for a first class member in a class. Next, in step 801, the system determines whether the incorrectly transcribed word is shared by another class member in the same class. If so, a series of decisions are made to determine how strong the evidence is to increment the incorrectly transcribed words of the other class members. For example, in step 803, the system determines if the erroneous transcription is from the same user. If so, there will be more powerful evidence than the erroneous transcription from another user. As another example, if an erroneous transcription is received in the same environment, step 805, it will be more powerful evidence than if an erroneous transcription is received in another environment. In addition, if two users that are considered to have the same first language (i.e., the L1 language) receive a false transcription, they are considered to be more powerful evidence of receiving a false transcription than individuals having different L1 languages. As described above, the phonetic similarity between words in class members and the wrong transcription will also be evidence. Moreover, as described above, the number of correctly recognized words may be a factor of the amount of evidence to be added as compared to the number of candidate false transcriptions, but because this is a "second hand" factor, it may be less than the false transcription evidence of the class member itself. Other decisions such as whether the user is in the same user class or whether the environment is of the same environment type, as well as other tests, may be included in embodiments of the present invention.

In step 807, the determined evidence is incremented, i.e., the incorrectly transcribed word is a legal substitution in another class member. If there are other class members, step 809, the process repeats until no other class members accumulate evidence for it. The process ends in step 811.

FIG. 9 is a flow chart illustrating a method for incrementing evidence that a wrongly transcribed word from one class is a legal substitution of the wrongly transcribed word for the same word in a class member of a different class. The process begins in step 901, where a wrongly transcribed word is determined from one class and the system determines if it should be considered evidence of a wrongly transcribed word of a class member in the other class. In one embodiment of the invention, false transcripts from class members of different classes are considered less evidence than false transcripts from the same class. Nevertheless, it is still some evidence, as a particular user may send out the same word in the same way, regardless of the class in which the word occurs. Thus, the system will perform a similar decision as described above. In step 903, the system will determine if the incorrectly transcribed word is present in a class member of a new class, which is a different class than the class in which the incorrect transcription was detected. In step 905, the system determines whether the same user has issued a false transcription. In step 907, the system determines whether an erroneous transcription has been made in the same environment. Another decision is whether the same erroneous transcription has been issued a threshold number of times. All of these factors are used to determine the amount of evidence that should be incremented, i.e., the incorrectly transcribed word is a valid replacement in other class members from different classes. Other tests may be used to determine whether evidence should be added to class members.

Next, in step 903, it is determined whether there is another class member to be checked. If so, the process returns to step 903. Next, in step 905, it is determined whether there is another class to be checked, and if not, the process ends, step 917.

In fig. 10, a flow chart of one embodiment of the present invention for obtaining additional evidence using a speech recognition system is shown. In one embodiment of the invention, when the transcription matches n_i-k words in the words and it is known that k remaining words are likely to be erroneous transcriptions, but for which some subset of the associated single-word erroneous transcriptions are never seen, the system synthetically creates an audio stream of utterances of single words with only mispronunciations via the text-to-speech subsystem and feeds the stream to the speech recognition engine to see if the erroneous transcriptions of the single words are corrected. If corrected, additional evidence of the erroneous transcription is accumulated, but not other evidence. This embodiment of the invention takes advantage of the characteristics of a speech recognition system using a sliding N-gram window. In such speech recognition systems, the engine automatically corrects some words, which would otherwise be a false transcription transcribed one word at a time, by means of a sliding N-gram window in its hidden markov model. On the other hand, some speech recognition engines provide word-by-word transcription that is less accurate, but faster than transcription using a sliding N-gram or other correction means to output an utterance phrase at a time. Verbatim transcription is typically used by systems that must respond immediately upon hearing a word/phrase and cannot wait for a pause indicating completion of the utterance. Thus, by pairing a verbatim speech recognition engine for fast system response with a sliding N-gram speech recognition engine, evidence can be accumulated for new class members for use by the verbatim engine.

Comparing a verbatim transcription to a one-time-sounding transcription provides examples of many possible erroneous transcriptions. For example, there are two or more suspected erroneous transcriptions in one utterance, for example, sentences are aa. A form of.. YY...a.bb, and the suspected correct version is aa. QQ... RR...and BB, where QQ is considered likely to be a correction for XX and RR of YY. However, it is assumed that the speech recognition engine never recognizes the number of words of the number QQ., YY., or XX., RR., and all words previously recognized have only a single substitution, so the evidence is indirect. In this case, the system generates a synthesized speech system (using a text-to-speech system) to feed aa using an N-gram window or other correction mechanism to the speech recognition system for the purposes of aa QQ., YY., BB and aa XX., RR., BB entering the speech recognition system, to see if the utterance is recognized. This will be evidence of single and double substitution. If they are so identified, there is additional evidence that supports dual transcription, otherwise not.

Referring to fig. 10, in step 1001, an utterance having a plurality of erroneous transcriptions is received. A set of erroneous transcriptions is listed. In step 1003, the next erroneous transcription is selected. In step 1005, the system generates a new synthesized utterance that contains the next erroneous transcription of the particular class member. In a preferred embodiment, only a single error transcription from the set of error transcriptions is used in the new synthesized utterance. In step 1007, the newly generated synthesized utterance is sent to a speech recognition system, which uses an N-gram window or other correction mechanism to see if the utterance was recognized, i.e., corrected. It is determined whether the speech recognition recognizes the synthesized utterance as a class member, step 1009. If not, the method proceeds to see if there is another erroneous transcription. If so, then in step 1011, evidence is accumulated that there is a false transcription for the class member, and therefore a new class member should be added. In step 1013, the system determines whether there is another candidate erroneous transcription in the utterance. If so, the process returns to step 1003. If not, the process ends.

The process extends to other class members within the class and other utterances that have the same word that may be incorrectly transcribed in other embodiments. In some of these embodiments, the rules will be used to accumulate less evidence of other class members and other utterances than the class members originally identified with the set of multiple erroneous transcriptions.

In an embodiment of the invention, the system adds new class members to the class by identifying a new phrase and then entering an interactive question pattern with the user to determine that the new phrase belongs to one of the existing classes.

In an embodiment of the present invention, a system administrator will define a set of class members for a given class. The system will then add new class members to the class using synonymous phrases or interactive question patterns, in addition to the new class members added due to the incorrect transcription.

While preferred operating environments and use cases have been described, the techniques herein may be used in any other operating environment where deployment of services is desired.

As described above, the above-described functionality may be implemented as a stand-alone method, e.g., one or more software-based functions executed by one or more hardware processors, or it may be implemented as a management service (including as a Web service via SOAP/XML or RESTful interfaces). Specific hardware and software implementation details described herein are for illustrative purposes only and are not meant to limit the scope of the described subject matter.

More generally, computing devices within the context of the disclosed subject matter are each data processing systems including hardware and software, and these entities communicate with each other over a network (such as the Internet, an intranet, an extranet, a private network, or any other communication medium or link). Applications on the data processing system provide native support for the Web and other known services and protocols, including but not limited to support for HTTP, FTP, SMTP, SOAP, XML, WSDL, UDDI and WSFL, etc. Information about SOAP, WSDL, UDDI and WSFL is available from the world wide web consortium (W3C), which is responsible for developing and maintaining these standards; more information about HTTP, FTP, SMTP and XML is available from Internet Engineering Task Force (IETF).

In addition to cloud-based environments, the techniques described herein may be implemented in or in conjunction with a variety of server-side architectures, including simple n-tier architectures, web portals, federated systems, and the like.

More generally, the subject matter described herein may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the module functions are implemented in software, which includes but is not limited to firmware, resident software, microcode, etc. Furthermore, the interfaces and functions can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain or store the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device). Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a Random Access Memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical discs include compact disc-read only memory (CD-ROM), compact disc-read/write (CD-R/W), and DVD. The computer readable medium is a tangible, non-transitory item.

The computer program product may be a product having program instructions (or program code) to implement one or more of the functions described. After being downloaded from a remote data processing system over a network, the instructions or code may be stored in a computer readable storage medium in the data processing system. Alternatively, those instructions or code may be stored in a computer readable storage medium in a server data processing system and adapted to be downloaded over a network to a remote data processing system for use with the computer readable storage medium in the remote system.

In representative embodiments, these techniques are implemented in a special purpose computing platform, preferably in software executed by one or more processors. The software is stored in one or more data stores or memories associated with the one or more processors and the software may be implemented as one or more computer programs. In general, such specialized hardware and software includes the functionality described above.

In a preferred embodiment, the functionality provided herein is implemented as an attachment or extension to existing cloud computing deployment management solutions.

While specific sequences of operations are described above as being performed by certain embodiments of the invention, it should be understood that such sequences are exemplary, as alternative embodiments may perform the operations in a different order, combine certain operations, overlap certain operations, etc. References in the specification to a given embodiment indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic.

Finally, while given components of the system have been described separately, one of ordinary skill will appreciate that some of the functions may be combined or shared in given instructions, program sequences, code portions, and the like.

Having described our invention, what we want to claim is as follows.

Claims

1. A method for identifying erroneous transcriptions generated by a speech recognition system, comprising:

providing a set of known speech members for use by a speech recognition system, each speech member consisting of a corresponding plurality of words;

matching the received utterance with a first utterance member within the set of known utterance members, the first utterance member being a closest matching utterance member having a first plurality of words, wherein fewer than the first plurality of words in the received utterance are matched with the first plurality of words in the first utterance member and the received utterance changes in a first particular manner as compared to a first word in a first time slot in the first utterance member;

transmitting the received utterance to an error transcription analyzer component;

if the received utterance is evidence of a false transcription, incremental evidence of a false transcription is made by the false transcription analyzer; and

In response to the incremental evidence for the erroneous transcription exceeding a threshold, future received utterances including the erroneous transcription are processed as if the first word were recognized.

2. The method of claim 1, wherein the received utterance uses a first word in place of a second word used in a first time slot in the first members of the utterance;

wherein the false transcription analyzer performs incremental evidence of false transcription if the received utterance is evidence of false transcription of the first word replacing the second word.

3. The method of claim 1, further comprising:

in response to matching a second received utterance with the first utterance member, sending the second received utterance to an erroneous transcription analyzer, wherein the matching matches the first plurality of words and a second plurality of remaining words in the received utterance are candidate erroneous transcriptions;

generating a first synthesized utterance via a text-to-speech subsystem of an audio stream based on replacing a first continuous word group that is presumed to be a false transcription from the second plurality of remaining words in the first speech member with a presumed correct replacement;

Sending the first synthesized utterance to a speech recognition engine having correction features; and

accumulating evidence that the first continuous word group is a false transcription of the assumed correct substitution in response to a correction of the first speech member by the synthesized speech.

4. The method of claim 1, wherein the false transcription analyzer matches the received utterance with respective utterance members having different numbers of words and having a single first candidate false transcription that results in greater evidence for the first candidate false transcription, the first candidate false transcription containing one or more consecutive words that do not exactly match one or more consecutive words in the respective utterance members.

5. The method of claim 2, wherein the erroneous transcription analyzer uses a rule that increases evidence of erroneous transcription for a second word in a second speech member that also includes the first word based on the received utterance that matches the first speech member, wherein an amount of evidence that increases for the erroneous transcription in the second speech member is less than an amount of evidence that increases for the erroneous transcription in the first speech member.

6. The method of claim 1, wherein the error transcription analyzer increments evidence for the error transcription in the first manner at the first time slot based on a plurality of received utterances from a first user having error transcription in the first manner at the first time slot.

7. The method of claim 2, further comprising: if each time a received utterance matches the first utterance member the received utterance is evidence of a false transcription of the second word for the first word, incremental evidence of a false transcription is made by the false transcription analyzer such that the more the evidence is accumulated for the false transcription as each received utterance in which the second word is transcribed to replace the first word in the received utterance.

8. The method of claim 1, wherein the error transcription analyzer uses the following voice-based rules: a greater degree of speech similarity between the second word in the received utterance and the first word at the first time slot in the first member of the utterance results in a greater amount of evidence per received utterance instance than if such speech similarity was not detected.

9. The method of claim 1, wherein the error transcription analyzer is to increment evidence for the error transcription at the first time slot in a first manner based on a plurality of received utterances from a first environment having error transcription at the first time slot in the first manner.

10. An apparatus for identifying erroneous transcriptions generated by a speech recognition system, comprising:

a processor;

a computer memory holding computer program instructions for execution by the processor for identifying erroneous transcriptions generated by a speech recognition system, the computer program instructions comprising:

program code operable to provide a set of known speech members for use by a speech recognition system, each speech member consisting of a respective plurality of words;

program code operable to match a received utterance with a first utterance member within the set of known utterance members, the first utterance member being a closest matching utterance member having a first plurality of words, wherein fewer words than the first plurality of words in the received utterance match the first plurality of words in the first utterance member and the received utterance changes in a first particular manner as compared to a first word in a first time slot in the first utterance member;

Program code operable to send the received utterance to an error transcription analyzer component;

program code operable to, if the received utterance is evidence of a false transcription, conduct incremental evidence of a false transcription by the false transcription analyzer; and

program code operable to process a future received utterance including a false transcription as if the first word was recognized in response to the incremental evidence for the false transcription exceeding a threshold.

11. The device of claim 10, wherein the received utterance uses a first word in place of a second word used in a first time slot in the first members of the utterance;

12. The apparatus of claim 10, further comprising:

program code operable to send a second received utterance to an erroneous transcription analyzer in response to matching the second received utterance with the first utterance member, wherein the matching matches a first plurality of words and a second plurality of remaining words in the received utterance are candidate erroneous transcriptions;

Program code operable to generate a first synthesized utterance via a text-to-speech subsystem of an audio stream based on replacing a first continuous word group hypothesized to be an erroneous transcription from the second plurality of remaining words in the first speech member with a hypothesized correct replacement;

program code operable to send the first synthesized utterance to a speech recognition engine having correction features; and

program code operable to accumulate evidence that the first consecutive word group is assumed to be incorrectly transcribed for correct substitution in response to a correction of the first speech member by the synthesized utterance.

13. The device of claim 11, wherein the false transcription analyzer increases evidence of false transcription for the second word for the first word, wherein evidence of false transcription for a first user that uttered the received utterance is greater than evidence of false transcription for other users of the device.

14. The device of claim 11, wherein the false transcription analyzer is to increment evidence of false transcription for the second word of the first word in the first utterance, wherein evidence of false transcription for a first environment that receives the received utterance is greater than evidence of false transcription from other environments that receive utterances by the device.

15. A non-transitory computer-readable storage medium for a data processing system, the computer-readable storage medium storing computer program instructions for execution by the data processing system for identifying erroneous transcriptions generated by a speech recognition system, the computer program instructions comprising:

Program code operable to process a future received utterance including the erroneous transcription as if the first word was recognized in response to the incremental evidence of the erroneous transcription exceeding a threshold.

16. The computer-readable storage medium of claim 15, wherein the received utterance uses a first word in place of a second word used in a first time slot in the first utterance member;

wherein the false transcription analyzer increments evidence if the received utterance is evidence of a false transcription of the first word replacing the second word.

17. The computer-readable storage medium of claim 15, further comprising:

program code operable to send a second received utterance to an erroneous transcription analyzer in response to matching the second received utterance with the first utterance member, wherein the matching matches the first plurality of words and a second plurality of remaining words in the received utterance are candidate erroneous transcriptions;

18. The computer-readable storage medium of claim 15, further comprising:

program code to add a second speech member as a temporary member of the group of speech members for use by the speech recognition system in response to the incremental evidence of the erroneous transcription in the first manner at the first time slot of the first speech member exceeding an intermediate threshold below a first threshold;

program code operable to, based on acceptance of a system response by a user for the first speech member, if the received speech is evidence of erroneous transcription in the first manner at the first time slot, conduct incremental evidence of erroneous transcription by the erroneous transcription analyzer.

19. The computer-readable storage medium of claim 16, wherein the false transcription analyzer increases evidence of false transcriptions for the second word of the first word, wherein evidence of false transcriptions for users in a first user class that uttered the received utterance is greater than evidence of false transcriptions for users in other user classes.

20. The computer-readable storage medium of claim 16, wherein the error transcription analyzer increments evidence of error transcription for the second word of the first word in the first utterance, wherein evidence of error transcription for an environment in a first environment type in which the received utterance is received is greater than evidence of error transcription from environments in other environment types.

21. A system for identifying erroneous transcriptions generated by a speech recognition system, comprising means for implementing the steps of any of claims 1-9.

22. A method for identifying erroneous transcriptions generated by a speech recognition system, comprising:

providing a first class of speech members for use by the speech recognition system, each speech class member consisting of a respective number of words, wherein a first class is defined by a first common meaning and a first common system response if class members of the first class are recognized;

in response to the speech recognition system matching the received utterance with a first class member of the first class, sending the received utterance to an error transcription analyzer, wherein the received utterance contains an error transcription that is compared to the first class member;

If the received utterance is evidence of the erroneous transcription of the first class member, incremental evidence of erroneous transcription is performed by the erroneous transcription analyzer;

responsive to increasing evidence of the erroneous transcription for the first class member exceeding a first threshold, adding a second class member to the first class of speech members based on the erroneous transcription of the first class member; and

the common system response is performed in response to identifying a second received utterance that matches the second class member.

23. The method of claim 22, further comprising: a plurality of speech class members are provided for use by the speech recognition system, each speech class member being composed of a respective number of words, wherein each respective class is defined by a respective common meaning and a respective common system response if class members of the respective class are recognized.

24. The method of claim 22, wherein the erroneous transcription of the first class member is an erroneously transcribed word, the method further comprising: evidence is incremented for all class members in the class containing the wrongly transcribed word according to a rule that less evidence is incremented for class members other than the first class member.

25. The method of claim 22, wherein the erroneous transcription of the first class member is an erroneously transcribed word, the method further comprising: evidence is incremented for all class members including those that are not members of the first class that contain the wrongly transcribed word according to a rule that increments less evidence for class members that are not members of the first class.

26. The method of claim 22, further comprising:

providing a first plurality of classes, each class comprising a set of speech members for use by the speech recognition system, each class of the first plurality of classes being for a respective user, wherein each class of the first plurality of classes is defined by the first common meaning and the first common system response if a class member of the first plurality of classes is identified;

class members of each respective class of the first plurality of classes are trained according to the user from which the received utterance was received.

27. The method of claim 22, further comprising:

providing a second plurality of classes, each class comprising a set of speech members for use by the speech recognition system, each class of the second plurality of classes being for a respective environment, wherein each class of the second plurality of classes is defined by the first common meaning and the first common system response if class members of the second plurality of classes are identified;

Class members of each respective class of the second plurality of classes are trained according to an environment from which the received utterance was received.

28. The method of claim 26, further comprising:

providing each class of the first plurality of classes with a set of identical initial class members;

incrementing different amounts of evidence for class members of the class of the respective user based on the same respective error transcription instance; and

in response to the incremental evidence of the erroneous transcription in the third class member for the third user exceeding the first threshold, adding the third class member to the utterance class member for the third user while the incremental evidence of other class members for other users does not exceed the first threshold.

29. The method of claim 27, further comprising:

providing a set of identical initial class members to each class of the second plurality of classes;

incrementing different amounts of evidence for class members of classes of respective environments based on the same respective error transcription instance; and

in response to the incremental evidence of the erroneous transcription for the fourth class member of the first environment user exceeding the first threshold, the fourth class member is added to the class of speech members for the first environment while the incremental evidence for the other class members of the other environment does not exceed the first threshold.

30. The method of claim 22, further comprising:

providing a third plurality of classes, each class comprising a set of speech members for use by the speech recognition system, each class of the third plurality of classes being for a respective user class, wherein each class of the third plurality of classes is defined by a third common meaning and a third common system response if a class member of the third plurality of classes is identified;

training class members of each respective class of the third plurality of classes according to the class of users from which the received utterance was received, wherein the training is based on the same respective error transcription instance, incrementing different amounts of evidence for class members of the class of the respective class of users; and

in response to the incremental evidence of the erroneous transcription in the fifth class member for the first user class exceeding the first threshold, the fifth class member is added to the utterance class member for the first user class while the incremental evidence for other class members of other user classes does not exceed the first threshold.

31. An apparatus for identifying erroneous transcriptions generated by a speech recognition system, comprising:

a processor;

Program code operable to provide first class of utterances for use by the speech recognition system, each class of utterances consisting of a respective number of words, wherein a first class is defined by a first common meaning and a first common system response if class members of the first class are recognized;

program code operable to send the received utterance to an error transcription analyzer in response to the speech recognition system matching the received utterance with a first class member of the first class, wherein the received utterance includes an error transcription as compared to the first class member;

program code operable to, if the received utterance is evidence of the erroneous transcription of the first class member, conduct incremental evidence of erroneous transcription by the erroneous transcription analyzer;

program code operable to add a second class of members to the first speech class member based on the erroneous transcription of the first class member in response to the incremental evidence of the erroneous transcription for the first class member exceeding a first threshold; and

program code operable to execute the common system response in response to identifying a second received utterance that matches the second class member.

32. The apparatus of claim 31, further comprising:

program code operable to provide a plurality of speech class members for use by the speech recognition system, each speech class member being comprised of a respective number of words, wherein each respective class is defined by a respective common meaning and a respective common system response if class members of the respective class are recognized.

33. The apparatus of claim 31, wherein the erroneous transcription of the first class member is an erroneously transcribed word, further comprising: evidence is incremented for all class members in the class containing the wrongly transcribed word according to a rule that less evidence is incremented for class members other than the first class member.

34. The apparatus of claim 31, further comprising: computer code operable to provide a third plurality of classes, each class comprising a set of speech members for use by the speech recognition system, each class of the third plurality of classes for a respective class of users, wherein different classes of the third plurality of classes increment different amounts of evidence based on the same instance of error transcription.

35. The apparatus of claim 33, further comprising: computer code operable to provide a fourth plurality of classes, each class comprising a set of speech members for use by the speech recognition system, each class of the fourth plurality of classes for a respective environment type, wherein different classes of the fourth plurality of classes increment different amounts of evidence based on the same instance of error transcription.

36. The apparatus of claim 33, further comprising: program code operable to increment evidence for all classes with candidate error transcripts, wherein different amounts of evidence are incremented for respective classes for a particular error transcript instance according to user and environmental rules.

37. A non-transitory computer-readable storage medium for a data processing system, the computer-readable storage medium storing computer program instructions for execution by the data processing system for identifying erroneous transcriptions generated by a speech recognition system, the computer program instructions comprising:

program code operable to add a second class of members to the first class of speech members based on the erroneous transcription of the first class of members in response to the incremental evidence of the erroneous transcription for the first class of members exceeding a first threshold; and

38. The computer-readable storage medium of claim 37, further comprising:

39. The computer-readable storage medium of claim 37, wherein the erroneous transcription of the first class member is an erroneously transcribed word, further comprising: evidence is incremented for all class members in the class containing the wrongly transcribed word according to a rule that less evidence is incremented for class members other than the first class member.

40. The computer-readable storage medium of claim 37, further comprising:

program code operable to identify a user or environment of the received utterance; and

program code operable to select a properly trained class based on the identified user or environment of the speech recognition system.

41. The computer-readable storage medium of claim 37, wherein the first utterance class member is for a particular user/environment combination.

42. The computer-readable storage medium of claim 37, wherein evidence is accumulated according to the following rules: if the error transcription is received by two users in a first user class, it is considered to be a stronger evidence of the error transcription for the first user class than if the first error transcription is received from a user in the first user class and the second error transcription is received from a user in a second user class.

43. A system for identifying erroneous transcriptions generated by a speech recognition system, comprising means for implementing the steps of any of claims 22-30.