WO2013056343A1 - System, method and computer program for correcting speech recognition information - Google Patents

System, method and computer program for correcting speech recognition information Download PDF

Info

Publication number
WO2013056343A1
WO2013056343A1 PCT/CA2012/000911 CA2012000911W WO2013056343A1 WO 2013056343 A1 WO2013056343 A1 WO 2013056343A1 CA 2012000911 W CA2012000911 W CA 2012000911W WO 2013056343 A1 WO2013056343 A1 WO 2013056343A1
Authority
WO
WIPO (PCT)
Prior art keywords
questions
question
speech recognition
information
speech
Prior art date
Application number
PCT/CA2012/000911
Other languages
French (fr)
Inventor
Ming Li
Yang Tang
Di WANG
Original Assignee
Ming Li
Yang Tang
Wang Di
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ming Li, Yang Tang, Wang Di filed Critical Ming Li
Publication of WO2013056343A1 publication Critical patent/WO2013056343A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present invention relates to voice speech recognition methods and systems.
  • the present invention relates more particularly to methods and systems for correcting the output of voice recognition methods and systems.
  • speech recognition As an interface to computer systems and computer programs. This includes use of speech recognition as a means of activating mobile device functions or mobile application functions, especially for example when a user is driving.
  • the first is the training that is required to gradually improve the accuracy of speech recognition output. This training is time consuming and may not be practical for use especially for mobile technologies for a number of reasons including variable noise conditions that make training based systems inaccurate in mobile settings.
  • the second, and perhaps more important disadvantage is that noisy environments, speaker diversity, and errors in speech are all widespread factors. Each of these factors has a significant negative impact on speech recognition accuracy, and in combination the negative impact can be quite severe.
  • a computer implemented speech recognition method comprising: (A) capturing one or more elements of speech using a speech capture means, the elements of speech relating to a domain; (B) using one or more speech recognition utilities so as to generate one or more sets of speech recognition information based on the one or more elements of speech; (C) using one or more computers to apply one or more correction routines to the one or more sets of speech recognition information, the one or more correction routines including information distance analysis or compression of the one or more sets of speech recognition information to a set of text elements related to a relevant domain and stored to a database; and (C) constructing one or more data outputs related to a meaning intended by a user by the one or more elements of speech.
  • the method computes a user's intended question or q, and the text elements consist of a database of questions or Q.
  • q is calculated by concurrently: (A) minimizing the information distance between q and Q, with irrelevant information removed; and (B) minimizing the information distance between the speech recognition information consisting of output queries.
  • the method yields as output a determination of q, which may be one or more output queries, or a combination of two or more of the output queries, or one of the questions form Q that may be related to two or more of the output queries.
  • the method comprises the further steps of: (A) splitting questions into words, and aligning groups of input questions that include words or name entities with similar pronunciations, so as to produce word alignment results; (B) enhancing the set of input questions by building additional questions based on the word alignment results; (C) determining a set of relevant questions from the database, based on semantic and/or syntactic similarity, optionally if the relevant questions yield a question that is the same as one of the input questions, this question is identified as the correct question; (D) optionally grouping the input questions using one or more hierarchical clustering operations into clusters, and extracting patterns from each cluster ("extracted patterns"); (E) generating candidate questions by mapping the relevant input question into the extracted patterns; (F) ranking candidate questions using the information distance analysis or compression operations; and (G) returning as a corrected question the candidate question with the minimum information distance score.
  • a computer implemented system for corrected speech recognition comprising: one or more computers including or being linked to a server computer, the server computer implementing a server application, the server application defining an correction utility, where the correction utility includes or is linked to one or more databases each including text elements related to a domain; the correction utility being operable to receive from one or more speech recognition utilities, linked to the server computer or one or more remote servers connected to the server via the Internet, one or more sets of speech recognition information based on one or more elements of speech captured from a user and associated with an intended meaning, wherein the one or more sets of speech recognition information are associated with a domain; the correction utility applying one or more correction routines to the one or more sets of speech recognition information that include information distance analysis of the one or more sets of speech recognition information to the text elements related to the domain and stored to the database; and the correction utility constructing one or more data outputs related to a meaning intended by the user by the one or more elements of speech.
  • the computer system computes a user's intended question or q, and the text elements consist of a database of questions or Q.
  • the correction utility applies one or more complementary information distance analysis or compression operations to disregard irrelevant questions.
  • the correction utility is configured to calculate q using a D min and D max operation, with the irrelevant questions from 0 being removed.
  • the correction utility is configured to calculate q by concurrently: (A) minimizing the information distance between q and 0, with irrelevant information removed; and (B) minimizing the information distance between the speech recognition information consisting of output queries.
  • the correction utility generates as output a determination of q, which may be one or the output queries, or a combination of two or more of the output queries, or one of the questions form Q that may be related to two or more of the output queries.
  • the correction utility is further operable to: (A) split questions into words, and align groups of input questions that include words or name entities with similar pronunciations, so as to produce word alignment results; (B) enhance the set of input questions by building additional questions based on the word alignment results; (C) determine a set of relevant questions from the database, based on semantic and/or syntactic similarity, optionally if the relevant questions yield a question that is the same as one of the input questions, this question is identified as the correct question; (D) optionally group the input questions using one or more hierarchical clustering operations into clusters, and extract patterns from each cluster ("extracted patterns"); (E) generate candidate questions by mapping the relevant input question into the extracted patterns; (F) rank candidate questions using the information distance analysis or compression operations; and (G) return as a corrected question the candidate question with the minimum information distance score.
  • the information distance analysis includes clustering the one or more elements of speech using a cluster of related records in the database.
  • the one or more data outputs are generated based calculation of a D min operation to the one or more elements of speech and a D max operation to the text elements.
  • An Internet implemented system is also provided that provides speech recognition data services to a network of computer devices, where the speech recognition data services include data correction in accordance with the method of the invention.
  • FIG. 1 depicts an exemplary system diagram illustrating the network architecture for implementing the present invention, in accordance with one embodiment of the present invention.
  • FIG. 2 is a workflow diagram illustrating a representative workflow in accordance with one aspect of the invention.
  • FIG. 3 is a representative architecture diagram illustrating a possible implementation of the technology of the present invention.
  • FIG. 4 illustrates a generic computer implementation of the computer program aspects of the present invention.
  • the present invention provides a computer network implemented system, a computer network implemented method, and a computer network architecture that enables improved speech recognition output, and also activation of third party systems based on speech recognition output, using a unique an innovation mechanism for data correction, that enables a voice input means to various computer devices that performs far better than what is possible with prior art solutions.
  • the invention also includes a computer program for implementing method functions described, which may be implemented for example as a server application, implemented to one or more servers.
  • the computer network architecture described herein enables delivery of improved speech recognition features for example to a mobile device.
  • a possible computer system implementation of the present invention is illustrated in Fig. 1. and Fig. 2.
  • a computer implemented method that applies information distance analysis, or compression methods, to voice recognition results, namely a set of query outputs (for example voice recognition results generated from one or more third party voice recognition platforms) relative to a set of meaningful queries, from which we system and method of the invention can reconstruct the intended query.
  • information distance is a form of compression that provides desirable outcomes.
  • the performance of the present invention may be illustrated by referring to 5 test cases, each using 200-300 questions from a test set not contained in the database (22) (shown in Fig. 1 ) linked to the computer system.
  • the computer system of the invention reduced the number of errors by 20-40% for native speakers and over 50% for the non-native speakers, in comparison to performance of marketing leading speech recognition software such as packages available from GOOGLETM, DRAGONTM, or MICROSOFTTM.
  • the less than optimal performance of third party speech recognition packages may be explained by: (i) impact of noisy environments, (ii) speech variations, for example, adults versus children, native speakers versus non-native speakers, females versus males, especially when individual voice input training is not possible or is impractical and (iii) error in speech. Regarding error in speech humans do not always speak in correct and complete sentences and without any break or corrections in the middle. Even a model speaker using a voice recognition system may cough from time to time, elide certain words, and so on. Moreover, prior art speech recognition systems are not able to resolve the differences for example between "sailfish" from "sale fish”?
  • the present invention is based on a unique and innovative speech recognition improvement methodology.
  • (A) Database (22) is configured to include information from a relevant domain, obtained for example from the Internet or databases, and that may optionally be enhanced.
  • the database is shown to include Q data sets.
  • the database (22) entries may relate to a particular domain or language component.
  • Figs. 1 and 2 illustrate the use of the present invention as a QA service, and therefore database (22) contains a multitude of possible queries.
  • a current implementation of the database (22) includes 35 million queries obtained from the Internet (“database queries").
  • (B) Database (22) is used to correct queries from one or more speech recognition systems (referred to as "output queries”), in accordance with the present example embodiment of the invention.
  • a correction utility or component (21 ) is operable to use the database queries as templates or patterns to generate the original intended question from the voice recognition input or output queries.
  • the correction utility (21 ) may incorporate a correction engine (24) and a database (22).
  • the correction engine (24) embodies the operations described herein, and based on such operations the correction utility (21 ) utilizing one or more databases such as database (22) is operable to generate corrected text (28).
  • Corrected text (28) may in turn be used to support a variety of applications including for example an enhanced QA service implemented using for example a QA server (not shown).
  • a QA server may for example include or link to the correction utility (21 ) of the present invention.
  • Kolmogorov complexity was invented in the 1960's. The concept may be explained in relation to an universal Turing machine U.
  • the Kolmogorov complexity of a binary string x condition to another binary string y is the length of the shortest (prefix-free) program for U that outputs x with input y. Since it can be shown that for a different universal Turing machine U the metric differs by only a constant, we will just write K(x ⁇ y) instead of We write ⁇ ( ⁇ ⁇ ), where ⁇ is the empty string, as (x). We call a string x random if K(x) ⁇
  • the minimum number of bits needed to convert between x and y to define their distance may be defined.
  • the cost of conversion between x and y may be defined as:
  • nax (-V . ) max ⁇ A ' i.
  • This distance D max is shown to satisfy the basic distance requirements such as positivity, symmetricity, and triangle inequality. Furthermore, D max is "universal" in the sense that D max always minorizes any other reasonable computable distance metrics. This information distance operation is known in the prior art and this concept and its normalized versions have been applied in a number of different areas.
  • the max distance D max (x,y) has several problems when we consider only partial matching where the triangle inequality fails to hold and the irrelevant information must be removed.
  • the present invention includes a complementary information distance to resolve this problem.
  • Eq (1) we asked for the smallest number of bits that must be used to reversibly convert between x and y.
  • D min (x,y) E m ⁇ n ⁇ x,y) as a complementary information distance that disregards irrelevant information.
  • D min is obviously symmetric, but it does not satisfy triangle inequality.
  • D min was used in the QUANTA QA system for example to deal with concepts that are too popular. Its use an operation for enabling information distance operations as between a first set of entities, and a second set of entities, where there may be irrelevant information in one or more of the sets, for the purpose of determining the most accurate entity, is a novel and innovative contribution.
  • One of the contributions of the present invention may be understood as an operation for determining q based on a combined D min and D max operation or min/max information distance operation, as further explained below.
  • D mi n measures information distance between q and Q with irrelevant information removed; and D max is the information distance between / and q.
  • D max (.v. y) max ⁇ A ' (.
  • the issues outlined in the paragraph above may be resolved using one or more of the following techniques.
  • Q is very large and contains different "types" of questions. For each type of questions, we could extract one or more question templates. In this way, Q could be considered as a set of templates and each template, denoted as p, covers a subset of questions from Q.
  • encoding q we do not have to encode q from Q directly. Instead we encode q with respect to the patterns or templates of Q. For example, if a pattern p in Q appears N times in Q. Then we can use ⁇ og 2 (Total/N) bits to encode the index for this pattern. Given the pattern p, we encode q with p by encoding their word mismatches. There will be a tradeoff between the encoding of p and the encoding of q given p.
  • a common pattern may be encoded with a few bits, but it may require more bits to encode a specific question using this pattern.
  • the template "who is the mayor of City Name” requires more bits to encode than the template "who is the mayor of Noun”, because the former is a smaller class than the latter.
  • the first template will require fewer bits to generate a question "who is the mayor of Waterloo”, since it requires fewer bits to encode Waterloo from the class "City Name” than from the class "Noun”.
  • patterns may be extracted by pre-processing or be extracted dynamically based on analysis of the output queries.
  • Q' may for example be organized in a hierarchical way. Similar questions may be mapped to a cluster and similar clusters may be mapped to a bigger cluster.
  • One pattern may be extracted from each cluster using for example a multiple alignment algorithm. This pattern should be as specific as possible, while at the same time covering all the questions in the cluster. The higher the cluster is in the hierarchy structure, the more general the pattern will be. So our hierarchical clustering technique, in one aspect, may ensure that all the possible patterns are extracted from relevant questions.
  • this aspect of the operation of correction engine (24) may use one or more semantic and/or syntactic information techniques, including POS tagger, Name Entity Recognition, Wordnet and WikiPedia. For example, given a cluster of three questions:
  • the correction engine is operable to extract a pattern such as: Who is the Leader of Location? "Mayor”, “president” and “senator” are all mapped to the Leader class, while “Toronto”, “United States” and “New York” all belong to the Location class.
  • pattern p is treated by correction engine (24) as a sentence, the problem of item-to-set encoding depends on the item-to-item encoding, same as the computation of k i i iV atld Ki W ) _
  • An optimal alignment between two sentences may be generated for example using a standard dynamic programming algorithm.
  • content engine (24) encodes a missing word by minus logarithm of their probabilities to appear at the said locations and encodes the mismatches by calculating their semantic and morphology similarities. Obviously, it requires fewer bits to encode between synonyms than antonyms.
  • the last problem to consider in Formula (4) is the selection of candidate questions q. It may not be possible to search through the whole question space. We only consider the possible question candidates that is relevant with the input and that could be matched by at least one of our templates from database (22). Furthermore, a bigram language model may be applied by the correction engine (24) to filter questions with low trustiness. The language model may be trained in our background question set. The value is trained by operations of correction engine (24). In the system of the present invention, the (i; value may be a function of the lengths of the output question. It is optimal for a voice recognition system to respond instantaneously. The speed requirement forced us to adapt some tradeoffs in the implementation, for example not considering all possible patterns. In other words, the content engine (24) may apply one or more operations to minimize the number of possible patterns analyzed, using one or more pre-configured thresholds. Further Details regarding Database Q
  • test set T contains 300 questions, selected (with the criteria: no more than 1 1 words or 65 letters, one question in a sentence, no non-English letters) from a Microsoft QA set at http://research.microsoft.com/en-us/downloads/88c0021 c-328a-4148-a158- a42d7331c6cf.
  • the only three questions that do not have similar patterns, in a strict sense, in Q are: Why is some sand white, some brown, and some black? Do flying squirrels fly or do they just glide? Was there ever a movement to abolish the electoral college?
  • Words or name entities with similar pronunciations are mapped together. For example, given three questions: “whole is the mayor of Waterloo”, “hole is the mayor of Water”, and “whole was the mayor of Water”, the best word alignment may be: Whole is the mayor of Waterloo
  • Step 2 Improve input questions 1. Build a question based on the word alignment results from the previous step. For each aligned word block, we choose one word to appear in the result.
  • Step 1 Find relevant database questions, and they are sorted based on their semantic and syntactic similarity to the improved input questions from Step 2.
  • the method may return that input question directly. No further steps to be done in this case.
  • patterns may be extracted from each cluster.
  • questions may be aligned in each group. Semantic similarities may be used to encode the word distance.
  • a group of questions may be converted into a single list with multiple word blocks, each block containing several alternative words from different questions. For example, given questions "Who is the mayor of New York", "Who is the president of United States” and "Which person is the leader of Toronto", a list of word blocks may be obtained after alignment: ⁇ who, who, which person ⁇ , ⁇ is ⁇ , ⁇ the ⁇ , ⁇ mayor, leader, president ⁇ of ⁇ New York, United States, Toronto ⁇ .
  • tags may be extracted that would best describe the slot.
  • YAGOTM may be used to describe the meaning of each word or phrase.
  • Several most common facts may be determined as the description of each word block. Then several semantic patterns may be determined using for example words and facts from YAGOTM.
  • Step 4. Generate the candidate questions
  • the original input questions that were extracted may be mapped into patterns.
  • Words in the patterns may be replaced with the words from the input. Lots of candidate questions could be generated by considering the different combinations of word replacements.
  • a bigram language model may be trained from our question set and candidate questions with low probability removed.
  • Step 5 Rank candidate questions using information distance
  • the distance is then calculated between the candidate questions and the input questions K(q
  • a missing word is encoded by minus logarithm of their probability to appear at the said locations and word mismatches are calculated by their semantic, morphology and metaphone similarities.
  • Step 6 Return the candidate with the minimum information distance score as the final result.
  • steps 3.3, 3.4, and 3.5 may be performed offline on the iplete database Q.
  • the Google voice recognition returned 3 options for each question.
  • the system of the present invention allows the speaker to repeat the question and the system of the present invention computes the optimal solution using the methods above. However, this experiment did not use that feature. Experiments of this nature is a tricky matter, it depends on people, environment, and questions. We have tried to test the system using different types of people.
  • CC means the voice recognition software (of Google, all same below) has returned the correct answer as the first option, and present system agreed with it;
  • W means voice recognition software returned wrong answer as the first option, and the system of the present invention has returned the correct answer;
  • CW means the voice recognition software returned the correct answer as the first option, and system of the present invention has returned the wrong answer;
  • WW means the voice recognition software returned the wrong answer as the first option, and system of the present invention has also returned wrong answer. All the experiments were performed in quiet environments.
  • the system of the present invention has also demonstrated a clear advantage. Such an advantage will be amplified in a noisy daily environment. Allowing the speaker to repeat the question will increase the success rate as the following example (with Google shows: the present system generated "How many toes does Mary Monroe have?" at the first query and generated "How many titles does Marilyne Monroe have?" at the second query. Putting the two questions together, the present system generates the correct and intended question "How many toes does Marilyne Monroe have?". This feature is implemented in the present system but was not included in this experiment.
  • T contains 300 questions. Twas chosen independently and 1 r Q . Not all questions in T were used by each speaker in the experiments mostly because non-native speakers and children skipped sentences that contained hard-to-pronounce words. Less proficient English speakers tend to skip more questions.
  • a server with 4 cores, 2.8GHz per core, and 4G memory was used, and WILLY typically required about 500ms to correct one question. That is, the speaker reads each question to the microphone, the GOOGLETM voice recognition to return 3 questions, and WILLY uses these 3 questions as input and it takes about half a second to output one final question.
  • the non-native speakers and the children selected relatively easy questions (without hard-to-pronounce names, for example) from T to do the tests.
  • the ratio of improvements are better for the non-native speakers, reducing number of errors (the WW column) by 30% on the average for experiments 1 , 4, 5, 6, 9, 10, 1 1 , 12.
  • WILLY also demonstrated a clear advantage, reducing the number of errors (the WW column) by 16% on the average for experiments 2, 3, 7, 8. Such an advantage will be amplified in a noisy real life environment.
  • the system of the present invention is operable to provide significant improvements over the performance of existing speech recognition software. This enables for example a QA system accessible using voice input, which would provide a powerful and convenient tool for people who may be driving, individuals with literacy challenges, people with impaired vision, children who wish their Talking Tom or R2-D2 to really talk, and mobile device users as a whole.
  • FIG. 1 A computer implemented method in accordance with the presented invention is shown as a workflow in FIG. 2.
  • Computer device (10) may be any manner of computer device, whether a laptop, desktop computer, tablet computer or otherwise, and is shown as a mobile device (10) in FIG. 1.
  • the computer device (10) includes or is linked to a speech capture utility (14).
  • the computer device (10) may be a mobile device as shown in FIG. 1.
  • the speech capture utility (14) is operable to for example record one or more phrases uttered by a user of mobile device (10).
  • the computer device (10) may also include or be linked to one or more speech recognition utilities (16).
  • the speed recognition utilities (16) generate one or more digital outputs conforming to interpretations of the intended sentences or sentence fragments. As explained above, the accuracy of the output from the speech recognition utilities (16) is often inaccurate. For this reason, one or more components are configured to apply the correction techniques described above.
  • the speech recognition utility may be implemented on the device (10), or the device (10) may connected to a voice server (18), which in turn may include or be linked to one or more speech recognition utilities (16).
  • database (22) relates preferably to a particular domain.
  • FIG. 1 illustrates a representative implementation of a QA service, and therefore the input to the correction utility (21 ) consists of queries.
  • multiple databases (22) may be used, each database (22) relating for example to a particular domain.
  • the system of the present invention may include a classifier that is operable to analyze the output of the speech recognition utilities (16) and based on such analysis determine for example the nature of the output from the speech recognition utilities so as assign the output to a particular database (22) that matches the domain of the output.
  • the classifier may be operable to determine that the output is a question and therefore apply to the database (22) that relates to queries as opposed to for example a database comprising commands.
  • the correction utility (21 ) of the present invention is operable to use the output queries, and is further operable, if required, to construct the intended query, as explained above.
  • the correction utility (21 ) may include or be linked to one or more utilities configured to support the operations described, including the construction of the output of the system, namely the corrected speech recognition output.
  • the correction utility may include a semantic engine that enables the correction utility (21 ) to use the entries in database (22) as templates as illustrated in the examples above.
  • the correction utility (21 ) may be implemented for example as a web service or a cloud service as further explained below.
  • the correction utility (21 ) may be made part of a web server, implemented for example as a server application.
  • One or more computer devices may call on the system of the present invention, via a communication network, to seek improvement of the accuracy of user generated queries, generated using one or more speech recognition routines.
  • the correction utility (21 ) may be implemented as part of a QA server, that is operable to correct speech recognition output so as to generate corrected text (28), and based on corrected text (28), provide answers that match the corrected text (28) corresponding to queries.
  • the correction utility (21 ) may also be integrated with existing speech recognition technologies.
  • the system and method of the present invention may be implemented with various systems and applications for the purpose of enhanced voice command or voice input functionality.
  • the present invention specifically contemplates the incorporation of the technology described in third party platforms, for example call centre application platforms, control systems, access control systems (such as for example an access control system used in a car system), a help utility for assisting with device or software functions such as for example a smart phone personal assistance system and so on.
  • the computer system of the present invention may be implemented as a knowledge based dialogue computer system, that permits a user to dialogue with one or more other users using the computer system.
  • the dialogue may include (A) at least one human user speaking to a network-connected device, (B) means being provided to the network-connected device to capture the user's voice, (C) the captured voice is transferred to a remote computer system, (D) the captured voice is analyzed to determine its probable meaning, using the natural language processing engine of the present invention, (E) one or more appropriate responses may be determined based on the probable meaning, using a logic engine and by constructing the response, including optionally based on the determined probable meaning, and (F) the appropriate response is made available to the user using the network- connected device.
  • a natural user interface may be provided that may be implemented to a network-connected device such as a smart phone, where the natural user interface enables the processing of speech input, and optionally responses to the speech input.
  • the system of the present invention may be configured so as to provide natural language processing based, multi-language, QA online platform or service.
  • the computer system enables natural language based interactions for both the English and Chinese languages.
  • a computer system may be provided that incorporates the functionality of the present invention, and also includes a series of other utilities that implement a multi-language QA service. Many other implementations are possible that leverage the technology described herein.
  • a user may provide speech audio input to a suitable network-connected device (52).
  • a third party speech recognition system may be used to recognize the audio input so as to produce text data.
  • the network-connected device (52) may be configured to connect to a computer platform (56) (referred to as the RSVP server) that includes the question correction functionality of the present invention.
  • a computer platform referred to as the RSVP server
  • means may be provided to determine the language in which text data is communicated to the computer system (58).
  • Question correction components (60) implement the question correction technology of the present invention.
  • Fig. 3 illustrates how the present invention may be utilized to improve performance of an overall multi-language dialogue computer system.
  • the natural language processing based translation features may use search results from a variety of third part online sources (62).
  • the RSVP platform may also include a translator (64) for translating queries.
  • a possible workflow may include: (A) determination of language of query, (B) question correction, and optionally translation, (C) question classification and then processing by a QA service (68).
  • the QA service (68) usually is implemented on a language by language basis.
  • the QA service (68) may rely on the online sources (62).
  • the output may be an answer selection in the user's intended language.
  • the solution described may enable for example English language resources to be utilized in order to enable broader data sets for enabling an answer selection if search of Chinese language resources do not provide an answer that is deemed to be correct.
  • the QA service (68) may include for example vertical domains answering; knowledge base answering; and communities QA searching, in order to find a good answer to a question corrected using the technology of the present invention.
  • a knowledge based dialogue computer system in accordance with the present invention may be implemented using a plan state transition system or a partially observed Markov decision process (or "POMDP").
  • POMDP partially observed Markov decision process
  • the method of the present invention may also be used to correct numerous grammatical errors and spelling mistakes in the various questions loaded to database (22) which have been obtained from the Internet and therefore may require correction.
  • Application of the correction method makes the QA process more accurate by providing a more correct database.
  • the method of the present invention may also be adapted to create an automatic writing assistant, which would be suitable for a well-defined domain.
  • a writing assistant can be used to convert the queries in a keyboard-input QA system to answerable queries.
  • FIG. 4 and the following discussion are intended to provide a brief, general description of a suitable computing environment in which the various aspects of the present invention can be implemented. While the innovation has been described above in the general context of computer-executable instructions that may run on one or more computers, those skilled in the art will recognize that the innovation also can be implemented in combination with other program modules and/or as a combination of hardware and software.
  • program modules include routines, programs, components, data structures, etc., that perform particular tasks or implement particular abstract data types.
  • inventive methods can be practiced with other computer system configurations, including single-processor or multiprocessor computer systems, minicomputers, mainframe computers, as well as personal computers, hand-held computing devices, microprocessor-based or programmable consumer electronics, and the like, each of which can be operatively coupled to one or more associated devices.
  • the illustrated aspects of the innovation may also be practiced in distributed computing environments where certain tasks are performed by remote processing devices that are linked through a communications network.
  • program modules can be located in both local and remote memory storage devices.
  • the database (22) may be located remotely from a computer device that includes other elements of the correction utility, such that the correction utility queries the database for the cluster of related queries as described above, however the information distance operations described herein may be utilized to improve performance.
  • a computer typically includes a variety of computer-readable media.
  • Computer-readable media can be any available media that can be accessed by the computer and includes both volatile and non-volatile media, removable and non-removable media.
  • Computer- readable media can comprise computer storage media and communication media.
  • Computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD- ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer.
  • Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above should also be included within the scope of computer-readable media.
  • the system of the present invention represents a collection of hardware and software elements that enable a user to manage a variety of device and information objects associated or generated by these devices, leveraging in-the-cloud resources in a new way.
  • What has been described above includes examples of the innovation. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the subject innovation, but one of ordinary skill in the art may recognize that many further combinations and permutations of the innovation are possible. Accordingly, the innovation is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims.
  • Cloud computing includes Internet based computing where shared resources, software and data are provided on demand.
  • a “cloud” therefore can refer to a collection of resources (e.g., hardware, data and/or software) provided and maintained by an off-site party (e.g. third party), wherein the collection of resources can be accessed by an identified user over a network.
  • the resources can include data storage services, word processing services, and many other general purpose computation (e.g., execution of arbitrary code) and information technological services that are conventionally associated with personal computers or local servers.
  • a component can be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer.
  • an application running on a server and the server can be a component.
  • One or more components can reside within a process and/or thread of execution, and a component can be localized on one computer and/or distributed between two or more computers.
  • the concepts of "virtual” and “cloud computing” include the utilization of a set of shared computing resources (e.g. servers) which are typically consolidated in one or more data center locations.
  • cloud computing systems may be implemented as a web service that enables a user to launch and manage computing resources (e.g., virtual server instances) in third party data centers.
  • computing resources e.g., virtual server instances
  • computer resources may be available in different sizes and configurations so that different resource types can be specified to meet specific needs of different users. For example, one user may desire to use small instance as a web server and another larger instance as a database server, or an even larger instance for processor intensive applications. Cloud computing offers this type of outsourced flexibility without having to manage the purchase and operation of additional hardware resources within an organization.
  • a cloud-based computing resource is thought to execute or reside somewhere on the "cloud", which may be an internal corporate network or the public Internet.
  • cloud computing enables the development and deployment of applications that exhibit scalability (e.g., increase or decrease resource utilization as needed), performance (e.g., execute efficiently and fast), and reliability (e.g., never, or at least rarely, fail), all without any regard for the nature or location of the underlying infrastructure.

Abstract

A computer implemented speech recognition system and method is provided that improves the accuracy of output from one or more speech recognition systems by applying one or more data correction routines. A data correction routine is provided that includes information distance analysis of one or more sets of speech recognition information to a set of text elements related to the domain and stored to a database. The system and method generate as output corrected text elements related to a meaning intended by a user from whom the speech recognition information was captured.

Description

SYSTEM, METHOD AND COMPUTER PROGRAM FOR CORRECTING SPEECH
RECOGNITION INFORMATION
PRIORITY CLAIM
This patent application claims priority to United States Provisional Patent Application 61/626,635 filed on September 30, 2011 and United States Provisional Patent Application 61/579,397 filed on December 22, 2011.
FIELD OF THE INVENTION
The present invention relates to voice speech recognition methods and systems. The present invention relates more particularly to methods and systems for correcting the output of voice recognition methods and systems.
BACKGROUND OF THE INVENTION
Various speech recognition technologies are known, as well as techniques and technologies for improving the accuracy of voice speech recognition systems.
There is significant demand for speech recognition as an interface to computer systems and computer programs. This includes use of speech recognition as a means of activating mobile device functions or mobile application functions, especially for example when a user is driving.
There are generally two categories of disadvantages related to speech recognition systems, computer programs, and methods. The first is the training that is required to gradually improve the accuracy of speech recognition output. This training is time consuming and may not be practical for use especially for mobile technologies for a number of reasons including variable noise conditions that make training based systems inaccurate in mobile settings. The second, and perhaps more important disadvantage, is that noisy environments, speaker diversity, and errors in speech are all widespread factors. Each of these factors has a significant negative impact on speech recognition accuracy, and in combination the negative impact can be quite severe.
Prior art solutions exist that attempt to address the above mentioned disadvantages. Generally speaking these prior art solutions either (1 ) attempt to train for general speech recognition, which is difficult, or (2) train speech recognition for "fix set commands" such as commands used for example in a car system where the number of commands is finite. Regarding (2), by its nature this type of system is not suitable for more open ended applications where there is a significant variability of possible speech elements requiring computer recognition.
These disadvantages are a practical obstacle to design and implementation of speech recognition technologies that are accurate enough for widespread user adoption, and also explain why voice activated Question Answering (QA) systems, for example to enable voice based Internet search using a mobile device, are not practical based on prior art solutions. The approach explained above in (2) is not suitable for the QA domain for example because this application requires the scope of the domain to be relatively unlimited.
There is a need for a computer system, computer program, and computer implemented method that addresses the above mentioned obstacles. There is a need for an improved speech recognition system and method that improves performance in instances where one or more of noisy environment, speaker diversity or errors in speech would otherwise result in less than desirable accuracy of the speech recognition output. There is a further need for a QA system that provides improved accuracy and therefore enables voice enabled QA services that address a significant segment of the population of interest, including individuals for whom English is not their first language.
SUMMARY OF THE INVENTION
The following presents a simplified summary of the innovation in order to provide a basic understanding of some aspects of the innovation. This summary is not an extensive overview of the innovation. It is not intended to identify key/critical elements of the innovation or to delineate the scope of the innovation. Its sole purpose is to present some concepts of the innovation in a simplified form as a prelude to the more detailed description that is presented later.
In one aspect of the invention, a computer implemented speech recognition method is provided comprising: (A) capturing one or more elements of speech using a speech capture means, the elements of speech relating to a domain; (B) using one or more speech recognition utilities so as to generate one or more sets of speech recognition information based on the one or more elements of speech; (C) using one or more computers to apply one or more correction routines to the one or more sets of speech recognition information, the one or more correction routines including information distance analysis or compression of the one or more sets of speech recognition information to a set of text elements related to a relevant domain and stored to a database; and (C) constructing one or more data outputs related to a meaning intended by a user by the one or more elements of speech. In another aspect, the method computes a user's intended question or q, and the text elements consist of a database of questions or Q.
In another aspect, the method consists of a complementary information distance analysis or compression is used to disregard irrelevant questions. In a still other aspect, the method comprises the step of calculating q using a Dmin and Dmax operation, with the irrelevant questions from Q being removed.
In another aspect, q is calculated by concurrently: (A) minimizing the information distance between q and Q, with irrelevant information removed; and (B) minimizing the information distance between the speech recognition information consisting of output queries. In another aspect, the method yields as output a determination of q, which may be one or more output queries, or a combination of two or more of the output queries, or one of the questions form Q that may be related to two or more of the output queries.
In a still other aspect, the method comprises the further steps of: (A) splitting questions into words, and aligning groups of input questions that include words or name entities with similar pronunciations, so as to produce word alignment results; (B) enhancing the set of input questions by building additional questions based on the word alignment results; (C) determining a set of relevant questions from the database, based on semantic and/or syntactic similarity, optionally if the relevant questions yield a question that is the same as one of the input questions, this question is identified as the correct question; (D) optionally grouping the input questions using one or more hierarchical clustering operations into clusters, and extracting patterns from each cluster ("extracted patterns"); (E) generating candidate questions by mapping the relevant input question into the extracted patterns; (F) ranking candidate questions using the information distance analysis or compression operations; and (G) returning as a corrected question the candidate question with the minimum information distance score. In another aspect of the invention, a computer implemented system for corrected speech recognition is provided comprising: one or more computers including or being linked to a server computer, the server computer implementing a server application, the server application defining an correction utility, where the correction utility includes or is linked to one or more databases each including text elements related to a domain; the correction utility being operable to receive from one or more speech recognition utilities, linked to the server computer or one or more remote servers connected to the server via the Internet, one or more sets of speech recognition information based on one or more elements of speech captured from a user and associated with an intended meaning, wherein the one or more sets of speech recognition information are associated with a domain; the correction utility applying one or more correction routines to the one or more sets of speech recognition information that include information distance analysis of the one or more sets of speech recognition information to the text elements related to the domain and stored to the database; and the correction utility constructing one or more data outputs related to a meaning intended by the user by the one or more elements of speech.
In another aspect, the computer system computes a user's intended question or q, and the text elements consist of a database of questions or Q.
In a still other aspect, the correction utility applies one or more complementary information distance analysis or compression operations to disregard irrelevant questions.
In yet another aspect, the correction utility is configured to calculate q using a Dmin and Dmax operation, with the irrelevant questions from 0 being removed. In a still other aspect, the correction utility is configured to calculate q by concurrently: (A) minimizing the information distance between q and 0, with irrelevant information removed; and (B) minimizing the information distance between the speech recognition information consisting of output queries.
In yet another aspect, the correction utility generates as output a determination of q, which may be one or the output queries, or a combination of two or more of the output queries, or one of the questions form Q that may be related to two or more of the output queries.
In a still other aspect, the correction utility is further operable to: (A) split questions into words, and align groups of input questions that include words or name entities with similar pronunciations, so as to produce word alignment results; (B) enhance the set of input questions by building additional questions based on the word alignment results; (C) determine a set of relevant questions from the database, based on semantic and/or syntactic similarity, optionally if the relevant questions yield a question that is the same as one of the input questions, this question is identified as the correct question; (D) optionally group the input questions using one or more hierarchical clustering operations into clusters, and extract patterns from each cluster ("extracted patterns"); (E) generate candidate questions by mapping the relevant input question into the extracted patterns; (F) rank candidate questions using the information distance analysis or compression operations; and (G) return as a corrected question the candidate question with the minimum information distance score.
In another aspect of the invention the information distance analysis includes clustering the one or more elements of speech using a cluster of related records in the database.
In another aspect of the invention, the one or more data outputs are generated based calculation of a Dmin operation to the one or more elements of speech and a Dmax operation to the text elements.
An Internet implemented system is also provided that provides speech recognition data services to a network of computer devices, where the speech recognition data services include data correction in accordance with the method of the invention.
In this respect, before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of construction and to the arrangements of the components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments and of being practiced and carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting.
DESCRIPTION OF THE DRAWINGS
FIG. 1 depicts an exemplary system diagram illustrating the network architecture for implementing the present invention, in accordance with one embodiment of the present invention.
FIG. 2 is a workflow diagram illustrating a representative workflow in accordance with one aspect of the invention.
FIG. 3 is a representative architecture diagram illustrating a possible implementation of the technology of the present invention.
FIG. 4 illustrates a generic computer implementation of the computer program aspects of the present invention.
In the drawings, embodiments of the invention are illustrated by way of example. It is to be expressly understood that the description and drawings are only for the purpose of illustration and as an aid to understanding, and are not intended as a definition of the limits of the invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
The present invention provides a computer network implemented system, a computer network implemented method, and a computer network architecture that enables improved speech recognition output, and also activation of third party systems based on speech recognition output, using a unique an innovation mechanism for data correction, that enables a voice input means to various computer devices that performs far better than what is possible with prior art solutions.
The invention also includes a computer program for implementing method functions described, which may be implemented for example as a server application, implemented to one or more servers. The computer network architecture described herein enables delivery of improved speech recognition features for example to a mobile device. A possible computer system implementation of the present invention is illustrated in Fig. 1. and Fig. 2.
The method of the invention is described below, and a representative system and computer network implementation is described further below. The invention may be explained by reference to a QA system that implements the method of the present invention. In one aspect of the invention, a computer implemented method is provided that applies information distance analysis, or compression methods, to voice recognition results, namely a set of query outputs (for example voice recognition results generated from one or more third party voice recognition platforms) relative to a set of meaningful queries, from which we system and method of the invention can reconstruct the intended query. A skilled reader will understand that in this invention, information distance is a form of compression that provides desirable outcomes.
Prior art solutions generally either tried to train for a general case (which is difficult) or train for a fixed finite domain such as a finite set of possible commands in a car system, or a finite set of names in a directory. In contrast, the present invention takes a very different approach, as described below.
The performance of the present invention may be illustrated by referring to 5 test cases, each using 200-300 questions from a test set not contained in the database (22) (shown in Fig. 1 ) linked to the computer system. The computer system of the invention reduced the number of errors by 20-40% for native speakers and over 50% for the non-native speakers, in comparison to performance of marketing leading speech recognition software such as packages available from GOOGLE™, DRAGON™, or MICROSOFT™. The less than optimal performance of third party speech recognition packages may be explained by: (i) impact of noisy environments, (ii) speech variations, for example, adults versus children, native speakers versus non-native speakers, females versus males, especially when individual voice input training is not possible or is impractical and (iii) error in speech. Regarding error in speech humans do not always speak in correct and complete sentences and without any break or corrections in the middle. Even a model speaker using a voice recognition system may cough from time to time, elide certain words, and so on. Moreover, prior art speech recognition systems are not able to resolve the differences for example between "sailfish" from "sale fish"?
It is known in the art that information relationships, knowledge base relationships and conceptual relationships may enable improvements to speech recognition performance.
The present invention, however, is based on a unique and innovative speech recognition improvement methodology.
(A) Database (22) is configured to include information from a relevant domain, obtained for example from the Internet or databases, and that may optionally be enhanced. In Fig. 2, the database is shown to include Q data sets. The database (22) entries may relate to a particular domain or language component. For example, Figs. 1 and 2 illustrate the use of the present invention as a QA service, and therefore database (22) contains a multitude of possible queries. For example, a current implementation of the database (22) includes 35 million queries obtained from the Internet ("database queries"). (B) Database (22) is used to correct queries from one or more speech recognition systems (referred to as "output queries"), in accordance with the present example embodiment of the invention. The output queries from the one or more speech recognition systems may or may not be correct, however, as explained below depending on the factors described, there is a strong likelihood that the accuracy of the output queries may be low in specific instances, and therefore the intent of the output queries may only be partially recognized by a speech recognition software package. In accordance with one aspect of the present invention, a correction utility or component (21 ) is operable to use the database queries as templates or patterns to generate the original intended question from the voice recognition input or output queries. As shown in Fig. 1 , the correction utility (21 ) may incorporate a correction engine (24) and a database (22). The correction engine (24) embodies the operations described herein, and based on such operations the correction utility (21 ) utilizing one or more databases such as database (22) is operable to generate corrected text (28). Corrected text (28) may in turn be used to support a variety of applications including for example an enhanced QA service implemented using for example a QA server (not shown). A QA server may for example include or link to the correction utility (21 ) of the present invention.
Providing a method for enabling the use of the database queries in this way, and to enable generation of the original intended question is not trivial. It is not known which, if any, of the output queries is the original intended question. It is now known if any of the database queries is the original intended question. If there is conflict between the output query indicated for example by the speech recognition software as being the original intended questions, and the database query indicated to be the original intended question, how is this conflict resolved? Often none of the output queries or the database queries is exactly correct. In these circumstances, how can we generate nonetheless the original intended question?
Information Distance This section explains the theory on which the method of the invention is based, and based on which the system of the present invention was designed.
Kolmogorov complexity was invented in the 1960's. The concept may be explained in relation to an universal Turing machine U. The Kolmogorov complexity of a binary string x condition to another binary string y,
Figure imgf000009_0001
is the length of the shortest (prefix-free) program for U that outputs x with input y. Since it can be shown that for a different universal Turing machine U the metric differs by only a constant, we will just write K(x\y) instead of
Figure imgf000009_0002
We write Κ(χ\ ε), where ε is the empty string, as (x). We call a string x random if K(x)≥ |x|. A skilled reader will appreciate further details of Kolmogorov complexity and its application. (x) defines the amount of information in x. What would be a good departure point for defining an "information distance" between two objects? There have been studies of the energy cost of conversion between two strings x and y. John von Neumann for example hypothesized that performing 1 bit of information processing costs 1 T of energy, where K is the Boltzmann's constant and 7 is the room temperature. In the 1960's, observing that reversible computations can be done for free, Rolf Landauer revised von Neumann's proposal to hold only for irreversible computations.
Starting from this von Neuman-Landauer principle, the minimum number of bits needed to convert between x and y to define their distance may be defined. Formally, with respect to a universal Turing machine U, the cost of conversion between x and y may be defined as:
Ε(Χ, Υ ) = min{|/)| ; Ui.x. p) = y. U{y. p ) = .i} [ 11
It is clear from the above that E{x,y)≤ K (x\y)+K(y\x). The following optimal result may be obtained, modulo log (| x| + |y|j:
Theorem 1. /:Ί'.ν. νΊ = ηι χ -!λ'ί. ,· ι .ΑΊ vl.v i ).
' 1 1 1
This enables the calculation of information distance between two sequences x and y as: nax (-V . ) = max{A'i. |y),A'(v|.i j }.
This distance Dmax is shown to satisfy the basic distance requirements such as positivity, symmetricity, and triangle inequality. Furthermore, Dmax is "universal" in the sense that Dmax always minorizes any other reasonable computable distance metrics. This information distance operation is known in the prior art and this concept and its normalized versions have been applied in a number of different areas. It was first applied to whole genome phylogeny, and later in many other applications including chain letter evolution history, plagiarism detection, more phylogeny studies, music classification, parameter-free data mining paradigm, protein sequence classification, protein structure comparison, heart rhythm data analysis, question and answering system, clustering, multiword expression linguistic analysis, software evolution and engineering, software metrics and obfuscation, web page authorship, topic and domain identification, phylogenetic reconstruction, SVM kernel for string classification, ortholog detection, analyzing worms and network traffic, picture similarity, internet knowledge discovery, multi-document summarization, network structure and dynamic behavior, and gene expression dynamics in macrophase, just to name a few.
Despite its good properties and many applications, the max distance Dmax (x,y) has several problems when we consider only partial matching where the triangle inequality fails to hold and the irrelevant information must be removed. Thus, the present invention includes a complementary information distance to resolve this problem. In Eq (1), we asked for the smallest number of bits that must be used to reversibly convert between x and y. To remove the irrelevant information from x or y, we thus define, with respect to a universal Turing machine U, the cost of conversion between x and y to be:
Figure imgf000011_0001
To interpret, the above definition separates r from x and q from y. Modulo an 0(log( |x | + |y|)) additive term, we have proved the following theorem:
Theorem 2. ?min(- . ) = min{A'(.v|y i . A'( v|.v)}.
Thus we can now define Dmin(x,y) = Em\n{x,y) as a complementary information distance that disregards irrelevant information. Dmin is obviously symmetric, but it does not satisfy triangle inequality. Dmin was used in the QUANTA QA system for example to deal with concepts that are too popular. Its use an operation for enabling information distance operations as between a first set of entities, and a second set of entities, where there may be irrelevant information in one or more of the sets, for the purpose of determining the most accurate entity, is a novel and innovative contribution.
Min-Max Distance
Now we formulate our problem in the frame of information distance. One of the contributions of the present invention may be understood as an operation for determining q based on a combined Dmin and Dmax operation or min/max information distance operation, as further explained below.
Given a Question Database Q, and k input questions from a voice recognition system, say I = ivi ¾ }- * < for example from the Google voice server, which was used in testing of the invention. The goal is to compute the user's intended question q. It could be one of the qi's; it could also be a combination of all k of them; it could also be one of the questions in Q that is close to some parts of the qfs.
We wish to find the most plausible question q such that q fits one of the question patterns in Q, and q has "close distance" to /. We will assume that Q contains almost all question patterns. Later in this disclosure, this assumption is explained further. Thus we can formulate our problem as: Given Q and /, find q such that it minimizes the sum of "distances" from Q to q, and q to /, as shown in the following:
/ i—► q i— Q
Here, Q is a huge database of 35M user asked questions. We will assume q is "similar" to one of those. For example, a QA user might have asked "Who is the mayor of Waterloo, Ontario?" but in Q, there might be questions like "Who is the mayor of Toronto, Ontario?" or "Who is the mayor of Washington DC?". / may contain such output queries as "Hole is the mayor?" and "Who mayor off Waterloo" based on operation of the voice recognition software. Since Q is very large, use of the Dmax measure may not be optimal, as most of information in Q are irrelevant, and therefore in the present invention Dmin (q,Q) for establishing the information distance between q and Q. However, for distance between q and /, we can use dmax(qf,/) to measure the information distance. Thus given l,Q, we wish to find q that minimizes the following function:
where Dmin measures information distance between q and Q with irrelevant information removed; and Dmax is the information distance between / and q. We know
¾ύη(Λ'- ν) = mni{A'(.v|v),A' (v|.v)}>
Dmax(.v. y) = max {A'(. |y ). A' ( \x ) } .
Thus, -"mm ' · · ( ' =
Figure imgf000012_0001
because Q is very large and q is just one question. Notice that t is a coefficient that determines how much weight we wish to give to a correct template or pattern in Q. Thus the operation as applied to a set of queries may be expressed as:
<ϋΑ'(ί/|ζ>) + max{A" /|/) , A'i/. f/) } . H I
Observations: We need to have < > i - ih,-'n </ = 1 does not minimize formula (4). If 8 is too large, then q = ε might minimize formula (4). There is a tradeoff: sometimes a less popular pattern (taking more bits in the Dmin term) might fit / better (taking fewer bits in the Dmax item) and a more popular pattern (taking fewer bits in the Dmin item) might miss one or two key words in / taking more bits to encode in the Dmax item. β is optimized for this tradeoff. Database Encoding
In another aspect of the invention, the issues outlined in the paragraph above may be resolved using one or more of the following techniques.
• Encode q using 0 in the first term. It is a problem to encode an item with respect to a big set.
• Encode q using / or encode / using q, and take whichever larger, in the second term.
• Find all the possible candidates q, and q0 that minimize Formula (4)??.
Q is very large and contains different "types" of questions. For each type of questions, we could extract one or more question templates. In this way, Q could be considered as a set of templates and each template, denoted as p, covers a subset of questions from Q. When encoding q, we do not have to encode q from Q directly. Instead we encode q with respect to the patterns or templates of Q. For example, if a pattern p in Q appears N times in Q. Then we can use \og2(Total/N) bits to encode the index for this pattern. Given the pattern p, we encode q with p by encoding their word mismatches. There will be a tradeoff between the encoding of p and the encoding of q given p. A common pattern may be encoded with a few bits, but it may require more bits to encode a specific question using this pattern. For example, the template "who is the mayor of City Name" requires more bits to encode than the template "who is the mayor of Noun", because the former is a smaller class than the latter. However the first template will require fewer bits to generate a question "who is the mayor of Waterloo", since it requires fewer bits to encode Waterloo from the class "City Name" than from the class "Noun".
It should be understood that such patterns may be extracted by pre-processing or be extracted dynamically based on analysis of the output queries. In one aspect of the invention, patterns are only extracted from relevant questions based on /, denoted as Q'. Q' may for example be organized in a hierarchical way. Similar questions may be mapped to a cluster and similar clusters may be mapped to a bigger cluster. One pattern may be extracted from each cluster using for example a multiple alignment algorithm. This pattern should be as specific as possible, while at the same time covering all the questions in the cluster. The higher the cluster is in the hierarchy structure, the more general the pattern will be. So our hierarchical clustering technique, in one aspect, may ensure that all the possible patterns are extracted from relevant questions. It should be understood that this aspect of the operation of correction engine (24) may use one or more semantic and/or syntactic information techniques, including POS tagger, Name Entity Recognition, Wordnet and WikiPedia. For example, given a cluster of three questions:
• Who is the mayor of Toronto?
• Who is the president of the United States? · Who is the senator of New York?
The correction engine is operable to extract a pattern such as: Who is the Leader of Location? "Mayor", "president" and "senator" are all mapped to the Leader class, while "Toronto", "United States" and "New York" all belong to the Location class.
If pattern p is treated by correction engine (24) as a sentence, the problem of item-to-set encoding depends on the item-to-item encoding, same as the computation of k i iiV atld Ki W) _ In fact, to convert a sentence from another sentence, we only need to encode the word mismatches and the missing words. An optimal alignment between two sentences may be generated for example using a standard dynamic programming algorithm. For example, content engine (24) encodes a missing word by minus logarithm of their probabilities to appear at the said locations and encodes the mismatches by calculating their semantic and morphology similarities. Obviously, it requires fewer bits to encode between synonyms than antonyms.
The last problem to consider in Formula (4) is the selection of candidate questions q. It may not be possible to search through the whole question space. We only consider the possible question candidates that is relevant with the input and that could be matched by at least one of our templates from database (22). Furthermore, a bigram language model may be applied by the correction engine (24) to filter questions with low trustiness. The language model may be trained in our background question set. The value is trained by operations of correction engine (24). In the system of the present invention, the (i; value may be a function of the lengths of the output question. It is optimal for a voice recognition system to respond instantaneously. The speed requirement forced us to adapt some tradeoffs in the implementation, for example not considering all possible patterns. In other words, the content engine (24) may apply one or more operations to minimize the number of possible patterns analyzed, using one or more pre-configured thresholds. Further Details Regarding Database Q
We have performed an experiment to test the hypothesis that Q contains almost all common question types. The test set T contains 300 questions, selected (with the criteria: no more than 1 1 words or 65 letters, one question in a sentence, no non-English letters) from a Microsoft QA set at http://research.microsoft.com/en-us/downloads/88c0021 c-328a-4148-a158- a42d7331c6cf. We found that all but three, although this test set is not included in the database Q, 99% of them, have corresponding patterns in Q. The only three questions that do not have similar patterns, in a strict sense, in Q are: Why is some sand white, some brown, and some black? Do flying squirrels fly or do they just glide? Was there ever a movement to abolish the electoral college?
Example of the method
The following are possible implementations of the computer system of the present invention, which is referred to as "WILLY" below. A skilled reader will understand that various other implementations and uses are possible. In one implementation, the computer implemented method involves the steps outlined below, with specific examples:
Step 1. Analyze Input
1. Split the questions into words using for example the Stanford Pos Tagger. At the same time, name entities may be extracted from the input questions using Stanford NER™, Linepipe NER™ and/or Yago™ for example.
2. Find the best alignments between the input questions using dynamic programming.
Words or name entities with similar pronunciations are mapped together. For example, given three questions: "whole is the mayor of Waterloo", "hole is the mayor of Water", and "whole was the mayor of Water", the best word alignment may be: Whole is the mayor of Waterloo
Hole is the mayor of Water
Whole was the mayor of Water
Step 2. Improve input questions 1. Build a question based on the word alignment results from the previous step. For each aligned word block, we choose one word to appear in the result.
2. We assume that a good format question should contain a "wh"-word including "what, who, which, whose, whom, when, where, why, how, what" or an auxiliary verb including "be, have, do, shall, will (would), shall (should), may (might), must, need, dare, ought". If the inputs do not contain any such word, we add proper words into the question candidates.
3. Since some correct words may not appear in the input, we further expand the question candidates with homonym dictionaries and metaphone dictionaries.
Step 3. Analyze relevant database patterns
1. Find relevant database questions, and they are sorted based on their semantic and syntactic similarity to the improved input questions from Step 2.
2. If there is a question that is almost the same as one of the input questions, the method may return that input question directly. No further steps to be done in this case.
3. In the database questions, there may be many forms and patterns. Similar questions may be grouped together using for example a hierarchical clustering algorithm. The distance between two clusters may be calculated based on the syntactic similarity between questions. The algorithm may stop when the minimum distance between clusters reach a predefined threshold. A skilled reader will understand that this aspect may be implemented in a number of different ways.
4. Once the relevant questions are grouped into clusters, patterns may be extracted from each cluster. Following the algorithm described in Step 1.2, questions may be aligned in each group. Semantic similarities may be used to encode the word distance. Then a group of questions may be converted into a single list with multiple word blocks, each block containing several alternative words from different questions. For example, given questions "Who is the mayor of New York", "Who is the president of United States" and "Which person is the leader of Toronto", a list of word blocks may be obtained after alignment: {who, who, which person}, {is}, {the}, { mayor, leader, president } of { New York, United States, Toronto}.
5. For each aligned word block, tags may be extracted that would best describe the slot. Here YAGO™ may be used to describe the meaning of each word or phrase. Several most common facts may be determined as the description of each word block. Then several semantic patterns may be determined using for example words and facts from YAGO™.
Step 4. Generate the candidate questions
1. The original input questions that were extracted may be mapped into patterns.
Words in the patterns may be replaced with the words from the input. Lots of candidate questions could be generated by considering the different combinations of word replacements.
2. To reduce the computing complexity, a bigram language model may be trained from our question set and candidate questions with low probability removed.
Step 5. Rank candidate questions using information distance
1. The distance is then calculated between the candidate questions and the input questions K(q | /) and K(l\ q). the candidate and input questions together and encode the word mismatches and missing words. A missing word is encoded by minus logarithm of their probability to appear at the said locations and word mismatches are calculated by their semantic, morphology and metaphone similarities.
2. Calculate the distance between the candidate questions and the patterns K(g | p). A similar method of the previous step is used to calculate the distances between questions and patterns.
3. Rank all the candidates using the ranking equation desribed herein.
Step 6. Return the candidate with the minimum information distance score as the final result.
In order to improve speed, steps 3.3, 3.4, and 3.5 may be performed offline on the iplete database Q. Possible Experiments
The advantages of the invention may be exemplified by reference to experiments. Two sets of experiments are described
GOOGLE™ speech recognition at google.com was used because the Google server has a high success rate, and responds quickly.
Experiment Set 1
In one example of a test to illustrate the operation of the invention, we performed 5 sets of experiments by different people all using same Microsoft QA subset T, or a subset of it for the non-native speakers.
The Google voice recognition returned 3 options for each question. The system of the present invention allows the speaker to repeat the question and the system of the present invention computes the optimal solution using the methods above. However, this experiment did not use that feature. Experiments of this nature is a tricky matter, it depends on people, environment, and questions. We have tried to test the system using different types of people. In the following, "CC" means the voice recognition software (of Google, all same below) has returned the correct answer as the first option, and present system agreed with it; "WC" means voice recognition software returned wrong answer as the first option, and the system of the present invention has returned the correct answer; "CW" means the voice recognition software returned the correct answer as the first option, and system of the present invention has returned the wrong answer; "WW" means the voice recognition software returned the wrong answer as the first option, and system of the present invention has also returned wrong answer. All the experiments were performed in quiet environments.
Experiment 1 , non-native speaker, male, total 164 questions. 105 CC, 39 WC, 5 CW, 15 WW. In this experiment, the speaker has chosen only easy questions out of the 300 Microsoft test set, and has tried until success (either voice recognition is correct or WILLY makes it correct). The Appendix contains the some examples where WILLY has corrected Google's errors for this experiment.
Experiment 2. native speaker, male URA student, total 300 questions. 2 9 CC, 25 WC, 6 CW, 50 WW. Experiment 3. Native speaker, male, student, 300 questions. 222 CC. 15 WC, 5 CW, 58 WW. In this case, the speaker tries again if the first time is not successful (i.e. either voice recognition is correct or WILLY is correct).
Experiment 4. Non-native speaker, male, 257 questions. 141 CC, 41 WC, 7 CW, 68 WW, improving 34 questions net. In this and next experiments, the speaker is allowed to try again until success (either voice recognition is correct or WILLY is correct) or fail after several tries.
Experiment 5. Non-native speaker, female, 181 questions. 100 CC, 26 WC, 4 CW, 51 WW, improving 22 questions net.
The ratio of improvements are better for the non-native speakers. For the native speakers, the system of the present invention has also demonstrated a clear advantage. Such an advantage will be amplified in a noisy daily environment. Allowing the speaker to repeat the question will increase the success rate as the following example (with Google shows: the present system generated "How many toes does Mary Monroe have?" at the first query and generated "How many titles does Marilyne Monroe have?" at the second query. Putting the two questions together, the present system generates the correct and intended question "How many toes does Marilyne Monroe have?". This feature is implemented in the present system but was not included in this experiment.
Further details of the tests are provided below:
Original question:
How many types of birds are found in North
America
Google voice recognition output:
How many pints of birds are formed in North
America How many pints of birds are found
in North America
How many pints of birds performed in north
America
Output from the present system:
How many types of birds are found in North - -
PCT/CA2012/000911
Figure imgf000020_0001
Original question:
how does a bb gun make the bb shot
Google voice recognition output:
how does a bb gun make the pd shot
how does a bb gun make the bd shot
how does a bb gun make the bb shot
Output from the present system:
how does a bb gun make the bb shot
Original question:
How does the atom bomb work
Google voice recognition output:
call does the atom bomb work
all does the atom bomb work
aha does the atom bomb work
Output from the present system:
How does the atom bomb work?
Original question:
Which NFL team won the most superbowls?
Google voice recognition output:
watch nfl team won the most superbowls
SUBSTITU watch nfl team won the most superbowls
water nfl team won the most superbowls
Output from the present systemt:
Which NFL team won the most superbowls?
Original question:
How holidays are celebrated around the
world?
Google voice recognition output:
call holidays are celebrated around in the
wall call holidays are celebrated around in the
world
how all holidays are celebrated around in the
wall
Output from the present system::
how holidays are celebrated around in the
world
Experiment Set 2
What follows is a description of an experiment based on a prototype implementation of the technology described in order to illustrate the functionality of the present invention that permits the correction of speech recognition errors in the QA domain. In particular the experiment involves non-native speakers. This is an area where the present invention provides significant advantages given that there are 3 non-native English speakers for each native English speaker in the world. In the next section, a test is described to illustrate the practical effects of application of the technology of the present invention to we further test and justify the methodology proposed in this paper by extending our method to the QA domain. In one aspect, the experiment was conducted using a GOOGLE™ speech recognition, with a computer terminal and a microphone. The experimenter reads a question, and GOOGLE™ e speech recognition returns three options. WILLY uses these three questions as input and computes the most likely question. A set of questions may be described as T described in the previous section. T contains 300 questions. Twas chosen independently and 1 r Q . Not all questions in T were used by each speaker in the experiments mostly because non-native speakers and children skipped sentences that contained hard-to-pronounce words. Less proficient English speakers tend to skip more questions. In the experiment, a server with 4 cores, 2.8GHz per core, and 4G memory was used, and WILLY typically required about 500ms to correct one question. That is, the speaker reads each question to the microphone, the GOOGLE™ voice recognition to return 3 questions, and WILLY uses these 3 questions as input and it takes about half a second to output one final question.
Attempts were made to remove the individual speaker variance by using different people independently performing the experiments. 14 human volunteers were used including native and non-native English speakers, adults and children, females and males, as show in the following table.
Table 1. Individuals used in our experiments
Native Speaker Non-Native Speaker
Adult Child Adult Child
Female 0 1 4 1
Male 3 2 3 0 Twelve sets of experiments involving 14 different speakers were performed for the same test set 7", or a subset of it. Because of the short attention period of children, the three native English speaking children (2 males and 1 female) completed one set of experiment (7), each responsible for 100 questions. A non-native speaking female child, age 12, performed the test (experiment 10) independently, and she was only able to finish 57 questions. In the following, "CC" signifies that the speech recognition software has returned the correct answer as the first option, and WILLY agrees with it: "WC" signifies that the speech recognition software has returned the wrong answer as the first option, and WILLY has returned the correct answer: "CW" signifies that the speech recognition software has returned the correct answer as the first option, and WILLY has returned the wrong answer; "WW" signifies that the speech recognition software returned the wrong answer as the first option, and WILLY has also returned the wrong answer. All the experiments were performed in quiet environments. In each experiment, the speaker tried again if neither the speech recognition nor WILLY was correct. The results of the experiments are given in Table 2, and some details of each experiment are given below.
Experiment 1. Non-native speaker, male. In this experiment, the speaker chose only easy questions out of a 300 question MICROSOFT™ test set. Experiment 2. Native speaker, male.
Experiment 3. Native speaker, male.
Experiment 4. Non-native speaker, male.
Experiment 5. Non-native speaker, female.
Experiment 6. Non-native speaker, female. Experiment 7. Three native English speaking children, one third of questions each,
8 years old, female; 9 years old, male; and 11 years old, male. In principle, we prefer independent tests with one individual responsible for the complete set of 300 questions. However, we were only able to get each of these children to do 100 questions, and skipping the difficult ones. The result is similar to that of adult native English speakers. Experiment 8. Native English speaker, male.
Experiment 9. Non-native English speaker, male.
Experiment 10. Non-native English speaker, female child, 11 years old. She came to Canada to attend summer camps for the purpose of learning English. Her English is elementary, and consequently she was only able to read 57 questions. Experiment 11. Non-native English speaker, female.
Experiment 12. Non-native English speaker, female.
Table 2. Experimental results for speech correction Experiment Total No. of questions CC WC CW WW
1. 164 105 39 5 15
2. 300 219 25 6 50
3. 300 222 15 5 58
4. 257 141 41 7 68
5. 181 100 26 4 51
6. 214 125 29 10 50
7. 206 145 19 8 34
8. 298 180 12 4 102
9. 131 77 14 0 40
10. 57 28 4 1 24
11. 63 35 9 1 18
12. 107 62 9 2 34
In the above, the non-native speakers and the children selected relatively easy questions (without hard-to-pronounce names, for example) from T to do the tests. The ratio of improvements are better for the non-native speakers, reducing number of errors (the WW column) by 30% on the average for experiments 1 , 4, 5, 6, 9, 10, 1 1 , 12. For the native speakers, WILLY also demonstrated a clear advantage, reducing the number of errors (the WW column) by 16% on the average for experiments 2, 3, 7, 8. Such an advantage will be amplified in a noisy real life environment. Allowing the speaker to repeat the question will increase the success rate as the following example (with GOOGLE™) shows: WILLY generated "How many toes does Mary Monroe have?" at the first query and generated "How many titles does Marilyn Monroe have?" at the second query. Putting the two questions together, WILLY generates the correct and intended question "How many toes does Marilyn Monroe have?".
Advantages As explained above, the system of the present invention is operable to provide significant improvements over the performance of existing speech recognition software. This enables for example a QA system accessible using voice input, which would provide a powerful and convenient tool for people who may be driving, individuals with literacy challenges, people with impaired vision, children who wish their Talking Tom or R2-D2 to really talk, and mobile device users as a whole.
Implementation of the Invention
The system of the present invention is best understood by reference to FIG. 1. A computer implemented method in accordance with the presented invention is shown as a workflow in FIG. 2.
Computer device (10) may be any manner of computer device, whether a laptop, desktop computer, tablet computer or otherwise, and is shown as a mobile device (10) in FIG. 1. The computer device (10) includes or is linked to a speech capture utility (14). The computer device (10) may be a mobile device as shown in FIG. 1. The speech capture utility (14) is operable to for example record one or more phrases uttered by a user of mobile device (10). The computer device (10) may also include or be linked to one or more speech recognition utilities (16). The speed recognition utilities (16) generate one or more digital outputs conforming to interpretations of the intended sentences or sentence fragments. As explained above, the accuracy of the output from the speech recognition utilities (16) is often inaccurate. For this reason, one or more components are configured to apply the correction techniques described above.
The speech recognition utility may be implemented on the device (10), or the device (10) may connected to a voice server (18), which in turn may include or be linked to one or more speech recognition utilities (16). It should be understood that database (22) relates preferably to a particular domain. The example illustrated in FIG. 1 illustrates a representative implementation of a QA service, and therefore the input to the correction utility (21 ) consists of queries. Also, multiple databases (22) may be used, each database (22) relating for example to a particular domain.
The system of the present invention may include a classifier that is operable to analyze the output of the speech recognition utilities (16) and based on such analysis determine for example the nature of the output from the speech recognition utilities so as assign the output to a particular database (22) that matches the domain of the output. For example, the classifier may be operable to determine that the output is a question and therefore apply to the database (22) that relates to queries as opposed to for example a database comprising commands. The correction utility (21 ) of the present invention is operable to use the output queries, and is further operable, if required, to construct the intended query, as explained above. The correction utility (21 ) may include or be linked to one or more utilities configured to support the operations described, including the construction of the output of the system, namely the corrected speech recognition output. For example, the correction utility may include a semantic engine that enables the correction utility (21 ) to use the entries in database (22) as templates as illustrated in the examples above.
The correction utility (21 ) may be implemented for example as a web service or a cloud service as further explained below. The correction utility (21 ) may be made part of a web server, implemented for example as a server application. One or more computer devices may call on the system of the present invention, via a communication network, to seek improvement of the accuracy of user generated queries, generated using one or more speech recognition routines. It should also be understood that the correction utility (21 ) may be implemented as part of a QA server, that is operable to correct speech recognition output so as to generate corrected text (28), and based on corrected text (28), provide answers that match the corrected text (28) corresponding to queries. The correction utility (21 ) may also be integrated with existing speech recognition technologies.
The system and method of the present invention may be implemented with various systems and applications for the purpose of enhanced voice command or voice input functionality. For example, it should be understood that the present invention specifically contemplates the incorporation of the technology described in third party platforms, for example call centre application platforms, control systems, access control systems (such as for example an access control system used in a car system), a help utility for assisting with device or software functions such as for example a smart phone personal assistance system and so on.
In one implementation of the invention, the computer system of the present invention may be implemented as a knowledge based dialogue computer system, that permits a user to dialogue with one or more other users using the computer system. The dialogue may include (A) at least one human user speaking to a network-connected device, (B) means being provided to the network-connected device to capture the user's voice, (C) the captured voice is transferred to a remote computer system, (D) the captured voice is analyzed to determine its probable meaning, using the natural language processing engine of the present invention, (E) one or more appropriate responses may be determined based on the probable meaning, using a logic engine and by constructing the response, including optionally based on the determined probable meaning, and (F) the appropriate response is made available to the user using the network- connected device.
In another aspect of the invention, a natural user interface may be provided that may be implemented to a network-connected device such as a smart phone, where the natural user interface enables the processing of speech input, and optionally responses to the speech input. As shown in FIG. 3, the system of the present invention may be configured so as to provide natural language processing based, multi-language, QA online platform or service. In the particular implementation of the present invention, the computer system enables natural language based interactions for both the English and Chinese languages. As shown in FIG. 3, a computer system may be provided that incorporates the functionality of the present invention, and also includes a series of other utilities that implement a multi-language QA service. Many other implementations are possible that leverage the technology described herein.
In one possible implementation, a user (50) may provide speech audio input to a suitable network-connected device (52). A third party speech recognition system (54) may be used to recognize the audio input so as to produce text data. The network-connected device (52) may be configured to connect to a computer platform (56) (referred to as the RSVP server) that includes the question correction functionality of the present invention. In the implementation shown in FIG. 3, means may be provided to determine the language in which text data is communicated to the computer system (58). Question correction components (60) implement the question correction technology of the present invention. Fig. 3 illustrates how the present invention may be utilized to improve performance of an overall multi-language dialogue computer system. The natural language processing based translation features may use search results from a variety of third part online sources (62).
The RSVP platform may also include a translator (64) for translating queries.
A possible workflow may include: (A) determination of language of query, (B) question correction, and optionally translation, (C) question classification and then processing by a QA service (68). The QA service (68) usually is implemented on a language by language basis. The QA service (68) may rely on the online sources (62).
The output may be an answer selection in the user's intended language. Significantly, the solution described may enable for example English language resources to be utilized in order to enable broader data sets for enabling an answer selection if search of Chinese language resources do not provide an answer that is deemed to be correct. The QA service (68) may include for example vertical domains answering; knowledge base answering; and communities QA searching, in order to find a good answer to a question corrected using the technology of the present invention.
In one aspect of the invention, a knowledge based dialogue computer system in accordance with the present invention may be implemented using a plan state transition system or a partially observed Markov decision process (or "POMDP").
Related Applications
It should be understood that the method of the present invention may also be used to correct numerous grammatical errors and spelling mistakes in the various questions loaded to database (22) which have been obtained from the Internet and therefore may require correction. Application of the correction method makes the QA process more accurate by providing a more correct database.
The method of the present invention may also be adapted to create an automatic writing assistant, which would be suitable for a well-defined domain. For example, a writing assistant can be used to convert the queries in a keyboard-input QA system to answerable queries.
Computing Environment
The description above discloses at a high level the various functions of the proposed control/management solution for a plurality of devices at the location.
In order to provide additional context for various aspects of the subject innovation, FIG. 4 and the following discussion are intended to provide a brief, general description of a suitable computing environment in which the various aspects of the present invention can be implemented. While the innovation has been described above in the general context of computer-executable instructions that may run on one or more computers, those skilled in the art will recognize that the innovation also can be implemented in combination with other program modules and/or as a combination of hardware and software.
Generally, program modules include routines, programs, components, data structures, etc., that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will appreciate that the inventive methods can be practiced with other computer system configurations, including single-processor or multiprocessor computer systems, minicomputers, mainframe computers, as well as personal computers, hand-held computing devices, microprocessor-based or programmable consumer electronics, and the like, each of which can be operatively coupled to one or more associated devices.
The illustrated aspects of the innovation may also be practiced in distributed computing environments where certain tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules can be located in both local and remote memory storage devices. For example the database (22) may be located remotely from a computer device that includes other elements of the correction utility, such that the correction utility queries the database for the cluster of related queries as described above, however the information distance operations described herein may be utilized to improve performance.
A computer (such as the computer(s) illustrated in the architecture described above) typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer and includes both volatile and non-volatile media, removable and non-removable media. By way of example, and not limitation, computer- readable media can comprise computer storage media and communication media. Computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD- ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above should also be included within the scope of computer-readable media.
The system of the present invention represents a collection of hardware and software elements that enable a user to manage a variety of device and information objects associated or generated by these devices, leveraging in-the-cloud resources in a new way. What has been described above includes examples of the innovation. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the subject innovation, but one of ordinary skill in the art may recognize that many further combinations and permutations of the innovation are possible. Accordingly, the innovation is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Cloud Computing Generally
As mentioned above, the correction utility of the present invention may be implemented as part of a cloud service. "Cloud computing" includes Internet based computing where shared resources, software and data are provided on demand. A "cloud" therefore can refer to a collection of resources (e.g., hardware, data and/or software) provided and maintained by an off-site party (e.g. third party), wherein the collection of resources can be accessed by an identified user over a network. The resources can include data storage services, word processing services, and many other general purpose computation (e.g., execution of arbitrary code) and information technological services that are conventionally associated with personal computers or local servers. As used in this application, the terms "component" and "system" are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component can be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a server and the server can be a component. One or more components can reside within a process and/or thread of execution, and a component can be localized on one computer and/or distributed between two or more computers.
In general, the concepts of "virtual" and "cloud computing" include the utilization of a set of shared computing resources (e.g. servers) which are typically consolidated in one or more data center locations. For example, cloud computing systems may be implemented as a web service that enables a user to launch and manage computing resources (e.g., virtual server instances) in third party data centers. In a cloud environment, computer resources may be available in different sizes and configurations so that different resource types can be specified to meet specific needs of different users. For example, one user may desire to use small instance as a web server and another larger instance as a database server, or an even larger instance for processor intensive applications. Cloud computing offers this type of outsourced flexibility without having to manage the purchase and operation of additional hardware resources within an organization.
A cloud-based computing resource is thought to execute or reside somewhere on the "cloud", which may be an internal corporate network or the public Internet. From the perspective of an application developer or information technology administrator, cloud computing enables the development and deployment of applications that exhibit scalability (e.g., increase or decrease resource utilization as needed), performance (e.g., execute efficiently and fast), and reliability (e.g., never, or at least rarely, fail), all without any regard for the nature or location of the underlying infrastructure.
A number of factors have given rise to an increase in the utilization of cloud computing resources. For example, advances in networking technologies have significantly improved resource connectivity while decreasing connectivity costs. Advances in virtualization technologies have increased the efficiency of computing hardware by improving scalability and making it possible to more closely match computing hardware resources to the requirements of a particular computing task. Additionally, virtualization technologies commonly deployed in cloud computing environments have improved application reliability by enabling failover policies and procedures that reduce disruption due to an application or hardware failure.
It should be understood that the present invention may be extended by linking the invention with other technologies or processes useful in the monitoring, control or management of a variety of devices, for a variety of purposes.

Claims

Claims
1. A computer implemented speech recognition method characterized in that the method comprises:
(a) capturing one or more elements of speech using a speech capture means, the elements of speech relating to a domain;
(b) using one or more speech recognition utilities so as to generate one or more sets of speech recognition information based on the one or more elements of speech;
(c) using one or more computers to apply one or more correction routines to the one or more sets of speech recognition information, the one or more correction routines including information distance analysis or compression of the one or more sets of speech recognition information to a set of text elements related to a relevant domain and stored to a database; and
(d) constructing one or more data outputs related to a meaning intended by a user by the one or more elements of speech.
2. The method of claim 1 , wherein the method computes a user's intended question or q, and the text elements consist of a database of questions or Q.
3. The method of claim 2 wherein a complementary information distance analysis or compression is used to disregard irrelevant questions.
4. The method of claim 1 , comprising the step of calculating q using a Dmin and Dmax operation, with the irrelevant questions from Q being removed.
5. The method of claim 1 , wherein q is calculated by concurrently:
(a) minimizing the information distance between q and Q, with irrelevant information removed; and
(b) minimizing the information distance between the speech recognition information consisting of output queries.
6. The method of claim 1 , wherein the method yields as output a determination of q, which may be one or the output queries, or a combination of two or more of the output queries, or one of the questions form O that may be related to two or more of the output queries. The method of claim 2, comprising the further steps of:
(a) splitting questions into words, and aligning groups of input questions that include words or name entities with similar pronunciations, so as to produce word alignment results;
(b) enhancing the set of input questions by building additional questions based on the word alignment results;
(c) determining a set of relevant questions from the database, based on semantic and/or syntactic similarity, optionally if the relevant questions yield a question that is the same as one of the input questions, this question is identified as the correct question;
(d) optionally grouping the input questions using one or more hierarchical clustering operations into clusters, and extracting patterns from each cluster ("extracted patterns");
(e) generating candidate questions by mapping the relevant input question into the extracted patterns;
(f) ranking candidate questions using the information distance analysis or compression operations; and
(g) returning as a corrected question the candidate question with the minimum information distance score.
A computer implemented system for corrected speech recognition comprises:
(a) one or more computers including or being linked to a server computer, the server computer implementing a server application, the server application defining an correction utility, where the correction utility includes or is linked to one or more databases each including text elements related to a domain;
(b) the correction utility being operable to receive from one or more speech recognition utilities, linked to the server computer or one or more remote servers connected to the server via the Internet, one or more sets of speech recognition information based on one or more elements of speech captured from a user and associated with an intended meaning, wherein the one or more sets of speech recognition information are associated with a domain; the correction utility applying one or more correction routines to the one or more sets of speech recognition information that include information distance analysis of the one or more sets of speech recognition information to the text elements related to the domain and stored to the database; and the correction utility constructing one or more data outputs related to a meaning intended by the user by the one or more elements of speech.
The computer system of claim 8, wherein computer system computes a user's intended question or q, and the text elements consist of a database of questions or Q.
The computer system of claim 10, wherein the correction utility applies one or more complementary information distance analysis or compression operations to disregard irrelevant questions.
The computer system of claim 8, wherein the correction utility is configured to calculate q using a Dmin and Dmax operation, with the irrelevant questions from Q being removed.
The computer system of claim 11 , wherein the correction utility is configured to calculate q by concurrently:
(a) minimizing the information distance between g and Q, with irrelevant information removed; and
(b) minimizing the information distance between the speech recognition information consisting of output queries.
The computer system of claim 8, wherein the correction utility generates as output a determination of q, which may be one or the output queries, or a combination of two or more of the output queries, or one of the questions form Q that may be related to two or more of the output queries.
The computer system of claim 9, wherein the correction utility is further operable to:
(a) split questions into words, and align groups of input questions that include words or name entities with similar pronunciations, so as to produce word alignment results; enhance the set of input questions by building additional questions based on the word alignment results; determine a set of relevant questions from the database, based on semantic and/or syntactic similarity, optionally if the relevant questions yield a question that is the same as one of the input questions, this question is identified as the correct question; optionally group the input questions using one or more hierarchical clustering operations into clusters, and extract patterns from each cluster ("extracted patterns"); generate candidate questions by mapping the relevant input question into the extracted patterns; rank candidate questions using the information distance analysis or compression operations; and return as a corrected question the candidate question with the minimum information distance score.
PCT/CA2012/000911 2011-09-30 2012-10-01 System, method and computer program for correcting speech recognition information WO2013056343A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201161626635P 2011-09-30 2011-09-30
US61/626,635 2011-09-30
US201161579397P 2011-12-22 2011-12-22
US61/579,397 2011-12-22

Publications (1)

Publication Number Publication Date
WO2013056343A1 true WO2013056343A1 (en) 2013-04-25

Family

ID=48140257

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CA2012/000911 WO2013056343A1 (en) 2011-09-30 2012-10-01 System, method and computer program for correcting speech recognition information

Country Status (1)

Country Link
WO (1) WO2013056343A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014180218A1 (en) * 2013-05-07 2014-11-13 华为终端有限公司 Update method, apparatus and system for voice recognition device
CN107077464A (en) * 2014-10-14 2017-08-18 三星电子株式会社 Electronic equipment and the method for its spoken interaction
CN107678964A (en) * 2017-09-29 2018-02-09 云南大学 A kind of component dynamic evolution internal consistency ensuring method
US10423649B2 (en) 2017-04-06 2019-09-24 International Business Machines Corporation Natural question generation from query data using natural language processing system
CN112562674A (en) * 2021-02-19 2021-03-26 智道网联科技(北京)有限公司 Internet of vehicles intelligent voice processing method and related device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050143999A1 (en) * 2003-12-25 2005-06-30 Yumi Ichimura Question-answering method, system, and program for answering question input by speech
US20070100618A1 (en) * 2005-11-02 2007-05-03 Samsung Electronics Co., Ltd. Apparatus, method, and medium for dialogue speech recognition using topic domain detection

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050143999A1 (en) * 2003-12-25 2005-06-30 Yumi Ichimura Question-answering method, system, and program for answering question input by speech
US20070100618A1 (en) * 2005-11-02 2007-05-03 Samsung Electronics Co., Ltd. Apparatus, method, and medium for dialogue speech recognition using topic domain detection

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LI ET AL.: "Answer Validation by Information Distance Calculation", PROCEEDINGS OF THE 2ND WORKSHOP ON INFORMATION RETRIEVAL FOR QUESTION ANSWERING, 2008, pages 42 - 49 *
LIN ET AL.: "Question Answering from the Web Using Knowledge Annotation and Knowledge Mining Techniques", PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 3 November 2003 (2003-11-03), pages 116 - 123 *
ZHANG ET AL.: "Information Distance from a Question to an Answer", PROCEEDINGS OF THE 13TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 12 August 2007 (2007-08-12), pages 874 - 883 *
ZHANG ET AL.: "New Information Distance Measure and Its Application in Question Answering System", JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, vol. 23, no. ISSUE, July 2008 (2008-07-01), pages 557 - 572 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014180218A1 (en) * 2013-05-07 2014-11-13 华为终端有限公司 Update method, apparatus and system for voice recognition device
CN107077464A (en) * 2014-10-14 2017-08-18 三星电子株式会社 Electronic equipment and the method for its spoken interaction
US10546587B2 (en) 2014-10-14 2020-01-28 Samsung Electronics Co., Ltd. Electronic device and method for spoken interaction thereof
CN107077464B (en) * 2014-10-14 2020-08-07 三星电子株式会社 Electronic device and method for oral interaction thereof
US10423649B2 (en) 2017-04-06 2019-09-24 International Business Machines Corporation Natural question generation from query data using natural language processing system
CN107678964A (en) * 2017-09-29 2018-02-09 云南大学 A kind of component dynamic evolution internal consistency ensuring method
CN107678964B (en) * 2017-09-29 2020-12-29 云南大学 Method for ensuring internal consistency of dynamic evolution of component
CN112562674A (en) * 2021-02-19 2021-03-26 智道网联科技(北京)有限公司 Internet of vehicles intelligent voice processing method and related device

Similar Documents

Publication Publication Date Title
US10192545B2 (en) Language modeling based on spoken and unspeakable corpuses
US10176804B2 (en) Analyzing textual data
US9947317B2 (en) Pronunciation learning through correction logs
US10437929B2 (en) Method and system for processing an input query using a forward and a backward neural network specific to unigrams
US10073673B2 (en) Method and system for robust tagging of named entities in the presence of source or translation errors
US20180061408A1 (en) Using paraphrase in accepting utterances in an automated assistant
CN106463117B (en) Dialog state tracking using WEB-style ranking and multiple language understanding engines
US8090738B2 (en) Multi-modal search wildcards
WO2018223796A1 (en) Speech recognition method, storage medium, and speech recognition device
US11093110B1 (en) Messaging feedback mechanism
US10366690B1 (en) Speech recognition entity resolution
JP6832501B2 (en) Meaning generation method, meaning generation device and program
KR102041621B1 (en) System for providing artificial intelligence based dialogue type corpus analyze service, and building method therefor
JP2015219583A (en) Topic determination device, utterance device, method, and program
WO2013056343A1 (en) System, method and computer program for correcting speech recognition information
KR101677859B1 (en) Method for generating system response using knowledgy base and apparatus for performing the method
US10417345B1 (en) Providing customer service agents with customer-personalized result of spoken language intent
CN111508497B (en) Speech recognition method, device, electronic equipment and storage medium
US9256597B2 (en) System, method and computer program for correcting machine translation information
WO2022057452A1 (en) End-to-end spoken language understanding without full transcripts
US11361761B2 (en) Pattern-based statement attribution
Tang et al. Information distance between what I said and what it heard
KR101971696B1 (en) Apparatus and method for creating optimum acoustic model
JP2019061297A (en) Information processing apparatus, program and retrieval method
CN115577712B (en) Text error correction method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12841860

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12841860

Country of ref document: EP

Kind code of ref document: A1