CN113851150A

CN113851150A - Method for selecting among multiple sets of voice recognition results by using confidence score

Info

Publication number: CN113851150A
Application number: CN202111220352.4A
Authority: CN
Inventors: 赵浩天; 林锋
Original assignee: Mgjia Beijing Technology Co ltd
Current assignee: Mgjia Beijing Technology Co ltd
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2021-12-28

Abstract

The invention relates to the technical field of audio recognition, in particular to a method for selecting among a plurality of sets of voice recognition results by using confidence score, which comprises the following steps: collecting voice and audio of people in the vehicle; respectively sending the voice audio to an offline recognition engine and an online recognition engine, and immediately sending the voice audio to a scoring model after receiving the analysis result of each engine to obtain corresponding confidence scores; judging whether the confidence score output by the scoring model firstly is greater than or equal to a first preset threshold value; if the first output confidence score is larger than or equal to the first preset threshold, outputting an analysis result corresponding to the first output confidence score as a recognition result; and if the first output confidence score is smaller than the first preset threshold value, waiting for the scoring model to output the confidence score corresponding to the analysis result of each engine. The invention can overcome the defect that the prior art cannot take speed and accuracy into account when recognizing the voice.

Description

Method for selecting among multiple sets of voice recognition results by using confidence score

Technical Field

The invention relates to the technical field of audio recognition, in particular to a method for selecting among multiple sets of voice recognition results by using confidence scores.

Background

The vehicle-mounted voice assistant usually deploys the recognition engine at the vehicle end and the cloud end simultaneously due to the limitation of the computing capability of the vehicle-end chip and the stability of network connection. In the actual use process, the selection of the recognition engine is needed, and the judgment is generally carried out based on network timeout or characters recognized by the recognition engine. The judgment based on the network overtime means that when the network state is good and the online identification result is returned before the set overtime, the online identification result is preferentially adopted, otherwise, the offline identification result is used. The judgment based on the characters recognized by the recognition engine refers to training a semantic model or classifying the recognized characters based on the rule of a template, and preferentially selecting an instruction which can be correctly executed by the voice assistant. The problems of the above mode are:

on one hand, in the method based on the network timeout, because the network condition of the vehicle-mounted system is usually unstable, the waiting network timeout is usually set to be longer in order to smoothly interact with the cloud system, which causes that the user generally needs to wait for a longer time before receiving the feedback of the vehicle-mounted voice assistant. The method ignores the advantage that the off-line identification can quickly and accurately identify some instructions (such as vehicle control instructions) in a specific field, and leads to a slow feedback speed.

On the other hand, in the method based on the text recognized by the recognition engine, only semantic information of the recognition result is considered, and the difference of acoustic characteristics of the user speaking in a plurality of vehicle-mounted scenes (such as different scenes of conversation with a robot, conversation with other people in the same vehicle, self-speaking and the like) is ignored. This difference may lead to a difference in the speech recognition results and thus to a large deviation of the selection result from what the user actually said, resulting in an inaccurate recognition result.

Disclosure of Invention

Therefore, the invention provides a method for selecting among a plurality of sets of voice recognition results by using confidence score, aiming at solving the defect that the prior art cannot take speed and accuracy into account when recognizing voice.

According to a first aspect of the present invention, there is provided a method of selecting among a plurality of sets of speech recognition results using confidence scores, comprising the steps of: collecting voice and audio of people in the vehicle; respectively sending the voice audio to an offline recognition engine and an online recognition engine, wherein the offline recognition engine is an engine for performing voice recognition on a model deployed on a vehicle; the online recognition engine is an engine for performing voice recognition on a model deployed at the cloud; the method comprises the steps that after an analysis result of each engine is received, the analysis result is immediately sent to a scoring model, and a corresponding confidence score is obtained, wherein each analysis result corresponds to one confidence score, and the higher the confidence score is, the more accurate the analysis result is; judging whether the confidence score output by the scoring model firstly is greater than or equal to a first preset threshold value; if the first output confidence score is larger than or equal to the first preset threshold, outputting an analysis result corresponding to the first output confidence score as a recognition result; if the first output confidence score is smaller than the first preset threshold value, waiting for the scoring model to output the confidence score corresponding to the analysis result of each engine; and integrating the confidence score corresponding to each analysis result to determine the feedback result of the voice audio.

Optionally, the determining the feedback result of the speech audio by integrating the confidence score corresponding to each analysis result includes: judging whether the confidence score corresponding to the analysis result of each engine is greater than or equal to a second preset threshold value or not, and counting the number of the confidence scores greater than or equal to the second preset threshold value; if the confidence score quantity greater than or equal to the second preset threshold value is 0, sending a request for re-collecting the voice audio; if the number of the confidence scores which are larger than or equal to the second preset threshold is 1, outputting an analysis result corresponding to the confidence score as a recognition result; and if the number of the confidence scores which are larger than or equal to the second preset threshold is multiple, selecting an analysis result corresponding to the corresponding confidence score as a recognition result by using a preset strategy and outputting the analysis result.

Optionally, the waiting for the scoring model to output the confidence score corresponding to the analysis result of each engine includes: and if the scoring model does not receive the analysis result of the target engine after waiting for the preset time, stopping waiting for the scoring model to output the confidence score.

Optionally, the offline recognition engine includes an offline acoustic model, an offline language model, and an offline decoder, and the online recognition engine includes an online acoustic model, an online speech model, and an online decoder, wherein the sending the speech audio to the offline recognition engine and the online recognition engine respectively further includes: sending the voice audio to an offline acoustic model, and outputting an offline acoustic probability; outputting the offline acoustic probability to obtain an offline language probability; decoding the voice audio by using the offline decoder in combination with the offline acoustic probability and the language model to obtain an analysis result of the offline recognition engine; sending the voice audio to an online acoustic model, and outputting online acoustic probability; obtaining an online language probability according to the online acoustic probability output; and decoding the voice audio by using the online decoder in combination with the online language probability to obtain an analysis result of the online recognition engine.

Optionally, the analysis result includes a recognition result, a decoding cost score and an audio frame number, where the higher the decoding cost score is, the lower the corresponding confidence score is, the recognition result is a language word recognized by the speech audio, and the lower the matching degree between the number of the language word and the audio frame number is, the lower the corresponding confidence score is.

Optionally, the decoding cost score includes an acoustic cost score and a language cost score, the acoustic cost score is a negative logarithm of the acoustic probability, and the language cost score is a negative logarithm of the language probability.

Optionally, the online recognition engine includes an online vehicle-mounted recognition engine and an online general recognition engine, a language model of the online vehicle-mounted recognition engine is formed by training corpora of a vehicle-mounted scene, and a language model of the online general recognition engine is formed by training corpora of a general scene.

According to a second aspect of the present invention, there is provided a system for selecting among a plurality of sets of speech recognition results using confidence scores, comprising: the acquisition module is used for acquiring voice and audio of people in the vehicle; the sending module is used for respectively sending the voice audio to an offline recognition engine and an online recognition engine, wherein the offline recognition engine is an engine for performing voice recognition on a model deployed on a vehicle; the online recognition engine is an engine for performing voice recognition on a model deployed at the cloud; the scoring module is used for sending the analysis results of each engine to the scoring model immediately after receiving the analysis results of each engine to obtain corresponding confidence scores, wherein each analysis result corresponds to one confidence score, and the higher the confidence score is, the more accurate the analysis result is; the judging module is used for judging whether the confidence score output by the scoring model firstly is greater than or equal to a first preset threshold value; and the determining module is used for determining the feedback result of the voice audio by integrating the confidence scores corresponding to the analysis results.

According to a third aspect of the present invention, there is provided a computer device comprising: the device comprises a memory and a processor, wherein the memory and the processor are mutually connected in a communication mode, computer instructions are stored in the memory, and the processor executes the computer instructions so as to execute the method for selecting among a plurality of sets of voice recognition results by using confidence.

According to a fourth aspect of the present invention, there is provided a computer-readable storage medium having stored thereon computer instructions for causing the computer to perform the above-described method of selecting among a plurality of sets of speech recognition results using confidence scores.

The technical scheme of the invention has the following advantages:

1. the invention provides a method for selecting among a plurality of sets of voice recognition results by using confidence score, which comprises the following steps: collecting voice and audio of people in the vehicle; respectively sending the voice audio to an offline recognition engine and an online recognition engine, wherein the offline recognition engine is an engine for performing voice recognition on a model deployed on a vehicle; the online recognition engine is an engine for performing voice recognition on a model deployed at the cloud; the method comprises the steps that after an analysis result of each engine is received, the analysis result is immediately sent to a scoring model, and a corresponding confidence score is obtained, wherein each analysis result corresponds to one confidence score, and the higher the confidence score is, the more accurate the analysis result is; judging whether the confidence score output by the scoring model firstly is greater than or equal to a first preset threshold value; if the first output confidence score is larger than or equal to the first preset threshold, outputting an analysis result corresponding to the first output confidence score as a recognition result; if the first output confidence score is smaller than the first preset threshold value, waiting for the scoring model to output the confidence score corresponding to the analysis result of each engine; and integrating the confidence score corresponding to each analysis result to determine the feedback result of the voice audio. Through the arrangement, the voice audio of the people in the vehicle is collected firstly, then the collected voice audio is sent to the offline recognition engine and the online recognition engine, each recognition engine immediately sends an analysis result to the scoring model after obtaining the analysis result, the scoring model scores each analysis result to obtain a confidence score, if the confidence score output firstly is larger than or equal to a first preset threshold value, the analysis result corresponding to the confidence score output firstly is output as the recognition result, otherwise, the scoring model is waited to output the confidence score corresponding to the analysis result of each engine, and then the confidence score corresponding to each analysis result is integrated to determine the feedback result of the voice audio, so that the speed and the accuracy of the voice recognition can be considered.

2. The invention provides a method for selecting among a plurality of sets of voice recognition results by using confidence scores, which integrates the confidence scores corresponding to each analysis result to determine the feedback result of the voice audio, and comprises the following steps: judging whether the confidence score corresponding to the analysis result of each engine is greater than or equal to a second preset threshold value or not, and counting the number of the confidence scores greater than or equal to the second preset threshold value; if the confidence score quantity greater than or equal to the second preset threshold value is 0, sending a request for re-collecting the voice audio; if the number of the confidence scores which are larger than or equal to the second preset threshold is 1, outputting an analysis result corresponding to the confidence score as a recognition result; and if the number of the confidence scores which are larger than or equal to the second preset threshold is multiple, selecting an analysis result corresponding to the corresponding confidence score as a recognition result by using a preset strategy and outputting the analysis result. Through the setting, if no confidence score is greater than or equal to a first preset threshold, judging whether the confidence score corresponding to the analysis result of each engine is greater than or equal to a second preset threshold, counting the number of the confidence scores greater than or equal to the second preset threshold, and if the number of the confidence scores greater than or equal to the second preset threshold is 0, sending a request for re-collecting the voice audio; if the number of the confidence scores which are larger than or equal to the second preset threshold is 1, outputting an analysis result corresponding to the confidence score as a recognition result; if the number of the confidence scores which are larger than or equal to the second preset threshold is multiple, the analysis result corresponding to the corresponding confidence score is selected by using a preset strategy to be output as a recognition result, and therefore the accuracy of voice recognition is guaranteed.

3. The invention provides a method for selecting among a plurality of sets of voice recognition results by using confidence scores, wherein the method for waiting the scoring model to output the confidence scores corresponding to the analysis results of each engine comprises the following steps: and if the scoring model does not receive the analysis result of the target engine after waiting for the preset time, stopping waiting for the scoring model to output the confidence score. Through the arrangement, when the voice audio of the personnel in the vehicle can not be identified all the time, the voice identification process is quitted, so that the system is prevented from entering endless loop.

4. The invention provides a method for selecting among a plurality of sets of voice recognition results by using confidence score, wherein an offline recognition engine comprises an offline acoustic model, an offline language model and an offline decoder, an online recognition engine comprises an online acoustic model, an online voice model and an online decoder, the voice audio is respectively sent to the offline recognition engine and the online recognition engine, and the method further comprises the following steps: sending the voice audio to an offline acoustic model, and outputting an offline acoustic probability; outputting the offline acoustic probability to obtain an offline language probability; decoding the voice audio by using the offline decoder in combination with the offline acoustic probability and the language model to obtain an analysis result of the offline recognition engine; sending the voice audio to an online acoustic model, and outputting online acoustic probability; obtaining an online language probability according to the online acoustic probability output; and decoding the voice audio by using the online decoder in combination with the online language probability to obtain an analysis result of the online recognition engine. Through the setting, the offline recognition engine obtains the offline acoustic probability through the offline acoustic model, obtains the offline language probability through the offline language model, and then decodes the voice audio by using an offline decoder to obtain an analysis result of the offline recognition engine; the online recognition engine obtains online acoustic probability through an online acoustic model, obtains online language probability through the online language model, and then decodes voice audio by using an online decoder to obtain an analysis result of the online recognition engine.

5. The invention provides a method for selecting among a plurality of sets of voice recognition results by using confidence scores, wherein the analysis results comprise recognition results, decoding cost scores and audio frame numbers, the higher the decoding cost score is, the lower the corresponding confidence score is, the recognition results are language characters recognized by the voice audio, and the lower the matching degree of the language character word number and the audio frame numbers is, the lower the corresponding confidence score is. Through the setting, other two-way output except the identification result is added: and the cost score and the audio frame number are decoded, so that the judgment of the confidence score is more accurate.

6. The method for selecting among a plurality of sets of voice recognition results by using the confidence score provided by the invention comprises the steps of obtaining an acoustic cost score and a language cost score, wherein the acoustic cost score is a negative logarithm of the acoustic probability, and the language cost score is a negative logarithm of the language probability. Through the arrangement, the acoustic cost score can be conveniently calculated through the acoustic probability, and the language cost score can be conveniently calculated through the language probability.

7. The invention provides a method for selecting among a plurality of sets of voice recognition results by using confidence, wherein an online recognition engine comprises an online vehicle-mounted recognition engine and an online general recognition engine, a language model of the online vehicle-mounted recognition engine is formed by training the linguistic data of a vehicle-mounted scene, and a language model of the online general recognition engine is formed by training the linguistic data of a general scene. Through the arrangement, the coverage range of the online recognition engine is wider, so that the voice audio of people in the vehicle can be recognized more accurately.

8. The invention provides a system for selecting among a plurality of sets of speech recognition results by using confidence score, comprising: the acquisition module is used for acquiring voice and audio of people in the vehicle; the sending module is used for respectively sending the voice audio to an offline recognition engine and an online recognition engine, wherein the offline recognition engine is an engine for performing voice recognition on a model deployed on a vehicle; the online recognition engine is an engine for performing voice recognition on a model deployed at the cloud; the scoring module is used for sending the analysis results of each engine to the scoring model immediately after receiving the analysis results of each engine to obtain corresponding confidence scores, wherein each analysis result corresponds to one confidence score, and the higher the confidence score is, the more accurate the analysis result is; the judging module is used for judging whether the confidence score output by the scoring model firstly is greater than or equal to a first preset threshold value; and the determining module is used for determining the feedback result of the voice audio by integrating the confidence scores corresponding to the analysis results. Through the arrangement, the voice audio of the people in the vehicle is collected firstly, then the collected voice audio is sent to the offline recognition engine and the online recognition engine, each recognition engine immediately sends an analysis result to the scoring model after obtaining the analysis result, the scoring model scores each analysis result to obtain a confidence score, if the confidence score output firstly is larger than or equal to a first preset threshold value, the analysis result corresponding to the confidence score output firstly is output as the recognition result, otherwise, the scoring model is waited to output the confidence score corresponding to the analysis result of each engine, and then the confidence score corresponding to each analysis result is integrated to determine the feedback result of the voice audio, so that the speed and the accuracy of the voice recognition can be considered.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a method for selecting among a plurality of sets of speech recognition results using confidence scores according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram illustrating step S107 of a method for selecting among a plurality of sets of speech recognition results using confidence scores according to an embodiment of the present application;

FIG. 3 is a block diagram of a system for selecting among multiple sets of speech recognition results using confidence scores according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Description of reference numerals: 1. an acquisition module; 2. a sending module; 3. a scoring module; 4. a judgment module; 5. and determining a module.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example 1

Compared with the traditional method for selecting in the multiple sets of voice recognition results by using the confidence score, the method can give consideration to both the speed and the accuracy of voice recognition. Referring to fig. 1-2, the method includes the steps of:

and S101, collecting voice and audio of people in the vehicle.

Step S102, the voice audio is respectively sent to an offline recognition engine and an online recognition engine, wherein the offline recognition engine is an engine for performing voice recognition on a model deployed on a vehicle; the online recognition engine is an engine which is deployed in a cloud model and used for performing voice recognition.

In order to increase the speed and accuracy of voice audio recognition, a plurality of offline recognition engines and online recognition engines can be provided.

Step S103, after the analysis result of each engine is received, the analysis result is immediately sent to a scoring model to obtain corresponding confidence scores, wherein each analysis result corresponds to one confidence score, and the higher the confidence score is, the more accurate the analysis result is.

And the confidence score model scores whether the accuracy of the identification result is confident or not according to the output of the decoder. In general, the formula for calculating the confidence score is defined as:

score＝f(costs，frame_count，query_text)；

the input of the formula f is the three outputs of the decoder, costs is the decoding cost scores of acoustics and languages, frame _ count is the frame number of the input audio, and query _ text is the recognition result. The formula f can be obtained by modeling a large amount of audio data and training the model. The model may be a linear regression model, a decision tree model, or other complex deep learning model. Regardless of the model, the result obtained by modeling must be in a large trend that higher scores indicate more confident that the recognition result is right, and low scores indicate that the recognition result is probably wrong.

And step S104, judging whether the confidence score output by the scoring model firstly is greater than or equal to a first preset threshold value.

Under normal conditions, the recognition speed of the online speech recognition engine is often slower than that of the offline recognition engine due to network fluctuation or network delay and the like, so that the confidence score output by the scoring model firstly is the confidence score corresponding to the offline recognition engine, and if the confidence score is greater than or equal to a first preset threshold value, the speech recognition speed can be improved.

And step S105, if the first output confidence score is larger than or equal to the first preset threshold, outputting an analysis result corresponding to the first output confidence score as a recognition result.

And step S106, if the first output confidence score is smaller than the first preset threshold, waiting for the scoring model to output the confidence score corresponding to the analysis result of each engine.

And S107, integrating the confidence scores corresponding to each analysis result to determine the feedback result of the voice audio.

Through the arrangement, the voice audio of people in the automobile is collected firstly, then the collected voice audio is sent to the offline recognition engine and the online recognition engine, each recognition engine immediately sends to the scoring model after obtaining an analysis result, the scoring model scores each analysis result to obtain a confidence score, if the confidence score output firstly is larger than or equal to a first preset threshold value, the analysis result corresponding to the confidence score output firstly is output as a recognition result, at the moment, the voice recognition speed can be improved, otherwise, the confidence score corresponding to the analysis result of each engine is output by waiting for the scoring model, then the feedback result of the voice audio is determined by integrating the confidence score corresponding to each analysis result, at the moment, the accuracy of the voice recognition can be improved, and the speed and the accuracy of the voice recognition can be considered.

Step S107 further includes the following substeps:

step S2071, judging whether the confidence score corresponding to the analysis result of each engine is greater than or equal to a second preset threshold value, and counting the number of the confidence scores greater than or equal to the second preset threshold value;

step S2072, if the confidence score quantity which is more than or equal to the second preset threshold value is 0, sending a request for re-collecting the voice audio;

step S2073, if the number of the confidence scores which are more than or equal to the second preset threshold is 1, outputting an analysis result corresponding to the confidence score as a recognition result;

step S2074, if the number of the confidence scores greater than or equal to the second preset threshold is multiple, selecting an analysis result corresponding to the corresponding confidence score as an identification result by using a preset policy and outputting the analysis result.

The preset strategies include the following steps:

firstly, selecting one with the highest score from a plurality of confidence scores, and outputting an analysis result corresponding to the confidence score as a recognition result;

secondly, different selection strategies are adopted according to actual service scenes, for example, comprehensive scores are obtained by combining multiple online semantic models such as a dialect classification model, a named entity recognition model, a semantic rejection model and the like, and the comprehensive scores of the general results and the special results are compared to finally determine which result is used.

Through the setting, if no confidence score is greater than or equal to a first preset threshold, judging whether the confidence score corresponding to the analysis result of each engine is greater than or equal to a second preset threshold, counting the number of the confidence scores greater than or equal to the second preset threshold, and if the number of the confidence scores greater than or equal to the second preset threshold is 0, sending a request for re-collecting the voice audio; if the number of the confidence scores which are larger than or equal to the second preset threshold is 1, outputting an analysis result corresponding to the confidence score as a recognition result; if the number of the confidence scores which are larger than or equal to the second preset threshold is multiple, the analysis result corresponding to the corresponding confidence score is selected by using a preset strategy to be output as a recognition result, and therefore the accuracy of voice recognition is guaranteed.

It should be noted that, if the scoring model does not receive the analysis result of the target engine after waiting for the preset time, the scoring model stops waiting for outputting the confidence score.

Through the arrangement, when the voice audio of the personnel in the vehicle can not be identified all the time, the voice identification process is quitted, so that the system is prevented from entering endless loop.

The offline recognition engine comprises an offline acoustic model, an offline language model and an offline decoder, the online recognition engine comprises an online acoustic model, an online voice model and an online decoder, wherein the voice audio is respectively sent to the offline recognition engine and the online recognition engine, and the offline recognition engine further comprises: sending the voice audio to an offline acoustic model, and outputting an offline acoustic probability; outputting the offline acoustic probability to obtain an offline language probability; decoding the voice audio by using the offline decoder in combination with the offline acoustic probability and the language model to obtain an analysis result of the offline recognition engine; sending the voice audio to an online acoustic model, and outputting online acoustic probability; obtaining an online language probability according to the online acoustic probability output; and decoding the voice audio by using the online decoder in combination with the online language probability to obtain an analysis result of the online recognition engine.

Through the setting, the offline recognition engine obtains the offline acoustic probability through the offline acoustic model, obtains the offline language probability through the offline language model, and then decodes the voice audio by using an offline decoder to obtain an analysis result of the offline recognition engine; the online recognition engine obtains online acoustic probability through an online acoustic model, obtains online language probability through the online language model, and then decodes voice audio by using an online decoder to obtain an analysis result of the online recognition engine.

The analysis result comprises a recognition result, a decoding cost score and an audio frame number, wherein the higher the decoding cost score is, the lower the corresponding confidence score is, the recognition result is the language characters recognized by the voice audio, and the lower the matching degree of the language character word number and the audio frame number is, the lower the corresponding confidence score is. Through the setting, other two-way output except the identification result is added: and the cost score and the audio frame number are decoded, so that the judgment of the confidence score is more accurate.

The specific meaning of the matching degree of the language character word number and the audio frame number is as follows: for example, if only 3 words are predicted in a 5-second audio, the corresponding confidence score will not be found, and the recognition result will probably not be correct.

The decoding cost score comprises an acoustic cost score and a language cost score, the acoustic cost score is a negative logarithm of the acoustic probability, and the language cost score is a negative logarithm of the language probability. Through the arrangement, the acoustic cost score can be conveniently calculated through the acoustic probability, and the language cost score can be conveniently calculated through the language probability.

The decoding process of speech recognition is a process of finding out the most likely recognized word result by combining an acoustic model, a language model, a pronunciation dictionary and the like. Taking a typical WFST (weighted finite state converter) based decoder as an example, the language model in combination with the pronunciation dictionary is represented as a weighted directed graph with input labels (pronunciation), output labels (text) and weights (cost of language model, negative logarithm of probability) on each side of the graph. The decoding process is to jump on the graph according to the possible pronunciations of each frame of the audio predicted by the acoustic model and the probability thereof, and each jump has an acoustic cost (the negative logarithm of the probability of the pronunciations) and a language cost (the weight of the jump edge). And finally, selecting a path with the minimum total cost, wherein output labels on the path are connected to form a recognized text result, and the acoustic and language cost scores of each jump on the path are also output to be used as the input of the scoring model.

The online recognition engine comprises an online vehicle-mounted recognition engine and an online general recognition engine, a language model of the online vehicle-mounted recognition engine is formed by training the linguistic data of a vehicle-mounted scene, and a language model of the online general recognition engine is formed by training the linguistic data of a general scene. Through the arrangement, the coverage range of the online recognition engine is wider, so that the voice audio of people in the vehicle can be recognized more accurately.

Example 2

The system for selecting among a plurality of sets of speech recognition results by using confidence scores provided by the invention is used for implementing the method for selecting among a plurality of sets of speech recognition results by using confidence scores in the above embodiment, and with reference to fig. 3, the system comprises the following modules:

the acquisition module 1 is used for acquiring voice and audio of people in the vehicle;

the sending module 2 is used for sending the voice audio to an offline recognition engine and an online recognition engine respectively, wherein the offline recognition engine is an engine for performing voice recognition on a model deployed on a vehicle; the online recognition engine is an engine for performing voice recognition on a model deployed at the cloud;

the scoring module 3 is used for sending the analysis results of each engine to the scoring model immediately after receiving the analysis results to obtain corresponding confidence scores, wherein each analysis result corresponds to one confidence score, and the higher the confidence score is, the more accurate the analysis result is;

the judging module 4 is used for judging whether the confidence score output by the scoring model firstly is greater than or equal to a first preset threshold value;

and the determining module 5 is used for determining the feedback result of the voice audio by integrating the confidence scores corresponding to the analysis results.

Through the arrangement, the voice audio of the people in the vehicle is collected firstly, then the collected voice audio is sent to the offline recognition engine and the online recognition engine, each recognition engine immediately sends an analysis result to the scoring model after obtaining the analysis result, the scoring model scores each analysis result to obtain a confidence score, if the confidence score output firstly is larger than or equal to a first preset threshold value, the analysis result corresponding to the confidence score output firstly is output as the recognition result, otherwise, the scoring model is waited to output the confidence score corresponding to the analysis result of each engine, and then the confidence score corresponding to each analysis result is integrated to determine the feedback result of the voice audio, so that the speed and the accuracy of the voice recognition can be considered.

Example 3

Referring to fig. 4, an embodiment of the present invention further provides a computer device, where the computer device includes a processor and a memory, where the processor and the memory may be connected by a bus or in another manner, and the connection by the bus is taken as an example in the figure.

The processor may be a Central Processing Unit (CPU). The Processor may also be other general-purpose processors, Digital Signal Processors (DSPs), Graphics Processing Units (GPUs), embedded Neural Network Processors (NPUs), or other dedicated deep learning coprocessors, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or the like, or a combination thereof.

The memory is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs and modules, and the processor executes the non-transitory software programs, instructions and modules stored in the memory so as to execute various functional applications and data processing of the processor, namely, implementing the method for selecting among multiple sets of speech recognition results by using confidence in the embodiments of the method.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor, and the like. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and such remote memory may be coupled to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The details of the computer device can be understood by referring to the corresponding descriptions and effects in the embodiments shown in the figures, and are not described herein again.

Embodiments of the present invention further provide a non-transitory computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions may execute the method for selecting among multiple sets of speech recognition results using confidence scores in any of the above method embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims

1. A method for selecting among a plurality of sets of speech recognition results using confidence scores, comprising the steps of:

collecting voice and audio of people in the vehicle;

respectively sending the voice audio to an offline recognition engine and an online recognition engine, wherein the offline recognition engine is an engine for performing voice recognition on a model deployed on a vehicle; the online recognition engine is an engine for performing voice recognition on a model deployed at the cloud;

the method comprises the steps that after an analysis result of each engine is received, the analysis result is immediately sent to a scoring model, and a corresponding confidence score is obtained, wherein each analysis result corresponds to one confidence score, and the higher the confidence score is, the more accurate the analysis result is;

judging whether the confidence score output by the scoring model firstly is greater than or equal to a first preset threshold value;

if the first output confidence score is larger than or equal to the first preset threshold, outputting an analysis result corresponding to the first output confidence score as a recognition result;

if the first output confidence score is smaller than the first preset threshold value, waiting for the scoring model to output the confidence score corresponding to the analysis result of each engine;

and integrating the confidence score corresponding to each analysis result to determine the feedback result of the voice audio.

2. The method of claim 1, wherein the integrating the confidence score corresponding to each analysis result to determine the feedback result of the speech audio comprises:

judging whether the confidence score corresponding to the analysis result of each engine is greater than or equal to a second preset threshold value or not, and counting the number of the confidence scores greater than or equal to the second preset threshold value;

if the confidence score quantity greater than or equal to the second preset threshold value is 0, sending a request for re-collecting the voice audio;

if the number of the confidence scores which are larger than or equal to the second preset threshold is 1, outputting an analysis result corresponding to the confidence score as a recognition result;

and if the number of the confidence scores which are larger than or equal to the second preset threshold is multiple, selecting an analysis result corresponding to the corresponding confidence score as a recognition result by using a preset strategy and outputting the analysis result.

3. The method of claim 1, wherein waiting for the scoring model to output the confidence score corresponding to the analysis result of each engine comprises:

and if the scoring model does not receive the analysis result of the target engine after waiting for the preset time, stopping waiting for the scoring model to output the confidence score.

4. The method of claim 1, wherein the offline recognition engine comprises an offline acoustic model, an offline language model and an offline decoder, and the online recognition engine comprises an online acoustic model, an online speech model and an online decoder, and wherein the sending the speech audio to the offline recognition engine and the online recognition engine respectively further comprises:

sending the voice audio to an offline acoustic model, and outputting an offline acoustic probability;

outputting the offline acoustic probability to obtain an offline language probability;

decoding the voice audio by using the offline decoder in combination with the offline acoustic probability and the language model to obtain an analysis result of the offline recognition engine;

sending the voice audio to an online acoustic model, and outputting online acoustic probability;

obtaining an online language probability according to the online acoustic probability output;

and decoding the voice audio by using the online decoder in combination with the online language probability to obtain an analysis result of the online recognition engine.

5. The method of claim 4, wherein the analysis result comprises a recognition result, a decoding cost score and an audio frame number, wherein the higher the decoding cost score is, the lower the corresponding confidence score is, the recognition result is a language word recognized by the audio, and the lower the matching degree of the language word number with the audio frame number is, the lower the corresponding confidence score is.

6. The method of claim 5, wherein the decoding cost score comprises an acoustic cost score and a language cost score, the acoustic cost score is a negative logarithm of the acoustic probability, and the language cost score is a negative logarithm of the language probability.

7. The method of claim 1, wherein the online recognition engine comprises an online vehicle-mounted recognition engine and an online general recognition engine, a language model of the online vehicle-mounted recognition engine is trained from a corpus of a vehicle-mounted scene, and a language model of the online general recognition engine is trained from a corpus of a general scene.

8. A system for selecting among a plurality of sets of speech recognition results using confidence scores, comprising:

the acquisition module is used for acquiring voice and audio of people in the vehicle;

the sending module is used for respectively sending the voice audio to an offline recognition engine and an online recognition engine, wherein the offline recognition engine is an engine for performing voice recognition on a model deployed on a vehicle; the online recognition engine is an engine for performing voice recognition on a model deployed at the cloud;

the scoring module is used for sending the analysis results of each engine to the scoring model immediately after receiving the analysis results of each engine to obtain corresponding confidence scores, wherein each analysis result corresponds to one confidence score, and the higher the confidence score is, the more accurate the analysis result is;

the judging module is used for judging whether the confidence score output by the scoring model firstly is greater than or equal to a first preset threshold value;

and the determining module is used for determining the feedback result of the voice audio by integrating the confidence scores corresponding to the analysis results.

9. A computer device, comprising: a memory and a processor, the memory and the processor being communicatively coupled to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the method of selecting among a plurality of sets of speech recognition results using confidence scores of any of claims 1-7.

10. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of selecting among a plurality of sets of speech recognition results using confidence scores of any of claims 1-7.