WO2021104102A1 - 语音识别纠错方法、相关设备及可读存储介质 - Google Patents
语音识别纠错方法、相关设备及可读存储介质 Download PDFInfo
- Publication number
- WO2021104102A1 WO2021104102A1 PCT/CN2020/129314 CN2020129314W WO2021104102A1 WO 2021104102 A1 WO2021104102 A1 WO 2021104102A1 CN 2020129314 W CN2020129314 W CN 2020129314W WO 2021104102 A1 WO2021104102 A1 WO 2021104102A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- recognition
- recognition result
- error correction
- speech
- result
- Prior art date
Links
- 238000012937 correction Methods 0.000 title claims abstract description 198
- 238000000034 method Methods 0.000 title claims abstract description 76
- 238000004364 calculation method Methods 0.000 claims description 100
- 239000013598 vector Substances 0.000 claims description 89
- 238000012549 training Methods 0.000 claims description 79
- 238000000605 extraction Methods 0.000 claims description 16
- 230000015654 memory Effects 0.000 claims description 14
- 239000000284 extract Substances 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 9
- 239000010410 layer Substances 0.000 description 135
- 230000004900 autophagic degradation Effects 0.000 description 13
- 238000012545 processing Methods 0.000 description 13
- 230000008569 process Effects 0.000 description 12
- 230000006870 function Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 8
- 238000003062 neural network model Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 7
- 238000012512 characterization method Methods 0.000 description 7
- 206010028980 Neoplasm Diseases 0.000 description 6
- 201000011510 cancer Diseases 0.000 description 6
- 238000004891 communication Methods 0.000 description 6
- 238000002560 therapeutic procedure Methods 0.000 description 6
- 230000007455 autophagic response Effects 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- 230000019771 cognition Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 239000002356 single layer Substances 0.000 description 3
- 238000013179 statistical model Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 230000004888 barrier function Effects 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 238000011862 kidney biopsy Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/32—Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/005—Correction of errors induced by the transmission channel, if related to the coding algorithm
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
- G10L2015/228—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of application context
Definitions
- speech recognition technology based on deep learning has matured.
- the recognition accuracy of traditional speech recognition models in general scenarios has reached satisfactory results.
- voice content usually There are some professional vocabularies, which appear less frequently in general scenarios, resulting in poor coverage of this type of vocabulary by traditional speech recognition models.
- the speech to be recognized contains this type of vocabulary, and the traditional speech recognition model is used to recognize the speech to be recognized, which is very prone to recognition errors, resulting in low speech recognition accuracy.
- this application is proposed to provide a speech recognition error correction method, related equipment and readable storage medium.
- the specific plan is as follows:
- a speech recognition error correction method includes:
- the final recognition result is determined.
- the final recognition result is determined.
- extracting keywords from the first recognition result includes:
- the second recognition of the voice data with reference to the context information of the first recognition result and the keyword to obtain the second recognition result includes:
- the acoustic characteristics of the voice data, the first recognition result, and the keywords are input into a pre-trained voice error correction recognition model to obtain the second recognition result.
- the voice error correction recognition model uses error correction
- the training data set is obtained by training the preset model;
- the error correction training data set includes at least one set of error correction training data, and each set of error correction training data includes an acoustic feature corresponding to a piece of speech data, a text corresponding to the piece of speech data, and a first set of error correction training data corresponding to the piece of speech data.
- the first recognition result and the keywords in the first recognition result are included in the first recognition result.
- the inputting the acoustic features of the voice data, the first recognition result, and the keywords into a pre-trained voice error correction recognition model to obtain the second recognition result includes:
- the voice error correction recognition model is used to encode the acoustic features of the voice data, the first recognition result, and the keyword and attention calculation, and based on the calculation result, the second recognition result is obtained.
- recognition results including:
- the acoustic features of the speech data, the first recognition result, and the keyword are respectively coded and attention calculated to obtain the calculation result ;
- the decoding layer of the speech error correction recognition model is used to decode the calculation result to obtain the second recognition result.
- the use of the voice error correction recognition model to encode the acoustic features of the voice data, the first recognition result, and the keyword, and calculate the attention, and obtain the second Recognition results including:
- the decoding layer of the speech error correction recognition model is used to decode the calculation result to obtain the second recognition result.
- the encoding layer and the attention layer of the speech error correction recognition model are used to respectively encode the acoustic features of the speech data, the first recognition result, and the keywords and calculate the attention To obtain the calculation result, including:
- the attention calculation is performed on the semantic vector at the previous time related to each target object and the output result of the speech error correction recognition model at the previous time, respectively, to obtain The hidden layer state related to each target object;
- the attention calculation is performed on the high-level acoustic features of each target object and the hidden layer state related to each target object to obtain each target object Related semantic vectors;
- the target object includes the acoustic feature of the voice data, the first recognition result, and the keyword.
- the using the coding layer and the attention layer of the speech error correction recognition model to perform coding and attention calculation on the merged vector to obtain the calculation result includes:
- the attention layer of the speech error correction recognition model uses the attention layer of the speech error correction recognition model to obtain the semantic vector related to the merged vector.
- the determining the final recognition result according to the second recognition result includes:
- a voice recognition error correction device including:
- the acquiring unit is used to acquire the voice data to be recognized and its first recognition result
- the first speech recognition unit is configured to refer to the context information of the first recognition result, perform second recognition on the speech data, and obtain the second recognition result;
- the recognition result determining unit is configured to determine the final recognition result according to the second recognition result.
- another voice recognition error correction device is provided, and the device includes:
- the acquiring unit is used to acquire the voice data to be recognized and its first recognition result
- the keyword extraction unit is used to extract keywords from the first recognition result
- the second voice recognition unit is configured to refer to the context information of the first recognition result and the keywords, and perform a second recognition on the voice data to obtain the second recognition result;
- the recognition result determining unit is configured to determine the final recognition result according to the second recognition result.
- the keyword extraction unit includes:
- the domain vocabulary extraction unit is used to extract vocabularies with domain characteristics from the first recognition result as keywords.
- the second speech recognition unit includes:
- An acoustic feature obtaining unit configured to obtain the acoustic feature of the voice data
- the model processing unit is used to input the acoustic features of the voice data, the first recognition result, and the keywords into a pre-trained voice error correction recognition model to obtain the second recognition result.
- the voice error correction The recognition model is obtained by training the preset model using the error correction training data set;
- the error correction training data set includes at least one set of error correction training data, and each set of error correction training data includes an acoustic feature corresponding to a piece of speech data, a text corresponding to the piece of speech data, and a first set of error correction training data corresponding to the piece of speech data.
- the first recognition result and the keywords in the first recognition result are included in the first recognition result.
- the model processing unit includes:
- An encoding and attention calculation unit configured to use the voice error correction recognition model to encode the acoustic features of the voice data, the first recognition result, and the keywords, and to perform attention calculation;
- the recognition unit is used to obtain the second recognition result based on the calculation result.
- the encoding and attention calculation unit includes a first encoding and attention calculation unit, and the recognition unit includes a first decoding unit;
- the first coding and attention calculation unit is configured to use the coding layer and the attention layer of the speech error correction recognition model to analyze the acoustic characteristics of the speech data, the first recognition result, and the key Word encoding and attention calculation are performed to obtain the calculation result;
- the first decoding unit is configured to use the decoding layer of the speech error correction recognition model to decode the calculation result to obtain the second recognition result.
- the model processing unit further includes a merging unit, the encoding and attention calculation unit includes a second encoding and an attention calculation unit, and the recognition unit includes a second decoding unit:
- the merging unit is configured to merge the acoustic features of the voice data, the first recognition result, and the keywords to obtain a merged vector;
- the second coding and attention calculation unit is configured to use the coding layer and the attention layer of the speech error correction recognition model to perform coding and attention calculation on the merged vector to obtain the calculation result;
- the second decoding unit is configured to use the decoding layer of the speech error correction recognition model to decode the calculation result to obtain the second recognition result.
- the first coding and attention calculation unit includes:
- the first coding unit is configured to use the coding layer of the speech error correction recognition model to respectively code each target object to obtain the acoustic high-level features of each target object;
- the first attention calculation unit is configured to use the attention layer of the speech error correction recognition model to separately analyze the semantic vector of the last time related to each target object and the speech error correction recognition model at the last time. Output the result, perform attention calculation to obtain the hidden layer state related to each target object; and use the attention layer of the speech error correction recognition model to separately analyze the acoustic high-level features and all the target objects of each target object.
- the hidden layer state related to each target object is calculated to obtain the semantic vector related to each target object; wherein, the target object includes the acoustic feature of the speech data and the first recognition result And the keywords.
- the second encoding and attention calculation unit includes:
- the second coding unit is configured to use the coding layer of the speech error correction recognition model to encode the merged vector to obtain the acoustic high-level features of the merged vector;
- the second attention calculation unit is configured to use the attention layer of the speech error correction recognition model to correlate the semantic vector at the previous time to the merge vector and the output result of the speech error correction recognition model at the previous time, Perform attention calculations to obtain the hidden layer state related to the merge vector; and, use the attention layer of the speech error correction recognition model to analyze the high-level acoustic features of the merge vector and the hidden layer state related to the merge vector , Perform attention calculation to obtain the semantic vector related to the merged vector.
- the recognition result determining unit includes:
- a confidence degree obtaining unit configured to obtain the confidence degree of the first recognition result and the confidence degree of the second recognition result
- the determining unit is configured to determine, from the first recognition result and the second recognition result, that the recognition result with high confidence is the final recognition result.
- a speech recognition error correction system including a memory and a processor
- the memory is used to store programs
- the processor is configured to execute the program to implement each step of the voice recognition error correction method described above.
- a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, each step of the above-mentioned speech recognition error correction method is realized.
- a computer program product is provided.
- the terminal device executes each step of the above-mentioned voice recognition error correction method.
- this application discloses a speech recognition error correction method, related equipment and readable storage medium, including: acquiring the speech data to be recognized and its first recognition result; and referring to the first recognition result Context information, the speech data is recognized for the second time, and the second recognition result is obtained; finally, the final recognition result is determined according to the second recognition result.
- the second time recognition of the voice data is performed with reference to the context information of the first recognition result, fully considering the context information of the recognition result and the application scenarios of the voice data. If the first recognition result is wrong, then The second time recognition is used to correct errors, and therefore, the accuracy of speech recognition can be improved.
- FIG. 1 is a schematic flowchart of a voice recognition error correction method disclosed in an embodiment of the application
- FIG. 2 is a schematic flowchart of another voice recognition error correction method disclosed in an embodiment of the application.
- FIG. 3 is a schematic diagram of the topology structure of a preset model for training a speech error correction recognition model disclosed in an embodiment of the application;
- FIG. 4 is a schematic diagram of the topology structure of yet another preset model for training a speech error correction recognition model disclosed in an embodiment of the application;
- FIG. 5 is a schematic structural diagram of a speech recognition error correction device disclosed in an embodiment of the application.
- FIG. 6 is a schematic structural diagram of another voice recognition error correction device disclosed in an embodiment of the application.
- FIG. 7 is a hardware structure block diagram of a speech recognition error correction system disclosed in an embodiment of the application.
- the first clause in the voice data stream due to insufficient context information, the first clause is recognized incorrectly, and the subclauses after the first clause are recognized incorrectly.
- sentence recognition because the context information is sufficient, the clauses after the first clause can be correctly recognized. In other words, the same word may be recognized incorrectly when it appears in the first clause, and correctly recognized when it appears in the second clause.
- the content of the speech to be recognized is "Scientists from the Saoke Institute in California have discovered that autophagy can inhibit the occurrence of cancer. Many people’s cognition is just the opposite of the past. Therefore, the therapies used to inhibit autophagy It may actually bring bad consequences.”, was identified as “Scientists from the Saoke Institute in California found that the reaction can inhibit the occurrence of cancer at this time. Many people’s perceptions are just the opposite. Therefore, those who use Therapies to suppress the autophagy response may have undesirable consequences.”
- the inventor of this case found that the contextual information carried by the recognition result itself has a certain impact on the correctness of the recognition result. Therefore, it can be based on the context information of the first recognition result of the voice data to be recognized, and the voice data can be recognized. Perform the second recognition and get the second recognition result. In the second recognition result, the incorrectly recognized domain vocabulary in the first recognition result may be corrected, thereby improving the accuracy of the speech recognition result.
- the inventor of this case proposes a speech recognition error correction method.
- the speech recognition error correction method provided by the present application will be introduced through the following embodiments.
- FIG. 1 is a schematic flowchart of a voice recognition error correction method disclosed in an embodiment of the application.
- the method may include:
- S101 Acquire voice data to be recognized and its first recognition result.
- the voice data to be recognized is voice data spoken by the user according to application requirements, such as voice data input by the user using a voice input method when sending a text message or chatting.
- the voice data to be recognized may be voice data in a general field, or voice data in a special scene (such as a professional field).
- the first recognition result of the voice data to be recognized can be implemented based on a neural network model.
- other methods for obtaining the first recognition result of the voice data to be recognized are also within the protection scope of this application.
- the first recognition result of the voice data to be recognized can be stored in advance, and when needed, it can be directly obtained from the storage medium.
- S102 Referring to the context information of the first recognition result, perform a second recognition on the voice data to obtain a second recognition result.
- the context information carried by the recognition result itself has a certain impact on the correctness of the recognition result. Therefore, in this embodiment, the context information of the first recognition result can be referred to for voice data. Perform the second recognition and get the second recognition result.
- the voice data is recognized for the second time to obtain the second recognition result in various ways. For example, it can be implemented based on a neural network model.
- the domain vocabulary contained in the first recognition result can be determined, and the domain vocabulary can be matched with other words in the first recognition result, and the matching degree can be selected to be higher than the set lower limit of the matching degree. And the words that are not exactly the same, use the domain words to replace the filtered words, and get the second recognition result.
- the second recognition result can be directly determined as the final recognition result.
- the second recognition result may not be better than the first recognition result. If the second recognition result is directly determined as the final recognition result, the recognition accuracy rate will be reduced. Therefore, in this case, an optimal recognition result can be determined from the first recognition result and the second recognition result as the final recognition result.
- the confidence level of the first recognition result can be obtained, and The confidence of the second recognition result; from the first recognition result and the second recognition result, it is determined that the recognition result with high confidence is the final recognition result.
- manual verification can be used to determine an optimal recognition result from the first recognition result and the second recognition result.
- This embodiment discloses a speech recognition error correction method to obtain the speech data to be recognized and its first recognition result; referring to the context information of the first recognition result, the speech data is recognized for the second time to obtain the second The second recognition result; finally, according to the second recognition result, the final recognition result is determined.
- the speech data is recognized for the second time with reference to the context information of the first recognition result, and the application scenarios of the context information of the recognition result are fully considered. If the first recognition result is wrong, the second recognition can be used. The secondary recognition corrects its errors, therefore, the accuracy of speech recognition can be improved.
- another voice recognition error correction method is further provided. Based on the above embodiments, keyword extraction can be performed on the first recognition result, and then the first recognition result can be referred to at the same time. The second time recognition of the voice data can further improve the accuracy of the second recognition result. Refer to Figure 2 for the specific implementation process. The method includes:
- S201 Acquire voice data to be recognized and its first recognition result.
- Step S201 is the same as the aforementioned step S101, and the detailed implementation process can be referred to the foregoing introduction, which will not be repeated here.
- the keyword may be a vocabulary with domain characteristics extracted from the first recognition result. That is, the keywords can be domain-related words that appear in the first recognition result, and are usually domain-specific words. Examples are: autophagy, bone traction, kidney biopsy and other vocabulary in the medical field; feedforward neural network, pooling layer in the computer field, etc.
- S203 Referring to the context information of the first recognition result and the keyword, perform a second recognition on the voice data to obtain a second recognition result.
- the voice data is recognized for the second time, and there may be multiple implementation ways to obtain the second recognition result. For example, it can be implemented based on a neural network model.
- Step S204 is the same as the aforementioned step S103, and the detailed implementation process can be referred to the foregoing introduction, and will not be repeated here.
- the speech recognition error correction method disclosed in this embodiment further extracts keywords from the first recognition result.
- the keywords can be vocabulary with domain characteristics, and can also refer to the context information and context information of the first recognition result.
- the keyword is used for the second recognition of the voice data, which can further improve the accuracy of the second recognition result.
- the voice data may be input into a pre-trained voice recognition model to obtain the first recognition result.
- the pre-trained speech recognition model may specifically be a traditional speech recognition model. It may also be a speech recognition model generated by training a preset model based on the recognition training data set.
- the recognition training data set includes at least one set of recognition training data, and each set of recognition training data includes a piece of text corresponding to the voice data, and the acoustic characteristics of the piece of voice data.
- the preset model can be any neural network model, which is not limited in this application.
- each recognition training data in the recognition training data set can be obtained in the following way: Obtain a voice Data, the voice data is manually labeled to obtain the text corresponding to the voice data; the acoustic features of the voice data are extracted; a recognition training data is generated, and the recognition training data includes the text corresponding to the voice data and the text of the voice data Acoustic characteristics.
- voice data can be received through a microphone of a smart terminal, which is an electronic device with a voice recognition function, such as a smart phone, a computer, a translator, a robot, Smart home, smart home appliances, etc. You can also obtain pre-stored voice data. Of course, other ways of obtaining voice data are also within the protection scope of this application, and this application does not impose any limitation on this.
- the acoustic feature of each voice data may be the spectral feature of the voice data, such as MFCC (Mel-Frequency Cepstral Coefficients, Mel frequency cepstral coefficients), or FBank features.
- MFCC Mel-Frequency Cepstral Coefficients, Mel frequency cepstral coefficients
- FBank features any mainstream acoustic feature extraction method can be used to extract the acoustic features of each voice data, which is not limited in this application.
- the preset model used when training the speech recognition model can be the traditional third encoding module based on attention-decoder (encoding and decoding based on attention mechanism) model structure, or it can be other model structures. This application does not impose any restrictions.
- the preset model when the preset model is trained based on the recognition training data, the acoustic characteristics of each speech data in the recognition training data are used as the input of the preset model, and the speech data in each recognition training data corresponds to The text is the training target, and the parameters of the preset model are trained.
- NER Named Entity Recognition, named entity recognition
- NER Named Entity Recognition
- other ways of extracting keywords in the first recognition result are also within the protection scope of this application.
- a manual method can be used to extract keywords from the first recognition result.
- NER Named Entity Recognition
- the implementation of extracting the keywords in the first recognition result may be specifically implemented as follows: input the first recognition result into a pre-trained keyword extraction model to obtain the keywords in the first recognition result.
- the keyword extraction model can train and generate a preset model structure based on the extracted training data set.
- the extracted training data set includes at least one set of extracted training data, and each set of extracted training data includes a text.
- the professional vocabulary with domain characteristics that appeared in has been marked.
- Each text can be a text under a special scene.
- a manual labeling method can be used to label the professional vocabulary with domain characteristics appearing in each text to achieve labeling.
- the preset model can be a BiLSTM_CRF (Bidirectional Long Short-term Memory Model_Conditional Random Field) model based on deep learning.
- BiLSTM_CRF Bidirectional Long Short-term Memory Model_Conditional Random Field
- the first recognition result is "autophagy can inhibit the occurrence of cancer. Many people have the opposite perception. Therefore, therapies used to inhibit autophagy may have bad consequences.”
- the keyword extraction model can output keywords: autophagy, cancer, therapy.
- NER Named Entity Recognition
- the specific implementation of extracting the keywords in the first recognition result may be specifically as follows: input the first recognition result into the statistical model to obtain the keywords in the first recognition result.
- the construction method of the statistical model is the current mature technology, which will not be repeated in this application.
- the speech data when referring to the context information of the first recognition result, the speech data is recognized for the second time, and the realization of the second recognition result is based on a neural network model, the speech can be The acoustic characteristics of the data, the first recognition result, and the pre-trained speech error correction recognition model are input to obtain the second recognition result.
- the speech error correction recognition model uses the error correction training data set to train a preset model
- the error correction training data set includes at least one set of error correction training data, and each set of error correction training data includes an acoustic feature corresponding to a piece of voice data, a text corresponding to the piece of voice data, and a piece of voice data corresponding to the The first recognition result.
- the acoustic feature corresponding to the piece of speech data and the first recognition result corresponding to the piece of speech data are the input of the preset speech error correction recognition model structure.
- the text corresponding to the piece of speech data is a training target of the preset speech error correction recognition model structure.
- each group of error correction training data can be obtained in the following way: a piece of voice data is obtained. Manually label the voice data to obtain the text corresponding to the voice data. Extract the acoustic features of the voice data. The voice data is input into a pre-trained voice recognition model, and the first recognition result corresponding to the voice data is obtained.
- the speech data when the speech data is recognized for the second time with reference to the context information of the first recognition result and the keywords, and the realization of the second recognition result is based on a neural network model ,
- the acoustic features of the voice data, the first recognition result, and the keywords can be input into a pre-trained voice error correction recognition model to obtain the second recognition result.
- the voice error correction recognition model uses The error correction training data set is obtained by training a preset model; the error correction training data set includes at least one set of error correction training data, and each set of error correction training data includes an acoustic feature corresponding to a piece of voice data, and the piece of voice data The corresponding text, the first recognition result corresponding to the piece of voice data, and the keywords in the first recognition result.
- the speech error correction recognition model when the speech error correction recognition model is trained, the acoustic features corresponding to the piece of speech data, the first recognition result corresponding to the piece of speech data, and the keywords in the first recognition result are selected.
- the text corresponding to the piece of speech data is a training target of the preset speech error correction recognition model structure.
- each group of error correction training data can be obtained in the following way: a piece of voice data is obtained. Manually label the voice data to obtain the text corresponding to the voice data. Extract the acoustic features of the voice data. Input the speech data into the pre-trained speech recognition model to obtain the first recognition result corresponding to the speech data, and input the first recognition result into the pre-trained keyword extraction model to obtain the keywords in the first recognition result .
- the embodiment of the present application can obtain the second recognition result in two ways, both of which are implemented based on the speech error correction recognition model.
- the difference is that the model input data of the two ways are different, and the first One way is to input the acoustic characteristics of the voice data and the first recognition result of the model, and the second way is to input the acoustic characteristics of the voice data, the first recognition result and the keywords extracted from the first recognition result.
- the second method has more keyword information than the first method to input the data of the model.
- the acoustic features of the speech data, the first recognition result, and the keywords are input into a pre-trained speech error correction recognition model to obtain the second recognition result.
- the specific implementation method may be:
- the voice error correction recognition model is used to encode the acoustic features of the voice data, the first recognition result, and the keyword and attention calculation, and based on the calculation result, the second recognition result is obtained.
- FIG. 3 is a schematic diagram of the topology structure of a preset model for training a speech error correction recognition model disclosed in an embodiment of the application.
- the model includes three layers, which are an encoding layer, an attention layer, and a decoding layer.
- the function of the coding layer is to extract advanced features
- the function of the attention layer is to calculate the correlation between the input of the layer and the final output result
- the input of the decoding layer is the output of the attention layer
- the output of the decoding layer is the output result at the current moment.
- the specific form of the decoding layer may be a single-layer neural network with softmax, which is not limited in this application.
- the coding layer can be further divided into three parts, namely the first coding module, the second coding module, and the third coding module.
- the specific structure of the first coding module, the second coding module, and the third coding module can be a bidirectional RNN (Recurrent Neural Network) with an inverted pyramid structure, or it can be a CNN (Convolutional Neural Networks, convolutional neural network), There are no limitations in this application.
- RNN Recurrent Neural Network
- CNN Convolutional Neural Networks, convolutional neural network
- the attention layer can also be further divided into three parts, namely the first attention module, the second attention module, and the third attention module.
- the specific structure of the first attention module, the second attention module, and the third attention module may be a two-way RNN (Recurrent Neural Network, recurrent neural network) or a one-way RNN, which is not limited in this application.
- RNN Recurrent Neural Network, recurrent neural network
- the input of the decoding layer is the output of the attention layer, and the output of the decoding layer is the output result at the current moment.
- the specific form of Decode may be a single-layer neural network with softmax, which is not limited in this application.
- the input of the first encoding module is the acoustic feature X corresponding to the voice data to be recognized, and the output is the high-level acoustic feature Ha
- the input of the second encoding module is the characterization P of the first recognition result corresponding to the voice data to be recognized, and the output It is the high-level feature Hw of the first recognition result of the speech data to be recognized.
- the input of the third coding module is the characterization Q of the keyword in the first recognition result of the speech data to be recognized, and the output is the characterization Q of the first recognition result of the speech data to be recognized.
- the characterization of the keyword in the first recognition result of the voice data is the high-level feature Hr of Q.
- the output result y i-1 at the last moment is a common input of the first attention module, the second attention module, and the third attention module.
- each part has different inputs and outputs.
- the input of the first attention module is Ha
- the output is the speech-related hidden layer state sa i and the semantic vector ca i
- the input of the second attention module is Hw
- the output is the hidden layer state related to the first recognition result sw i and semantic vector cw i
- input of the third module is the attention Hr
- the first output is the result of the recognition of the hidden layer state keywords and semantic vector sr i cr i.
- Decoding the input layer to output layer attention sa i, ca i, sw i, cw i, sr i, cr i, outputs the decoded output of the current layer of time y i, y i of the speech data to be recognized Recognition results.
- P(y i ) represents the probability that the output result is y i at the current moment
- P(y i ) Decode(sa i ,sw i ,sr i ,ca i ,cw i ,cr i ).
- the voice error correction recognition model is used to encode and pay attention to the acoustic features of the voice data, the first recognition result, and the keywords.
- the specific implementation method may be: using the coding layer and the attention layer of the speech error correction recognition model to separately compare the acoustic features of the speech data and the first
- the second recognition result and the keyword are encoded and the attention calculation is performed to obtain the calculation result; the decoding layer of the speech error correction recognition model is used to decode the calculation result to obtain the second recognition result.
- the acoustic features of the speech data, the first recognition result, and the keywords are respectively coded and attention calculated to obtain the
- the calculation result can be implemented as follows: use the coding layer of the speech error correction recognition model to separately encode each target object to obtain the high-level acoustic features of each target object; use the speech error correction recognition model
- the attention layer performs attention calculations on the semantic vector at the last moment related to each target object and the output result of the speech error correction recognition model at the last moment, and obtains the implicit information related to each target object.
- the attention calculation is performed on the high-level acoustic features of each target object and the hidden layer state related to each target object to obtain each target object Related semantic vectors;
- the target object includes the acoustic feature of the voice data, the first recognition result, and the keyword.
- the above example is an optional processing procedure of the voice error correction recognition model when the input data is the acoustic feature of the voice data, the first recognition result, and the keyword.
- the input data is the acoustic characteristics of the voice data and the first recognition result
- all the model structures and processing procedures related to keywords in Figure 3 can be omitted, that is, the third code is removed from the voice error correction recognition model Module and the third attention model, the rest of the model structure can be kept unchanged, the specific process can refer to the previous introduction, and I will not repeat it here.
- FIG. 4 is a schematic diagram of the topology structure of another preset model for training a speech error correction recognition model disclosed in an embodiment of the application.
- the model includes three layers. They are the coding layer, the attention layer, and the decoding layer.
- the function of the Encode layer is to extract advanced features
- the function of the attention layer is to calculate the correlation between the input of this layer and the final output result
- the input of the decoding layer is the output of the attention layer
- the output of the decoding layer is the output result at the current moment.
- the specific form of Decode can be a single-layer neural network with softmax, which is not limited in this application.
- the input of the coding layer is the acoustic feature X corresponding to the voice data to be recognized, the characterization P of the first recognition result corresponding to the voice data to be recognized, and the characterization of the keywords in the first recognition result of the voice data to be recognized
- the output of the coding layer is composed of the high-level feature Ha of the acoustic feature, the high-level feature Hw of the first recognition result of the voice data to be recognized, and the high-level feature Hw of the keyword in the first recognition result of the voice data to be recognized.
- the merged vector [Ha, Hw, Hr] composed of the high-level features Hr.
- the output of the coding layer and the output result y i-1 of the model at the last moment are the input of the attention layer, and the output of the attention layer is related to the hidden layer state sa i of speech and the semantic vector ca i and the first recognition result.
- sw i status hidden layer and semantic vector cw i the vector of the first keyword recognition result associated hidden layers and the state of semantic vector sr i cr i composed of [sa i, ca i, sw i, cw i, sr i , cr i ].
- the input of the decoding layer is the output of the attention layer, and the output of the decoding layer is the output result y i at the current moment, and y i is the recognition result of the voice data to be recognized.
- the voice error correction recognition model is used to encode and pay attention to the acoustic features of the voice data, the first recognition result, and the keywords. Calculate and obtain the second recognition result based on the calculation result.
- the specific implementation method may be: merging the acoustic features of the voice data, the first recognition result, and the keywords to obtain a merged vector;
- the coding layer and the attention layer of the speech error correction recognition model are encoded and the attention calculation is performed on the merged vector to obtain the calculation result; the decoding layer of the speech error correction recognition model is used to perform the calculation result Decode, get the second recognition result.
- the implementation manner for obtaining the calculation result may be as follows:
- the attention layer of the speech error correction recognition model uses the attention layer of the speech error correction recognition model to obtain the semantic vector related to the merged vector.
- the main focus of the attention layer of the traditional speech recognition model is the correlation between the output result of the traditional speech recognition model and the acoustic characteristics of the speech data.
- the speech error correction recognition model in this application uses the speech data for the first time
- the keywords in the recognition result and the first recognition result are integrated into the attention layer, so that the output result of the speech error correction recognition model can pay attention to the error correction information of the recognition result and the context information of the recognition result.
- the speech error correction recognition model can learn the attention mechanism related to the output result and context information, as well as the attention mechanism related to the output result and error correction, and find the current voice data needs through the above two attention mechanisms Concerned context information and error correction information, that is, according to the input voice data, it can automatically choose whether to pay attention to the first recognition result and the keyword information in the first recognition result, which is equivalent to the speech error correction recognition model has the basis of the first recognition result.
- the ability of the keywords in the first recognition result and the first recognition result to automatically correct errors.
- the above example is another optional processing procedure of the voice error correction recognition model when the input data is the acoustic feature of the voice data, the first recognition result, and the keyword.
- the input of the coding layer in FIG. 4 is the acoustic feature X corresponding to the voice data to be recognized, and the first voice data corresponding to the voice data to be recognized.
- the combination vector [X, P] composed of the characterization P of the recognition result, the output of the coding layer is the combination vector composed of the high-level feature Ha of the acoustic feature and the high-level feature Hw of the first recognition result of the speech data to be recognized [Ha, Hw].
- the attention of the output layer by the speech-related status hidden layer and semantic vector sa i ca i the vector of the first correlation results to identify the hidden layer state sw i and semantic vector cw i composed of [sa i, ca i , sw i , cw i ].
- the input of the decoding layer is the output of the attention layer, and the output of the decoding layer is the output result y i at the current moment, and y i is the recognition result of the voice data to be recognized.
- the data of the input model reduces the keyword information
- the only difference is that the keyword information is removed from the input combination vector of the coding layer.
- the remaining layers of the model are processed with reference to the original processing logic for the input of the coding layer. Yes, the specific process can refer to the previous introduction, so I won’t repeat it here.
- this application also provides an implementation method for generating a recognition training data set and an error correction training data set, which are specifically as follows:
- the smart terminal is an electronic device with voice recognition function, such as smart phone, computer, translator, Robots, smart home (home appliances), etc.
- voice recognition function such as smart phone, computer, translator, Robots, smart home (home appliances), etc.
- each piece of voice data is manually labeled, that is, each piece of voice data is manually transcribed into corresponding text data.
- extract the acoustic features of each piece of voice data are generally the spectral features of the voice data, such as MFCC or FBank.
- the specific method for acquiring the acoustic features is an existing method, and the introduction is not repeated here.
- the acoustic features of the voice data and the manually labeled text corresponding to the voice data are obtained.
- the acoustic features of the voice data obtained in the above steps and the manually labeled text corresponding to the voice data are divided into two parts.
- the first part is represented by the A set
- the second part is represented by the B set.
- the acoustic features of the voice data obtained in the above steps and the artificially annotated text corresponding to the voice data total 1 million groups, and these 1 million groups are randomly divided into two sets of equal amount, namely set A and set B.
- Both the A set and the B set include multiple sets of training data, and each set of training data includes an acoustic feature corresponding to a piece of voice data and a manually labeled text corresponding to the voice data.
- Set C includes multiple sets of training data. Each set of training data includes an acoustic feature corresponding to a piece of voice data, a manually labeled text corresponding to the voice data, and a set of The recognition result and the keywords in the recognition result.
- the training is obtained as a speech error correction recognition model.
- the above-mentioned recognition training data set and error correction training data set contain keywords. If the input data of the speech error correction recognition model only needs the acoustic characteristics of the speech data and the first recognition result, that is, it does not contain the key In the case of word information, the step of obtaining keywords in the above process can be omitted, and the finally obtained recognition training data set and error correction training data set do not need to contain keywords.
- the speech recognition error correction device disclosed in the embodiments of the present application will be described below.
- the speech recognition error correction device described below and the speech recognition error correction method described above can be referred to each other.
- FIG. 5 is a schematic structural diagram of a speech recognition error correction device disclosed in an embodiment of the application.
- the speech recognition error correction device may include:
- the acquiring unit 51 is configured to acquire the voice data to be recognized and the first recognition result thereof;
- the first voice recognition unit 52 is configured to refer to the context information of the first recognition result, perform a second recognition on the voice data, and obtain the second recognition result;
- the recognition result determining unit 53 is configured to determine the final recognition result according to the second recognition result.
- the voice recognition error correction device may include:
- the acquiring unit 51 is configured to acquire the voice data to be recognized and the first recognition result thereof;
- the keyword extraction unit 54 is used to extract keywords from the first recognition result
- the second voice recognition unit 55 is configured to refer to the context information of the first recognition result and the keywords, and perform a second recognition on the voice data to obtain the second recognition result;
- the recognition result determining unit 53 is configured to determine the final recognition result according to the second recognition result.
- the keyword extraction unit includes:
- the domain vocabulary extraction unit is used to extract vocabularies with domain characteristics from the first recognition result as keywords.
- the second speech recognition unit includes:
- An acoustic feature obtaining unit configured to obtain the acoustic feature of the voice data
- the model processing unit is used to input the acoustic features of the voice data, the first recognition result, and the keywords into a pre-trained voice error correction recognition model to obtain the second recognition result.
- the voice error correction The recognition model is obtained by training the preset model using the error correction training data set;
- the error correction training data set includes at least one set of error correction training data, and each set of error correction training data includes an acoustic feature corresponding to a piece of speech data, a text corresponding to the piece of speech data, and a first set of error correction training data corresponding to the piece of speech data.
- the first recognition result and the keywords in the first recognition result are included in the first recognition result.
- the model processing unit includes:
- An encoding and attention calculation unit configured to use the voice error correction recognition model to encode the acoustic features of the voice data, the first recognition result, and the keywords, and to perform attention calculation;
- the recognition unit is used to obtain the second recognition result based on the calculation result.
- the encoding and attention calculation unit includes a first encoding and attention calculation unit, and the recognition unit includes a first decoding unit;
- the first coding and attention calculation unit is configured to use the coding layer and the attention layer of the speech error correction recognition model to analyze the acoustic characteristics of the speech data, the first recognition result, and the key Word encoding and attention calculation are performed to obtain the calculation result;
- the first decoding unit is configured to use the decoding layer of the speech error correction recognition model to decode the calculation result to obtain the second recognition result.
- the model processing unit further includes a merging unit, the coding and attention calculation unit includes a second coding and attention calculation unit, and the recognition unit includes a second decoding unit:
- the merging unit is configured to merge the acoustic features of the voice data, the first recognition result, and the keywords to obtain a merged vector;
- the second coding and attention calculation unit is configured to use the coding layer and the attention layer of the speech error correction recognition model to perform coding and attention calculation on the merged vector to obtain the calculation result;
- the second decoding unit is configured to use the decoding layer of the speech error correction recognition model to decode the calculation result to obtain the second recognition result.
- the first coding and attention calculation unit includes:
- the first coding unit is configured to use the coding layer of the speech error correction recognition model to respectively code each target object to obtain the acoustic high-level features of each target object;
- the first attention calculation unit is configured to use the attention layer of the speech error correction recognition model to separately analyze the semantic vector of the last time related to each target object and the speech error correction recognition model at the last time. Output the result, perform attention calculation to obtain the hidden layer state related to each target object; and use the attention layer of the speech error correction recognition model to separately analyze the acoustic high-level features and all the target objects of each target object.
- the hidden layer state related to each target object is calculated to obtain the semantic vector related to each target object; wherein, the target object includes the acoustic feature of the speech data and the first recognition result And the keywords.
- the second encoding and attention calculation unit includes:
- the second coding unit is configured to use the coding layer of the speech error correction recognition model to encode the merged vector to obtain the acoustic high-level features of the merged vector;
- the second attention calculation unit is configured to use the attention layer of the speech error correction recognition model to correlate the semantic vector at the previous time to the merge vector and the output result of the speech error correction recognition model at the previous time, Perform attention calculations to obtain the hidden layer state related to the merge vector; and, use the attention layer of the speech error correction recognition model to analyze the high-level acoustic features of the merge vector and the hidden layer state related to the merge vector , Perform attention calculation to obtain the semantic vector related to the merged vector.
- the recognition result determining unit includes:
- a confidence degree obtaining unit configured to obtain the confidence degree of the first recognition result and the confidence degree of the second recognition result
- the determining unit is configured to determine, from the first recognition result and the second recognition result, that the recognition result with high confidence is the final recognition result.
- Fig. 7 shows a hardware structure block diagram of a speech recognition error correction system.
- the hardware structure of the speech recognition error correction system may include: at least one processor 1, at least one communication interface 2, at least one memory 3, and at least one communication Bus 4;
- the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 communicate with each other through the communication bus 4;
- the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention, etc.;
- CPU central processing unit
- ASIC Application Specific Integrated Circuit
- the memory 3 may include a high-speed RAM memory, or may also include a non-volatile memory (non-volatile memory), for example, at least one disk memory;
- the memory stores a program
- the processor can call the program stored in the memory, and the program is used for:
- the final recognition result is determined.
- the program can be used to:
- the final recognition result is determined.
- the embodiments of the present application also provide a storage medium, the storage medium may store a program suitable for execution by a processor, and the program is used for:
- the final recognition result is determined.
- the program can be used to:
- the final recognition result is determined.
- the embodiments of the present application also provide a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute any one of the above-mentioned voice recognition error correction methods.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (15)
- 一种语音识别纠错方法,其特征在于,所述方法包括:获取待识别的语音数据及其第一次识别结果;参考所述第一次识别结果的上下文信息,对所述语音数据进行第二次识别,得到第二次识别结果;根据所述第二次识别结果,确定最终的识别结果。
- 一种语音识别纠错方法,其特征在于,所述方法包括:获取待识别的语音数据及其第一次识别结果;从所述第一次识别结果中提取关键词;参考所述第一次识别结果的上下文信息以及所述关键词,对所述语音数据进行第二次识别,得到第二次识别结果;根据所述第二次识别结果,确定最终的识别结果。
- 根据权利要求2所述的方法,其特征在于,所述从所述第一次识别结果中提取关键词,包括:从所述第一识别结果中提取具有领域特性的词汇,作为关键词。
- 根据权利要求2所述的方法,其特征在于,所述参考所述第一次识别结果的上下文信息以及所述关键词,对所述语音数据进行第二次识别,得到第二次识别结果,包括:获取所述语音数据的声学特征;将所述语音数据的声学特征、所述第一次识别结果以及所述关键词,输入预先训练的语音纠错识别模型,得到第二次识别结果,所述语音纠错识别模型是利用纠错训练数据集对预设模型进行训练得到的;其中,所述纠错训练数据集中包括至少一组纠错训练数据,每组纠错训练数据包括一条语音数据对应的声学特征、所述一条语音数据对应的文本、所述一条语音数据对应的第一次识别结果以及所述第一次识别结果中的关键词。
- 根据权利要求4所述的方法,其特征在于,所述将所述语音数据的声学特征、所述第一次识别结果以及所述关键词,输入预先训练的语音纠 错识别模型,得到第二次识别结果,包括:利用所述语音纠错识别模型对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,并基于计算结果,得到第二次识别结果。
- 根据权利要求5所述的方法,其特征在于,所述利用所述语音纠错识别模型对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,并基于计算结果,得到第二次识别结果,包括:利用所述语音纠错识别模型的编码层和注意力层,分别对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,得到所述计算结果;利用所述语音纠错识别模型的解码层,对所述计算结果进行解码,得到第二次识别结果。
- 根据权利要求5所述的方法,其特征在于,所述利用所述语音纠错识别模型对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,并基于计算结果,得到第二次识别结果,包括:对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行合并,得到合并向量;利用所述语音纠错识别模型的编码层和注意力层,对所述合并向量进行编码以及注意力计算,得到所述计算结果;利用所述语音纠错识别模型的解码层,对所述计算结果进行解码,得到第二次识别结果。
- 根据权利要求6所述的方法,其特征在于,所述利用所述语音纠错识别模型的编码层和注意力层,分别对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,得到所述计算结果,包括:利用所述语音纠错识别模型的编码层,分别对每一目标对象进行编码, 得到所述每一目标对象的声学高级特征;利用所述语音纠错识别模型的注意力层,分别对所述每一目标对象相关的上一时刻的语义向量以及所述语音纠错识别模型上一时刻的输出结果,进行注意力计算,得到所述每一目标对象相关的隐层状态;利用所述语音纠错识别模型的注意力层,分别对所述每一目标对象的声学高级特征以及所述每一目标对象相关的隐层状态,进行注意力计算,得到所述每一目标对象相关的语义向量;其中,所述目标对象包括所述语音数据的声学特征、所述第一次识别结果以及所述关键词。
- 根据权利要求7所述的方法,其特征在于,所述利用所述语音纠错识别模型的编码层和注意力层,对所述合并向量进行编码以及注意力计算,得到所述计算结果,包括:利用所述语音纠错识别模型的编码层,对所述合并向量进行编码,得到所述合并向量的声学高级特征;利用所述语音纠错识别模型的注意力层,对所述合并向量相关的上一时刻的语义向量以及所述语音纠错识别模型上一时刻的输出结果,进行注意力计算,得到所述合并向量相关的隐层状态;利用所述语音纠错识别模型的注意力层,对所述合并向量的声学高级特征以及所述合并向量相关的隐层状态,进行注意力计算,得到所述合并向量相关的语义向量。
- 根据权利要求2所述的方法,其特征在于,所述根据所述第二次识别结果,确定最终的识别结果,包括:获取所述第一次识别结果的置信度,以及,所述第二次识别结果的置信度;从所述第一次识别结果以及所述第二次识别结果中,确定置信度高的识别结果为最终的识别结果。
- 一种语音识别纠错装置,其特征在于,所述装置包括:获取单元,用于获取待识别的语音数据及其第一次识别结果;第一语音识别单元,用于参考所述第一次识别结果的上下文信息,对所述语音数据进行第二次识别,得到第二次识别结果;识别结果确定单元,用于根据所述第二次识别结果,确定最终的识别结果。
- 一种语音识别纠错装置,其特征在于,所述装置包括:获取单元,用于获取待识别的语音数据及其第一次识别结果;关键词提取单元,用于从所述第一次识别结果中提取关键词;第二语音识别单元,用于参考所述第一次识别结果的上下文信息以及所述关键词,对所述语音数据进行第二次识别,得到第二次识别结果;识别结果确定单元,用于根据所述第二次识别结果,确定最终的识别结果。
- 一种语音识别纠错系统,其特征在于,包括存储器和处理器;所述存储器,用于存储程序;所述处理器,用于执行所述程序,实现如权利要求1至10中任一项所述的语音识别纠错方法的各个步骤。
- 一种可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,实现如权利要求1至10中任一项所述的语音识别纠错方法的各个步骤。
- 一种计算机程序产品,所述计算机程序产品在终端设备上运行时,使得所述终端设备执行权利要求1至10中任一项所述的方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2022522366A JP7514920B2 (ja) | 2019-11-25 | 2020-11-17 | 音声認識誤り訂正方法、関連装置及び読取可能な記憶媒体 |
KR1020227005374A KR102648306B1 (ko) | 2019-11-25 | 2020-11-17 | 음성 인식 오류 정정 방법, 관련 디바이스들, 및 판독 가능 저장 매체 |
US17/773,641 US20220383853A1 (en) | 2019-11-25 | 2020-11-17 | Speech recognition error correction method, related devices, and readable storage medium |
EP20894295.3A EP4068280A4 (en) | 2019-11-25 | 2020-11-17 | VOICE RECOGNITION ERROR CORRECTION METHOD, ASSOCIATED APPARATUS AND READABLE STORAGE MEDIUM |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911167009.0 | 2019-11-25 | ||
CN201911167009.0A CN110956959B (zh) | 2019-11-25 | 2019-11-25 | 语音识别纠错方法、相关设备及可读存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021104102A1 true WO2021104102A1 (zh) | 2021-06-03 |
Family
ID=69978361
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/129314 WO2021104102A1 (zh) | 2019-11-25 | 2020-11-17 | 语音识别纠错方法、相关设备及可读存储介质 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20220383853A1 (zh) |
EP (1) | EP4068280A4 (zh) |
KR (1) | KR102648306B1 (zh) |
CN (1) | CN110956959B (zh) |
WO (1) | WO2021104102A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113421553A (zh) * | 2021-06-15 | 2021-09-21 | 北京天行汇通信息技术有限公司 | 音频挑选的方法、装置、电子设备和可读存储介质 |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110956959B (zh) * | 2019-11-25 | 2023-07-25 | 科大讯飞股份有限公司 | 语音识别纠错方法、相关设备及可读存储介质 |
CN111627457A (zh) * | 2020-05-13 | 2020-09-04 | 广州国音智能科技有限公司 | 语音分离方法、系统及计算机可读存储介质 |
CN111583909B (zh) * | 2020-05-18 | 2024-04-12 | 科大讯飞股份有限公司 | 一种语音识别方法、装置、设备及存储介质 |
CN111754987A (zh) * | 2020-06-23 | 2020-10-09 | 国投(宁夏)大数据产业发展有限公司 | 一种大数据分析语音识别方法 |
CN112016305B (zh) * | 2020-09-09 | 2023-03-28 | 平安科技(深圳)有限公司 | 文本纠错方法、装置、设备及存储介质 |
CN112259100B (zh) * | 2020-09-15 | 2024-04-09 | 科大讯飞华南人工智能研究院(广州)有限公司 | 语音识别方法及相关模型的训练方法和相关设备、装置 |
CN112133453B (zh) * | 2020-09-16 | 2022-08-26 | 成都美透科技有限公司 | 一种基于医美数据和医疗数据的用户咨询辅助分析系统 |
CN112257437B (zh) * | 2020-10-20 | 2024-02-13 | 中国科学技术大学 | 语音识别纠错方法、装置、电子设备和存储介质 |
CN112435671B (zh) * | 2020-11-11 | 2021-06-29 | 深圳市小顺智控科技有限公司 | 汉语精准识别的智能化语音控制方法及系统 |
CN112489651B (zh) * | 2020-11-30 | 2023-02-17 | 科大讯飞股份有限公司 | 语音识别方法和电子设备、存储装置 |
CN114678027A (zh) * | 2020-12-24 | 2022-06-28 | 深圳Tcl新技术有限公司 | 语音识别结果的纠错方法、装置、终端设备及存储介质 |
CN113035175B (zh) * | 2021-03-02 | 2024-04-12 | 科大讯飞股份有限公司 | 一种语音文本重写模型构建方法、语音识别方法 |
CN113257227B (zh) * | 2021-04-25 | 2024-03-01 | 平安科技(深圳)有限公司 | 语音识别模型性能检测方法、装置、设备及存储介质 |
CN113409767B (zh) * | 2021-05-14 | 2023-04-25 | 北京达佳互联信息技术有限公司 | 一种语音处理方法、装置、电子设备及存储介质 |
CN113221580B (zh) * | 2021-07-08 | 2021-10-12 | 广州小鹏汽车科技有限公司 | 语义拒识方法、语义拒识装置、交通工具及介质 |
WO2024029845A1 (ko) * | 2022-08-05 | 2024-02-08 | 삼성전자주식회사 | 전자 장치 및 이의 음성 인식 방법 |
US11657803B1 (en) * | 2022-11-02 | 2023-05-23 | Actionpower Corp. | Method for speech recognition by using feedback information |
CN116991874B (zh) * | 2023-09-26 | 2024-03-01 | 海信集团控股股份有限公司 | 一种文本纠错、基于大模型的sql语句生成方法及设备 |
CN117238276B (zh) * | 2023-11-10 | 2024-01-30 | 深圳市托普思维商业服务有限公司 | 一种基于智能化语音数据识别的分析纠正系统 |
CN117558263B (zh) * | 2024-01-10 | 2024-04-26 | 科大讯飞股份有限公司 | 语音识别方法、装置、设备及可读存储介质 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003308094A (ja) * | 2002-02-12 | 2003-10-31 | Advanced Telecommunication Research Institute International | 音声認識における認識誤り箇所の訂正方法 |
CN101042867A (zh) * | 2006-03-24 | 2007-09-26 | 株式会社东芝 | 语音识别设备和方法 |
US20140195226A1 (en) * | 2013-01-04 | 2014-07-10 | Electronics And Telecommunications Research Institute | Method and apparatus for correcting error in speech recognition system |
CN106875943A (zh) * | 2017-01-22 | 2017-06-20 | 上海云信留客信息科技有限公司 | 一种用于大数据分析的语音识别系统 |
CN107093423A (zh) * | 2017-05-27 | 2017-08-25 | 努比亚技术有限公司 | 一种语音输入修正方法、装置及计算机可读存储介质 |
CN109065054A (zh) * | 2018-08-31 | 2018-12-21 | 出门问问信息科技有限公司 | 语音识别纠错方法、装置、电子设备及可读存储介质 |
CN110021293A (zh) * | 2019-04-08 | 2019-07-16 | 上海汽车集团股份有限公司 | 语音识别方法及装置、可读存储介质 |
CN110956959A (zh) * | 2019-11-25 | 2020-04-03 | 科大讯飞股份有限公司 | 语音识别纠错方法、相关设备及可读存储介质 |
Family Cites Families (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4867654B2 (ja) * | 2006-12-28 | 2012-02-01 | 日産自動車株式会社 | 音声認識装置、および音声認識方法 |
JP4709887B2 (ja) * | 2008-04-22 | 2011-06-29 | 株式会社エヌ・ティ・ティ・ドコモ | 音声認識結果訂正装置および音声認識結果訂正方法、ならびに音声認識結果訂正システム |
CN101876975A (zh) * | 2009-11-04 | 2010-11-03 | 中国科学院声学研究所 | 汉语地名的识别方法 |
CN102592595B (zh) * | 2012-03-19 | 2013-05-29 | 安徽科大讯飞信息科技股份有限公司 | 语音识别方法及系统 |
CN103366741B (zh) * | 2012-03-31 | 2019-05-17 | 上海果壳电子有限公司 | 语音输入纠错方法及系统 |
US9818401B2 (en) * | 2013-05-30 | 2017-11-14 | Promptu Systems Corporation | Systems and methods for adaptive proper name entity recognition and understanding |
US9940927B2 (en) * | 2013-08-23 | 2018-04-10 | Nuance Communications, Inc. | Multiple pass automatic speech recognition methods and apparatus |
KR102199444B1 (ko) * | 2014-11-24 | 2021-01-07 | 에스케이텔레콤 주식회사 | 음성 인식 오류에 강인한 의미 추론 방법 및 이를 위한 장치 |
KR102380833B1 (ko) * | 2014-12-02 | 2022-03-31 | 삼성전자주식회사 | 음성 인식 방법 및 음성 인식 장치 |
CN107391504B (zh) * | 2016-05-16 | 2021-01-29 | 华为技术有限公司 | 新词识别方法与装置 |
KR20180071029A (ko) * | 2016-12-19 | 2018-06-27 | 삼성전자주식회사 | 음성 인식 방법 및 장치 |
CN107437416B (zh) * | 2017-05-23 | 2020-11-17 | 创新先进技术有限公司 | 一种基于语音识别的咨询业务处理方法及装置 |
CN107293296B (zh) * | 2017-06-28 | 2020-11-20 | 百度在线网络技术(北京)有限公司 | 语音识别结果纠正方法、装置、设备及存储介质 |
JP2019113636A (ja) | 2017-12-22 | 2019-07-11 | オンキヨー株式会社 | 音声認識システム |
JP6985138B2 (ja) | 2017-12-28 | 2021-12-22 | 株式会社イトーキ | 音声認識システム及び音声認識方法 |
US11482213B2 (en) * | 2018-07-20 | 2022-10-25 | Cisco Technology, Inc. | Automatic speech recognition correction |
CN113016030A (zh) * | 2018-11-06 | 2021-06-22 | 株式会社赛斯特安国际 | 提供语音识别服务的方法及装置 |
US11017778B1 (en) * | 2018-12-04 | 2021-05-25 | Sorenson Ip Holdings, Llc | Switching between speech recognition systems |
CN110110041B (zh) * | 2019-03-15 | 2022-02-15 | 平安科技(深圳)有限公司 | 错词纠正方法、装置、计算机装置及存储介质 |
US11636853B2 (en) * | 2019-08-20 | 2023-04-25 | Soundhound, Inc. | Natural language grammar improvement |
-
2019
- 2019-11-25 CN CN201911167009.0A patent/CN110956959B/zh active Active
-
2020
- 2020-11-17 KR KR1020227005374A patent/KR102648306B1/ko active IP Right Grant
- 2020-11-17 US US17/773,641 patent/US20220383853A1/en active Pending
- 2020-11-17 EP EP20894295.3A patent/EP4068280A4/en active Pending
- 2020-11-17 WO PCT/CN2020/129314 patent/WO2021104102A1/zh unknown
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003308094A (ja) * | 2002-02-12 | 2003-10-31 | Advanced Telecommunication Research Institute International | 音声認識における認識誤り箇所の訂正方法 |
CN101042867A (zh) * | 2006-03-24 | 2007-09-26 | 株式会社东芝 | 语音识别设备和方法 |
US20140195226A1 (en) * | 2013-01-04 | 2014-07-10 | Electronics And Telecommunications Research Institute | Method and apparatus for correcting error in speech recognition system |
CN106875943A (zh) * | 2017-01-22 | 2017-06-20 | 上海云信留客信息科技有限公司 | 一种用于大数据分析的语音识别系统 |
CN107093423A (zh) * | 2017-05-27 | 2017-08-25 | 努比亚技术有限公司 | 一种语音输入修正方法、装置及计算机可读存储介质 |
CN109065054A (zh) * | 2018-08-31 | 2018-12-21 | 出门问问信息科技有限公司 | 语音识别纠错方法、装置、电子设备及可读存储介质 |
CN110021293A (zh) * | 2019-04-08 | 2019-07-16 | 上海汽车集团股份有限公司 | 语音识别方法及装置、可读存储介质 |
CN110956959A (zh) * | 2019-11-25 | 2020-04-03 | 科大讯飞股份有限公司 | 语音识别纠错方法、相关设备及可读存储介质 |
Non-Patent Citations (1)
Title |
---|
See also references of EP4068280A4 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113421553A (zh) * | 2021-06-15 | 2021-09-21 | 北京天行汇通信息技术有限公司 | 音频挑选的方法、装置、电子设备和可读存储介质 |
CN113421553B (zh) * | 2021-06-15 | 2023-10-20 | 北京捷通数智科技有限公司 | 音频挑选的方法、装置、电子设备和可读存储介质 |
Also Published As
Publication number | Publication date |
---|---|
JP2022552662A (ja) | 2022-12-19 |
EP4068280A1 (en) | 2022-10-05 |
KR102648306B1 (ko) | 2024-03-15 |
CN110956959B (zh) | 2023-07-25 |
CN110956959A (zh) | 2020-04-03 |
US20220383853A1 (en) | 2022-12-01 |
KR20220035222A (ko) | 2022-03-21 |
EP4068280A4 (en) | 2023-11-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021104102A1 (zh) | 语音识别纠错方法、相关设备及可读存储介质 | |
CN109146610B (zh) | 一种智能保险推荐方法、装置及智能保险机器人设备 | |
CN108899013B (zh) | 语音搜索方法、装置和语音识别系统 | |
US10917758B1 (en) | Voice-based messaging | |
US11093110B1 (en) | Messaging feedback mechanism | |
US20210090570A1 (en) | Automated calling system | |
US10366690B1 (en) | Speech recognition entity resolution | |
Xu et al. | Rescorebert: Discriminative speech recognition rescoring with bert | |
KR20160015218A (ko) | 온라인 음성 번역 방법 및 장치 | |
US10152298B1 (en) | Confidence estimation based on frequency | |
Peyser et al. | Improving performance of end-to-end ASR on numeric sequences | |
JP2008293019A (ja) | 言語理解装置 | |
CN111539199B (zh) | 文本的纠错方法、装置、终端、及存储介质 | |
CN109933773A (zh) | 一种多重语义语句解析系统及方法 | |
CN111128175B (zh) | 口语对话管理方法及系统 | |
WO2012004955A1 (ja) | テキスト補正方法及び認識方法 | |
CN116226338A (zh) | 基于检索和生成融合的多轮对话系统及方法 | |
Arora et al. | Two-pass low latency end-to-end spoken language understanding | |
Ganhotra et al. | Integrating dialog history into end-to-end spoken language understanding systems | |
Wang et al. | Diarizationlm: Speaker diarization post-processing with large language models | |
CN116564330A (zh) | 弱监督语音预训练方法、电子设备和存储介质 | |
CN112989794A (zh) | 模型训练方法、装置、智能机器人和存储介质 | |
JP7514920B2 (ja) | 音声認識誤り訂正方法、関連装置及び読取可能な記憶媒体 | |
JP2024512606A (ja) | 自己アライメントを用いたストリーミングasrモデル遅延の短縮 | |
Bijwadia et al. | Text Injection for Capitalization and Turn-Taking Prediction in Speech Models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20894295 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 20227005374 Country of ref document: KR Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2022522366 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2020894295 Country of ref document: EP Effective date: 20220627 |