WO2021104102A1 - 语音识别纠错方法、相关设备及可读存储介质 - Google Patents

语音识别纠错方法、相关设备及可读存储介质 Download PDF

Info

Publication number
WO2021104102A1
WO2021104102A1 PCT/CN2020/129314 CN2020129314W WO2021104102A1 WO 2021104102 A1 WO2021104102 A1 WO 2021104102A1 CN 2020129314 W CN2020129314 W CN 2020129314W WO 2021104102 A1 WO2021104102 A1 WO 2021104102A1
Authority
WO
WIPO (PCT)
Prior art keywords
recognition
recognition result
error correction
speech
result
Prior art date
Application number
PCT/CN2020/129314
Other languages
English (en)
French (fr)
Inventor
许丽
潘嘉
王智国
胡国平
Original Assignee
科大讯飞股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 科大讯飞股份有限公司 filed Critical 科大讯飞股份有限公司
Priority to JP2022522366A priority Critical patent/JP7514920B2/ja
Priority to KR1020227005374A priority patent/KR102648306B1/ko
Priority to US17/773,641 priority patent/US20220383853A1/en
Priority to EP20894295.3A priority patent/EP4068280A4/en
Publication of WO2021104102A1 publication Critical patent/WO2021104102A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/005Correction of errors induced by the transmission channel, if related to the coding algorithm
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/228Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of application context

Definitions

  • speech recognition technology based on deep learning has matured.
  • the recognition accuracy of traditional speech recognition models in general scenarios has reached satisfactory results.
  • voice content usually There are some professional vocabularies, which appear less frequently in general scenarios, resulting in poor coverage of this type of vocabulary by traditional speech recognition models.
  • the speech to be recognized contains this type of vocabulary, and the traditional speech recognition model is used to recognize the speech to be recognized, which is very prone to recognition errors, resulting in low speech recognition accuracy.
  • this application is proposed to provide a speech recognition error correction method, related equipment and readable storage medium.
  • the specific plan is as follows:
  • a speech recognition error correction method includes:
  • the final recognition result is determined.
  • the final recognition result is determined.
  • extracting keywords from the first recognition result includes:
  • the second recognition of the voice data with reference to the context information of the first recognition result and the keyword to obtain the second recognition result includes:
  • the acoustic characteristics of the voice data, the first recognition result, and the keywords are input into a pre-trained voice error correction recognition model to obtain the second recognition result.
  • the voice error correction recognition model uses error correction
  • the training data set is obtained by training the preset model;
  • the error correction training data set includes at least one set of error correction training data, and each set of error correction training data includes an acoustic feature corresponding to a piece of speech data, a text corresponding to the piece of speech data, and a first set of error correction training data corresponding to the piece of speech data.
  • the first recognition result and the keywords in the first recognition result are included in the first recognition result.
  • the inputting the acoustic features of the voice data, the first recognition result, and the keywords into a pre-trained voice error correction recognition model to obtain the second recognition result includes:
  • the voice error correction recognition model is used to encode the acoustic features of the voice data, the first recognition result, and the keyword and attention calculation, and based on the calculation result, the second recognition result is obtained.
  • recognition results including:
  • the acoustic features of the speech data, the first recognition result, and the keyword are respectively coded and attention calculated to obtain the calculation result ;
  • the decoding layer of the speech error correction recognition model is used to decode the calculation result to obtain the second recognition result.
  • the use of the voice error correction recognition model to encode the acoustic features of the voice data, the first recognition result, and the keyword, and calculate the attention, and obtain the second Recognition results including:
  • the decoding layer of the speech error correction recognition model is used to decode the calculation result to obtain the second recognition result.
  • the encoding layer and the attention layer of the speech error correction recognition model are used to respectively encode the acoustic features of the speech data, the first recognition result, and the keywords and calculate the attention To obtain the calculation result, including:
  • the attention calculation is performed on the semantic vector at the previous time related to each target object and the output result of the speech error correction recognition model at the previous time, respectively, to obtain The hidden layer state related to each target object;
  • the attention calculation is performed on the high-level acoustic features of each target object and the hidden layer state related to each target object to obtain each target object Related semantic vectors;
  • the target object includes the acoustic feature of the voice data, the first recognition result, and the keyword.
  • the using the coding layer and the attention layer of the speech error correction recognition model to perform coding and attention calculation on the merged vector to obtain the calculation result includes:
  • the attention layer of the speech error correction recognition model uses the attention layer of the speech error correction recognition model to obtain the semantic vector related to the merged vector.
  • the determining the final recognition result according to the second recognition result includes:
  • a voice recognition error correction device including:
  • the acquiring unit is used to acquire the voice data to be recognized and its first recognition result
  • the first speech recognition unit is configured to refer to the context information of the first recognition result, perform second recognition on the speech data, and obtain the second recognition result;
  • the recognition result determining unit is configured to determine the final recognition result according to the second recognition result.
  • another voice recognition error correction device is provided, and the device includes:
  • the acquiring unit is used to acquire the voice data to be recognized and its first recognition result
  • the keyword extraction unit is used to extract keywords from the first recognition result
  • the second voice recognition unit is configured to refer to the context information of the first recognition result and the keywords, and perform a second recognition on the voice data to obtain the second recognition result;
  • the recognition result determining unit is configured to determine the final recognition result according to the second recognition result.
  • the keyword extraction unit includes:
  • the domain vocabulary extraction unit is used to extract vocabularies with domain characteristics from the first recognition result as keywords.
  • the second speech recognition unit includes:
  • An acoustic feature obtaining unit configured to obtain the acoustic feature of the voice data
  • the model processing unit is used to input the acoustic features of the voice data, the first recognition result, and the keywords into a pre-trained voice error correction recognition model to obtain the second recognition result.
  • the voice error correction The recognition model is obtained by training the preset model using the error correction training data set;
  • the error correction training data set includes at least one set of error correction training data, and each set of error correction training data includes an acoustic feature corresponding to a piece of speech data, a text corresponding to the piece of speech data, and a first set of error correction training data corresponding to the piece of speech data.
  • the first recognition result and the keywords in the first recognition result are included in the first recognition result.
  • the model processing unit includes:
  • An encoding and attention calculation unit configured to use the voice error correction recognition model to encode the acoustic features of the voice data, the first recognition result, and the keywords, and to perform attention calculation;
  • the recognition unit is used to obtain the second recognition result based on the calculation result.
  • the encoding and attention calculation unit includes a first encoding and attention calculation unit, and the recognition unit includes a first decoding unit;
  • the first coding and attention calculation unit is configured to use the coding layer and the attention layer of the speech error correction recognition model to analyze the acoustic characteristics of the speech data, the first recognition result, and the key Word encoding and attention calculation are performed to obtain the calculation result;
  • the first decoding unit is configured to use the decoding layer of the speech error correction recognition model to decode the calculation result to obtain the second recognition result.
  • the model processing unit further includes a merging unit, the encoding and attention calculation unit includes a second encoding and an attention calculation unit, and the recognition unit includes a second decoding unit:
  • the merging unit is configured to merge the acoustic features of the voice data, the first recognition result, and the keywords to obtain a merged vector;
  • the second coding and attention calculation unit is configured to use the coding layer and the attention layer of the speech error correction recognition model to perform coding and attention calculation on the merged vector to obtain the calculation result;
  • the second decoding unit is configured to use the decoding layer of the speech error correction recognition model to decode the calculation result to obtain the second recognition result.
  • the first coding and attention calculation unit includes:
  • the first coding unit is configured to use the coding layer of the speech error correction recognition model to respectively code each target object to obtain the acoustic high-level features of each target object;
  • the first attention calculation unit is configured to use the attention layer of the speech error correction recognition model to separately analyze the semantic vector of the last time related to each target object and the speech error correction recognition model at the last time. Output the result, perform attention calculation to obtain the hidden layer state related to each target object; and use the attention layer of the speech error correction recognition model to separately analyze the acoustic high-level features and all the target objects of each target object.
  • the hidden layer state related to each target object is calculated to obtain the semantic vector related to each target object; wherein, the target object includes the acoustic feature of the speech data and the first recognition result And the keywords.
  • the second encoding and attention calculation unit includes:
  • the second coding unit is configured to use the coding layer of the speech error correction recognition model to encode the merged vector to obtain the acoustic high-level features of the merged vector;
  • the second attention calculation unit is configured to use the attention layer of the speech error correction recognition model to correlate the semantic vector at the previous time to the merge vector and the output result of the speech error correction recognition model at the previous time, Perform attention calculations to obtain the hidden layer state related to the merge vector; and, use the attention layer of the speech error correction recognition model to analyze the high-level acoustic features of the merge vector and the hidden layer state related to the merge vector , Perform attention calculation to obtain the semantic vector related to the merged vector.
  • the recognition result determining unit includes:
  • a confidence degree obtaining unit configured to obtain the confidence degree of the first recognition result and the confidence degree of the second recognition result
  • the determining unit is configured to determine, from the first recognition result and the second recognition result, that the recognition result with high confidence is the final recognition result.
  • a speech recognition error correction system including a memory and a processor
  • the memory is used to store programs
  • the processor is configured to execute the program to implement each step of the voice recognition error correction method described above.
  • a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, each step of the above-mentioned speech recognition error correction method is realized.
  • a computer program product is provided.
  • the terminal device executes each step of the above-mentioned voice recognition error correction method.
  • this application discloses a speech recognition error correction method, related equipment and readable storage medium, including: acquiring the speech data to be recognized and its first recognition result; and referring to the first recognition result Context information, the speech data is recognized for the second time, and the second recognition result is obtained; finally, the final recognition result is determined according to the second recognition result.
  • the second time recognition of the voice data is performed with reference to the context information of the first recognition result, fully considering the context information of the recognition result and the application scenarios of the voice data. If the first recognition result is wrong, then The second time recognition is used to correct errors, and therefore, the accuracy of speech recognition can be improved.
  • FIG. 1 is a schematic flowchart of a voice recognition error correction method disclosed in an embodiment of the application
  • FIG. 2 is a schematic flowchart of another voice recognition error correction method disclosed in an embodiment of the application.
  • FIG. 3 is a schematic diagram of the topology structure of a preset model for training a speech error correction recognition model disclosed in an embodiment of the application;
  • FIG. 4 is a schematic diagram of the topology structure of yet another preset model for training a speech error correction recognition model disclosed in an embodiment of the application;
  • FIG. 5 is a schematic structural diagram of a speech recognition error correction device disclosed in an embodiment of the application.
  • FIG. 6 is a schematic structural diagram of another voice recognition error correction device disclosed in an embodiment of the application.
  • FIG. 7 is a hardware structure block diagram of a speech recognition error correction system disclosed in an embodiment of the application.
  • the first clause in the voice data stream due to insufficient context information, the first clause is recognized incorrectly, and the subclauses after the first clause are recognized incorrectly.
  • sentence recognition because the context information is sufficient, the clauses after the first clause can be correctly recognized. In other words, the same word may be recognized incorrectly when it appears in the first clause, and correctly recognized when it appears in the second clause.
  • the content of the speech to be recognized is "Scientists from the Saoke Institute in California have discovered that autophagy can inhibit the occurrence of cancer. Many people’s cognition is just the opposite of the past. Therefore, the therapies used to inhibit autophagy It may actually bring bad consequences.”, was identified as “Scientists from the Saoke Institute in California found that the reaction can inhibit the occurrence of cancer at this time. Many people’s perceptions are just the opposite. Therefore, those who use Therapies to suppress the autophagy response may have undesirable consequences.”
  • the inventor of this case found that the contextual information carried by the recognition result itself has a certain impact on the correctness of the recognition result. Therefore, it can be based on the context information of the first recognition result of the voice data to be recognized, and the voice data can be recognized. Perform the second recognition and get the second recognition result. In the second recognition result, the incorrectly recognized domain vocabulary in the first recognition result may be corrected, thereby improving the accuracy of the speech recognition result.
  • the inventor of this case proposes a speech recognition error correction method.
  • the speech recognition error correction method provided by the present application will be introduced through the following embodiments.
  • FIG. 1 is a schematic flowchart of a voice recognition error correction method disclosed in an embodiment of the application.
  • the method may include:
  • S101 Acquire voice data to be recognized and its first recognition result.
  • the voice data to be recognized is voice data spoken by the user according to application requirements, such as voice data input by the user using a voice input method when sending a text message or chatting.
  • the voice data to be recognized may be voice data in a general field, or voice data in a special scene (such as a professional field).
  • the first recognition result of the voice data to be recognized can be implemented based on a neural network model.
  • other methods for obtaining the first recognition result of the voice data to be recognized are also within the protection scope of this application.
  • the first recognition result of the voice data to be recognized can be stored in advance, and when needed, it can be directly obtained from the storage medium.
  • S102 Referring to the context information of the first recognition result, perform a second recognition on the voice data to obtain a second recognition result.
  • the context information carried by the recognition result itself has a certain impact on the correctness of the recognition result. Therefore, in this embodiment, the context information of the first recognition result can be referred to for voice data. Perform the second recognition and get the second recognition result.
  • the voice data is recognized for the second time to obtain the second recognition result in various ways. For example, it can be implemented based on a neural network model.
  • the domain vocabulary contained in the first recognition result can be determined, and the domain vocabulary can be matched with other words in the first recognition result, and the matching degree can be selected to be higher than the set lower limit of the matching degree. And the words that are not exactly the same, use the domain words to replace the filtered words, and get the second recognition result.
  • the second recognition result can be directly determined as the final recognition result.
  • the second recognition result may not be better than the first recognition result. If the second recognition result is directly determined as the final recognition result, the recognition accuracy rate will be reduced. Therefore, in this case, an optimal recognition result can be determined from the first recognition result and the second recognition result as the final recognition result.
  • the confidence level of the first recognition result can be obtained, and The confidence of the second recognition result; from the first recognition result and the second recognition result, it is determined that the recognition result with high confidence is the final recognition result.
  • manual verification can be used to determine an optimal recognition result from the first recognition result and the second recognition result.
  • This embodiment discloses a speech recognition error correction method to obtain the speech data to be recognized and its first recognition result; referring to the context information of the first recognition result, the speech data is recognized for the second time to obtain the second The second recognition result; finally, according to the second recognition result, the final recognition result is determined.
  • the speech data is recognized for the second time with reference to the context information of the first recognition result, and the application scenarios of the context information of the recognition result are fully considered. If the first recognition result is wrong, the second recognition can be used. The secondary recognition corrects its errors, therefore, the accuracy of speech recognition can be improved.
  • another voice recognition error correction method is further provided. Based on the above embodiments, keyword extraction can be performed on the first recognition result, and then the first recognition result can be referred to at the same time. The second time recognition of the voice data can further improve the accuracy of the second recognition result. Refer to Figure 2 for the specific implementation process. The method includes:
  • S201 Acquire voice data to be recognized and its first recognition result.
  • Step S201 is the same as the aforementioned step S101, and the detailed implementation process can be referred to the foregoing introduction, which will not be repeated here.
  • the keyword may be a vocabulary with domain characteristics extracted from the first recognition result. That is, the keywords can be domain-related words that appear in the first recognition result, and are usually domain-specific words. Examples are: autophagy, bone traction, kidney biopsy and other vocabulary in the medical field; feedforward neural network, pooling layer in the computer field, etc.
  • S203 Referring to the context information of the first recognition result and the keyword, perform a second recognition on the voice data to obtain a second recognition result.
  • the voice data is recognized for the second time, and there may be multiple implementation ways to obtain the second recognition result. For example, it can be implemented based on a neural network model.
  • Step S204 is the same as the aforementioned step S103, and the detailed implementation process can be referred to the foregoing introduction, and will not be repeated here.
  • the speech recognition error correction method disclosed in this embodiment further extracts keywords from the first recognition result.
  • the keywords can be vocabulary with domain characteristics, and can also refer to the context information and context information of the first recognition result.
  • the keyword is used for the second recognition of the voice data, which can further improve the accuracy of the second recognition result.
  • the voice data may be input into a pre-trained voice recognition model to obtain the first recognition result.
  • the pre-trained speech recognition model may specifically be a traditional speech recognition model. It may also be a speech recognition model generated by training a preset model based on the recognition training data set.
  • the recognition training data set includes at least one set of recognition training data, and each set of recognition training data includes a piece of text corresponding to the voice data, and the acoustic characteristics of the piece of voice data.
  • the preset model can be any neural network model, which is not limited in this application.
  • each recognition training data in the recognition training data set can be obtained in the following way: Obtain a voice Data, the voice data is manually labeled to obtain the text corresponding to the voice data; the acoustic features of the voice data are extracted; a recognition training data is generated, and the recognition training data includes the text corresponding to the voice data and the text of the voice data Acoustic characteristics.
  • voice data can be received through a microphone of a smart terminal, which is an electronic device with a voice recognition function, such as a smart phone, a computer, a translator, a robot, Smart home, smart home appliances, etc. You can also obtain pre-stored voice data. Of course, other ways of obtaining voice data are also within the protection scope of this application, and this application does not impose any limitation on this.
  • the acoustic feature of each voice data may be the spectral feature of the voice data, such as MFCC (Mel-Frequency Cepstral Coefficients, Mel frequency cepstral coefficients), or FBank features.
  • MFCC Mel-Frequency Cepstral Coefficients, Mel frequency cepstral coefficients
  • FBank features any mainstream acoustic feature extraction method can be used to extract the acoustic features of each voice data, which is not limited in this application.
  • the preset model used when training the speech recognition model can be the traditional third encoding module based on attention-decoder (encoding and decoding based on attention mechanism) model structure, or it can be other model structures. This application does not impose any restrictions.
  • the preset model when the preset model is trained based on the recognition training data, the acoustic characteristics of each speech data in the recognition training data are used as the input of the preset model, and the speech data in each recognition training data corresponds to The text is the training target, and the parameters of the preset model are trained.
  • NER Named Entity Recognition, named entity recognition
  • NER Named Entity Recognition
  • other ways of extracting keywords in the first recognition result are also within the protection scope of this application.
  • a manual method can be used to extract keywords from the first recognition result.
  • NER Named Entity Recognition
  • the implementation of extracting the keywords in the first recognition result may be specifically implemented as follows: input the first recognition result into a pre-trained keyword extraction model to obtain the keywords in the first recognition result.
  • the keyword extraction model can train and generate a preset model structure based on the extracted training data set.
  • the extracted training data set includes at least one set of extracted training data, and each set of extracted training data includes a text.
  • the professional vocabulary with domain characteristics that appeared in has been marked.
  • Each text can be a text under a special scene.
  • a manual labeling method can be used to label the professional vocabulary with domain characteristics appearing in each text to achieve labeling.
  • the preset model can be a BiLSTM_CRF (Bidirectional Long Short-term Memory Model_Conditional Random Field) model based on deep learning.
  • BiLSTM_CRF Bidirectional Long Short-term Memory Model_Conditional Random Field
  • the first recognition result is "autophagy can inhibit the occurrence of cancer. Many people have the opposite perception. Therefore, therapies used to inhibit autophagy may have bad consequences.”
  • the keyword extraction model can output keywords: autophagy, cancer, therapy.
  • NER Named Entity Recognition
  • the specific implementation of extracting the keywords in the first recognition result may be specifically as follows: input the first recognition result into the statistical model to obtain the keywords in the first recognition result.
  • the construction method of the statistical model is the current mature technology, which will not be repeated in this application.
  • the speech data when referring to the context information of the first recognition result, the speech data is recognized for the second time, and the realization of the second recognition result is based on a neural network model, the speech can be The acoustic characteristics of the data, the first recognition result, and the pre-trained speech error correction recognition model are input to obtain the second recognition result.
  • the speech error correction recognition model uses the error correction training data set to train a preset model
  • the error correction training data set includes at least one set of error correction training data, and each set of error correction training data includes an acoustic feature corresponding to a piece of voice data, a text corresponding to the piece of voice data, and a piece of voice data corresponding to the The first recognition result.
  • the acoustic feature corresponding to the piece of speech data and the first recognition result corresponding to the piece of speech data are the input of the preset speech error correction recognition model structure.
  • the text corresponding to the piece of speech data is a training target of the preset speech error correction recognition model structure.
  • each group of error correction training data can be obtained in the following way: a piece of voice data is obtained. Manually label the voice data to obtain the text corresponding to the voice data. Extract the acoustic features of the voice data. The voice data is input into a pre-trained voice recognition model, and the first recognition result corresponding to the voice data is obtained.
  • the speech data when the speech data is recognized for the second time with reference to the context information of the first recognition result and the keywords, and the realization of the second recognition result is based on a neural network model ,
  • the acoustic features of the voice data, the first recognition result, and the keywords can be input into a pre-trained voice error correction recognition model to obtain the second recognition result.
  • the voice error correction recognition model uses The error correction training data set is obtained by training a preset model; the error correction training data set includes at least one set of error correction training data, and each set of error correction training data includes an acoustic feature corresponding to a piece of voice data, and the piece of voice data The corresponding text, the first recognition result corresponding to the piece of voice data, and the keywords in the first recognition result.
  • the speech error correction recognition model when the speech error correction recognition model is trained, the acoustic features corresponding to the piece of speech data, the first recognition result corresponding to the piece of speech data, and the keywords in the first recognition result are selected.
  • the text corresponding to the piece of speech data is a training target of the preset speech error correction recognition model structure.
  • each group of error correction training data can be obtained in the following way: a piece of voice data is obtained. Manually label the voice data to obtain the text corresponding to the voice data. Extract the acoustic features of the voice data. Input the speech data into the pre-trained speech recognition model to obtain the first recognition result corresponding to the speech data, and input the first recognition result into the pre-trained keyword extraction model to obtain the keywords in the first recognition result .
  • the embodiment of the present application can obtain the second recognition result in two ways, both of which are implemented based on the speech error correction recognition model.
  • the difference is that the model input data of the two ways are different, and the first One way is to input the acoustic characteristics of the voice data and the first recognition result of the model, and the second way is to input the acoustic characteristics of the voice data, the first recognition result and the keywords extracted from the first recognition result.
  • the second method has more keyword information than the first method to input the data of the model.
  • the acoustic features of the speech data, the first recognition result, and the keywords are input into a pre-trained speech error correction recognition model to obtain the second recognition result.
  • the specific implementation method may be:
  • the voice error correction recognition model is used to encode the acoustic features of the voice data, the first recognition result, and the keyword and attention calculation, and based on the calculation result, the second recognition result is obtained.
  • FIG. 3 is a schematic diagram of the topology structure of a preset model for training a speech error correction recognition model disclosed in an embodiment of the application.
  • the model includes three layers, which are an encoding layer, an attention layer, and a decoding layer.
  • the function of the coding layer is to extract advanced features
  • the function of the attention layer is to calculate the correlation between the input of the layer and the final output result
  • the input of the decoding layer is the output of the attention layer
  • the output of the decoding layer is the output result at the current moment.
  • the specific form of the decoding layer may be a single-layer neural network with softmax, which is not limited in this application.
  • the coding layer can be further divided into three parts, namely the first coding module, the second coding module, and the third coding module.
  • the specific structure of the first coding module, the second coding module, and the third coding module can be a bidirectional RNN (Recurrent Neural Network) with an inverted pyramid structure, or it can be a CNN (Convolutional Neural Networks, convolutional neural network), There are no limitations in this application.
  • RNN Recurrent Neural Network
  • CNN Convolutional Neural Networks, convolutional neural network
  • the attention layer can also be further divided into three parts, namely the first attention module, the second attention module, and the third attention module.
  • the specific structure of the first attention module, the second attention module, and the third attention module may be a two-way RNN (Recurrent Neural Network, recurrent neural network) or a one-way RNN, which is not limited in this application.
  • RNN Recurrent Neural Network, recurrent neural network
  • the input of the decoding layer is the output of the attention layer, and the output of the decoding layer is the output result at the current moment.
  • the specific form of Decode may be a single-layer neural network with softmax, which is not limited in this application.
  • the input of the first encoding module is the acoustic feature X corresponding to the voice data to be recognized, and the output is the high-level acoustic feature Ha
  • the input of the second encoding module is the characterization P of the first recognition result corresponding to the voice data to be recognized, and the output It is the high-level feature Hw of the first recognition result of the speech data to be recognized.
  • the input of the third coding module is the characterization Q of the keyword in the first recognition result of the speech data to be recognized, and the output is the characterization Q of the first recognition result of the speech data to be recognized.
  • the characterization of the keyword in the first recognition result of the voice data is the high-level feature Hr of Q.
  • the output result y i-1 at the last moment is a common input of the first attention module, the second attention module, and the third attention module.
  • each part has different inputs and outputs.
  • the input of the first attention module is Ha
  • the output is the speech-related hidden layer state sa i and the semantic vector ca i
  • the input of the second attention module is Hw
  • the output is the hidden layer state related to the first recognition result sw i and semantic vector cw i
  • input of the third module is the attention Hr
  • the first output is the result of the recognition of the hidden layer state keywords and semantic vector sr i cr i.
  • Decoding the input layer to output layer attention sa i, ca i, sw i, cw i, sr i, cr i, outputs the decoded output of the current layer of time y i, y i of the speech data to be recognized Recognition results.
  • P(y i ) represents the probability that the output result is y i at the current moment
  • P(y i ) Decode(sa i ,sw i ,sr i ,ca i ,cw i ,cr i ).
  • the voice error correction recognition model is used to encode and pay attention to the acoustic features of the voice data, the first recognition result, and the keywords.
  • the specific implementation method may be: using the coding layer and the attention layer of the speech error correction recognition model to separately compare the acoustic features of the speech data and the first
  • the second recognition result and the keyword are encoded and the attention calculation is performed to obtain the calculation result; the decoding layer of the speech error correction recognition model is used to decode the calculation result to obtain the second recognition result.
  • the acoustic features of the speech data, the first recognition result, and the keywords are respectively coded and attention calculated to obtain the
  • the calculation result can be implemented as follows: use the coding layer of the speech error correction recognition model to separately encode each target object to obtain the high-level acoustic features of each target object; use the speech error correction recognition model
  • the attention layer performs attention calculations on the semantic vector at the last moment related to each target object and the output result of the speech error correction recognition model at the last moment, and obtains the implicit information related to each target object.
  • the attention calculation is performed on the high-level acoustic features of each target object and the hidden layer state related to each target object to obtain each target object Related semantic vectors;
  • the target object includes the acoustic feature of the voice data, the first recognition result, and the keyword.
  • the above example is an optional processing procedure of the voice error correction recognition model when the input data is the acoustic feature of the voice data, the first recognition result, and the keyword.
  • the input data is the acoustic characteristics of the voice data and the first recognition result
  • all the model structures and processing procedures related to keywords in Figure 3 can be omitted, that is, the third code is removed from the voice error correction recognition model Module and the third attention model, the rest of the model structure can be kept unchanged, the specific process can refer to the previous introduction, and I will not repeat it here.
  • FIG. 4 is a schematic diagram of the topology structure of another preset model for training a speech error correction recognition model disclosed in an embodiment of the application.
  • the model includes three layers. They are the coding layer, the attention layer, and the decoding layer.
  • the function of the Encode layer is to extract advanced features
  • the function of the attention layer is to calculate the correlation between the input of this layer and the final output result
  • the input of the decoding layer is the output of the attention layer
  • the output of the decoding layer is the output result at the current moment.
  • the specific form of Decode can be a single-layer neural network with softmax, which is not limited in this application.
  • the input of the coding layer is the acoustic feature X corresponding to the voice data to be recognized, the characterization P of the first recognition result corresponding to the voice data to be recognized, and the characterization of the keywords in the first recognition result of the voice data to be recognized
  • the output of the coding layer is composed of the high-level feature Ha of the acoustic feature, the high-level feature Hw of the first recognition result of the voice data to be recognized, and the high-level feature Hw of the keyword in the first recognition result of the voice data to be recognized.
  • the merged vector [Ha, Hw, Hr] composed of the high-level features Hr.
  • the output of the coding layer and the output result y i-1 of the model at the last moment are the input of the attention layer, and the output of the attention layer is related to the hidden layer state sa i of speech and the semantic vector ca i and the first recognition result.
  • sw i status hidden layer and semantic vector cw i the vector of the first keyword recognition result associated hidden layers and the state of semantic vector sr i cr i composed of [sa i, ca i, sw i, cw i, sr i , cr i ].
  • the input of the decoding layer is the output of the attention layer, and the output of the decoding layer is the output result y i at the current moment, and y i is the recognition result of the voice data to be recognized.
  • the voice error correction recognition model is used to encode and pay attention to the acoustic features of the voice data, the first recognition result, and the keywords. Calculate and obtain the second recognition result based on the calculation result.
  • the specific implementation method may be: merging the acoustic features of the voice data, the first recognition result, and the keywords to obtain a merged vector;
  • the coding layer and the attention layer of the speech error correction recognition model are encoded and the attention calculation is performed on the merged vector to obtain the calculation result; the decoding layer of the speech error correction recognition model is used to perform the calculation result Decode, get the second recognition result.
  • the implementation manner for obtaining the calculation result may be as follows:
  • the attention layer of the speech error correction recognition model uses the attention layer of the speech error correction recognition model to obtain the semantic vector related to the merged vector.
  • the main focus of the attention layer of the traditional speech recognition model is the correlation between the output result of the traditional speech recognition model and the acoustic characteristics of the speech data.
  • the speech error correction recognition model in this application uses the speech data for the first time
  • the keywords in the recognition result and the first recognition result are integrated into the attention layer, so that the output result of the speech error correction recognition model can pay attention to the error correction information of the recognition result and the context information of the recognition result.
  • the speech error correction recognition model can learn the attention mechanism related to the output result and context information, as well as the attention mechanism related to the output result and error correction, and find the current voice data needs through the above two attention mechanisms Concerned context information and error correction information, that is, according to the input voice data, it can automatically choose whether to pay attention to the first recognition result and the keyword information in the first recognition result, which is equivalent to the speech error correction recognition model has the basis of the first recognition result.
  • the ability of the keywords in the first recognition result and the first recognition result to automatically correct errors.
  • the above example is another optional processing procedure of the voice error correction recognition model when the input data is the acoustic feature of the voice data, the first recognition result, and the keyword.
  • the input of the coding layer in FIG. 4 is the acoustic feature X corresponding to the voice data to be recognized, and the first voice data corresponding to the voice data to be recognized.
  • the combination vector [X, P] composed of the characterization P of the recognition result, the output of the coding layer is the combination vector composed of the high-level feature Ha of the acoustic feature and the high-level feature Hw of the first recognition result of the speech data to be recognized [Ha, Hw].
  • the attention of the output layer by the speech-related status hidden layer and semantic vector sa i ca i the vector of the first correlation results to identify the hidden layer state sw i and semantic vector cw i composed of [sa i, ca i , sw i , cw i ].
  • the input of the decoding layer is the output of the attention layer, and the output of the decoding layer is the output result y i at the current moment, and y i is the recognition result of the voice data to be recognized.
  • the data of the input model reduces the keyword information
  • the only difference is that the keyword information is removed from the input combination vector of the coding layer.
  • the remaining layers of the model are processed with reference to the original processing logic for the input of the coding layer. Yes, the specific process can refer to the previous introduction, so I won’t repeat it here.
  • this application also provides an implementation method for generating a recognition training data set and an error correction training data set, which are specifically as follows:
  • the smart terminal is an electronic device with voice recognition function, such as smart phone, computer, translator, Robots, smart home (home appliances), etc.
  • voice recognition function such as smart phone, computer, translator, Robots, smart home (home appliances), etc.
  • each piece of voice data is manually labeled, that is, each piece of voice data is manually transcribed into corresponding text data.
  • extract the acoustic features of each piece of voice data are generally the spectral features of the voice data, such as MFCC or FBank.
  • the specific method for acquiring the acoustic features is an existing method, and the introduction is not repeated here.
  • the acoustic features of the voice data and the manually labeled text corresponding to the voice data are obtained.
  • the acoustic features of the voice data obtained in the above steps and the manually labeled text corresponding to the voice data are divided into two parts.
  • the first part is represented by the A set
  • the second part is represented by the B set.
  • the acoustic features of the voice data obtained in the above steps and the artificially annotated text corresponding to the voice data total 1 million groups, and these 1 million groups are randomly divided into two sets of equal amount, namely set A and set B.
  • Both the A set and the B set include multiple sets of training data, and each set of training data includes an acoustic feature corresponding to a piece of voice data and a manually labeled text corresponding to the voice data.
  • Set C includes multiple sets of training data. Each set of training data includes an acoustic feature corresponding to a piece of voice data, a manually labeled text corresponding to the voice data, and a set of The recognition result and the keywords in the recognition result.
  • the training is obtained as a speech error correction recognition model.
  • the above-mentioned recognition training data set and error correction training data set contain keywords. If the input data of the speech error correction recognition model only needs the acoustic characteristics of the speech data and the first recognition result, that is, it does not contain the key In the case of word information, the step of obtaining keywords in the above process can be omitted, and the finally obtained recognition training data set and error correction training data set do not need to contain keywords.
  • the speech recognition error correction device disclosed in the embodiments of the present application will be described below.
  • the speech recognition error correction device described below and the speech recognition error correction method described above can be referred to each other.
  • FIG. 5 is a schematic structural diagram of a speech recognition error correction device disclosed in an embodiment of the application.
  • the speech recognition error correction device may include:
  • the acquiring unit 51 is configured to acquire the voice data to be recognized and the first recognition result thereof;
  • the first voice recognition unit 52 is configured to refer to the context information of the first recognition result, perform a second recognition on the voice data, and obtain the second recognition result;
  • the recognition result determining unit 53 is configured to determine the final recognition result according to the second recognition result.
  • the voice recognition error correction device may include:
  • the acquiring unit 51 is configured to acquire the voice data to be recognized and the first recognition result thereof;
  • the keyword extraction unit 54 is used to extract keywords from the first recognition result
  • the second voice recognition unit 55 is configured to refer to the context information of the first recognition result and the keywords, and perform a second recognition on the voice data to obtain the second recognition result;
  • the recognition result determining unit 53 is configured to determine the final recognition result according to the second recognition result.
  • the keyword extraction unit includes:
  • the domain vocabulary extraction unit is used to extract vocabularies with domain characteristics from the first recognition result as keywords.
  • the second speech recognition unit includes:
  • An acoustic feature obtaining unit configured to obtain the acoustic feature of the voice data
  • the model processing unit is used to input the acoustic features of the voice data, the first recognition result, and the keywords into a pre-trained voice error correction recognition model to obtain the second recognition result.
  • the voice error correction The recognition model is obtained by training the preset model using the error correction training data set;
  • the error correction training data set includes at least one set of error correction training data, and each set of error correction training data includes an acoustic feature corresponding to a piece of speech data, a text corresponding to the piece of speech data, and a first set of error correction training data corresponding to the piece of speech data.
  • the first recognition result and the keywords in the first recognition result are included in the first recognition result.
  • the model processing unit includes:
  • An encoding and attention calculation unit configured to use the voice error correction recognition model to encode the acoustic features of the voice data, the first recognition result, and the keywords, and to perform attention calculation;
  • the recognition unit is used to obtain the second recognition result based on the calculation result.
  • the encoding and attention calculation unit includes a first encoding and attention calculation unit, and the recognition unit includes a first decoding unit;
  • the first coding and attention calculation unit is configured to use the coding layer and the attention layer of the speech error correction recognition model to analyze the acoustic characteristics of the speech data, the first recognition result, and the key Word encoding and attention calculation are performed to obtain the calculation result;
  • the first decoding unit is configured to use the decoding layer of the speech error correction recognition model to decode the calculation result to obtain the second recognition result.
  • the model processing unit further includes a merging unit, the coding and attention calculation unit includes a second coding and attention calculation unit, and the recognition unit includes a second decoding unit:
  • the merging unit is configured to merge the acoustic features of the voice data, the first recognition result, and the keywords to obtain a merged vector;
  • the second coding and attention calculation unit is configured to use the coding layer and the attention layer of the speech error correction recognition model to perform coding and attention calculation on the merged vector to obtain the calculation result;
  • the second decoding unit is configured to use the decoding layer of the speech error correction recognition model to decode the calculation result to obtain the second recognition result.
  • the first coding and attention calculation unit includes:
  • the first coding unit is configured to use the coding layer of the speech error correction recognition model to respectively code each target object to obtain the acoustic high-level features of each target object;
  • the first attention calculation unit is configured to use the attention layer of the speech error correction recognition model to separately analyze the semantic vector of the last time related to each target object and the speech error correction recognition model at the last time. Output the result, perform attention calculation to obtain the hidden layer state related to each target object; and use the attention layer of the speech error correction recognition model to separately analyze the acoustic high-level features and all the target objects of each target object.
  • the hidden layer state related to each target object is calculated to obtain the semantic vector related to each target object; wherein, the target object includes the acoustic feature of the speech data and the first recognition result And the keywords.
  • the second encoding and attention calculation unit includes:
  • the second coding unit is configured to use the coding layer of the speech error correction recognition model to encode the merged vector to obtain the acoustic high-level features of the merged vector;
  • the second attention calculation unit is configured to use the attention layer of the speech error correction recognition model to correlate the semantic vector at the previous time to the merge vector and the output result of the speech error correction recognition model at the previous time, Perform attention calculations to obtain the hidden layer state related to the merge vector; and, use the attention layer of the speech error correction recognition model to analyze the high-level acoustic features of the merge vector and the hidden layer state related to the merge vector , Perform attention calculation to obtain the semantic vector related to the merged vector.
  • the recognition result determining unit includes:
  • a confidence degree obtaining unit configured to obtain the confidence degree of the first recognition result and the confidence degree of the second recognition result
  • the determining unit is configured to determine, from the first recognition result and the second recognition result, that the recognition result with high confidence is the final recognition result.
  • Fig. 7 shows a hardware structure block diagram of a speech recognition error correction system.
  • the hardware structure of the speech recognition error correction system may include: at least one processor 1, at least one communication interface 2, at least one memory 3, and at least one communication Bus 4;
  • the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 communicate with each other through the communication bus 4;
  • the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention, etc.;
  • CPU central processing unit
  • ASIC Application Specific Integrated Circuit
  • the memory 3 may include a high-speed RAM memory, or may also include a non-volatile memory (non-volatile memory), for example, at least one disk memory;
  • the memory stores a program
  • the processor can call the program stored in the memory, and the program is used for:
  • the final recognition result is determined.
  • the program can be used to:
  • the final recognition result is determined.
  • the embodiments of the present application also provide a storage medium, the storage medium may store a program suitable for execution by a processor, and the program is used for:
  • the final recognition result is determined.
  • the program can be used to:
  • the final recognition result is determined.
  • the embodiments of the present application also provide a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute any one of the above-mentioned voice recognition error correction methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

一种语音识别纠错方法、相关设备及可读存储介质,获取待识别的语音数据及其第一次识别结果(S101);参考第一次识别结果的上下文信息,对语音数据进行第二次识别,得到第二次识别结果(S102);根据第二次识别结果,确定最终的识别结果(S103)。上述方案中,在参考第一次识别结果的上下文信息,对语音数据进行第二次识别,充分考虑了识别结果的上下文信息以及语音数据的适用场景,如果第一次识别结果有误,即可利用第二次识别对其进行纠错,因此,能够提升语音识别准确率。进一步的还可以从第一次识别结果中提取关键词,基于此,可以参考第一次识别结果的上下文信息以及关键词,对语音数据进行第二次识别,能够进一步提高第二次识别结果的准确率。

Description

语音识别纠错方法、相关设备及可读存储介质 技术领域
本申请要求于2019年11月25日提交至中国国家知识产权局、申请号为201911167009.0、发明名称为“语音识别纠错方法、相关设备及可读存储介质”的专利申请的优先权,其全部内容通过引用结合在本申请中。
背景技术
近几年来,人工智能设备逐渐走入普通大众的生活、工作中,成为不可或缺的一部分,这些都得力于人工智能技术的飞速发展。而语音交互作为最自然的一种人机交互方式,广泛应用于各种人工智能设备中,使得人类与机器可以进行无障碍交流。语音交互过程中,基于语音识别技术,可以让机器“听懂”人类的语言,进而为人类提供服务。
目前,基于深度学习的语音识别技术已经日趋成熟,传统语音识别模型在通用场景下的识别准确率已经达到令人满意的效果,但是在一些特殊场景(例如专业领域)下的语音内容中,通常存在一些专业词汇,这类词汇在通用场景下出现的频率较小,导致传统语音识别模型对该类词汇的覆盖较差。在一些特殊场景下,待识别语音中包含该类词汇,采用传统语音识别模型对待识别语音进行识别,非常容易出现识别错误的情况,导致语音识别准确率低下。
因此,如何提升语音识别准确率,成为本领域技术人员亟待解决的技术问题。
发明内容
鉴于上述问题,提出了本申请以便提供一种语音识别纠错方法、相关设备及可读存储介质。具体方案如下:
在本申请的第一方面中,提供了一种语音识别纠错方法,所述方法包括:
获取待识别的语音数据及其第一次识别结果;
参考所述第一次识别结果的上下文信息,对所述语音数据进行第二次 识别,得到第二次识别结果;
根据所述第二次识别结果,确定最终的识别结果。
在本申请的第二方面中,提供了另一种语音识别纠错方法,所述方法包括:
获取待识别的语音数据及其第一次识别结果;
从所述第一次识别结果中提取关键词;
参考所述第一次识别结果的上下文信息以及所述关键词,对所述语音数据进行第二次识别,得到第二次识别结果;
根据所述第二次识别结果,确定最终的识别结果。
可选地,从所述第一次识别结果中提取关键词,包括:
从所述第一识别结果中提取具有领域特性的词汇,作为关键词。
可选地,所述参考所述第一次识别结果的上下文信息以及所述关键词,对所述语音数据进行第二次识别,得到第二次识别结果,包括:
获取所述语音数据的声学特征;
将所述语音数据的声学特征、所述第一次识别结果以及所述关键词,输入预先训练的语音纠错识别模型,得到第二次识别结果,所述语音纠错识别模型是利用纠错训练数据集对预设模型进行训练得到的;
其中,所述纠错训练数据集中包括至少一组纠错训练数据,每组纠错训练数据包括一条语音数据对应的声学特征、所述一条语音数据对应的文本、所述一条语音数据对应的第一次识别结果以及所述第一次识别结果中的关键词。
可选地,所述将所述语音数据的声学特征、所述第一次识别结果以及所述关键词,输入预先训练的语音纠错识别模型,得到第二次识别结果,包括:
利用所述语音纠错识别模型对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,并基于计算结果,得到第二次识别结果。
可选地,所述利用所述语音纠错识别模型对所述语音数据的声学特征、 所述第一次识别结果以及所述关键词进行编码以及注意力计算,并基于计算结果,得到第二次识别结果,包括:
利用所述语音纠错识别模型的编码层和注意力层,分别对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,得到所述计算结果;
利用所述语音纠错识别模型的解码层,对所述计算结果进行解码,得到第二次识别结果。
可选地,所述利用所述语音纠错识别模型对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,并基于计算结果,得到第二次识别结果,包括:
对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行合并,得到合并向量;
利用所述语音纠错识别模型的编码层和注意力层,对所述合并向量进行编码以及注意力计算,得到所述计算结果;
利用所述语音纠错识别模型的解码层,对所述计算结果进行解码,得到第二次识别结果。
可选地,所述利用所述语音纠错识别模型的编码层和注意力层,分别对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,得到所述计算结果,包括:
利用所述语音纠错识别模型的编码层,分别对每一目标对象进行编码,得到所述每一目标对象的声学高级特征;
利用所述语音纠错识别模型的注意力层,分别对所述每一目标对象相关的上一时刻的语义向量以及所述语音纠错识别模型上一时刻的输出结果,进行注意力计算,得到所述每一目标对象相关的隐层状态;
利用所述语音纠错识别模型的注意力层,分别对所述每一目标对象的声学高级特征以及所述每一目标对象相关的隐层状态,进行注意力计算,得到所述每一目标对象相关的语义向量;
其中,所述目标对象包括所述语音数据的声学特征、所述第一次识别 结果以及所述关键词。
可选地,所述利用所述语音纠错识别模型的编码层和注意力层,对所述合并向量进行编码以及注意力计算,得到所述计算结果,包括:
利用所述语音纠错识别模型的编码层,对所述合并向量进行编码,得到所述合并向量的声学高级特征;
利用所述语音纠错识别模型的注意力层,对所述合并向量相关的上一时刻的语义向量以及所述语音纠错识别模型上一时刻的输出结果,进行注意力计算,得到所述合并向量相关的隐层状态;
利用所述语音纠错识别模型的注意力层,对所述合并向量的声学高级特征以及所述合并向量相关的隐层状态,进行注意力计算,得到所述合并向量相关的语义向量。
可选地,所述根据所述第二次识别结果,确定最终的识别结果,包括:
获取所述第一次识别结果的置信度,以及,所述第二次识别结果的置信度;
从所述第一次识别结果以及所述第二次识别结果中,确定置信度高的识别结果为最终的识别结果。
在本申请的第三方面中,提供了一种语音识别纠错装置,所述装置包括:
获取单元,用于获取待识别的语音数据及其第一次识别结果;
第一语音识别单元,用于参考所述第一次识别结果的上下文信息,对所述语音数据进行第二次识别,得到第二次识别结果;
识别结果确定单元,用于根据所述第二次识别结果,确定最终的识别结果。
在本申请的第四方面中,提供了另一种语音识别纠错装置,所述装置包括:
获取单元,用于获取待识别的语音数据及其第一次识别结果;
关键词提取单元,用于从所述第一次识别结果中提取关键词;
第二语音识别单元,用于参考所述第一次识别结果的上下文信息以及 所述关键词,对所述语音数据进行第二次识别,得到第二次识别结果;
识别结果确定单元,用于根据所述第二次识别结果,确定最终的识别结果。
可选的,所述关键词提取单元,包括:
领域词汇提取单元,用于从所述第一识别结果中提取具有领域特性的词汇,作为关键词。
可选地,所述第二语音识别单元,包括:
声学特征获取单元,用于获取所述语音数据的声学特征;
模型处理单元,用于将所述语音数据的声学特征、所述第一次识别结果以及所述关键词,输入预先训练的语音纠错识别模型,得到第二次识别结果,所述语音纠错识别模型是利用纠错训练数据集对预设模型进行训练得到的;
其中,所述纠错训练数据集中包括至少一组纠错训练数据,每组纠错训练数据包括一条语音数据对应的声学特征、所述一条语音数据对应的文本、所述一条语音数据对应的第一次识别结果以及所述第一次识别结果中的关键词。
可选地,模型处理单元,包括:
编码及注意力计算单元,用于利用所述语音纠错识别模型对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算;
识别单元,用于基于计算结果,得到第二次识别结果。
可选地,所述编码以及注意力计算单元包括第一编码以及注意力计算单元,所述识别单元包括第一解码单元;
所述第一编码以及注意力计算单元,用于利用所述语音纠错识别模型的编码层和注意力层,分别对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,得到所述计算结果;
所述第一解码单元,用于利用所述语音纠错识别模型的解码层,对所述计算结果进行解码,得到第二次识别结果。
可选地,所述模型处理单元还包括合并单元,所述编码以及注意力计算单元包括第二编码以及注意力计算单元,所述识别单元包括第二解码单元:
所述合并单元,用于对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行合并,得到合并向量;
所述第二编码以及注意力计算单元,用于利用所述语音纠错识别模型的编码层和注意力层,对所述合并向量进行编码以及注意力计算,得到所述计算结果;
所述第二解码单元,用于利用所述语音纠错识别模型的解码层,对所述计算结果进行解码,得到第二次识别结果。
可选地,所述第一编码以及注意力计算单元,包括:
第一编码单元,用于利用所述语音纠错识别模型的编码层,分别对每一目标对象进行编码,得到所述每一目标对象的声学高级特征;
第一注意力计算单元,用于利用所述语音纠错识别模型的注意力层,分别对所述每一目标对象相关的上一时刻的语义向量以及所述语音纠错识别模型上一时刻的输出结果,进行注意力计算,得到所述每一目标对象相关的隐层状态;以及,利用所述语音纠错识别模型的注意力层,分别对所述每一目标对象的声学高级特征以及所述每一目标对象相关的隐层状态,进行注意力计算,得到所述每一目标对象相关的语义向量;其中,所述目标对象包括所述语音数据的声学特征、所述第一次识别结果以及所述关键词。
可选地,所述第二编码以及注意力计算单元,包括:
第二编码单元,用于利用所述语音纠错识别模型的编码层,对所述合并向量进行编码,得到所述合并向量的声学高级特征;
第二注意力计算单元,用于利用所述语音纠错识别模型的注意力层,对所述合并向量相关的上一时刻的语义向量以及所述语音纠错识别模型上一时刻的输出结果,进行注意力计算,得到所述合并向量相关的隐层状态;以及,利用所述语音纠错识别模型的注意力层,对所述合并向量的声学高 级特征以及所述合并向量相关的隐层状态,进行注意力计算,得到所述合并向量相关的语义向量。
可选地,所述识别结果确定单元,包括:
置信度获取单元,用于获取所述第一次识别结果的置信度,以及,所述第二次识别结果的置信度;
确定单元,用于从所述第一次识别结果以及所述第二次识别结果中,确定置信度高的识别结果为最终的识别结果。
在本申请的第五方面中,提供了一种语音识别纠错系统,包括存储器和处理器;
所述存储器,用于存储程序;
所述处理器,用于执行所述程序,实现如上所述的语音识别纠错方法的各个步骤。
在本申请的第六方面中,提供了一种可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,实现如上所述的语音识别纠错方法的各个步骤。
在本申请的第七方面中,提供了一种计算机程序产品,所述计算机程序产品在终端设备上运行时,使得所述终端设备执行上述语音识别纠错方法的各个步骤。
借由上述技术方案,本申请公开了一种语音识别纠错方法、相关设备及可读存储介质,包括:获取待识别的语音数据及其第一次识别结果;并参考第一次识别结果的上下文信息,对语音数据进行第二次识别,得到第二次识别结果;最后,根据第二次识别结果,确定最终的识别结果。上述方案中,在参考第一次识别结果的上下文信息,对语音数据进行第二次识别,充分考虑了识别结果的上下文信息以及语音数据的适用场景,如果第一次识别结果有误,即可利用第二次识别对其进行纠错,因此,能够提升语音识别准确率。
在此基础上,进一步的还可以从第一次识别结果中提取关键词,基于此,可以参考第一次识别结果的上下文信息以及所述关键词,对语音数据 进行第二次识别,能够进一步提高第二次识别结果的准确率。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1为本申请实施例公开的一种语音识别纠错方法的流程示意图;
图2为本申请实施例公开的另一种语音识别纠错方法的流程示意图;
图3为本申请实施例公开的一种用于训练语音纠错识别模型的预设模型的拓扑结构示意图;
图4为本申请实施例公开的又一种用于训练语音纠错识别模型的预设模型的拓扑结构示意图;
图5为本申请实施例公开的一种语音识别纠错装置结构示意图;
图6为本申请实施例公开的另一种语音识别纠错装置结构示意图;
图7为本申请实施例公开的一种语音识别纠错系统的硬件结构框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
为了提升在特殊场景(例如专业领域)下的语音识别准确率,本案发明人进行研究,起初的思路为:
搜集特殊场景下的包含领域特性词汇的文本作为语料,对传统语音识别模型做针对性优化和定制,采用经过定制和优化后的模型,对该特殊场景下的待识别语音进行识别,能够达到较高的准确率,但是,采用经过定 制和优化后的模型,对通用场景下的待识别语音进行识别,准确率相对于传统语音识别模型却有所下降。
为兼顾通用场景下以及特殊场景下的语音识别准确率,在对待识别语音进行识别之前,需要预先确定待识别语音是在通用场景下产生的,还是在特殊场景下产生的,如果确定待识别语音是在通用场景下产生的,则采用传统语音识别模型对其进行识别,如果确定待识别语音是在特殊场景下产生的,则采用经过定制和优化后的模型对其进行识别,这样,才能既保证在通用场景下的语音识别准确率,又能保证在特殊场景下的语音识别准确率。但是,对于实现语音识别的系统来说,在对待识别语音进行识别之前,是无法预先确定待识别语音是在通用场景下产生的,还是在特殊场景下产生的。
鉴于上述思路存在的问题,本案发明人进行了深入研究发现,现有语音识别技术,通常根据语音数据流给出对应的识别结果,一旦给出识别结果后,不会再去修正。但是,实际应用中会有这种情况,对语音数据流中的第一个子句识别时,由于上下文信息不够充分,导致第一个子句识别错误,而对第一个子句后面的子句识别时,由于上下文信息较为充分,则可以对第一个子句后面的子句识别正确。也就是说,同一个词语,可能在第一个子句中出现时识别错误,在第二个子句中出现时识别正确。
示例如:待识别语音内容为“来自加州索克研究所的科学家们发现,自噬反应可以抑制癌症的发生着和过去,许多人的认知恰好相反,因此那些用来抑制自噬反应的疗法可能反而会带来不好的后果。”,被识别成“来自加州索克研究所的科学家们发现,此时反应可以抑制癌症的发生着和过去,许多人的认知恰好相反,因此那些用来抑制自噬反应的疗法可能反而会带来不好的后果。”。
上述示例中,由于自噬反应第一次出现时,上文并没有太多相关内容,而自噬反应又属于较为生僻的领域词汇,导致识别错误,而自噬反应第二次出现时,上文是抑制,抑制自噬反应这个组合语言模型得分较高,因此识别正确。
基于以上研究,本案发明人发现,识别结果本身携带的上下文信息, 对识别结果的正确与否有一定的影响,因而可以基于待识别语音数据的第一次识别结果的上下文信息,对待识别语音数据进行第二次识别,得到第二次识别结果,第二次识别结果中可能会将第一次识别结果中识别错误的领域词汇纠正过来,从而提升语音识别结果的准确率。
基于以上,本案发明人提出一种语音识别纠错方法。接下来,通过下述实施例对本申请提供的语音识别纠错方法进行介绍。
请参阅图1,图1为本申请实施例公开的一种语音识别纠错方法的流程示意图,该方法可以包括:
S101:获取待识别的语音数据及其第一次识别结果。
本实施例中,待识别的语音数据为用户根据应用需求说出的语音数据,如用户发短信或聊天时使用语音输入法输入的语音数据。待识别的语音数据可以为通用领域的语音数据,也可以为特殊场景(如专业领域)的语音数据。
本申请中,可以采用多种方式获取待识别语音数据的第一次识别结果。比如,可以基于神经网络模型实现。当然,其他的获取待识别的语音数据的第一次识别结果的方式也在本申请的保护范围之内。示例如,可以预先将待识别的语音数据的第一次识别结果进行存储,当需要时,从存储介质中直接获取即可。
S102:参考所述第一次识别结果的上下文信息,对所述语音数据进行第二次识别,得到第二次识别结果。
基于前文介绍的发明人的研究成果可知,识别结果本身携带的上下文信息,对识别结果的正确与否有一定的影响,因此本实施例中可以参考第一次识别结果的上下文信息,对语音数据进行第二次识别,得到第二次识别结果。
本实施例中,参考所述第一次识别结果的上下文信息,对所述语音数据进行第二次识别,得到第二次识别结果的实现方式可以有多种。比如,可基于神经网络模型实现。
当然,其他的实现方式也在本申请的保护范围之内。示例如:可以确定所述第一次识别结果中包含的领域性词汇,并将该领域性词汇与第一次 识别结果中其它词汇进行匹配,筛选出匹配度高于设定匹配度下限值且不完全相同的词汇,利用该领域性词汇替换掉筛选出来的词汇,得到第二次识别结果。
以前述列举的例子进行说明,对于第一次识别结果:“来自加州索克研究所的科学家们发现,此时反应可以抑制癌症的发生着和过去,许多人的认知恰好相反,因此那些用来抑制自噬反应的疗法可能反而会带来不好的后果。”,从中可以提取领域性词汇,如“自噬反应”。进一步将第一次识别结果中各词汇与“自噬反应”进行匹配,发现“此时反应”的匹配度达到了50%,假定设定匹配度下限值为30%,则说明可以利用“自噬反应”替换掉第一次识别结果中的“此时反应”,进而得到第二次识别结果为:“来自加州索克研究所的科学家们发现,自噬反应可以抑制癌症的发生着和过去,许多人的认知恰好相反,因此那些用来抑制自噬反应的疗法可能反而会带来不好的后果。”
S103:根据所述第二次识别结果确定最终的识别结果。
在本申请中,可以直接将所述第二次识别结果确定为最终的识别结果。但是,某些情况下,可能第二次识别结果未必优于第一次识别结果,如果直接将第二次识别结果确定为最终的识别结果,反而会降低识别准确率。因此,这种情况下,可以从第一次识别结果以及第二次识别结果中确定一个最优的识别结果作为最终的识别结果。
从第一次识别结果以及第二次识别结果中确定一个最优的识别结果的方式可以有多种,作为一种可实施方式,可以获取所述第一次识别结果的置信度,以及,所述第二次识别结果的置信度;从所述第一次识别结果以及所述第二次识别结果中,确定置信度高的识别结果为最终的识别结果。
当然,其他方式也在本申请保护范围内,示例如,可以采用人工核查的方式从第一次识别结果以及第二次识别结果中确定一个最优的识别结果。
本实施例中公开了一种语音识别纠错方法,获取待识别的语音数据及其第一次识别结果;参考第一次识别结果的上下文信息,对语音数据进行 第二次识别,得到第二次识别结果;最后,根据第二次识别结果,确定最终的识别结果。上述方法中,在参考第一次识别结果的上下文信息,对语音数据进行第二次识别,充分考虑了识别结果的上下文信息的适用场景,如果第一次识别结果有误,即可利用第二次识别对其进行纠错,因此,能够提升语音识别准确率。
在本申请的另一些实施例中,进一步提供了另一种语音识别纠错方法,在上述实施例基础上,可以对第一次识别结果进行关键词提取,进而可以同时参考第一次识别结果的上下文信息以及关键词,对语音数据进行第二次识别,能够进一步提高第二次识别结果的准确率。具体实现过程可以参考图2所示,方法包括:
S201:获取待识别的语音数据及其第一次识别结果。
步骤S201与前述步骤S101相同,详细实施过程可以参照前文介绍,此处不再赘述。
S202:提取所述第一次识别结果中的关键词。
本实施例中,所述关键词可以是从第一识别结果中提取的具有领域特性的词汇。也即,关键词可以是在第一次识别结果中出现的跟领域相关的词汇,通常是具有领域特性的词汇。示例如:医疗领域中的自噬反应、骨牵引、肾活检等词汇;计算机领域中的前馈神经网络、池化层等。
S203:参考所述第一次识别结果的上下文信息以及所述关键词,对所述语音数据进行第二次识别,得到第二次识别结果。
本实施例中,同时参考所述第一次识别结果的上下文信息以及所述关键词,对所述语音数据进行第二次识别,得到第二次识别结果的实现方式可以有多种。比如,可基于神经网络模型实现。
当然,其他的实现方式也在本申请的保护范围之内。示例如:可以确定所述第一次识别结果中与所述关键词匹配的词汇,筛选出匹配度高于设定匹配度下限值且不完全相同的词汇,利用所述关键词替换掉筛选出来的词汇,得到第二次识别结果。
S204:根据所述第二次识别结果确定最终的识别结果。
步骤S204与前述步骤S103相同,详细实施过程可以参照前文介绍,此处不再赘述。
本实施例中公开了的语音识别纠错方法,进一步从第一次识别结果中提取了关键词,该关键词可以是具有领域特性的词汇,进而可以同时参考第一次识别结果的上下文信息以及所述关键词,对语音数据进行第二次识别,能够进一步提高第二次识别结果的准确率。
本申请中,当获取待识别的语音数据的第一次识别结果的方式为基于神经网络模型实现时,可以将所述语音数据输入预先训练的语音识别模型,得到第一次识别结果。预先训练的语音识别模型具体可以为传统语音识别模型。也可以为基于识别训练数据集对预设模型进行训练生成的语音识别模型。识别训练数据集中包括至少一组识别训练数据,每组识别训练数据中包括一条语音数据对应的文本,以及,该条语音数据的声学特征。预设模型可以为任意神经网络模型,对此,本申请不进行任何限定。
需要说明的是,当预先训练的语音识别模型为基于识别训练数据集对预设模型进行训练生成的语音识别模型时,识别训练数据集中的每个识别训练数据可以通过以下方式得到:获取一条语音数据,对该语音数据进行人工标注,得到与该语音数据对应的文本;提取该语音数据的声学特征;生成一个识别训练数据,所述识别训练数据中包括语音数据对应的文本以及该语音数据的声学特征。
在本申请中,获取语音数据的方式可以有多种,比如可以通过智能终端的麦克风接收语音数据,所述智能终端为具有语音识别功能的电子设备,如智能手机、电脑、翻译机、机器人、智能家居、智能家电等。也可以获取预先存储的语音数据。当然,获取语音数据的其他方式也在本申请的保护范围之内,对此,本申请不进行任何限定。
在本申请中,每个语音数据的声学特征可以为语音数据的频谱特征,如MFCC(Mel-Frequency Cepstral Coefficients,梅尔频率倒谱系数),或者,FBank特征等。本申请中,可以采用主流的任一声学特征提取方法提取每 个语音数据的声学特征,对此,本申请不进行任何限定。
在本申请中,训练语音识别模型时采用的预设模型可以为传统的基于attention的第三编码模块-decoder(基于注意力机制的编码解码)模型结构,也可以为其他模型结构,对此,本申请不进行任何限定。
在本申请中,基于识别训练数据对预设模型进行训练时,是以每个识别训练数据中的语音数据的声学特征作为预设模型的输入,以每个识别训练数据中的语音数据对应的文本为训练目标,对预设模型的参数进行训练的。
在本申请中,可以采用NER(Named Entity Recognition,命名实体识别)技术提取第一次识别结果中的关键词。当然,其他的提取第一次识别结果中的关键词的方式也在本申请的保护范围之内。示例如,可以采用人工的方式从第一次识别结果中提取关键词。
目前,NER(Named Entity Recognition,命名实体识别)技术可以基于神经网络模型实现。这种情况下,提取第一次识别结果中的关键词的实现方式具体可以如下:将第一次识别结果输入预先训练的关键词提取模型,得到所述第一次识别结果中的关键词。
需要说明的是,关键词提取模型可以基于提取训练数据集对预设模型结构进行训练生成,其中,提取训练数据集中包括至少一组提取训练数据,每组提取训练数据中包括一个文本,该文本中出现的具有领域特性的专业词汇已被标注。每个文本可以为特殊场景下的文本,具体可以采用人工标注的方法将每个文本中出现的具有领域特性的专业词汇打上标签实现标注。
预设模型可以为基于深度学习的BiLSTM_CRF(双向长短时记忆模型_条件随机场)模型等。
示例如,第一次识别结果为“自噬反应可以抑制癌症的发生和过去,许多人的认知正好相反,因此那些用来抑制自噬反应的疗法可能反而会带来不好的后果”,将第一次识别结果输入关键词提取模型之后,关键词提取 模型即可输出关键词:自噬反应、癌症、疗法。
另外,NER(Named Entity Recognition,命名实体识别)技术,也可以基于统计模型实现。这种情况下,提取第一次识别结果中的关键词的实现方式具体可以如下:将第一次识别结果输入统计模型,得到所述第一次识别结果中的关键词。统计模型的构建方式为目前成熟技术,对此,本申请不再赘述。
在本申请中,当参考所述第一次识别结果的上下文信息,对所述语音数据进行第二次识别,得到第二次识别结果的实现方式是基于神经网络模型时,可以将所述语音数据的声学特征、所述第一次识别结果,输入预先训练的语音纠错识别模型,得到第二次识别结果,所述语音纠错识别模型是利用纠错训练数据集对预设模型进行训练得到的;所述纠错训练数据集中包括至少一组纠错训练数据,每组纠错训练数据包括一条语音数据对应的声学特征、所述一条语音数据对应的文本及所述一条语音数据对应的第一次识别结果。
需要说明的是,在训练语音纠错识别模型时,所述一条语音数据对应的声学特征及所述一条语音数据对应的第一次识别结果为所述预设语音纠错识别模型结构的输入,所述一条语音数据对应的文本为所述预设语音纠错识别模型结构的训练目标。
其中,每组纠错训练数据可以通过以下方式得到:获取一条语音数据。对该语音数据进行人工标注,得到与该语音数据对应的文本。提取该语音数据的声学特征。将该语音数据输入预先训练的语音识别模型,得到该语音数据对应的第一次识别结果。
另一实施例中,当参考所述第一次识别结果的上下文信息以及所述关键词,对所述语音数据进行第二次识别,得到第二次识别结果的实现方式是基于神经网络模型时,可以将所述语音数据的声学特征、所述第一次识别结果以及所述关键词,输入预先训练的语音纠错识别模型,得到第二次识别结果,所述语音纠错识别模型是利用纠错训练数据集对预设模型进行 训练得到的;所述纠错训练数据集中包括至少一组纠错训练数据,每组纠错训练数据包括一条语音数据对应的声学特征、所述一条语音数据对应的文本、所述一条语音数据对应的第一次识别结果以及所述第一次识别结果中的关键词。
需要说明的是,在训练语音纠错识别模型时,所述一条语音数据对应的声学特征、所述一条语音数据对应的第一次识别结果以及所述第一次识别结果中的关键词为所述预设语音纠错识别模型结构的输入,所述一条语音数据对应的文本为所述预设语音纠错识别模型结构的训练目标。
其中,每组纠错训练数据可以通过以下方式得到:获取一条语音数据。对该语音数据进行人工标注,得到与该语音数据对应的文本。提取该语音数据的声学特征。将该语音数据输入预先训练的语音识别模型,得到该语音数据对应的第一次识别结果,将第一次识别结果输入预先训练的关键词提取模型,得到该第一次识别结果中的关键词。
基于上述可知,本申请实施例可以通过两种方式来获得第二次识别结果,该两种实现方式均是基于语音纠错识别模型实现,区别在于,两种方式的模型输入数据不同,其中第一种方式输入模型的为语音数据的声学特征及第一次识别结果,第二种方式输入模型的为语音数据的声学特征、第一次识别结果及从第一次识别结果中提取的关键词。也即,第二种方式相比第一种方式输入模型的数据更加了关键词的信息。
接下来,以第二种方式为例,对语音纠错识别模型的具体处理过程进行展开说明。
在本申请中,将所述语音数据的声学特征、所述第一次识别结果以及所述关键词,输入预先训练的语音纠错识别模型,得到第二次识别结果的具体实现方式可以为:利用所述语音纠错识别模型对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,并基于计算结果,得到第二次识别结果。
请参阅图3,图3为本申请实施例公开的一种用于训练语音纠错识别 模型的预设模型的拓扑结构示意图,该模型包含三层,分别是编码层、注意力层、解码层。编码层的功能是提取高级特征,注意力层的功能是计算该层的输入与最终输出结果的相关性,解码层的输入为注意力层的输出,解码层的输出为当前时刻的输出结果。解码层的具体形式可以是带有softmax的单层神经网络,本申请中不进行任何限定。
编码层可以进一步划分为三个部分,分别是第一编码模块、第二编码模块、第三编码模块。
第一编码模块、第二编码模块,第三编码模块的具体结构可以是倒金字塔结构的双向RNN(Recurrent Neural Network,递归神经网络),也可以是CNN(Convolutional Neural Networks,卷积神经网络),本申请中不进行任何限定。
注意力层也可以进一步划分为三个部分,分别是第一注意力模块,第二注意力模块,第三注意力模块。第一注意力模块,第二注意力模块,第三注意力模块的具体结构可以是双向RNN(Recurrent Neural Network,递归神经网络),也可以是单向RNN,本申请中不进行任何限定。
解码层的输入为注意力层的输出,解码层的输出为当前时刻的输出结果。Decode的具体形式可以是带有softmax的单层神经网络,本申请中不进行任何限定。
第一编码模块的输入是待识别的语音数据对应的声学特征X,输出是声学的高级特征Ha,第二编码模块的输入是待识别的语音数据对应的第一次识别结果的表征P,输出是待识别的语音数据的第一次识别结果的表征P的高级特征Hw,第三编码模块的输入是待识别的语音数据的第一次识别结果中的关键词的表征Q,输出是待识别的语音数据的第一次识别结果中的关键词的表征Q的高级特征Hr。
上一时刻的输出结果y i-1是第一注意力模块,第二注意力模块,第三注意力模块的一个公共的输入,除此之外,每个部分还有不同的输入和输出,其中,第一注意力模块的输入是Ha,输出是语音相关的隐层状态sa i和语义向量ca i,第二注意力模块的输入是Hw,输出是第一次识别结果相 关的隐层状态sw i和语义向量cw i,第三注意力模块的输入是Hr,输出是第一次识别结果中的关键词相关的隐层状态sr i和语义向量cr i
解码层的输入为注意力层的输出sa i、ca i、sw i、cw i、sr i、cr i,解码层的输出为当前时刻的输出结果y i,y i为待识别的语音数据的识别结果。
一般情况下,在训练阶段,P(y i)大于预设阈值时,即可认为训练结束,P(y i)表示当前时刻输出结果是y i的概率,P(y i)=Decode(sa i,sw i,sr i,ca i,cw i,cr i)。
基于上述模型,作为一种可实施方式,在本申请中,利用所述语音纠错识别模型对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,并基于计算结果,得到第二次识别结果的具体实现方式可以为:利用所述语音纠错识别模型的编码层和注意力层,分别对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码和注意力计算,得到所述计算结果;利用所述语音纠错识别模型的解码层,对所述计算结果进行解码,得到第二次识别结果。
其中,利用所述语音纠错识别模型的编码层和注意力层,分别对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,得到所述计算结果的实现方式可以如下:利用所述语音纠错识别模型的编码层,分别对每一目标对象进行编码,得到所述每一目标对象的声学高级特征;利用所述语音纠错识别模型的注意力层,分别对所述每一目标对象相关的上一时刻的语义向量以及所述语音纠错识别模型上一时刻的输出结果,进行注意力计算,得到所述每一目标对象相关的隐层状态;
利用所述语音纠错识别模型的注意力层,分别对所述每一目标对象的声学高级特征以及所述每一目标对象相关的隐层状态,进行注意力计算,得到所述每一目标对象相关的语义向量;
其中,所述目标对象包括所述语音数据的声学特征、所述第一次识别结果以及所述关键词。
具体过程如下:
利用第一编码模块对所述语音数据的声学特征进行编码,得到所述语音数据的声学高级特征;利用第一注意力模块对与所述语音数据相关的上一时刻的语义向量以及所述语音纠错识别模型上一时刻的输出结果进行注意力计算,得到所述与所述语音数据相关的隐层状态;利用第一注意力模块对所述语音数据的声学高级特征以及所述与所述语音数据相关的隐层状态进行注意力计算,得到所述与所述语音数据相关的语义向量。
利用第二编码模块对所述第一次识别结果进行编码,得到所述第一次识别结果的高级特征;利用第二注意力模块对与所述第一次识别结果相关的上一时刻的语义向量以及所述语音纠错识别模型上一时刻的输出结果进行注意力计算,得到所述与所述第一次识别结果相关的隐层状态;利用第二注意力模块对所述第一次识别结果的高级特征以及所述与所述第一次识别结果相关的隐层状态进行注意力计算,得到所述与所述第一次识别结果相关的语义向量。
利用第三编码模块对所述关键词进行编码,得到所述关键词的高级特征;利用第三注意力模块对与所述关键词相关的上一时刻的语义向量以及所述语音纠错识别模型上一时刻的输出结果进行注意力计算,得到所述与所述关键词相关的隐层状态;利用第三注意力模块对所述关键词的高级特征以及所述与所述关键词相关的隐层状态进行注意力计算,得到所述与所述关键词相关的语义向量。
可以理解的是,上述示例的是输入数据为语音数据的声学特征、所述第一次识别结果以及所述关键词时,语音纠错识别模型的一种可选处理过程。当输入数据为语音数据的声学特征及所述第一次识别结果时,则可以将图3中所有涉及关键词的模型结构及处理流程省略掉,也即语音纠错识别模型中去除第三编码模块及第三注意力模型,其余模型结构保持不变即可,具体流程可以参照前文介绍,此处不再赘述。
进一步,仍以第二种方式为例,参阅图4,图4为本申请实施例公开的又一种用于训练语音纠错识别模型的预设模型的拓扑结构示意图,该模 型包含三层,分别是编码层、注意力层、解码层。Encode层的功能是提取高级特征,注意力层的功能是计算该层的输入与最终输出结果的相关性,解码层的输入为注意力层的输出,解码层的输出为当前时刻的输出结果。Decode的具体形式可以是带有softmax的单层神经网络,本申请中不进行任何限定。
编码层的输入是由待识别的语音数据对应的声学特征X、待识别的语音数据对应的第一次识别结果的表征P以及待识别的语音数据的第一次识别结果中的关键词的表征Q组成的合并向量[X,P,Q]。编码层的输出是由声学特征的高级特征Ha、待识别的语音数据的第一次识别结果的表征P的高级特征Hw以及待识别的语音数据的第一次识别结果中的关键词的表征Q的高级特征Hr组成的合并向量[Ha,Hw,Hr]。
编码层的输出以及模型上一时刻的输出结果y i-1是注意力层的输入,注意力层的输出是由语音相关的隐层状态sa i和语义向量ca i、第一次识别结果相关的隐层状态sw i和语义向量cw i、第一次识别结果中的关键词相关的隐层状态sr i和语义向量cr i组成的向量[sa i,ca i,sw i,cw i,sr i,cr i]。
解码层的输入为注意力层的输出,解码层的输出为当前时刻的输出结果y i,y i为待识别的语音数据的识别结果。
基于上述模型,作为一种可实施方式,在本申请中,利用所述语音纠错识别模型对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,并基于计算结果,得到第二次识别结果的具体实现方式可以为:对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行合并,得到合并向量;利用所述语音纠错识别模型的编码层和注意力层,对所述合并向量进行编码以及注意力计算,得到所述计算结果;利用所述语音纠错识别模型的解码层,对所述计算结果进行解码,得到第二次识别结果。
其中,利用所述语音纠错识别模型的编码层和注意力层,对所述合并向量进行编码以及注意力计算,得到所述计算结果的实现方式可以如下:
利用所述语音纠错识别模型的编码层,对所述合并向量进行编码,得 到所述合并向量的声学高级特征;
利用所述语音纠错识别模型的注意力层,对所述合并向量相关的上一时刻的语义向量以及所述语音纠错识别模型上一时刻的输出结果,进行注意力计算,得到所述合并向量相关的隐层状态;
利用所述语音纠错识别模型的注意力层,对所述合并向量的声学高级特征以及所述合并向量相关的隐层状态,进行注意力计算,得到所述合并向量相关的语义向量。
需要说明的是,传统语音识别模型,注意力层的主要关注点是传统语音识别模型的输出结果和语音数据的声学特征的相关性,本申请语音纠错识别模型中将语音数据的第一次识别结果和第一次识别结果中的关键词融入注意力层,使得语音纠错识别模型的输出结果可以关注到识别结果的纠错信息和识别结果的上下文信息。之所以这么做是希望语音纠错识别模型可以学习到输出结果与上下文信息相关的注意力机制、以及,输出结果与纠错相关的注意力机制,通过上述两个注意力机制找到当前语音数据需要关注的上下文信息和纠错信息,即根据输入语音数据能够自动选择是否需要关注第一次识别结果和第一次识别结果中的关键词信息,也就是相当于语音纠错识别模型具有了根据第一次识别结果和第一次识别结果中的关键词自动纠正错误的能力。
可以理解的是,上述示例的是输入数据为语音数据的声学特征、所述第一次识别结果以及所述关键词时,语音纠错识别模型的另一种可选处理过程。当输入数据为语音数据的声学特征及所述第一次识别结果时,则图4中编码层的输入是由待识别的语音数据对应的声学特征X、待识别的语音数据对应的第一次识别结果的表征P组成的合并向量[X,P],编码层的输出是由声学特征的高级特征Ha及待识别的语音数据的第一次识别结果的表征P的高级特征Hw组成的合并向量[Ha,Hw]。进一步的,注意力层的输出结果为由语音相关的隐层状态sa i和语义向量ca i、第一次识别结果相关的隐层状态sw i和语义向量cw i组成的向量[sa i,ca i,sw i,cw i]。解码层的输入为注意力层的输出,解码层的输出为当前时刻的输出结果y i,y i 为待识别的语音数据的识别结果。
也即,当输入模型的数据减少了关键词的信息时,区别仅在于编码层的输入组合向量中去掉关键词信息,模型的其余各层针对编码层的输入,参考原有处理逻辑进行处理即可,具体流程可以参照前文介绍,此处不再赘述。
另外,本申请还给出了一种生成识别训练数据集以及纠错训练数据集的实现方式,具体如下:
收集用于训练语音识别模型以及语音纠错识别模型的语音数据,这部分语音数据可以通过智能终端的麦克风接收到,智能终端为具有语音识别功能的电子设备,如智能手机、电脑、翻译机、机器人、智能家居(家电)等。再由人工标注每条语音数据,即将每条语音数据通过人工转写成对应的文本数据。再提取每条语音数据的声学特征,所述声学特征一般为语音数据的频谱特征,如MFCC或者FBank等特征,该声学特征的具体获取方法为现有方法,这里不再赘余介绍。最终,得到语音数据的声学特征以及语音数据对应的人工标注文本。
将上述步骤得到的语音数据的声学特征以及语音数据对应的人工标注文本分为两部分,在本申请中,第一部分用A集表示,第二部分用B集表示。示例如,上述步骤得到的语音数据的声学特征以及语音数据对应的人工标注文本共有100万组,随机将这100万组分成等量的两个集合,分别为A集和B集。A集和B集中均包括多组训练数据,每组训练数据包括一条语音数据对应的声学特征以及该语音数据对应的人工标注文本。
使用A集作为识别训练数据集,训练得到语音识别模型。
将B集输入训练好的语音识别模型,得到B集对应的识别结果,再将B集对应的识别结果输入关键词提取模型,得到B集对应的识别结果中的关键词;B集对应的声学特征、人工标注文本、识别结果以及关键词组成C集,C集中包括多组训练数据,每组训练数据包括一条语音数据对应的声学特征、该语音数据对应的人工标注文本、该语音数据对应的识别结果 以及该识别结果中的关键词。
使用C集作为纠错训练数据集,训练得到为语音纠错识别模型。
进一步需要说明的是,还可以将B集输入训练好的语音识别模型,得到B集对应的Nbest的识别结果,再将每个识别结果输入关键词提取模型,得到每个识别结果中的关键词。如果B集数据有n条语音数据,每条语音均有Nbest识别结果,最终能获取到n*N组训练数据。这样处理,能够使纠错训练数据集更为丰富,提升语音纠错识别模型的覆盖度。
可以理解的是,上述识别训练数据集以及纠错训练数据集均包含了关键词,若语音纠错识别模型的输入数据仅需要语音数据的声学特征及第一次识别结果,也即不包含关键词信息时,则可以省略掉上述流程中获得关键词的步骤,则最终得到的识别训练数据集以及纠错训练数据集不需要包含关键词。
下面对本申请实施例公开的语音识别纠错装置进行描述,下文描述的语音识别纠错装置与上文描述的语音识别纠错方法可相互对应参照。
参照图5,图5为本申请实施例公开的一种语音识别纠错装置结构示意图。如图5所示,该语音识别纠错装置可以包括:
获取单元51,用于获取待识别的语音数据及其第一次识别结果;
第一语音识别单元52,用于参考所述第一次识别结果的上下文信息,对所述语音数据进行第二次识别,得到第二次识别结果;
识别结果确定单元53,用于根据所述第二次识别结果,确定最终的识别结果。
在本申请的另一些实施例中,公开了另外一种语音识别纠错装置,如图6所示,该语音识别纠错装置可以包括:
获取单元51,用于获取待识别的语音数据及其第一次识别结果;
关键词提取单元54,用于从所述第一次识别结果中提取关键词;
第二语音识别单元55,用于参考所述第一次识别结果的上下文信息以及所述关键词,对所述语音数据进行第二次识别,得到第二次识别结果;
识别结果确定单元53,用于根据所述第二次识别结果,确定最终的识别结果。
可选的,所述关键词提取单元,包括:
领域词汇提取单元,用于从所述第一识别结果中提取具有领域特性的词汇,作为关键词。
可选地,所述第二语音识别单元,包括:
声学特征获取单元,用于获取所述语音数据的声学特征;
模型处理单元,用于将所述语音数据的声学特征、所述第一次识别结果以及所述关键词,输入预先训练的语音纠错识别模型,得到第二次识别结果,所述语音纠错识别模型是利用纠错训练数据集对预设模型进行训练得到的;
其中,所述纠错训练数据集中包括至少一组纠错训练数据,每组纠错训练数据包括一条语音数据对应的声学特征、所述一条语音数据对应的文本、所述一条语音数据对应的第一次识别结果以及所述第一次识别结果中的关键词。
可选地,模型处理单元,包括:
编码及注意力计算单元,用于利用所述语音纠错识别模型对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算;
识别单元,用于基于计算结果,得到第二次识别结果。
可选地,所述编码以及注意力计算单元包括第一编码以及注意力计算单元,所述识别单元包括第一解码单元;
所述第一编码以及注意力计算单元,用于利用所述语音纠错识别模型的编码层和注意力层,分别对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,得到所述计算结果;
所述第一解码单元,用于利用所述语音纠错识别模型的解码层,对所述计算结果进行解码,得到第二次识别结果。
可选地,所述模型处理单元还包括合并单元,所述编码以及注意力计 算单元包括第二编码以及注意力计算单元,所述识别单元包括第二解码单元:
所述合并单元,用于对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行合并,得到合并向量;
所述第二编码以及注意力计算单元,用于利用所述语音纠错识别模型的编码层和注意力层,对所述合并向量进行编码以及注意力计算,得到所述计算结果;
所述第二解码单元,用于利用所述语音纠错识别模型的解码层,对所述计算结果进行解码,得到第二次识别结果。
可选地,所述第一编码以及注意力计算单元,包括:
第一编码单元,用于利用所述语音纠错识别模型的编码层,分别对每一目标对象进行编码,得到所述每一目标对象的声学高级特征;
第一注意力计算单元,用于利用所述语音纠错识别模型的注意力层,分别对所述每一目标对象相关的上一时刻的语义向量以及所述语音纠错识别模型上一时刻的输出结果,进行注意力计算,得到所述每一目标对象相关的隐层状态;以及,利用所述语音纠错识别模型的注意力层,分别对所述每一目标对象的声学高级特征以及所述每一目标对象相关的隐层状态,进行注意力计算,得到所述每一目标对象相关的语义向量;其中,所述目标对象包括所述语音数据的声学特征、所述第一次识别结果以及所述关键词。
可选地,所述第二编码以及注意力计算单元,包括:
第二编码单元,用于利用所述语音纠错识别模型的编码层,对所述合并向量进行编码,得到所述合并向量的声学高级特征;
第二注意力计算单元,用于利用所述语音纠错识别模型的注意力层,对所述合并向量相关的上一时刻的语义向量以及所述语音纠错识别模型上一时刻的输出结果,进行注意力计算,得到所述合并向量相关的隐层状态;以及,利用所述语音纠错识别模型的注意力层,对所述合并向量的声学高级特征以及所述合并向量相关的隐层状态,进行注意力计算,得到所述合 并向量相关的语义向量。
可选地,所述识别结果确定单元,包括:
置信度获取单元,用于获取所述第一次识别结果的置信度,以及,所述第二次识别结果的置信度;
确定单元,用于从所述第一次识别结果以及所述第二次识别结果中,确定置信度高的识别结果为最终的识别结果。
图7示出了语音识别纠错系统的硬件结构框图,参照图7,语音识别纠错系统的硬件结构可以包括:至少一个处理器1,至少一个通信接口2,至少一个存储器3和至少一个通信总线4;
在本申请实施例中,处理器1、通信接口2、存储器3、通信总线4的数量为至少一个,且处理器1、通信接口2、存储器3通过通信总线4完成相互间的通信;
处理器1可能是一个中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路等;
存储器3可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory)等,例如至少一个磁盘存储器;
其中,存储器存储有程序,处理器可调用存储器存储的程序,所述程序用于:
获取待识别的语音数据及其第一次识别结果;
参考所述第一次识别结果的上下文信息,对所述语音数据进行第二次识别,得到第二次识别结果;
根据所述第二次识别结果,确定最终的识别结果。
或者,所述程序可以用于:
获取待识别的语音数据及其第一次识别结果;
从所述第一次识别结果中提取关键词;
参考所述第一次识别结果的上下文信息以及所述关键词,对所述语音数据进行第二次识别,得到第二次识别结果;
根据所述第二次识别结果,确定最终的识别结果。
可选的,所述程序的细化功能和扩展功能可参照上文描述。
本申请实施例还提供一种存储介质,该存储介质可存储有适于处理器执行的程序,所述程序用于:
获取待识别的语音数据及其第一次识别结果;
参考所述第一次识别结果的上下文信息,对所述语音数据进行第二次识别,得到第二次识别结果;
根据所述第二次识别结果,确定最终的识别结果。
或者,所述程序可以用于:
获取待识别的语音数据及其第一次识别结果;
从所述第一次识别结果中提取关键词;
参考所述第一次识别结果的上下文信息以及所述关键词,对所述语音数据进行第二次识别,得到第二次识别结果;
根据所述第二次识别结果,确定最终的识别结果。
可选的,所述程序的细化功能和扩展功能可参照上文描述。
进一步地,本申请实施例还提供了一种计算机程序产品,所述计算机程序产品在终端设备上运行时,使得所述终端设备执行上述语音识别纠错方法中的任意一种实现方式。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间可以相互结合,且相同相 似部分互相参见即可。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (15)

  1. 一种语音识别纠错方法,其特征在于,所述方法包括:
    获取待识别的语音数据及其第一次识别结果;
    参考所述第一次识别结果的上下文信息,对所述语音数据进行第二次识别,得到第二次识别结果;
    根据所述第二次识别结果,确定最终的识别结果。
  2. 一种语音识别纠错方法,其特征在于,所述方法包括:
    获取待识别的语音数据及其第一次识别结果;
    从所述第一次识别结果中提取关键词;
    参考所述第一次识别结果的上下文信息以及所述关键词,对所述语音数据进行第二次识别,得到第二次识别结果;
    根据所述第二次识别结果,确定最终的识别结果。
  3. 根据权利要求2所述的方法,其特征在于,所述从所述第一次识别结果中提取关键词,包括:
    从所述第一识别结果中提取具有领域特性的词汇,作为关键词。
  4. 根据权利要求2所述的方法,其特征在于,所述参考所述第一次识别结果的上下文信息以及所述关键词,对所述语音数据进行第二次识别,得到第二次识别结果,包括:
    获取所述语音数据的声学特征;
    将所述语音数据的声学特征、所述第一次识别结果以及所述关键词,输入预先训练的语音纠错识别模型,得到第二次识别结果,所述语音纠错识别模型是利用纠错训练数据集对预设模型进行训练得到的;
    其中,所述纠错训练数据集中包括至少一组纠错训练数据,每组纠错训练数据包括一条语音数据对应的声学特征、所述一条语音数据对应的文本、所述一条语音数据对应的第一次识别结果以及所述第一次识别结果中的关键词。
  5. 根据权利要求4所述的方法,其特征在于,所述将所述语音数据的声学特征、所述第一次识别结果以及所述关键词,输入预先训练的语音纠 错识别模型,得到第二次识别结果,包括:
    利用所述语音纠错识别模型对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,并基于计算结果,得到第二次识别结果。
  6. 根据权利要求5所述的方法,其特征在于,所述利用所述语音纠错识别模型对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,并基于计算结果,得到第二次识别结果,包括:
    利用所述语音纠错识别模型的编码层和注意力层,分别对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,得到所述计算结果;
    利用所述语音纠错识别模型的解码层,对所述计算结果进行解码,得到第二次识别结果。
  7. 根据权利要求5所述的方法,其特征在于,所述利用所述语音纠错识别模型对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,并基于计算结果,得到第二次识别结果,包括:
    对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行合并,得到合并向量;
    利用所述语音纠错识别模型的编码层和注意力层,对所述合并向量进行编码以及注意力计算,得到所述计算结果;
    利用所述语音纠错识别模型的解码层,对所述计算结果进行解码,得到第二次识别结果。
  8. 根据权利要求6所述的方法,其特征在于,所述利用所述语音纠错识别模型的编码层和注意力层,分别对所述语音数据的声学特征、所述第一次识别结果以及所述关键词进行编码以及注意力计算,得到所述计算结果,包括:
    利用所述语音纠错识别模型的编码层,分别对每一目标对象进行编码, 得到所述每一目标对象的声学高级特征;
    利用所述语音纠错识别模型的注意力层,分别对所述每一目标对象相关的上一时刻的语义向量以及所述语音纠错识别模型上一时刻的输出结果,进行注意力计算,得到所述每一目标对象相关的隐层状态;
    利用所述语音纠错识别模型的注意力层,分别对所述每一目标对象的声学高级特征以及所述每一目标对象相关的隐层状态,进行注意力计算,得到所述每一目标对象相关的语义向量;
    其中,所述目标对象包括所述语音数据的声学特征、所述第一次识别结果以及所述关键词。
  9. 根据权利要求7所述的方法,其特征在于,所述利用所述语音纠错识别模型的编码层和注意力层,对所述合并向量进行编码以及注意力计算,得到所述计算结果,包括:
    利用所述语音纠错识别模型的编码层,对所述合并向量进行编码,得到所述合并向量的声学高级特征;
    利用所述语音纠错识别模型的注意力层,对所述合并向量相关的上一时刻的语义向量以及所述语音纠错识别模型上一时刻的输出结果,进行注意力计算,得到所述合并向量相关的隐层状态;
    利用所述语音纠错识别模型的注意力层,对所述合并向量的声学高级特征以及所述合并向量相关的隐层状态,进行注意力计算,得到所述合并向量相关的语义向量。
  10. 根据权利要求2所述的方法,其特征在于,所述根据所述第二次识别结果,确定最终的识别结果,包括:
    获取所述第一次识别结果的置信度,以及,所述第二次识别结果的置信度;
    从所述第一次识别结果以及所述第二次识别结果中,确定置信度高的识别结果为最终的识别结果。
  11. 一种语音识别纠错装置,其特征在于,所述装置包括:
    获取单元,用于获取待识别的语音数据及其第一次识别结果;
    第一语音识别单元,用于参考所述第一次识别结果的上下文信息,对所述语音数据进行第二次识别,得到第二次识别结果;
    识别结果确定单元,用于根据所述第二次识别结果,确定最终的识别结果。
  12. 一种语音识别纠错装置,其特征在于,所述装置包括:
    获取单元,用于获取待识别的语音数据及其第一次识别结果;
    关键词提取单元,用于从所述第一次识别结果中提取关键词;
    第二语音识别单元,用于参考所述第一次识别结果的上下文信息以及所述关键词,对所述语音数据进行第二次识别,得到第二次识别结果;
    识别结果确定单元,用于根据所述第二次识别结果,确定最终的识别结果。
  13. 一种语音识别纠错系统,其特征在于,包括存储器和处理器;
    所述存储器,用于存储程序;
    所述处理器,用于执行所述程序,实现如权利要求1至10中任一项所述的语音识别纠错方法的各个步骤。
  14. 一种可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,实现如权利要求1至10中任一项所述的语音识别纠错方法的各个步骤。
  15. 一种计算机程序产品,所述计算机程序产品在终端设备上运行时,使得所述终端设备执行权利要求1至10中任一项所述的方法。
PCT/CN2020/129314 2019-11-25 2020-11-17 语音识别纠错方法、相关设备及可读存储介质 WO2021104102A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2022522366A JP7514920B2 (ja) 2019-11-25 2020-11-17 音声認識誤り訂正方法、関連装置及び読取可能な記憶媒体
KR1020227005374A KR102648306B1 (ko) 2019-11-25 2020-11-17 음성 인식 오류 정정 방법, 관련 디바이스들, 및 판독 가능 저장 매체
US17/773,641 US20220383853A1 (en) 2019-11-25 2020-11-17 Speech recognition error correction method, related devices, and readable storage medium
EP20894295.3A EP4068280A4 (en) 2019-11-25 2020-11-17 VOICE RECOGNITION ERROR CORRECTION METHOD, ASSOCIATED APPARATUS AND READABLE STORAGE MEDIUM

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911167009.0 2019-11-25
CN201911167009.0A CN110956959B (zh) 2019-11-25 2019-11-25 语音识别纠错方法、相关设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2021104102A1 true WO2021104102A1 (zh) 2021-06-03

Family

ID=69978361

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/129314 WO2021104102A1 (zh) 2019-11-25 2020-11-17 语音识别纠错方法、相关设备及可读存储介质

Country Status (5)

Country Link
US (1) US20220383853A1 (zh)
EP (1) EP4068280A4 (zh)
KR (1) KR102648306B1 (zh)
CN (1) CN110956959B (zh)
WO (1) WO2021104102A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113421553A (zh) * 2021-06-15 2021-09-21 北京天行汇通信息技术有限公司 音频挑选的方法、装置、电子设备和可读存储介质

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110956959B (zh) * 2019-11-25 2023-07-25 科大讯飞股份有限公司 语音识别纠错方法、相关设备及可读存储介质
CN111627457A (zh) * 2020-05-13 2020-09-04 广州国音智能科技有限公司 语音分离方法、系统及计算机可读存储介质
CN111583909B (zh) * 2020-05-18 2024-04-12 科大讯飞股份有限公司 一种语音识别方法、装置、设备及存储介质
CN111754987A (zh) * 2020-06-23 2020-10-09 国投(宁夏)大数据产业发展有限公司 一种大数据分析语音识别方法
CN112016305B (zh) * 2020-09-09 2023-03-28 平安科技(深圳)有限公司 文本纠错方法、装置、设备及存储介质
CN112259100B (zh) * 2020-09-15 2024-04-09 科大讯飞华南人工智能研究院(广州)有限公司 语音识别方法及相关模型的训练方法和相关设备、装置
CN112133453B (zh) * 2020-09-16 2022-08-26 成都美透科技有限公司 一种基于医美数据和医疗数据的用户咨询辅助分析系统
CN112257437B (zh) * 2020-10-20 2024-02-13 中国科学技术大学 语音识别纠错方法、装置、电子设备和存储介质
CN112435671B (zh) * 2020-11-11 2021-06-29 深圳市小顺智控科技有限公司 汉语精准识别的智能化语音控制方法及系统
CN112489651B (zh) * 2020-11-30 2023-02-17 科大讯飞股份有限公司 语音识别方法和电子设备、存储装置
CN114678027A (zh) * 2020-12-24 2022-06-28 深圳Tcl新技术有限公司 语音识别结果的纠错方法、装置、终端设备及存储介质
CN113035175B (zh) * 2021-03-02 2024-04-12 科大讯飞股份有限公司 一种语音文本重写模型构建方法、语音识别方法
CN113257227B (zh) * 2021-04-25 2024-03-01 平安科技(深圳)有限公司 语音识别模型性能检测方法、装置、设备及存储介质
CN113409767B (zh) * 2021-05-14 2023-04-25 北京达佳互联信息技术有限公司 一种语音处理方法、装置、电子设备及存储介质
CN113221580B (zh) * 2021-07-08 2021-10-12 广州小鹏汽车科技有限公司 语义拒识方法、语义拒识装置、交通工具及介质
WO2024029845A1 (ko) * 2022-08-05 2024-02-08 삼성전자주식회사 전자 장치 및 이의 음성 인식 방법
US11657803B1 (en) * 2022-11-02 2023-05-23 Actionpower Corp. Method for speech recognition by using feedback information
CN116991874B (zh) * 2023-09-26 2024-03-01 海信集团控股股份有限公司 一种文本纠错、基于大模型的sql语句生成方法及设备
CN117238276B (zh) * 2023-11-10 2024-01-30 深圳市托普思维商业服务有限公司 一种基于智能化语音数据识别的分析纠正系统
CN117558263B (zh) * 2024-01-10 2024-04-26 科大讯飞股份有限公司 语音识别方法、装置、设备及可读存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003308094A (ja) * 2002-02-12 2003-10-31 Advanced Telecommunication Research Institute International 音声認識における認識誤り箇所の訂正方法
CN101042867A (zh) * 2006-03-24 2007-09-26 株式会社东芝 语音识别设备和方法
US20140195226A1 (en) * 2013-01-04 2014-07-10 Electronics And Telecommunications Research Institute Method and apparatus for correcting error in speech recognition system
CN106875943A (zh) * 2017-01-22 2017-06-20 上海云信留客信息科技有限公司 一种用于大数据分析的语音识别系统
CN107093423A (zh) * 2017-05-27 2017-08-25 努比亚技术有限公司 一种语音输入修正方法、装置及计算机可读存储介质
CN109065054A (zh) * 2018-08-31 2018-12-21 出门问问信息科技有限公司 语音识别纠错方法、装置、电子设备及可读存储介质
CN110021293A (zh) * 2019-04-08 2019-07-16 上海汽车集团股份有限公司 语音识别方法及装置、可读存储介质
CN110956959A (zh) * 2019-11-25 2020-04-03 科大讯飞股份有限公司 语音识别纠错方法、相关设备及可读存储介质

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4867654B2 (ja) * 2006-12-28 2012-02-01 日産自動車株式会社 音声認識装置、および音声認識方法
JP4709887B2 (ja) * 2008-04-22 2011-06-29 株式会社エヌ・ティ・ティ・ドコモ 音声認識結果訂正装置および音声認識結果訂正方法、ならびに音声認識結果訂正システム
CN101876975A (zh) * 2009-11-04 2010-11-03 中国科学院声学研究所 汉语地名的识别方法
CN102592595B (zh) * 2012-03-19 2013-05-29 安徽科大讯飞信息科技股份有限公司 语音识别方法及系统
CN103366741B (zh) * 2012-03-31 2019-05-17 上海果壳电子有限公司 语音输入纠错方法及系统
US9818401B2 (en) * 2013-05-30 2017-11-14 Promptu Systems Corporation Systems and methods for adaptive proper name entity recognition and understanding
US9940927B2 (en) * 2013-08-23 2018-04-10 Nuance Communications, Inc. Multiple pass automatic speech recognition methods and apparatus
KR102199444B1 (ko) * 2014-11-24 2021-01-07 에스케이텔레콤 주식회사 음성 인식 오류에 강인한 의미 추론 방법 및 이를 위한 장치
KR102380833B1 (ko) * 2014-12-02 2022-03-31 삼성전자주식회사 음성 인식 방법 및 음성 인식 장치
CN107391504B (zh) * 2016-05-16 2021-01-29 华为技术有限公司 新词识别方法与装置
KR20180071029A (ko) * 2016-12-19 2018-06-27 삼성전자주식회사 음성 인식 방법 및 장치
CN107437416B (zh) * 2017-05-23 2020-11-17 创新先进技术有限公司 一种基于语音识别的咨询业务处理方法及装置
CN107293296B (zh) * 2017-06-28 2020-11-20 百度在线网络技术(北京)有限公司 语音识别结果纠正方法、装置、设备及存储介质
JP2019113636A (ja) 2017-12-22 2019-07-11 オンキヨー株式会社 音声認識システム
JP6985138B2 (ja) 2017-12-28 2021-12-22 株式会社イトーキ 音声認識システム及び音声認識方法
US11482213B2 (en) * 2018-07-20 2022-10-25 Cisco Technology, Inc. Automatic speech recognition correction
CN113016030A (zh) * 2018-11-06 2021-06-22 株式会社赛斯特安国际 提供语音识别服务的方法及装置
US11017778B1 (en) * 2018-12-04 2021-05-25 Sorenson Ip Holdings, Llc Switching between speech recognition systems
CN110110041B (zh) * 2019-03-15 2022-02-15 平安科技(深圳)有限公司 错词纠正方法、装置、计算机装置及存储介质
US11636853B2 (en) * 2019-08-20 2023-04-25 Soundhound, Inc. Natural language grammar improvement

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003308094A (ja) * 2002-02-12 2003-10-31 Advanced Telecommunication Research Institute International 音声認識における認識誤り箇所の訂正方法
CN101042867A (zh) * 2006-03-24 2007-09-26 株式会社东芝 语音识别设备和方法
US20140195226A1 (en) * 2013-01-04 2014-07-10 Electronics And Telecommunications Research Institute Method and apparatus for correcting error in speech recognition system
CN106875943A (zh) * 2017-01-22 2017-06-20 上海云信留客信息科技有限公司 一种用于大数据分析的语音识别系统
CN107093423A (zh) * 2017-05-27 2017-08-25 努比亚技术有限公司 一种语音输入修正方法、装置及计算机可读存储介质
CN109065054A (zh) * 2018-08-31 2018-12-21 出门问问信息科技有限公司 语音识别纠错方法、装置、电子设备及可读存储介质
CN110021293A (zh) * 2019-04-08 2019-07-16 上海汽车集团股份有限公司 语音识别方法及装置、可读存储介质
CN110956959A (zh) * 2019-11-25 2020-04-03 科大讯飞股份有限公司 语音识别纠错方法、相关设备及可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4068280A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113421553A (zh) * 2021-06-15 2021-09-21 北京天行汇通信息技术有限公司 音频挑选的方法、装置、电子设备和可读存储介质
CN113421553B (zh) * 2021-06-15 2023-10-20 北京捷通数智科技有限公司 音频挑选的方法、装置、电子设备和可读存储介质

Also Published As

Publication number Publication date
JP2022552662A (ja) 2022-12-19
EP4068280A1 (en) 2022-10-05
KR102648306B1 (ko) 2024-03-15
CN110956959B (zh) 2023-07-25
CN110956959A (zh) 2020-04-03
US20220383853A1 (en) 2022-12-01
KR20220035222A (ko) 2022-03-21
EP4068280A4 (en) 2023-11-01

Similar Documents

Publication Publication Date Title
WO2021104102A1 (zh) 语音识别纠错方法、相关设备及可读存储介质
CN109146610B (zh) 一种智能保险推荐方法、装置及智能保险机器人设备
CN108899013B (zh) 语音搜索方法、装置和语音识别系统
US10917758B1 (en) Voice-based messaging
US11093110B1 (en) Messaging feedback mechanism
US20210090570A1 (en) Automated calling system
US10366690B1 (en) Speech recognition entity resolution
Xu et al. Rescorebert: Discriminative speech recognition rescoring with bert
KR20160015218A (ko) 온라인 음성 번역 방법 및 장치
US10152298B1 (en) Confidence estimation based on frequency
Peyser et al. Improving performance of end-to-end ASR on numeric sequences
JP2008293019A (ja) 言語理解装置
CN111539199B (zh) 文本的纠错方法、装置、终端、及存储介质
CN109933773A (zh) 一种多重语义语句解析系统及方法
CN111128175B (zh) 口语对话管理方法及系统
WO2012004955A1 (ja) テキスト補正方法及び認識方法
CN116226338A (zh) 基于检索和生成融合的多轮对话系统及方法
Arora et al. Two-pass low latency end-to-end spoken language understanding
Ganhotra et al. Integrating dialog history into end-to-end spoken language understanding systems
Wang et al. Diarizationlm: Speaker diarization post-processing with large language models
CN116564330A (zh) 弱监督语音预训练方法、电子设备和存储介质
CN112989794A (zh) 模型训练方法、装置、智能机器人和存储介质
JP7514920B2 (ja) 音声認識誤り訂正方法、関連装置及び読取可能な記憶媒体
JP2024512606A (ja) 自己アライメントを用いたストリーミングasrモデル遅延の短縮
Bijwadia et al. Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20894295

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20227005374

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2022522366

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020894295

Country of ref document: EP

Effective date: 20220627