WO2021136029A1 - 重打分模型训练方法及装置、语音识别方法及装置 - Google Patents

重打分模型训练方法及装置、语音识别方法及装置 Download PDF

Info

Publication number
WO2021136029A1
WO2021136029A1 PCT/CN2020/138536 CN2020138536W WO2021136029A1 WO 2021136029 A1 WO2021136029 A1 WO 2021136029A1 CN 2020138536 W CN2020138536 W CN 2020138536W WO 2021136029 A1 WO2021136029 A1 WO 2021136029A1
Authority
WO
WIPO (PCT)
Prior art keywords
recognition result
voice data
sample
speech recognition
voice
Prior art date
Application number
PCT/CN2020/138536
Other languages
English (en)
French (fr)
Inventor
李安
陈江
胡正伦
傅正佳
Original Assignee
百果园技术(新加坡)有限公司
李安
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百果园技术(新加坡)有限公司, 李安 filed Critical 百果园技术(新加坡)有限公司
Publication of WO2021136029A1 publication Critical patent/WO2021136029A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals

Definitions

  • the embodiments of the present application relate to the technical field of speech recognition, such as a re-scoring model training method, a re-scoring model training device, a voice recognition method, a voice recognition device, equipment, and storage medium.
  • ASR Automatic Speech Recognition
  • multiple speech recognition results can be obtained from the speech data.
  • the speech content is: "I am a good student”.
  • each speech recognition result is usually scored.
  • the higher the score the greater the rationality or accuracy of the speech recognition result.
  • the accuracy of relying on a single scoring result as the judgment criterion is still relatively low. In order to make comprehensive judgments after scoring each speech recognition result through multiple language models at the same time.
  • the re-scoring mechanism in the related technology is to directly add multiple scores of each recognition result or add a weight to each score according to a manually set weight to calculate the total score.
  • the embodiments of the application provide a re-scoring model training method, a re-scoring model training device, a voice recognition method, a voice recognition device, equipment, and a storage medium, so as to avoid the subjective influence and poor applicability of voice recognition in related technologies. happening.
  • an embodiment of the present application provides a re-scoring model training method, including:
  • the model is trained using the sample feature vector corresponding to the at least one voice data sample and the second label to obtain a re-scoring model for re-scoring the voice recognition result.
  • an embodiment of the present application provides a voice recognition method, including:
  • the re-scoring model is obtained by training the re-scoring model training method described in the embodiment of the present application.
  • an embodiment of the present application provides a re-scoring model training device, including:
  • the first acquisition module is configured to acquire multiple voice recognition results of each voice data sample in at least one voice data sample and a first label of each voice data sample, where the first label is the pre-labeled The label of each voice data sample;
  • the scoring module is set to obtain multiple scores of each speech recognition result under multiple different language models
  • the second obtaining module is configured to obtain a plurality of sample feature vectors corresponding to each voice data sample based on the multiple voice recognition results of each voice data sample, the multiple scores, and the first label And multiple second labels, sample feature vectors and second labels are used for training the re-scoring model;
  • the model training module is configured to obtain a re-scoring model for re-scoring the voice recognition result according to the sample feature vector corresponding to the at least one voice data sample and the second label training model.
  • an embodiment of the present application provides a voice recognition device, including:
  • the voice recognition result obtaining module is configured to obtain multiple voice recognition results of the voice data to be recognized
  • the initial score acquisition module is set to acquire multiple scores for each speech recognition result under multiple different language models
  • a feature vector obtaining module configured to obtain a feature vector corresponding to each voice recognition result of the voice data to be recognized based on each voice recognition result and the multiple scores;
  • the final score prediction module is configured to input the sample feature vector corresponding to each voice recognition result into the pre-trained re-scoring model to obtain the final score of each voice recognition result;
  • a voice recognition result determination module configured to determine the voice recognition result corresponding to the smallest final score among the final scores of the multiple voice recognition results as the final recognition result of the voice data to be recognized;
  • the re-scoring model is obtained through training of the re-scoring model training method described in any one of the embodiments of the present application.
  • an embodiment of the present application provides a device, and the device includes:
  • At least one processor At least one processor
  • the storage device is set to store at least one program
  • the at least one processor When the at least one program is executed by the at least one processor, the at least one processor implements at least one of the following: the re-scoring model training method described in any embodiment of the present application is described in any embodiment of the present application. The voice recognition method described.
  • an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, at least one of the following is implemented:
  • the scoring model training method is the speech recognition method described in any embodiment of this application.
  • FIG. 1 is a flow chart of the steps of a re-scoring model training method provided in Embodiment 1 of the present application;
  • FIG. 2 is a schematic diagram of a weighted directed acyclic graph obtained after decoding voice data samples in an embodiment of the present application
  • FIG. 3 is a flowchart of the steps of a re-scoring model training method provided in the second embodiment of the present application;
  • FIG. 4 is a flowchart of the steps of a voice recognition method provided in the third embodiment of the present application.
  • FIG. 5 is a structural block diagram of a re-scoring model training device provided in the fourth embodiment of the present application.
  • FIG. 6 is a structural block diagram of a voice recognition device provided by Embodiment 5 of the present application.
  • FIG. 7 is a structural block diagram of a device provided in Embodiment 6 of the present application.
  • FIG. 1 is a flowchart of the steps of a re-scoring model training method provided in Embodiment 1 of this application.
  • the embodiment of this application is applicable to the case of training a re-scoring model.
  • the method can be executed by the re-scoring model training device implemented in this application.
  • the re-scoring model training device can be implemented by hardware or software and integrated in the device provided in the embodiment of the present application.
  • the re-scoring model training method in the embodiment of the present application may include steps S101 to S104. .
  • a plurality of voice recognition results of each voice data sample in at least one voice data sample and a first label of each voice data sample are acquired, where the first label is each pre-labeled voice The label of the data sample.
  • each speech recognition result can be obtained through the speech recognition codec model, and each speech recognition result includes a series of ordered characters and words, that is, for a piece of speech data, Multiple speech recognition decoding paths.
  • the result of speech recognition decoding is a weighted directed acyclic graph.
  • Each path in the figure is a representation of the optional word sequence in the speech recognition decoding process.
  • the circle represents a piece of speech data.
  • the voice data sample has a human-labeled real voice recognition result, and the real voice recognition result is the first label of the voice data sample.
  • the voice content is: "I am a good student”.
  • multiple voice recognition results of the voice data sample can be obtained through Encoder-Decoder.
  • multiple voice recognition results of the voice data sample can also be obtained in other ways, for example, Multiple speech recognition results can be obtained by artificial generation.
  • the language model can construct the probability distribution p(s) of the string s, and p(s) expresses the probability that the string s is a sentence.
  • the probability here refers to the combination of the string. The probability of whether a sentence composed of this combination is a natural language (human language).
  • each voice recognition result can be input into multiple different language models to obtain a score of the voice recognition result.
  • the score expresses the probability that the voice recognition result conforms to natural speech.
  • different language models can be acoustic models, n-gram language models and RNNLM models.
  • the acoustic model represents the knowledge of differences in acoustics, phonetics, environmental variables, speaker gender, accent, etc.
  • the acoustic model can be trained with lstm+ctc to obtain the mapping of speech features to phonemes.
  • the task of the acoustic model is the probability that the text will be sent out after a given text.
  • the n-gram language model is a statistical language model that is used to predict the nth word based on the first (n-1) words, that is, calculate the probability of a sentence, that is, calculate a series of words that make up a sentence The probability.
  • the RNNLM model is a language model trained through RNN and its variant network. Its task is to predict the next word through the above.
  • the speech recognition result can be analyzed to extract the word frequency, character frequency, word or word order, sentence length, number of words, number of words, etc. of the speech recognition result as a sentence Word structure feature, which combines multiple scores and sentence structure features of the speech recognition results under multiple different language models into the sample feature vector of the speech recognition result, and then uses the speech recognition result and the first label of the speech data sample to calculate the speech recognition The resulting character error rate is used as the second label of the speech recognition result.
  • the sample feature vector corresponding to the at least one voice data sample and the second label training model are used to obtain a re-scoring model for re-scoring the voice recognition result.
  • the sample feature vector may be input into the model after the initial model parameters to obtain the estimated character error rate of each speech recognition result, and the estimated character error rate and the second label of the speech recognition result are used to calculate the loss When the loss rate meets the preset conditions, the training mode is stopped.
  • the model parameters are adjusted according to the loss rate to re-iterate the training model until the loss rate meets the preset conditions, and it is set to A re-scoring model in which the recognition results are re-scored, that is, for multiple speech recognition results of each voice data sample, the character error rate can be re-scored through the re-scoring model to obtain the final score, and the speech recognition result with the lowest final score is the voice data The best speech recognition result of the sample.
  • the sample feature vector and the second label of the voice data sample are obtained by the individual scores and the first label.
  • the sample feature vector and the second label are used for training the rescoring model; the sample feature vector and the second label training model are used to obtain the rescoring model.
  • the embodiment of the application obtains the sample feature vector and the second label of the voice data sample based on the voice recognition result, the score and the first label.
  • the sample feature vector and the second label are used for the training of the re-scoring model, and the second label and multiple labels are mined.
  • the implicit internal correlation of the scores obtained by different language models is used to obtain the best combination of the scores of different language models, which eliminates subjective factors and ensures the accuracy of speech recognition results, even if the scoring mechanism of each language model Change, there is no need to modify the weight between each score, which improves the versatility and universality of the re-scoring model.
  • Fig. 3 is a flowchart of the steps of a re-scoring model training method provided in the second embodiment of the application.
  • the embodiment of the present application is refined on the basis of the foregoing embodiment 1.
  • the re-scoring model of the embodiment of the present application is detailed
  • the scoring model training method may include steps S201 to S212.
  • each voice data sample of the at least one voice data sample is input into a decoding model to obtain multiple voice recognition results, and each voice data sample has a pre-labeled first label.
  • the voice data sample can be any voice data
  • the voice data can be input into a voice recognition decoding model (for example, through Encoder-Decoder) to obtain multiple recognition results, and each voice recognition result has one Probability, which expresses the probability that the voice recognition result is a pre-labeled label, and the pre-labeled first label is the real text corresponding to the artificially labeled voice data sample.
  • a voice recognition decoding model for example, through Encoder-Decoder
  • a preset number of voice recognition results are extracted as multiple voice recognition results for each voice data sample.
  • all voice recognition results may be sorted according to the probability of each voice recognition result, and K voice recognition results ranked as TOP K are extracted as multiple voice recognition results of voice data samples.
  • each speech recognition result is input into multiple different language models, and multiple scores of the speech recognition result under different language models are obtained.
  • the language model may include three language models: an acoustic model, an n-gram language model, and an RNNLM language model. After obtaining multiple speech recognition results, each speech recognition result can be input into the acoustic model, n-gram language model, and RNNLM language model to obtain 3 scores for each speech recognition result.
  • acoustic model After obtaining multiple speech recognition results, each speech recognition result can be input into the acoustic model, n-gram language model, and RNNLM language model to obtain 3 scores for each speech recognition result.
  • RNNLM language model may include three language models: an acoustic model, an n-gram language model, and an RNNLM language model.
  • the speech recognition result is analyzed to extract the sentence structure feature of the speech recognition result.
  • the speech recognition result is composed of a series of ordered characters and words, and the number of words contained in the speech recognition result, the number of words, the frequency of occurrence of the words and words, the length of the sentence, and the word or words can be counted.
  • the ordering and other characteristics of the sentence are used as the structural characteristics of sentences.
  • multiple scores of the speech recognition result and sentence structure features can be connected to form a sample feature vector A (score 1, score 2, score 3, word frequency, word frequency, word or word order, sentence length, word count, Word count), where score 1, score 2, score 3, word frequency, word frequency, word or word order, sentence length, word count, and word count are the feature values of the sample feature vector A, respectively.
  • Character Error Rate is a scoring method and a standard for evaluating the excellence of the ASR model.
  • the character error rate is based on the sum of the times of insertion, deletion, and replacement from the predicted value to the true value, that is, for speech
  • the data sample has a real first label, which is the real text corresponding to the voice data sample, and the voice recognition result is not necessarily the real file.
  • the voice recognition result to the real text needs to be inserted, deleted, replaced, and statistically inserted
  • the number of times of deletion and replacement is the character error rate.
  • the label is the real text "I am a good student”
  • the voice recognition result is "The grip is a good student”
  • you need to replace the word once and insert the word once, and the character error rate can be determined to be 2.
  • the speech recognition result can be compared with the first label of the speech data sample during decoding, and the character error rate of each speech recognition result can be calculated as the second label of the speech recognition result.
  • normalization processing is performed on each sample feature vector corresponding to the at least one voice data sample to obtain a normalized sample feature vector.
  • the sample feature vector with the largest modulus and the sample feature vector with the smallest modulus can be determined from all the sample feature vectors corresponding to the at least one voice data sample, and the sample feature vector with the largest modulus and the sample feature vector can be calculated.
  • the difference value of the sample feature vector with the smallest modulus is the vector difference value, and the result obtained by dividing each sample feature vector corresponding to the at least one voice data sample by the modulus of the vector difference value is used as the value corresponding to the at least one voice data sample
  • the normalization calculation formula of the sample feature vector is as follows:
  • x i is the sample feature vector corresponding to the i-th speech recognition result
  • x′ i is the normalized sample feature vector
  • x max and x min are samples of multiple speech recognition results of the speech data sample
  • the sample feature vector with the largest modulus and the sample feature vector with the smallest modulus can be normalized, and the sample feature vectors of multiple speech recognition results can be unified under one dimension, which is convenient for the quantitative expression of the sample feature vector , Provide high-quality training data for candidate model training to improve the accuracy of model training.
  • the model of the embodiment of this application may be a machine learning algorithm training model such as linear regression, support vector machine, decision tree model, etc.
  • the example of this application takes linear regression as an example, and the modeling equation is:
  • a m is the coefficient corresponding to the m-th eigenvalue of the sample eigenvector
  • z m is the m-th eigenvalue of the sample eigenvector x i
  • y i is the estimated character error rate of the i-th sample eigenvector
  • n Is the number of feature values in the sample feature vector.
  • the sample feature vector after the normalization of the speech recognition result can be input into the initialized model, that is, the x′ i in S207 is input into the model, and for each x′ i , the model outputs an estimated character error rate y i .
  • the estimated character error rate of each speech recognition result and the second tag are used to calculate the loss rate.
  • the loss function is the mean square loss function:
  • mse loss is the loss rate
  • y i is the estimated character error rate of the i-th sample feature vector
  • the number of speech data samples is N
  • K is the number of speech recognition results extracted from multiple speech recognition results of one speech data sample. Therefore, The total number of extracted speech recognition results is N*K, and the estimated character error rate and the second label of the N*K speech recognition results are substituted into the above-mentioned mean square loss function to calculate the loss rate.
  • the loss rate when the loss rate does not meet a preset condition, the loss rate is used to calculate the gradient.
  • the preset gradient algorithm can be used to calculate the gradient. The embodiment of the present application does not impose restrictions on the algorithm for calculating the gradient.
  • the gradient and the preset learning rate may be used to perform gradient descent on the current parameters of the model to obtain the model after adjusting the model parameters, and return to S209 to continue to iterate the model until The loss rate is less than the preset threshold.
  • the voice data samples are input into the decoding model to obtain multiple voice recognition results, a preset number of voice recognition results are extracted as multiple voice recognition results of the voice data samples, and each voice recognition result is input into multiple different voice recognition results.
  • the scores of the speech recognition results under different language models are obtained, and the speech recognition results are analyzed for each speech recognition result to extract the sentence structure characteristics of the speech recognition results, and the speech recognition results are put under multiple different language models
  • the multiple scores and sentence structure features are combined into the sample feature vector of the speech recognition result.
  • the speech recognition result and the first label of the speech data sample are used to calculate the character error rate as the second label of the speech recognition result.
  • the sample feature vector and the first label are used to calculate the character error rate.
  • Two-label training re-scoring model It can dig out the implicit internal correlation between the second label and the scores scored by multiple different language models to obtain the best combination of scores for different language models, which eliminates subjective factors and ensures the accuracy of speech recognition results Even if the scoring mechanism of each language model is changed, there is no need to modify the weight between each score, which improves the versatility and universality of the re-scoring model.
  • the speech recognition result and the first label are used to calculate the character error rate as the second label of the speech recognition result, so that the model indirectly learns the character error rate of the voice data sample through the speech recognition result, so as to use the model to obtain a better speech recognition result.
  • Figure 4 is a flowchart of the steps of a voice recognition method provided in the third embodiment of the application.
  • the embodiment of the present application is applicable to the situation of voice recognition.
  • the method can be executed by the voice recognition device implemented in the present application, and the voice recognition device can It is implemented by hardware or software and integrated in the device provided in the embodiment of the present application.
  • the voice recognition method of the embodiment of the present application may include steps S301 to S305.
  • the voice data to be recognized may be data that needs to be converted into text, for example, it may be voice data in a short video, voice data on a chat interface of an instant messaging application, etc.
  • the voice data to be recognized can be input into the decoding model to obtain multiple voice recognition results.
  • the process refer to Embodiment 1 or Embodiment 2 to obtain multiple voice recognition results of the voice data sample, which will not be described in detail in this embodiment.
  • the speech recognition results can be input into the acoustic model, the n-gram language model, and the RNNLM language model to obtain 3 scores for each speech recognition result.
  • Exemplary can refer to S204-S207 in the second embodiment, which will not be described in detail here.
  • the feature vector corresponding to each speech recognition result is input into the pre-trained re-scoring model to obtain the final score of each speech recognition result.
  • the re-scoring model can be obtained by training the re-scoring model training method described in any one of Embodiment 1 or Embodiment 2, and the re-scoring model can be performed on multiple speech recognition results of the speech data to be recognized. Re-scoring, after inputting the feature vector into the pre-trained re-scoring model, the final score of each speech recognition result can be obtained.
  • the voice recognition result corresponding to the smallest final score among the final scores of the multiple voice recognition results is determined as the final recognition result of the voice data to be recognized.
  • the final score expresses the character error rate of the speech recognition result relative to the real result.
  • the sample feature vector and the second label of the voice data sample are obtained based on the speech recognition result, score and the first label.
  • the sample feature vector and the second label are used for training and mining of the re-scoring model.
  • the implicit internal correlation between the second label and the scores obtained by multiple different language models is shown to obtain the best combination of the scores of different language models.
  • the multiple speech recognition results of the speech data to be recognized through the re-scoring model When re-scoring, human subjective factors can be eliminated, and the accuracy of speech recognition results can be ensured. Even if the scoring mechanism of each language model is changed, there is no need to modify the weight between each score, which improves the generality and universality of the re-scoring model. Adaptability.
  • FIG. 5 is a structural block diagram of a re-scoring model training device provided in the fourth embodiment of the present application.
  • the re-scoring model training device in the embodiment of the present application may include a first obtaining module 401, a scoring module 402, and a second Second, the acquisition module 403 and the model training module 404.
  • the first acquisition module 401 is configured to acquire multiple voice recognition results of each voice data sample in at least one voice data sample and a first label of each voice data sample, where the first label is all pre-labeled Describe the label of each voice data sample.
  • the scoring module 402 is configured to obtain multiple scores of each speech recognition result under multiple different language models.
  • the second obtaining module 403 is configured to obtain multiple sample feature vectors corresponding to each voice data sample based on the multiple voice recognition results of each voice data sample, the multiple scores, and the first label And multiple second labels, sample feature vectors and second labels are used for training the re-scoring model.
  • the model training module 404 is configured to use the sample feature vector corresponding to the at least one voice data sample and the second label training model to obtain a re-scoring model for re-scoring the voice recognition result.
  • the first acquisition module 401 includes a decoding sub-module and a speech recognition result extraction sub-module.
  • the decoding sub-module is configured to input each voice data sample of the at least one voice data sample into a decoding model to obtain multiple voice recognition results, and each voice data sample has a pre-labeled first label.
  • the voice recognition result extraction sub-module is configured to extract a preset number of voice recognition results as multiple voice recognition results for each voice data sample.
  • the scoring module 402 includes:
  • the scoring model input sub-module is configured to input each speech recognition result into multiple different language models to obtain multiple scores of each speech recognition result under different language models.
  • the multiple different language models include an acoustic model, an n-gram language model, and an RNNLM language model.
  • the second acquisition module 403 includes a sentence structure feature acquisition submodule, a feature combination submodule, and a second tag acquisition submodule.
  • the sentence structure feature acquisition sub-module is set to analyze the voice recognition result for each voice recognition result to extract the sentence structure feature of the voice recognition result.
  • the feature combination submodule is configured to combine multiple scores of each voice recognition result under multiple different language models and the sentence structure feature into a sample feature vector corresponding to each voice recognition result.
  • the second label acquisition sub-module is configured to calculate the character error rate of each voice recognition result using each of the voice recognition results and the first label as the second label of each voice recognition result.
  • the sentence structure feature includes at least one of the following:
  • Word frequency word frequency, word or word order, sentence length, word count, word count.
  • it also includes:
  • the feature normalization processing module is configured to perform normalization processing on each sample feature vector corresponding to the at least one voice data sample to obtain a normalized sample feature vector.
  • the feature normalization processing module includes a maximum and minimum sample feature vector determination sub-module, a difference calculation sub-module, and a sample feature vector calculation sub-module.
  • the maximum and minimum sample feature vector determining sub-module is configured to determine the sample feature vector with the largest modulus and the sample feature vector with the smallest modulus among all the sample feature vectors corresponding to the at least one voice data sample.
  • the difference calculation sub-module is configured to calculate the difference between the sample eigenvector with the largest modulus and the sample eigenvector with the smallest modulus to obtain a vector difference.
  • the sample feature vector calculation submodule is configured to divide each sample feature vector corresponding to the at least one voice data sample by the modulus of the vector difference as a result obtained by dividing each sample feature vector corresponding to the at least one voice data sample as each sample corresponding to the at least one voice data sample
  • the eigenvector is the normalized sample eigenvector.
  • the model training module 404 includes an initialization model sub-module, a feature input sub-module, a loss rate calculation sub-module, a gradient calculation sub-module, and a model parameter adjustment sub-module.
  • the feature input sub-module is configured to input the normalized sample feature vector corresponding to each speech recognition result into the model to obtain the estimated character error rate of each speech recognition result.
  • the loss rate calculation sub-module is set to calculate the loss rate using the estimated character error rate of each speech recognition result and the second tag.
  • the gradient calculation sub-module is configured to calculate the gradient using the loss rate when the loss rate does not meet a preset condition.
  • the model parameter adjustment sub-module is set to adjust the model parameters using the gradient and return to the feature input sub-module.
  • the loss rate calculation submodule includes:
  • the loss rate calculation unit is configured to substitute the estimated character error rate of each speech recognition result and the second label into a preset mean square loss function to calculate the loss rate.
  • the re-scoring model training device provided in the embodiment of this application can execute the re-scoring model training method described in Embodiment 1 or Embodiment 2 of this application, and has functional modules corresponding to the execution method.
  • FIG. 6 is a structural block diagram of a voice recognition device provided in the fifth embodiment of the present application.
  • the voice recognition device in the embodiment of the present application may include a voice recognition result obtaining module 501, an initial score obtaining module 502, and a feature vector The acquisition module 503, the final score prediction module 504, and the speech recognition result determination module 505.
  • the voice recognition result obtaining module 501 is configured to obtain multiple voice recognition results of the voice data to be recognized.
  • the initial score obtaining module 502 is configured to obtain multiple scores of each speech recognition result under multiple different language models.
  • the feature vector obtaining module 503 is configured to obtain a feature vector corresponding to each voice recognition result of the voice data to be recognized based on each voice recognition result and the multiple scores.
  • the final score prediction module 504 is configured to input the feature vector corresponding to each voice recognition result into the pre-trained re-scoring model to obtain the final score of each voice recognition result.
  • the voice recognition result determining module 505 is configured to determine the voice recognition result corresponding to the smallest final score among the final scores of the multiple voice recognition results as the final recognition result of the voice data to be recognized.
  • the re-scoring model is obtained by training the re-scoring model training method described in any embodiment of the present application.
  • the voice recognition device provided in the embodiment of the present application can execute the voice recognition method described in the third embodiment of the present application, and is equipped with functional modules corresponding to the method.
  • the device may include: a processor 60, a memory 61, a display screen 62 with a touch function, an input device 63, an output device 64, and a communication device 65.
  • the number of processors 60 in the device may be at least one, and one processor 60 is taken as an example in FIG. 7.
  • the processor 60, the memory 61, the display screen 62, the input device 63, the output device 64, and the communication device 65 of the device may be connected by a bus or other methods. In FIG. 7, the connection by a bus is taken as an example.
  • the memory 61 can be configured to store software programs, computer-executable programs, and modules, such as the program instructions/modules corresponding to the re-scoring model training methods described in Embodiments 1 to 2 of this application (for example, the first acquisition module 401, the scoring module 402, the second acquisition module 403, and the model training module 404 in the re-scoring model training device of the fourth embodiment above, or corresponding to the speech recognition method described in the third embodiment of the present application Program instructions/modules (for example, the speech recognition result acquisition module 501, the initial score acquisition module 502, the feature vector acquisition module 503, and the final score prediction module 504 in the speech recognition device of the fifth embodiment above).
  • Program instructions/modules for example, the speech recognition result acquisition module 501, the initial score acquisition module 502, the feature vector acquisition module 503, and the final score prediction module 504 in the speech recognition device of the fifth embodiment above).
  • the memory 61 may mainly include a storage program area and a storage data area.
  • the storage program area may store an operating device and an application program required by at least one function; the storage data area may store data created according to the use of the device and the like.
  • the memory 61 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the memory 60 may include a memory remotely provided with respect to the processor 60, and these remote memories may be connected to the device through a network. Examples of the aforementioned networks include but are not limited to the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the display screen 62 is a display screen 62 with a touch function, which may be a capacitive screen, an electromagnetic screen or an infrared screen.
  • the display screen 62 is set to display data according to instructions of the processor 60, and is also set to receive touch operations on the display screen 62 and send corresponding signals to the processor 60 or other devices.
  • the display screen 62 is an infrared screen, it also includes an infrared touch frame, which is arranged around the display screen 62, and it can also be set to receive an infrared signal and send the infrared signal to the processor 60 or other equipment.
  • the communication device 65 is configured to establish a communication connection with other devices, and it may be a wired communication device and/or a wireless communication device.
  • the input device 63 can be configured to receive input digital or character information, and to generate key signal input related to user settings and function control of the device, and can also be a camera set to obtain images and a sound pickup device to obtain audio data.
  • the output device 64 may include audio equipment such as speakers. It should be noted that the composition of the input device 63 and the output device 64 can be set according to actual conditions.
  • the processor 60 executes various functional applications and data processing of the device by running the software programs, instructions, and modules stored in the memory 61, that is, realizes at least one of the above-mentioned re-scoring model training method and speech recognition method .
  • the processor 60 executes at least one program stored in the memory 61, it implements at least one of the re-scoring model training method and the speech recognition method provided in the embodiment of the present application.
  • An embodiment of the present application also provides a computer-readable storage medium.
  • the instructions in the storage medium are executed by the processor of the device, the device can execute the rescoring model training method and the voice recognition method described in the above method embodiment. At least one of them.
  • this application can be implemented by software and necessary general-purpose hardware, of course, it can also be implemented by hardware, but in many cases the former is a better implementation.
  • the technical solution of this application essentially or the part that contributes to the related technology can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, Read-Only Memory (ROM), Random Access Memory (RAM), Flash memory (FLASH), hard disk or optical disk, etc., including several instructions to make a computer device (may be a robot, personal A computer, a server, or a network device, etc.) execute the re-scoring model training method and/or the voice recognition method described in any embodiment of the present application.
  • a computer device may be a robot, personal A computer, a server, or a network device, etc.
  • the units and modules included in the above-mentioned re-scoring model training device and speech recognition device are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be realized;
  • the names of the functional units are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application.
  • each part of this application can be implemented by hardware, software, firmware, or a combination thereof.
  • multiple steps or methods can be implemented by software or firmware stored in a memory and executed by a suitable instruction execution device.
  • a logic gate circuit configured to implement a logic function for a data signal Discrete logic circuits, ASICs with suitable combinational logic gate circuits, Programmable Gate Array (PGA), Field Programmable Gate Array (FPGA), etc.

Abstract

一种重打分模型训练方法及装置、语音识别方法及装置,训练方法包括:获取语音数据样本的多个语音识别结果和语音数据样本的第一标签,第一标签为预先标注的标签;获取每一个语音识别结果在多个不同语言模型下的多个分数;基于语音识别结果、多个分数和第一标签获得语音数据样本的样本特征向量和第二标签;采用样本特征向量和第二标签训练模型得到重打分模型。

Description

重打分模型训练方法及装置、语音识别方法及装置
本申请要求在2019年12月31日提交中国专利局、申请号为201911413152.3的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及语音识别技术领域,例如一种重打分模型训练方法、重打分模型训练装置、语音识别方法、语音识别装置、设备和存储介质。
背景技术
自动语音识别(Automatic Speech Recognition,ASR)是一种将语音转为文字的技术,ASR能够应用于语音翻译、人机交互,智能家居等应用场景。
在语音识别的解码过程中,语音数据可以得到多个语音识别结果,比如语音内容是:“我是好学生”,在语音识别解码的过程中可能得到如下多个语音识别结果:“握是号学声”,“窝时浩学升”,“卧室好学生”,“我是好学生”……,对于究竟选择哪一条最合适或者合理关乎着语音识别结果的准确性。
在相关技术中,通常是对每个语音识别结果进行打分,分数越高的语音识别结果的合理性或准确性越大,然而只依靠单一打分结果来作为判断标准准确度还是比较低,因此出现了同时通过多个语言模型来对每个语音识别结果打分后进行综合判断。
然而,相关技术中的重打分机制是直接将每个识别结果的多个分数相加或者根据人工设置的权重对每个分数添加权重后计算总得分,一方面,人为主观影响最终分数,准确度差,另一方面,某一个打分机制改变,需要重新设置该打分机制的权重,适用性差。
发明内容
本申请实施例提供一种重打分模型训练方法、重打分模型训练装置、语音 识别方法、语音识别装置、设备和存储介质,以避免相关技术中语音识别重打分存在人为主观性影响大和适用性差的情况。
第一方面,本申请实施例提供了一种重打分模型训练方法,包括:
获取至少一个语音数据样本中的每一个语音数据样本的多个语音识别结果和所述每一个语音数据样本的第一标签,所述第一标签为预先标注的所述每一个语音数据样本的标签;
获取每一个语音识别结果在多个不同语言模型下的多个分数;
基于所述每一个语音数据样本的多个语音识别结果、所述多个分数和所述第一标签,获得所述语音数据样本的多个样本特征向量和多个第二标签,所述样本特征向量和所述第二标签用于所述重打分模型的训练;
采用所述至少一个语音数据样本对应的所述样本特征向量和所述第二标签对模型进行训练,得到用于对所述语音识别结果进行重打分的重打分模型。
第二方面,本申请实施例提供了一种语音识别方法,包括:
获取待识别语音数据的多个语音识别结果;
获取每一个语音识别结果在多个不同语言模型下的多个分数;
基于所述每一个语音识别结果和所述多个分数,获得所述待识别语音数据的每一个语音识别结果对应的特征向量;
将所述每一个语音识别结果对应的特征向量输入预先训练的重打分模型中获得每一个语音识别结果的最终分数;
将多个语音识别结果的最终分数中的最小最终分数对应的语音识别结果确定为所述待识别语音数据的最终识别结果;
其中,所述重打分模型通过本申请实施例所述的重打分模型训练方法训练得到。
第三方面,本申请实施例提供了一种重打分模型训练装置,包括:
第一获取模块,设置为获取至少一个语音数据样本中的每一个语音数据样本的多个语音识别结果和所述每一个语音数据样本的第一标签,所述第一标签 为预先标注的所述每一个语音数据样本的标签;
打分模块,设置为获取每一个语音识别结果在多个不同语言模型下的多个分数;
第二获取模块,设置为基于所述每一个语音数据样本的多个语音识别结果、所述多个分数和所述第一标签,获得所述每一个语音数据样本对应的的多个样本特征向量和多个第二标签,样本特征向量和第二标签用于重打分模型的训练;
模型训练模块,设置为根据所述至少一个语音数据样本对应的样本特征向量和所述第二标签训练模型,得到用于对所述语音识别结果进行重打分的重打分模型。
第四方面,本申请实施例提供了一种语音识别装置,包括:
语音识别结果获取模块,设置为获取待识别语音数据的多个语音识别结果;
初始分数获取模块,设置为获取每一个语音识别结果在多个不同语言模型下的多个分数;
特征向量获取模块,设置为基于所述每一个语音识别结果和所述多个分数,获得所述待识别语音数据的每一个语音识别结果对应的特征向量;
最终分数预测模块,设置为将所述每一个语音识别结果对应的样本特征向量输入预先训练的重打分模型中获得每一个语音识别结果的最终分数;
语音识别结果确定模块,设置为将多个语音识别结果的最终分数中的最小最终分数对应的语音识别结果确定为所述待识别语音数据的最终识别结果;
其中,所述重打分模型通过本申请实施例任一项所述的重打分模型训练方法训练得到。
第五方面,本申请实施例提供了一种设备,所述设备包括:
至少一个处理器;
存储装置,设置为存储至少一个程序,
当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现以下至少之一:本申请任一实施例所述的重打分模型训练方法,本申请 任一实施例所述的语音识别方法。
第六方面,本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时,实现以下至少之一:本申请任一实施例所述的重打分模型训练方法,本申请任一实施例所述的语音识别方法。
附图说明
图1是本申请实施例一提供的一种重打分模型训练方法的步骤流程图;
图2是本申请实施例中语音数据样本解码后得到的加权有向无环图的示意图;
图3是本申请实施例二提供的一种重打分模型训练方法的步骤流程图;
图4是本申请实施例三提供的一种语音识别方法的步骤流程图;
图5是本申请实施例四提供的一种重打分模型训练装置的结构框图;
图6是本申请实施例五提供的一种语音识别装置的结构框图;
图7是本申请实施例六提供的一种设备的结构框图。
具体实施方式
实施例一
图1为本申请实施例一提供的一种重打分模型训练方法的步骤流程图,本申请实施例可适用于训练重打分模型的情况,该方法可以由本申请实施的重打分模型训练装置来执行,该重打分模型训练装置可以由硬件或软件来实现,并集成在本申请实施例所提供的设备中,如图1所示,本申请实施例的重打分模型训练方法可以包括步骤S101至S104。
在S101中,获取至少一个语音数据样本中的每一个语音数据样本的多个语音识别结果和所述每一个语音数据样本的第一标签,所述第一标签为预先标注的所述每一个语音数据样本的标签。
在本申请实施例中,对于一条语音数据,通过语音识别编解码模型后可以 得到多个语音识别结果,每个语音识别结果包括一系列有序的字、词组成,即对于一条语音数据可以得到多条语音识别解码路径。
如图2所示,语音识别解码结果是一个加权有向无环图,图中的每条路径为语音识别解码过程中可选词序列的一种表示,在图2中,圆圈表示一条语音数据经过语音识别解码后的字、词,每条边均设置有一个权重,从最左边到最右边的圆圈具有多条路径,每条路径视为一个语音识别结果。同时,该语音数据样本具有人为标注的、真实的语音识别结果,该真实的语音识别结果为该语音数据样本的第一标签。示例性地,语音内容是:“我是好学生”,在语音识别解码的过程中可能得到如下多个语音识别结果:“握是号学声”,“窝时浩学升”,“卧室好学生”,“我是好学生”。
可选地,对于语音数据样本,可以通过Encoder-Decoder(编码-解码)得到语音数据样本的多个语音识别结果,当然,还可以通过其他方式得到语音数据样本的多个语音识别结果,例如,可以通过人工生成的方式得到多个语音识别结果。
在S102中,获取每一个语音识别结果在多个不同语言模型下的多个分数。
在本申请实施例中,语言模型可以构建字符串s的概率分布p(s),p(s)表达了字符串s为一个句子的概率,此处的概率指的是组成字符串的组合,该组合组成的一句话是否是自然语言(人话)的概率。
本申请实施例在获得语音识别结果后,可以将每一种语音识别结果输入多个不同语言模型中得到该语音识别结果的分数,该分数表达了该语音识别结果符合自然语音的概率,可选地,不同的语言模型可以是声学模型、n-gram语言模型和RNNLM模型。
声学模型对声学、语音学、环境的变量、说话人性别、口音等的差异的知识表示,声学模型可以用lstm+ctc训练,得到语音特征到音素的映射。声学模型的任务是给定文字之后发出给到文字语音的概率。
n-gram语言模型是一种基于统计学的语言模型,用来根据前(n-1)个词来预测第n个词,即计算一个句子的概率,亦即计算组成一个句子的一系列词语的概率。
RNNLM模型为通过RNN及其变种网络来训练的语言模型,其任务是通过上文来预测下一个词。
当然,在实际应用中,本领域技术人员还可以通过其他语言模型来对每种语音识别结果进行打分,本申请实施例对使用何种语言模型来对语音识别结果进行打分不加以限制,对语言模型的数量也不加以限制。
在S103中,基于所述每一个语音数据样本的多个语音识别结果、所述多个分数和所述第一标签,获得所述每一个语音数据样本对应的多个样本特征向量和多个第二标签,样本特征向量和第二标签用于重打分模型的训练。
在本申请的可选实施例中,针对每个语音识别结果,可以对语音识别结果进行分析以提取语音识别结果的词频、字频、字或词排序、句子长度、字数、词数等作为句词结构特征,将语音识别结果在多个不同语言模型下的多个分数和句词结构特征组合为语音识别结果的样本特征向量,然后采用语音识别结果和语音数据样本的第一标签计算语音识别结果的字符错误率作为语音识别结果的第二标签。
在S104中,采用所述至少一个语音数据样本对应的样本特征向量和所述第二标签训练模型,得到用于对所述语音识别结果进行重打分的重打分模型。
示例性的,可以将样本特征向量输入初始的模型参数后的模型中,得到每个语音识别结果的预估字符错误率,采用该预估字符错误率和该语音识别结果的第二标签计算损失率,在损失率满足预设条件时,停止训练模式,在损失未满足预设条件的情况下,根据损失率调整模型参数重新迭代训练模型,直到损失率满足预设条件,得到设置为对语音识别结果进行重打分的重打分模型,即对于每条语音数据样本的多个语音识别结果,可以通过重打分模型重新打分获 得字符错误率作为最终分数,最终分数最低的语音识别结果即为语音数据样本的最佳语音识别结果。
本申请实施例在获取语音数据样本的多个语音识别结果和语音数据样本的第一标签后;获取每一个语音识别结果在多个不同语言模型下的多个分数,并基于语音识别结果、多个分数和第一标签获得语音数据样本的样本特征向量和第二标签,样本特征向量和第二标签用于重打分模型的训练;采用样本特征向量和第二标签训练模型得到重打分模型。本申请实施例基于语音识别结果、分数和第一标签获得语音数据样本的样本特征向量和第二标签,样本特征向量和第二标签用于重打分模型的训练,挖掘出了第二标签和多个不同语言模型打分得到的分数隐含的内在关联,以得到不同语言模型的打分分数的最佳组合方式,排除了人为主观性因素,确保了语音识别结果的准确度,即使各个语言模型打分机制改变,也无需修改各个分数之间的权重,提高了重打分模型的通用性和普适性。
实施例二
图3为本申请实施例二提供的一种重打分模型训练方法的步骤流程图,本申请实施例在前述实施例一的基础上进行细化,如图3所示,本申请实施例的重打分模型训练方法可以包括步骤S201至S212。
在S201中,将所述至少一个语音数据样本中的每一个语音数据样本输入解码模型中得到多个语音识别结果,所述每一个语音数据样本具有预先标注的第一标签。
本申请实施例中,语音数据样本可以为任意语音数据,该语音数据可以输入语音识别解码模型(如通过Encoder-Decoder(编码-解码))中获得多个识别结果,每个语音识别结果具有一个概率,该概率表达了该语音识别结果为预先标注的标签的概率,预先标注的第一标签为人为标注的语音数据样本对应的真实文本。
在S202中,提取预设数量的语音识别结果作为所述每一个语音数据样本的多个语音识别结果。
示例性的,可以根据每个语音识别结果的概率对所有语音识别结果进行排序,将排序为TOP K的K个语音识别结果提取出来作为语音数据样本的多个语音识别结果。
在S203中,将每一个语音识别结果分别输入多个不同语言模型中,获得所述语音识别结果在不同语言模型下的多个分数。
在本申请的可选实施例中,语言模型可以包括声学模型、n-gram语言模型和RNNLM语言模型三个语言模型。在得到多个语音识别结果后,可以将每个语音识别结果分别输入声学模型、n-gram语言模型和RNNLM语言模型中得到每个语音识别结果的3个分数。当然,在实际应用中,本领域技术人员还可以将语音识别结果输入其他语言模型中,本申请实施例对语言模型和语语言模型的数量不加以限制。
在S204中,针对每个语音识别结果,对所述语音识别结果进行分析以提取所述语音识别结果的句词结构特征。
在本申请实施例中,语音识别结果由一系列有序的字、词组成,可以统计语音识别结果中包含的字的字数、词数,字、词出现的频率、句子的长度、字或词的排序等特征作为句词结构特征。
在S205中,将所述每一个语音识别结果在多个不同语言模型下的多个分数和所述句词结构特征组合为所述每一个语音识别结果对应的样本特征向量。
示例性的,可以将语音识别结果的多个分数和句词结构特征连接形成一个样本特征向量A(分数1,分数2,分数3,词频,字频,字或词排序,句子长度,字数,词数),其中,分数1、分数2、分数3、词频、字频、字或词排序、句子长度、字数及词数分别为样本特征向量A的特征值。
在S206中,采用所述每一个语音识别结果和所述第一标签,计算所述每一 个语音识别结果的字符错误率,作为所述每一个语音识别结果的第二标签。
字符错误率(Character Error Rate,CER)是一种评分方式,是评价ASR模型优良的一种标准,字符错误率根据由预测值到真实值的插入、删除、替换的次数之和,即对于语音数据样本其具有真实的第一标签,该第一标签为语音数据样本对应的真实文本,而语音识别结果并不一定是真实文件,该语音识别结果到真实文本需要插入、删除、替换,统计插入、删除、替换的次数即为字符错误率。
例如,对于标签为真实文本“我是三好学生”,如果语音识别结果为“握是好学生”,则需要替换字1次,插入字1次,可以确定其字符错误率为2。
对于每个语音识别结果,可以将该语音识别结果与解码时语音数据样本的第一标签进行对比,计算出每个语音识别结果的字符错误率作为该语音识别结果的第二标签。
在S207中,对所述至少一个语音数据样本对应的每一个样本特征向量进行归一化处理,得到归一化处理后的样本特征向量。
在本申请的可选实施例中,可以在所述至少一个语音数据样本对应的所有样本特征向量中确定出模最大的样本特征向量和模最小的样本特征向量,计算模最大的样本特征向量和模最小的样本特征向量的差值得到向量差值,将所述至少一个语音数据样本对应的每一个样本特征向量除以向量差值的模得到的结果,作为所述至少一个语音数据样本对应的每一个样本特征向量归一化处理后的样本特征向量,归一化处理的计算公式如下:
Figure PCTCN2020138536-appb-000001
上述公式中,x i为第i个语音识别结果对应的样本特征向量,x′ i为归一化处理后的样本特征向量,x max和x min为语音数据样本的多个语音识别结果的样本特征向量中的模最大的样本特征向量和模最小的样本特征向量,通过归一化处 理后,可以将多个语音识别结果的样本特征向量统一在一量纲下,便于对样本特征向量量化表达,为候选模型训练提供高质量的训练数据,以提高模型训练的精度。
在S208中,初始化模型参数。
示例性的,本申请实施例的模型可以为线性回归,支持向量机,决策树模型等机器学习算法训练模型,本申请示例以线性回归为例,建模方程为:
Figure PCTCN2020138536-appb-000002
其中,a m为样本特征向量的第m个特征值对应的系数,z m为样本特征向量x i的第m个特征值,y i为第i个样本特征向量的预估字符错误率,n为样本特征向量中的特征值的个数,在初始化a m后,模型训练的目的是得到最优a m,使得y i接近第二标签。
在S209中,将所述每一个语音识别结果对应的归一化处理后的样本特征向量输入所述模型中,获得所述每一个语音识别结果的预估字符错误率。
示例性的,可以将语音识别结果归一化处理后的样本特征向量输入初始化后的模型中,即将S207中的x′ i输入模型中,对于每个x′ i,模型输出预估字符错误率y i
在S210中,采用所述每一个语音识别结果的预估字符错误率和所述第二标签计算损失率。
在本申请实施例中,损失函数为均方损失函数:
Figure PCTCN2020138536-appb-000003
mse loss为损失率,y i为第i个样本特征向量的预估字符错误率,
Figure PCTCN2020138536-appb-000004
为第i个样本特征向量对应的第二标签,示例性的,语音数据样本的数量为N,K为从 一条语音数据样本的多个语音识别结果中提取出来的语音识别结果的数量,因此,提取出来的语音识别结果的总数量为N*K,将该N*K个语音识别结果的预估字符错误率和第二标签代入上述均方损失函数中计算得到损失率。
在S211中,在所述损失率未满足预设条件时,采用所述损失率计算梯度。
如果计算得到的损失率小于预设阈值,则停止对模型进行迭代,在损失率大于或等于预设阈值的情况下,采用损失率计算梯度,示例性的,可以采用预设梯度算法计算梯度,本申请实施例对计算梯度的算法不加以限制。
在S212中,采用所述梯度调整所述模型参数,返回S209。
示例性的,可以采用所述梯度和预设的学习率(学习率为模型的超参数)对模型的当前参数进行梯度下降,获得调整模型参数后的模型,返回S209继续对模型进行迭代,直到损失率小于预设阈值,当然也可以是迭代次数达到预设次数时停止训练模型,得到用于对语音识别结果进行重打分的重打分模型。
本申请实施例将语音数据样本输入解码模型中得到多个语音识别结果,提取预设数量的语音识别结果作为语音数据样本的多个语音识别结果,将每一种语音识别结果分别输入多个不同语言模型中,获得语音识别结果在不同语言模型下的分数,针对每个语音识别结果对语音识别结果进行分析以提取语音识别结果的句词结构特征,将语音识别结果在多个不同语言模型下的多个分数和句词结构特征组合为语音识别结果的样本特征向量,采用语音识别结果和语音数据样本的第一标签计算字符错误率作为语音识别结果的第二标签,通过样本特征向量和第二标签训练重打分模型。能够挖掘出第二标签和多个不同语言模型打分得到的分数隐含的内在关联,以得到不同语言模型的打分分数的最佳组合方式,排除了人为主观性因素,确保了语音识别结果的准确度,即使各个语言模型打分机制改变,也无需修改各个分数之间的权重,提高了重打分模型的通用性和普适性。
采用语音识别结果和第一标签计算字符错误率作为语音识别结果的第二标 签,使得模型通过语音识别结果间接学习语音数据样本的字符错误率,从而使用模型得到更优的语音识别结果。
实施例三
图4为本申请实施例三提供的一种语音识别方法的步骤流程图,本申请实施例可适用于语音识别的情况,该方法可以由本申请实施的语音识别装置来执行,该语音识别装置可以由硬件或软件来实现,并集成在本申请实施例所提供的设备中,如图4所示,本申请实施例的语音识别方法可以包括步骤S301至S305。
在S301中,获取待识别语音数据的多个语音识别结果。
在本申请实施例中,待识别语音数据可以为需要将语音转换为文本的数据,例如,可以是短视频中的语音数据、即时通信应用程序的聊天界面上的语音数据等,本申请实施例可以将待识别语音数据输入解码模型中获得多个语音识别结果,过程可参考实施例一或者实施例二获得语音数据样本的多个语音识别结果,本申请实施例在此不再详述。
在S302中,获取每一个语音识别结果在多个不同语言模型下的多个分数。
可选地,可以将语音识别结果分别输入声学模型、n-gram语言模型和RNNLM语言模型中得到每个语音识别结果的3个分数。
在S303中,基于所述每一个语音识别结果和所述多个分数,获得所述待识别语音数据的每一个语音识别结果对应的特征向量。
示例性的可参考实施例二中S204-S207,在此不再详述。
在S304中,将所述每一个语音识别结果对应的特征向量输入预先训练的重打分模型中获得每个语音识别结果的最终分数。
在本申请实施例中,重打分模型可以通过实施例一或者实施例二任一实施例所述的重打分模型训练方法训练得到,该重打分模型可以对待识别语音数据的多个语音识别结果进行重新打分,在将特征向量输入预先训练的重打分模型 中后,可以获得每个语音识别结果的最终分数。
在S305中,将多个语音识别结果的最终分数中的最小最终分数对应的语音识别结果确定为所述待识别语音数据的最终识别结果。
在本申请实施例中,最终分数表达了语音识别结果相对于真实结果的字符错误率,字符错误率越小,说明语音识别结果越接近于真实结果,因此可以将最终分数最小的语音识别结果确定为待识别语音数据的最终识别结果。
本申请实施例在训练重打分模型时,基于语音识别结果、分数和第一标签获得语音数据样本的样本特征向量和第二标签,样本特征向量和第二标签用于重打分模型的训练,挖掘出了第二标签和多个不同语言模型打分得到的分数隐含的内在关联,以得到不同语言模型的打分分数的最佳组合方式,在通过重打分模型对待识别语音数据的多个语音识别结果进行重打分时,能够排除了人为主观性因素,确保了语音识别结果的准确度,即使各个语言模型打分机制改变,也无需修改各个分数之间的权重,提高了重打分模型的通用性和普适性。
实施例四
图5是本申请实施例四提供的一种重打分模型训练装置的结构框图,如图5所示,本申请实施例的重打分模型训练装置可以包括第一获取模块401、打分模块402、第二获取模块403以及模型训练模块404。
第一获取模块401,设置为获取至少一个语音数据样本中的每一个语音数据样本的多个语音识别结果和所述每一个语音数据样本的第一标签,所述第一标签为预先标注的所述每一个语音数据样本的标签。
打分模块402,设置为获取每一个语音识别结果在多个不同语言模型下的多个分数。
第二获取模块403,设置为基于所述每一个语音数据样本的多个语音识别结果、所述多个分数和所述第一标签,获得所述每一个语音数据样本对应的多个样本特征向量和多个第二标签,样本特征向量和第二标签用于重打分模型的训 练。
模型训练模块404,设置为采用所述至少一个语音数据样本对应的样本特征向量和所述第二标签训练模型,得到用于对所述语音识别结果进行重打分的重打分模型。
可选地,所述第一获取模块401包括解码子模块以及语音识别结果提取子模块。
解码子模块,设置为将所述至少一个语音数据样本中的每一个语音数据样本输入解码模型中得到多个语音识别结果,所述每一个语音数据样本具有预先标注的第一标签。
语音识别结果提取子模块,设置为提取预设数量的语音识别结果作为所述每一个语音数据样本的多个语音识别结果。
可选地,所述打分模块402包括:
打分模型输入子模块,设置为将每一个语音识别结果分别输入多个不同语言模型中,获得所述每一个语音识别结果在不同语言模型下的多个分数。
可选地,所述多个不同语言模型包括声学模型、n-gram语言模型和RNNLM语言模型。
可选地,所述第二获取模块403包括句词结构特征获取子模块、特征组合子模块以及第二标签获取子模块。
句词结构特征获取子模块,设置为针对每个语音识别结果,对所述语音识别结果进行分析以提取所述语音识别结果的句词结构特征。
特征组合子模块,设置为将所述每一个语音识别结果在多个不同语言模型下的多个分数和所述句词结构特征组合为所述每一个语音识别结果对应的样本特征向量。
第二标签获取子模块,设置为采用所述每一个语音识别结果和所述第一标 签计算所述每一个语音识别结果的字符错误率,作为所述每一个语音识别结果的第二标签。
可选地,所述句词结构特征包括以下至少一项:
词频、字频、字或词排序、句子长度、字数、词数。
可选地,还包括:
特征归一化处理模块,设置为对所述至少一个语音数据样本对应的每一个样本特征向量进行归一化处理,得到归一化处理后的样本特征向量。
可选地,所述特征归一化处理模块,包括最大和最小样本特征向量确定子模块、差值计算子模块以及样本特征向量计算子模块。
最大和最小样本特征向量确定子模块,设置为在所述至少一个语音数据样本对应的所有样本特征向量中确定出模最大的样本特征向量和模最小的样本特征向量。
差值计算子模块,设置为计算所述模最大的样本特征向量和模最小的样本特征向量的差值得到向量差值。
样本特征向量计算子模块,设置为将所述至少一个语音数据样本对应的每一个样本特征向量除以所述向量差值的模得到的结果,作为所述至少一个语音数据样本对应的每一个样本特征向量归一化处理后的样本特征向量。
可选地,所述模型训练模块404包括初始化模型子模块、特征输入子模块、损失率计算子模块、梯度计算子模块以及模型参数调整子模块。
初始化模型子模块,设置为初始化模型参数。
特征输入子模块,设置为将所述每一个语音识别结果对应的归一化处理后的样本特征向量输入所述模型中,获得所述每一个语音识别结果的预估字符错误率。
损失率计算子模块,设置为采用所述每一个语音识别结果的预估字符错误 率和所述第二标签计算损失率。
梯度计算子模块,设置为在所述损失率未满足预设条件时,采用所述损失率计算梯度。
模型参数调整子模块,设置为采用所述梯度调整所述模型参数,返回特征输入子模块。
可选地,所述损失率计算子模块包括:
损失率计算单元,设置为将所述每一个语音识别结果的预估字符错误率和所述第二标签代入预设的均方损失函数中计算得到损失率。
本申请实施例所提供的重打分模型训练装置可执行本申请实施例一或实施例二所述重打分模型训练方法,具备执行方法相应的功能模块。
实施例五
图6是本申请实施例五提供的一种语音识别装置的结构框图,如图6所示,本申请实施例的语音识别装置可以包括语音识别结果获取模块501、初始分数获取模块502、特征向量获取模块503、最终分数预测模块504以及语音识别结果确定模块505。
语音识别结果获取模块501,设置为获取待识别语音数据的多个语音识别结果。
初始分数获取模块502,设置为获取每一个语音识别结果在多个不同语言模型下的多个分数。
特征向量获取模块503,设置为基于所述每一个语音识别结果和所述多个分数,获得所述待识别语音数据的每一个语音识别结果对应的特征向量。
最终分数预测模块504,设置为将所述每一个语音识别结果对应的特征向量输入预先训练的重打分模型中,获得每个语音识别结果的最终分数。
语音识别结果确定模块505,设置为将多个语音识别结果的最终分数中的最 小最终分数对应的语音识别结果确定为所述待识别语音数据的最终识别结果。
其中,所述重打分模型通过本申请任一实施例所述的重打分模型训练方法训练得到。
本申请实施例所提供的语音识别装置可执行本申请实施例三所述语音识别方法,具备执行方法相应的功能模块。
实施例六
参照图7,示出了本申请一个示例中的一种设备的结构示意图。如图7所示,该设备可以包括:处理器60、存储器61、具有触摸功能的显示屏62、输入装置63、输出装置64以及通信装置65。该设备中处理器60的数量可以是至少一个,图7中以一个处理器60为例。该设备的处理器60、存储器61、显示屏62、输入装置63、输出装置64以及通信装置65可以通过总线或者其他方式连接,图7中以通过总线连接为例。
存储器61作为一种计算机可读存储介质,可设置为存储软件程序、计算机可执行程序以及模块,如本申请实施例一到实施例二所述的重打分模型训练方法对应的程序指令/模块(例如,上述实施例四的重打分模型训练装置中的第一获取模块401、打分模块402、第二获取模块403和模型训练模块404),或如本申请实施例三所述的语音识别方法对应的程序指令/模块(例如,上述实施例五的语音识别装置中的语音识别结果获取模块501、初始分数获取模块502、特征向量获取模块503和最终分数预测模块504)。存储器61可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作装置、至少一个功能所需的应用程序;存储数据区可存储根据设备的使用所创建的数据等。此外,存储器61可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器60可包括相对于处理器60远程设置的存储器,这些远程存储器可以通过网络连接至设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、 移动通信网及其组合。
显示屏62为具有触摸功能的显示屏62,其可以是电容屏、电磁屏或者红外屏。一般而言,显示屏62设置为根据处理器60的指示显示数据,还设置为接收作用于显示屏62的触摸操作,并将相应的信号发送至处理器60或其他装置。可选的,当显示屏62为红外屏时,其还包括红外触摸框,该红外触摸框设置在显示屏62的四周,其还可以设置为接收红外信号,并将该红外信号发送至处理器60或者其他设备。
通信装置65,设置为与其他设备建立通信连接,其可以是有线通信装置和/或无线通信装置。
输入装置63可设置为接收输入的数字或者字符信息,以及产生与设备的用户设置以及功能控制有关的键信号输入,还可以是设置为获取图像的摄像头以及获取音频数据的拾音设备。输出装置64可以包括扬声器等音频设备。需要说明的是,输入装置63和输出装置64的组成可以根据实际情况设定。
处理器60通过运行存储在存储器61中的软件程序、指令以及模块,从而执行设备的各种功能应用以及数据处理,即实现上述所述的重打分模型训练方法和语音识别方法中的至少之一。
示例性的,处理器60执行存储器61中存储的至少一个程序时,实现本申请实施例提供的重打分模型训练方法和语音识别方法中的至少之一。
本申请实施例还提供一种计算机可读存储介质,所述存储介质中的指令由设备的处理器执行时,使得设备能够执行如上述方法实施例所述的重打分模型训练方法和语音识别方法中的至少之一。
需要说明的是,对于装置、设备、存储介质实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
通过以上关于实施方式的描述,所属领域的技术人员可以清楚地了解到, 本申请可借助软件及必需的通用硬件来实现,当然也可以通过硬件实现,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括若干指令用以使得一台计算机设备(可以是机器人,个人计算机,服务器,或者网络设备等)执行本申请任意实施例所述的重打分模型训练方法和/或语音识别方法。
值得注意的是,上述重打分模型训练装置和语音识别装置中,所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的名称也只是为了便于相互区分,并不用于限制本申请的保护范围。
应当理解,本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行装置执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有设置为对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(Programmable Gate Array,PGA),现场可编程门阵列(Field Programmable Gate Array,FPGA)等。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的至少一个实施例或示例中以合适的方式结合。

Claims (15)

  1. 一种重打分模型训练方法,包括:
    获取至少一个语音数据样本中的每一个语音数据样本的多个语音识别结果和所述每一个语音数据样本的第一标签,所述第一标签为预先标注的所述每一个语音数据样本的标签;
    获取每一个语音识别结果在多个不同语言模型下的多个分数;
    基于所述每一个语音数据样本的多个语音识别结果、所述多个分数和所述第一标签,获得所述每一个语音数据样本对应的多个样本特征向量和多个第二标签,所述样本特征向量和所述第二标签用于所述重打分模型的训练;
    采用所述至少一个语音数据样本对应的所述样本特征向量和所述第二标签对模型进行训练,得到用于对所述语音识别结果进行重打分的重打分模型。
  2. 根据权利要求1所述的方法,其中,所述获取至少一个语音数据样本中的每一个语音数据样本的多个语音识别结果和所述每一个语音数据样本的第一标签,包括:
    将所述至少一个语音数据样本中的每一个语音数据样本输入解码模型中得到多个语音识别结果,所述每一个语音数据样本具有预先标注的第一标签;
    提取预设数量的语音识别结果作为所述每一个语音数据样本的多个语音识别结果。
  3. 根据权利要求1所述的方法,其中,所述获取每一个语音识别结果在多个不同语言模型下的多个分数,包括:
    将每一个语音识别结果分别输入多个不同语言模型中,获得所述每一个语音识别结果在不同语言模型下的多个分数。
  4. 根据权利要求1-3任一项所述的方法,其中,所述多个不同语言模型包括声学模型、n-gram语言模型和RNNLM语言模型。
  5. 根据权利要求1所述的方法,其中,所述基于所述每一个语音数据样本的多个语音识别结果、所述多个分数和所述第一标签获得所述每一个语音数据样本的多个样本特征向量和多个第二标签,包括:
    针对每一个语音识别结果,对所述语音识别结果进行分析以提取所述语音识别结果的句词结构特征;
    将所述每一个语音识别结果在多个不同语言模型下的多个分数和所述句词结构特征组合为所述每一个语音识别结果对应的样本特征向量;
    采用所述每一个语音识别结果和所述第一标签计算所述每一个语音识别结果的字符错误率,作为所述每一个语音识别结果的第二标签。
  6. 根据权利要求5所述的方法,其中,所述句词结构特征包括以下至少一项:
    词频、字频、字或词排序、句子长度、字数、词数。
  7. 根据权利要求1所述的方法,其中,在所述采用至少一个语音数据样本对应的所述样本特征向量和所述第二标签对模型进行训练,得到重打分模型之前,包括:
    对所述至少一个语音数据样本对应的每一个样本特征向量进行归一化处理,得到归一化处理后的样本特征向量。
  8. 根据权利要求7所述的方法,其中,所述对所述至少一个语音数据样本对应的每一个样本特征向量进行归一化处理,得到归一化处理后的样本特征向量,包括:
    在所述至少一个语音数据样本对应的所有样本特征向量中确定出模最大的样本特征向量和模最小的样本特征向量;
    计算所述模最大的样本特征向量和所述模最小的样本特征向量的差值得到向量差值;
    将所述至少一个语音数据样本对应的每一个样本特征向量除以所述向量差值的模得到的结果,作为所述至少一个语音数据样本对应的每一个样本特征向量归一化处理后的样本特征向量。
  9. 根据权利要求7或8所述的方法,其中,所述采用所述至少一个语音数据样本对应的所述样本特征向量和所述第二标签对模型进行训练,得到重打分 模型,包括:
    初始化模型参数;
    将所述每一个语音识别结果对应的归一化处理后的样本特征向量输入所述模型中,获得所述每一个语音识别结果的预估字符错误率;
    采用所述每一个语音识别结果的预估字符错误率和所述第二标签计算损失率;
    在所述损失率未满足预设条件时,采用所述损失率计算梯度;
    采用所述梯度调整所述模型参数,返回将所述每一个语音识别结果归一化处理后的样本特征向量输入所述模型中获得所述每一个语音识别结果的预估字符错误率的步骤。
  10. 根据权利要求9所述的方法,其中,所述采用所述每一个语音识别结果的预估字符错误率和所述第二标签计算损失率,包括:
    将所述每一个语音识别结果的预估字符错误率和所述第二标签代入预设的均方损失函数中计算。
  11. 一种语音识别方法,包括:
    获取待识别语音数据的多个语音识别结果;
    获取每一个语音识别结果在多个不同语言模型下的多个分数;
    基于所述每一个语音识别结果和所述多个分数,获得所述待识别语音数据的每一个语音识别结果对应的特征向量;
    将所述每一个语音识别结果对应的特征向量输入预先训练的重打分模型中,获得每一个语音识别结果的最终分数;
    将多个语音识别结果的最终分数中的最小最终分数对应的语音识别结果确定为所述待识别语音数据的最终识别结果;
    其中,所述重打分模型通过权利要求1-10任一项所述的重打分模型训练方法训练得到。
  12. 一种重打分模型训练装置,包括:
    第一获取模块,设置为获取至少一个语音数据样本中的每一个语音数据样本的多个语音识别结果和所述每一个语音数据样本的第一标签,所述第一标签为预先标注的所述每一个语音数据样本的标签;
    打分模块,设置为获取每一个语音识别结果在多个不同语言模型下的多个分数;
    第二获取模块,设置为基于所述每一个语音数据样本的多个语音识别结果、所述多个分数和所述第一标签,获得所述每一个语音数据样本对应的多个样本特征向量和多个第二标签,所述样本特征向量和所述第二标签用于重打分模型的训练;
    模型训练模块,设置为根据所述至少一个语音数据样本对应的样本特征向量和所述第二标签对模型进行训练,得到用于对所述语音识别结果进行重打分的重打分模型。
  13. 一种语音识别装置,包括:
    语音识别结果获取模块,设置为获取待识别语音数据的多个语音识别结果;
    初始分数获取模块,设置为获取每一个语音识别结果在多个不同语言模型下的多个分数;
    特征向量获取模块,设置为基于所述每一个语音识别结果和所述多个分数,获得所述待识别语音数据的每一个语音识别结果对应的特征向量;
    最终分数预测模块,设置为将所述每一个语音识别结果对应的样本特征向量输入预先训练的重打分模型中,获得每一个语音识别结果的最终分数;
    语音识别结果确定模块,设置为将多个语音识别结果的最终分数中的最小最终分数对应的语音识别结果确定为所述待识别语音数据的最终识别结果;
    其中,所述重打分模型通过权利要求1-10任一项所述的重打分模型训练方法训练得到。
  14. 一种设备,包括:
    至少一个处理器;
    存储装置,设置为存储至少一个程序;
    当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现以下至少之一:如权利要求1-10中任一项所述的重打分模型训练方法,如权利要求11所述的语音识别方法。
  15. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时,实现以下至少之一:如权利要求1-10中任一项所述的重打分模型训练方法,如权利要求11所述的语音识别方法。
PCT/CN2020/138536 2019-12-31 2020-12-23 重打分模型训练方法及装置、语音识别方法及装置 WO2021136029A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911413152.3A CN111179916B (zh) 2019-12-31 2019-12-31 重打分模型训练方法、语音识别方法及相关装置
CN201911413152.3 2019-12-31

Publications (1)

Publication Number Publication Date
WO2021136029A1 true WO2021136029A1 (zh) 2021-07-08

Family

ID=70657647

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/138536 WO2021136029A1 (zh) 2019-12-31 2020-12-23 重打分模型训练方法及装置、语音识别方法及装置

Country Status (2)

Country Link
CN (1) CN111179916B (zh)
WO (1) WO2021136029A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113793593A (zh) * 2021-11-18 2021-12-14 北京优幕科技有限责任公司 适用于语音识别模型的训练数据生成方法及设备

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111179916B (zh) * 2019-12-31 2023-10-13 广州市百果园信息技术有限公司 重打分模型训练方法、语音识别方法及相关装置
CN112562640B (zh) * 2020-12-01 2024-04-12 北京声智科技有限公司 多语言语音识别方法、装置、系统及计算机可读存储介质
CN112700766B (zh) * 2020-12-23 2024-03-19 北京猿力未来科技有限公司 语音识别模型的训练方法及装置、语音识别方法及装置
CN112967720B (zh) * 2021-01-29 2022-12-30 南京迪港科技有限责任公司 少量重口音数据下的端到端语音转文本模型优化方法
CN112885336B (zh) * 2021-01-29 2024-02-02 深圳前海微众银行股份有限公司 语音识别系统的训练、识别方法、装置、电子设备
CN113380228A (zh) * 2021-06-08 2021-09-10 北京它思智能科技有限公司 一种基于循环神经网络语言模型的在线语音识别方法和系统
CN113378586B (zh) * 2021-07-15 2023-03-28 北京有竹居网络技术有限公司 语音翻译方法、翻译模型训练方法、装置、介质及设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110144986A1 (en) * 2009-12-10 2011-06-16 Microsoft Corporation Confidence calibration in automatic speech recognition systems
WO2013088707A1 (ja) * 2011-12-16 2013-06-20 日本電気株式会社 辞書学習装置、パターン照合装置、辞書学習方法および記憶媒体
CN105845130A (zh) * 2016-03-30 2016-08-10 乐视控股(北京)有限公司 用于语音识别的声学模型训练方法及装置
CN108415898A (zh) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 深度学习语言模型的词图重打分方法和系统
CN110349597A (zh) * 2019-07-03 2019-10-18 山东师范大学 一种语音检测方法及装置
CN110580290A (zh) * 2019-09-12 2019-12-17 北京小米智能科技有限公司 用于文本分类的训练集的优化方法及装置
CN111179916A (zh) * 2019-12-31 2020-05-19 广州市百果园信息技术有限公司 重打分模型训练方法、语音识别方法及相关装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9218339B2 (en) * 2011-11-29 2015-12-22 Educational Testing Service Computer-implemented systems and methods for content scoring of spoken responses
CN103426428B (zh) * 2012-05-18 2016-05-25 华硕电脑股份有限公司 语音识别方法及系统
CN103474061A (zh) * 2013-09-12 2013-12-25 河海大学 基于分类器融合的汉语方言自动辨识方法
CN108711422B (zh) * 2018-05-14 2023-04-07 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机可读存储介质和计算机设备
CN110111775B (zh) * 2019-05-17 2021-06-22 腾讯科技(深圳)有限公司 一种流式语音识别方法、装置、设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110144986A1 (en) * 2009-12-10 2011-06-16 Microsoft Corporation Confidence calibration in automatic speech recognition systems
WO2013088707A1 (ja) * 2011-12-16 2013-06-20 日本電気株式会社 辞書学習装置、パターン照合装置、辞書学習方法および記憶媒体
CN105845130A (zh) * 2016-03-30 2016-08-10 乐视控股(北京)有限公司 用于语音识别的声学模型训练方法及装置
CN108415898A (zh) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 深度学习语言模型的词图重打分方法和系统
CN110349597A (zh) * 2019-07-03 2019-10-18 山东师范大学 一种语音检测方法及装置
CN110580290A (zh) * 2019-09-12 2019-12-17 北京小米智能科技有限公司 用于文本分类的训练集的优化方法及装置
CN111179916A (zh) * 2019-12-31 2020-05-19 广州市百果园信息技术有限公司 重打分模型训练方法、语音识别方法及相关装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113793593A (zh) * 2021-11-18 2021-12-14 北京优幕科技有限责任公司 适用于语音识别模型的训练数据生成方法及设备

Also Published As

Publication number Publication date
CN111179916A (zh) 2020-05-19
CN111179916B (zh) 2023-10-13

Similar Documents

Publication Publication Date Title
WO2021136029A1 (zh) 重打分模型训练方法及装置、语音识别方法及装置
US10332507B2 (en) Method and device for waking up via speech based on artificial intelligence
WO2021232725A1 (zh) 基于语音交互的信息核实方法、装置、设备和计算机存储介质
US10176804B2 (en) Analyzing textual data
CN106098059B (zh) 可定制语音唤醒方法及系统
CN110517664B (zh) 多方言识别方法、装置、设备及可读存储介质
JP5901001B1 (ja) 音響言語モデルトレーニングのための方法およびデバイス
WO2018157789A1 (zh) 一种语音识别的方法、计算机、存储介质以及电子装置
CN114694076A (zh) 基于多任务学习与层叠跨模态融合的多模态情感分析方法
CN111048064B (zh) 基于单说话人语音合成数据集的声音克隆方法及装置
US20150199340A1 (en) System for translating a language based on user's reaction and method thereof
JP5932869B2 (ja) N−gram言語モデルの教師無し学習方法、学習装置、および学習プログラム
WO2020238061A1 (zh) 自然语言分类方法、装置、计算机设备以及存储介质
CN111833845A (zh) 多语种语音识别模型训练方法、装置、设备及存储介质
CN106340297A (zh) 一种基于云计算与置信度计算的语音识别方法与系统
Chen et al. Characterizing phonetic transformations and acoustic differences across English dialects
US11605377B2 (en) Dialog device, dialog method, and dialog computer program
US11322151B2 (en) Method, apparatus, and medium for processing speech signal
JP6810580B2 (ja) 言語モデル学習装置およびそのプログラム
JP2012018201A (ja) テキスト補正方法及び認識方法
US10720149B2 (en) Dynamic vocabulary customization in automated voice systems
CN111126084A (zh) 数据处理方法、装置、电子设备和存储介质
CN117063228A (zh) 用于灵活流式和非流式自动语音识别的混合模型注意力
WO2021228084A1 (zh) 语音数据识别方法、设备及介质
US10403275B1 (en) Speech control for complex commands

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20908970

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20908970

Country of ref document: EP

Kind code of ref document: A1