CN118553234A - Speech recognition model training, testing and speech recognition method and device - Google Patents

Speech recognition model training, testing and speech recognition method and device Download PDF

Info

Publication number
CN118553234A
CN118553234A CN202410726281.2A CN202410726281A CN118553234A CN 118553234 A CN118553234 A CN 118553234A CN 202410726281 A CN202410726281 A CN 202410726281A CN 118553234 A CN118553234 A CN 118553234A
Authority
CN
China
Prior art keywords
training
model
voice
recognition model
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410726281.2A
Other languages
Chinese (zh)
Inventor
赵镜儒
石东升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202410726281.2A priority Critical patent/CN118553234A/en
Publication of CN118553234A publication Critical patent/CN118553234A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Machine Translation (AREA)

Abstract

The disclosure provides a method and a device for training a speech recognition model, relates to the technical field of artificial intelligence, in particular to the technical fields of speech recognition, deep learning, large models and the like, and can be applied to scenes such as content generation of artificial intelligence. The specific implementation scheme is as follows: obtaining a speech sample set, the speech sample set comprising at least one speech sample, the speech sample comprising: an audio feature sequence and an initial word unit sequence; acquiring an initial voice recognition model, wherein the voice recognition model is used for representing the corresponding relation between the audio feature sequence and the predicted word unit sequence; replacing language word units in an initial word unit sequence in a voice sample set by using a predicted word unit of a representation language to obtain a training sample set, wherein the predicted word units are predicted word units in the predicted word unit sequence obtained by inputting voice samples selected from the voice sample set into a voice recognition model; based on the training sample set, training the speech recognition model to obtain a trained speech recognition model.

Description

Speech recognition model training, testing and speech recognition method and device
Technical Field
The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of voice recognition, deep learning, large models and the like, and can be applied to scenes such as content generation of artificial intelligence, and particularly relates to a voice recognition model training method and device, a voice recognition model testing method and device, a voice recognition method and device, electronic equipment, a computer readable storage medium and a computer program product.
Background
At present, the voice recognition technology is mature, especially for widely used languages such as English, chinese and the like, and the voice recognition technology is comparable with professional staff. However, the languages of some languages are limited by objective factors such as the number of training samples, so that the existing models have weaker voice recognition capability.
Disclosure of Invention
The present disclosure provides a speech recognition model training method and apparatus, a speech recognition model testing method and apparatus, a speech recognition method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product.
According to a first aspect, there is provided a speech recognition model training method, the method comprising: obtaining a speech sample set, the speech sample set comprising at least one speech sample, the speech sample comprising: an audio feature sequence and an initial word unit sequence; acquiring an initial voice recognition model, wherein the voice recognition model is used for representing the corresponding relation between the audio feature sequence and the predicted word unit sequence; replacing language word units in an initial word unit sequence in a voice sample set by using a predicted word unit of a representation language to obtain a training sample set, wherein the predicted word units are predicted word units in the predicted word unit sequence obtained by inputting voice samples selected from the voice sample set into a voice recognition model; based on the training sample set, training the speech recognition model to obtain a trained speech recognition model.
According to a second aspect, there is provided a speech recognition model testing method, the method comprising: acquiring a test set and a trained speech recognition model, wherein the test set comprises at least one test sample, and the test sample comprises: an audio feature sequence and a test word unit sequence; the trained speech recognition model is obtained by training a speech recognition model training method as described in any implementation manner of the first aspect, and the speech recognition model comprises: an encoder, a decoder, and a keyword module; selecting a test sample from the test set; inputting the audio feature sequence in the test sample into an encoder to obtain audio intermediate features; inputting the audio intermediate characteristics and the current word unit sequence into a decoder to obtain a predicted word unit output by a voice recognition model; responding to the predicted word unit as a non-ending symbol, updating the current word unit sequence based on the predicted word unit, and continuously inputting the audio intermediate feature and the current word unit sequence into a decoder until the predicted word unit output by the speech recognition model is the ending symbol, so as to obtain all the predicted word units corresponding to the test sample; and detecting whether the voice recognition model is qualified or not based on all the predicted word units corresponding to the test sample and the test word unit sequences in the test sample.
According to a third aspect, there is provided a speech recognition method, the method comprising: acquiring voice to be recognized; processing the voice to be recognized to obtain audio characteristic data; inputting the audio feature data into a voice recognition model to obtain a predicted word unit sequence of the voice to be recognized, wherein the voice recognition model is a trained voice recognition model obtained by adopting the voice recognition model training method described in any implementation manner of the first aspect; based on the predicted word unit sequence, text data of the voice to be recognized is obtained.
According to a fourth aspect, there is provided a speech recognition model training apparatus, the apparatus comprising: a sample acquisition unit configured to acquire a set of speech samples, the set of speech samples including at least one speech sample, the speech sample including: an audio feature sequence and an initial word unit sequence; the model acquisition unit is configured to acquire an initial voice recognition model, wherein the voice recognition model is used for representing the corresponding relation between the audio feature sequence and the predicted word unit sequence; the replacing unit is configured to replace language word units in the initial word unit sequence in the voice sample set by using the predicted word units of the characterization language to obtain a training sample set, wherein the predicted word units are predicted word units in the predicted word unit sequence obtained by inputting voice samples selected from the voice sample set into a voice recognition model; and the training unit is configured to train the voice recognition model based on the training sample set to obtain a trained voice recognition model.
According to a fifth aspect, there is provided a speech recognition model testing apparatus comprising: an information acquisition unit configured to acquire a test set including at least one test sample including: an audio feature sequence and a test word unit sequence; the trained speech recognition model is obtained by training a speech recognition model training device as described in any implementation manner of the fourth aspect, where the speech recognition model includes: an encoder, a decoder, and a keyword module; a selecting unit configured to select a test sample from the test set; the input unit is configured to input the audio feature sequence in the test sample into the encoder to obtain audio intermediate features; the obtaining unit is configured to input the audio intermediate characteristics and the current word unit sequence into the decoder to obtain a predicted word unit output by the voice recognition model; the updating unit is configured to respond to the fact that the predicted word unit is a non-ending symbol, update the current word unit sequence based on the predicted word unit, continuously input the audio intermediate feature and the current word unit sequence into the decoder until the predicted word unit output by the voice recognition model is the ending symbol, and obtain all the predicted word units corresponding to the test sample; and the test unit is configured to detect whether the voice recognition model is qualified or not based on all the prediction word units corresponding to the test sample and the test word unit sequences in the test sample.
According to a sixth aspect there is provided a speech recognition apparatus comprising: a voice acquisition unit configured to acquire a voice to be recognized; the processing unit is configured to process the voice to be recognized to obtain audio characteristic data; the recognition unit is configured to input the audio feature data into a voice recognition model to obtain a predicted word unit sequence of the voice to be recognized, wherein the voice recognition model is a trained voice recognition model obtained by a voice recognition model training device described by any implementation mode of the fourth aspect; and the conversion unit is configured to obtain text data of the voice to be recognized based on the predicted word unit sequence.
According to a seventh aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first, second, or third aspects.
According to an eighth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method as described in any implementation of the first, second or third aspects.
According to a ninth aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first, second or third aspects.
The embodiment of the disclosure provides a method and a device for training a speech recognition model, firstly, a speech sample set is obtained, the speech sample set comprises at least one speech sample, and the speech sample comprises: an audio feature sequence and an initial word unit sequence; secondly, an initial voice recognition model is obtained, and the voice recognition model is used for representing the corresponding relation between the audio feature sequence and the predicted word unit sequence; thirdly, replacing language word units in the initial word unit sequence in the voice sample set with the language word units of the characterization language, so as to obtain a training sample set, wherein the predicted word units are predicted word units in the predicted word unit sequence obtained by inputting voice samples selected from the voice sample set into a voice recognition model; finally, based on the training sample set, training the voice recognition model to obtain a trained voice recognition model. Therefore, before the speech recognition model is trained, the language word units in the initial word unit sequence in the speech sample set are replaced by the predicted word units for representing the language, so that the model can predict the audio feature sequence in the familiar word unit environment, the convergence rate of the model is improved, and the training effect of the model is improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow chart of one embodiment of a speech recognition model training method according to the present disclosure;
FIG. 2 is a schematic diagram of one architecture of the keyword module training process of the present disclosure;
FIG. 3 is a flow chart of one embodiment of a speech recognition model testing method according to the present disclosure;
FIG. 4 is a schematic diagram of a structure of the speech recognition model testing process of the present disclosure;
FIG. 5 is a flow chart of one embodiment of a speech recognition method according to the present disclosure;
FIG. 6 is a schematic diagram of a structure of one embodiment of a speech recognition model training apparatus according to the present disclosure;
FIG. 7 is a schematic diagram of a structure of one embodiment of a speech recognition model testing apparatus according to the present disclosure;
FIG. 8 is a schematic diagram of a structure of one embodiment of a speech recognition apparatus according to the present disclosure;
fig. 9 is a block diagram of an electronic device used to implement a speech recognition model training method or a speech recognition model testing method or a speech recognition method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Traditional speech recognition techniques mainly implement speech-to-text conversion based on acoustic models and language models. The acoustic model is used for constructing the relation between the voice and the text, and the common acoustic model is a hidden Markov model and a Gaussian mixture model. The language model is used for constructing the relation among words in the text, and the common language model is an N-gram model. Because the traditional voice recognition technology is limited by manually designed voice characteristics and non-end-to-end recognition schemes, the recognition capability of the traditional voice recognition technology cannot be comparable with that of human beings.
The deep learning-based speech recognition technique is capable of automatically obtaining useful feature representations from raw speech data and learning complex features and structures in speech signals using complex network model structures. This allows for higher accuracy and robustness of speech recognition techniques in processing non-standard speech, accents, dialects, and speech at different speech rates.
With the wide use of model structures such as Transfomer in the field of speech recognition and the rapid development of computer hardware, speech recognition technology has been a qualitative leap. For languages widely used in English, chinese, etc., speech recognition techniques have been comparable to professionals; but not widely used languages, the language recognition model is still weak in recognition capability.
Based on this, the present disclosure proposes a speech recognition model training method, fig. 1 shows a flow 100 according to one embodiment of the speech recognition model training method of the present disclosure, the speech recognition model training method comprising the steps of:
Step 101, a speech sample set is obtained.
In this embodiment, the speech sample set is a set of speech samples of the language to be recognized, for example, the language to be recognized is a language with few samples, and the speech recognition model is used to recognize the speech of the language of the category to obtain the text of the language of the category, so that the propagation, learning and research of the language of the category can be improved.
In this embodiment, the executing body on which the speech recognition model training method operates may acquire the speech sample set in various manners, for example, the executing body may acquire the speech sample set stored therein from the database server in a wired connection manner or a wireless connection manner. For another example, a user may obtain a set of speech samples collected by a terminal by communicating with the terminal.
Here, the set of speech samples may include at least one speech sample including: an audio feature sequence and an initial word unit sequence; the audio feature sequence is a sequence obtained after feature extraction of audio, such as mel features, fbank (FilterBank) features and the like.
In this embodiment, the initial word unit sequence, also called token sequence, is a word unit that divides the text corresponding to the audio feature sequence into language units such as words, punctuation marks, numbers, or pure letters, and has a language representing the text corresponding to the audio feature sequence in the initial unit sequence.
In the technical scheme of the disclosure, the related processing such as collection, storage, use, processing, transmission, provision, disclosure and the like of the voice sample set is performed after authorization, and accords with related laws and regulations.
Step 102, an initial speech recognition model is obtained.
In this embodiment, the speech recognition model is used to characterize the correspondence between the audio feature sequence and the predicted word unit sequence, and specifically, the speech recognition model may be a Whisper model, which is a standard transducer model. The Whisper model takes audio with fixed length (30 seconds) as input, firstly, fbank features of the extracted audio are input into an encoder of the Whisper model to generate intermediate features of the audio, then, a decoder of the Whisper model is used for decoding the intermediate features, and a predicted word unit sequence of model prediction is output.
Optionally, the speech recognition model may also be an MMS (MASSIVELY MULTILINGUAL SPEECH, large-scale multilingual speech) model, where the MMS model can recognize more than 4000 spoken languages, and the MMS model has larger model parameters, more pre-training data and more powerful model functions than the Whisper model, but has huge parameter and too high deployment cost.
And 103, replacing the language word units in the initial word unit sequence in the voice sample set by the predicted word units of the characterization language to obtain a training sample set.
In this embodiment, the predicted word units are predicted word units in a predicted word unit sequence obtained by inputting a voice sample selected from a voice sample set into a voice recognition model.
In this embodiment, a speech sample is selected from the speech samples, and an audio feature sequence of the speech sample is input into a speech recognition model, so as to obtain a predicted word unit sequence output by the speech recognition model, where one predicted word unit in the predicted word unit sequence is a word unit for representing a language, and the predicted word unit for representing the language in the predicted word unit sequence replaces a language word unit in an initial word unit sequence in the speech sample set, so that the speech recognition model is more familiar with a training sample set.
In this embodiment, the predicted word units of the token language are replaced with the language word units in the initial word unit sequence to obtain a new word unit sequence, and a training sample set including the audio feature sequence and the new word unit sequence can be obtained.
In this embodiment, the training samples include: an audio feature sequence x= (X 1,x2,…xk) and its corresponding token sequence label= (l 1,l2,…lm). For the recognition task of the language with few samples, label records the language text information of the category corresponding to the current audio. The speech recognition model will take the audio sequence X as input and Output the predicted token sequence output= (o 1,o2,…om). The difference between Label and Output is generally measured by using a cross entropy loss function in the training process, so that Output is kept consistent with Label as much as possible. In the prediction process, through the output of the speech recognition model, the text of the language of the type predicted by the model can be obtained.
The speech recognition model may be divided into a non-autoregressive model and a non-autoregressive model according to the form of Output. The non-autoregressive model takes the audio sequence X as Output and decodes Output in parallel; the autoregressive model takes the audio sequence X as Output and decodes Output serially. The non-autoregressive model is decoded in parallel, the decoding speed is high, but the context relation between the audio sequences is easily ignored, and the model effect is poor; the autoregressive model is decoded serially, the decoding speed is low, but the front-back relation between the audio sequences can be constructed, and the model effect is good. Because the number of the language-like samples is small, the training difficulty is slightly high, and in order to ensure the effect of the voice recognition model, the method adopts the decoding mode of the autoregressive model.
In this embodiment, the predicted word unit sequence Output by the speech recognition model is an array, for example [223, 2124, 2323, 45], where each number represents a different word, that is, the Output above, so that in order to improve the training effect of the speech recognition model, in the training process of the speech recognition model, the Output can be obtained in one step (non-autoregressive), and in the testing process of the speech recognition model, the predicted word units in the Output can be obtained one by one step (autoregressive).
Step 104, training the voice recognition model based on the training sample set to obtain a trained voice recognition model.
In this embodiment, the executing body may select a training sample from the training sample set obtained in step 103, and execute the following training steps from step a to step B, so as to complete iterative training of the speech recognition model. The selection manner and the number of selection of the training samples from the training sample set are not limited in the present application, and the number of iterative training of the speech recognition model is not limited.
And A, calculating a network loss value of the voice recognition model based on the training sample input to the voice recognition model.
In this embodiment, during each iterative training of the speech recognition model, a training sample is selected from the training sample set, the selected training sample is input into the speech recognition model, and a loss value of the speech recognition model is calculated based on a predicted word unit sequence output by the speech recognition model.
In this embodiment, the loss function of the speech recognition model for calculating the loss value of the speech recognition model may be a cross entropy loss function, which can measure the degree of difference between two different probability distributions in the same random variable, and is expressed as the difference between the true probability distribution and the predicted probability distribution in machine learning. The smaller the value of the cross entropy loss function, the better the predictive effect of the speech recognition model.
Alternatively, the loss function of the speech recognition model may also be a mean square error function, where the mean square error function is the expectation of the square difference between the predicted word unit sequence of the speech recognition model and the true value (the new word unit sequence in the training sample), and in the iterative training process of the speech recognition model, the loss function of the speech recognition model may be minimized by using a gradient descent algorithm, so as to iteratively optimize the network parameters of the speech recognition model.
The intention of the gradient is a vector that indicates that the directional derivative of a certain loss function at that point takes a maximum along that direction, i.e. the loss function changes the fastest along that direction at that point with the greatest rate of change. In deep learning, the main task of the neural network is to find the optimal network parameters (weights and biases) at the time of learning, which are the parameters at which the loss function is minimum.
And B, training the voice recognition model based on the loss value of the voice recognition model to obtain a trained voice recognition model.
In this embodiment, the trained speech recognition model is a trained speech recognition model obtained by inputting the selected speech sample into the speech recognition model for iterative training, and performing parameter tuning on the speech recognition model.
In this embodiment, whether the speech recognition model meets the training completion condition can be detected through the loss value of the speech recognition model, and after the speech recognition model meets the training completion condition, the trained speech recognition model is obtained.
In this embodiment, the training completion conditions include: the loss value of the speech recognition model is less than a first loss value threshold. Wherein the first loss threshold may be determined based on specific training requirements, e.g., the first loss threshold is 0.01.
Optionally, in this embodiment, in response to the speech recognition model not meeting the training completion condition, the relevant parameters in the speech recognition model are adjusted so that the loss value of the speech recognition model converges, and the training steps a-B are continuously performed based on the adjusted speech recognition model.
In this embodiment, when the speech recognition model does not meet the training completion condition, the relevant parameters of the speech recognition model are adjusted, which is helpful to help the convergence of the loss value of the speech recognition model.
In this embodiment, in the Whisper model for recognizing a language with a small number of samples, the token sequence (word unit sequence) received by the Whisper model must include a token (word unit) of a language, and the model determines which type of language is to be used for speech recognition. However, since the Whisper original model is not trained on data in this language-like, there is a lack of a token in this language-like sequence. For this reason, the speech recognition model is trained many times, and attempts are made to replace the token of the language of the category with the token of other languages such as Chinese, english and the like; or assign a new language token to the class of language, but the model effect after training is not expected. The Whisper model is used to test the voice frequency of the language of the category, and the Whisper model is found to predict most of the voice frequency of the language of the category as another language (the predicted word unit for representing the voice), and the token of the language of the category is replaced by the token of the language of the category, so that good training effect can be obtained.
In this embodiment, training a speech recognition model based on the loss value of the speech recognition model to obtain a trained speech recognition model includes: and in response to the loss value of the voice recognition model being smaller than the loss threshold value, determining that the voice recognition model meets the training completion condition, and taking the feature recognition sub-network and the classification sub-network in the voice recognition model meeting the training completion condition as the voice recognition model.
The embodiment of the disclosure provides a voice recognition model training method, firstly, a voice sample set is obtained, the voice sample set includes at least one voice sample, and the voice sample includes: an audio feature sequence and an initial word unit sequence; secondly, an initial voice recognition model is obtained, and the voice recognition model is used for representing the corresponding relation between the audio feature sequence and the predicted word unit sequence; thirdly, replacing language word units in the initial word unit sequence in the voice sample set with the language word units of the characterization language, so as to obtain a training sample set, wherein the predicted word units are predicted word units in the predicted word unit sequence obtained by inputting voice samples selected from the voice sample set into a voice recognition model; finally, based on the training sample set, training the voice recognition model to obtain a trained voice recognition model. Therefore, before the speech recognition model is trained, the language word units in the initial word unit sequence in the speech sample set are replaced by the predicted word units for representing the language, so that the model can predict the audio feature sequence in the familiar word unit environment, the convergence rate of the model is improved, and the training effect of the model is improved.
In some optional implementations of the disclosure, the speech recognition model includes: the system comprises an initial recognition sub-model and a keyword module connected with the initial recognition sub-model; based on the training sample set, training the speech recognition model, the obtaining the trained speech recognition model comprising: training an initial recognition sub-model based on the training sample set to obtain a trained recognition sub-model; training a keyword module based on the training sample set and the trained recognition sub-model to obtain a trained keyword module; and taking the trained recognition sub-model and the trained keyword module as the trained voice recognition model.
As shown in fig. 2, the speech recognition model includes: the recognition sub-model and keyword module inputs training samples in the training sample set into the recognition sub-model, and trains the recognition sub-model for multiple times, so that the trained recognition sub-model can be obtained.
According to the method for training the voice recognition model provided by the alternative implementation mode, an initial recognition sub-model is trained firstly, and a trained recognition sub-model is obtained; and training the keyword model based on the training sample set and the trained recognition sub-model to obtain a trained keyword module, and training the initial recognition sub-model and the keyword module step by step to improve the reliability of the trained speech recognition model.
In some optional implementations of the disclosure, training the keyword module based on the training sample set and the trained recognition sub-model, where obtaining the trained keyword module includes: fixing parameters of the trained recognition sub-model; inputting training samples in the training sample set into the trained recognition sub-model to obtain a first probability distribution output by the trained recognition sub-model and a second probability distribution output by the keyword module; calculating a loss value of an initial word unit in the training sample based on the first probability distribution and the second probability distribution; and obtaining a trained keyword module based on the loss value.
In this alternative implementation, as shown in fig. 2, the recognition sub-model is a Whisper model, in order to enhance the recognition capability of the model on specific keywords (bias words), a keyword module of the present disclosure may use a TCPGen (Tree-constrained pointer generator) module, and the Whisper model (in this case, the trained recognition sub-model) may be assisted by a TCPGen module to enhance the recognition capability on the keywords, and first, a keyword sequence that needs to be identified in a focused manner is generated (biasing list). The TCPGen module then generates a tree for keyword prefixes from the keyword sequence, generates a probability distribution for all token by traversing the tree, P ptr(yi), and generates a probability P gen(yi). Specifically, the TCPGen module firstly obtains a token sequence corresponding to a keyword through traversing a keyword prefix tree, then obtains embedding corresponding to each token sequence, and finally generates a probability distribution P ptr(yi) and a probability P gen(yi by using a full connection layer and a softmax function respectively. Finally, the two are weighted and summed with a token probability distribution P mdl(yi generated by the Whisper model (although the Whisper model parameters are fixed, the Whisper model can also perform forward prediction), so as to obtain a final token probability distribution P (y i), specifically as shown in the formula (1):
P(yi)=Pmdl(yi)(1-Pgen(yi))+Pptr(yi)Pgen(yi) (1)
In the alternative implementation mode, the keyword model is trained on the basis of the completion of training the recognition sub-model, so that the training effect of the keyword model can be ensured. Specifically, the Whisper model needs to adopt a word unit sequence right shift one-bit technology during training, but through training, the word unit sequence right shift one-bit technology is not needed during independent training of TCPGen modules, otherwise, training loss can oscillate, and the model cannot be trained. This is because TCPGen module does not directly predict token sequence, but corrects the result of predicting the original Whisper model, therefore, after the Whisper model training is completed, training TCPGen module can improve the training effect of TCPGen module without adopting word unit sequence right shift one-bit technology.
According to the method for training the keyword module, firstly, parameters of a trained recognition sub-model are fixed; then, based on the training sample set, calculating a first probability score output by the trained recognition sub-model and a second probability distribution output by the keyword module; and calculating the loss value of the initial word unit in the training sample based on the first probability distribution and the second probability distribution, and obtaining the trained keyword module based on the loss value, thereby improving the training reliability of the keyword module.
In some optional implementations of the disclosure, training the initial recognition sub-model based on the training sample set, where obtaining the trained recognition sub-model includes: selecting a training sample from the training sample set to obtain a selected sample; right shifting a new word unit in the new word unit sequence in the selected sample by one bit to obtain a training word unit sequence; based on the audio feature sequence and the training word unit sequence in the selected sample, training the initial recognition sub-model to obtain a trained recognition sub-model.
In this alternative implementation, the new word unit sequence is a word unit sequence obtained by replacing a word unit of a language in the initial word unit sequence in the speech sample set with a predicted word unit of a characterization language, and the new word unit is a word unit in the new word unit sequence.
In this alternative implementation, the recognition sub-model may be a Whisper model, and in order to adapt to the characteristics of Whisper autoregressive model when training the Whisper model, a word unit sequence right shift one-bit (shift_ tokens _right) technique is used in the training process, that is, the model needs to predict the next word of the token sequence. To achieve this, it is necessary to shift the token sequence input to the model one unit to the right. However, through training, the word unit sequence right shift one-bit technology should not be used when the model is independently trained TCPGen, otherwise, the training loss can oscillate, and the model cannot be trained. This is because TCPGen module does not directly predict the token sequence, but rather makes corrections on the predictions of the original Whisper model.
In this alternative implementation, the right shift of a new word unit in the new word unit sequence in the selected sample is performed for the purpose of enabling the initial recognition sub-model to effectively predict the next Token.
According to the method for training the initial recognition sub-model, the training sample is selected from the training sample set, the new word units in the new word unit sequence in the selected sample are shifted to the right by one bit to obtain the training word unit sequence, and the initial recognition sub-model is trained based on the audio feature sequence and the training word unit sequence in the selected sample, so that the reliability of the trained recognition sub-model is improved, and the training effect of the voice recognition model is improved.
In some optional implementations of the disclosure, the acquiring a set of speech samples includes: acquiring an initial data set comprising at least one initial voice; preprocessing an initial data set to obtain a processed data set; performing data enhancement on the processed data set to obtain a voice data set; based on the speech data set, a speech sample set is obtained.
In this alternative implementation, the initial speech is speech data of a language, where the initial speech may be unprocessed speech data, and the speech sample set is obtained by processing the initial speech in the initial data set. For example, where the initial speech is speech in a rare language, the resulting speech sample set is a sample set of the rare language from which a speech recognition model may be trained.
In this optional implementation, the data preprocessing is a process of preparing the original data and adapting it to the machine learning model, and the preprocessed processed data set may be adapted to training of the machine learning model by the data preprocessing, where preprocessing the initial data set to obtain the processed data set includes: and removing unidentifiable information such as punctuations in the initial data.
In this alternative implementation, the data enhancement is performed on the processed data set based on the problem that the sample information amount of the language corresponding to the initial data set is small, so as to increase the number of voice data in the voice data set.
In this optional implementation manner, the obtaining a voice sample set based on the voice data set includes: extracting voice characteristics of each voice data in the voice data set to obtain the voice characteristics of each voice data; and defining the fixed length of each voice feature to obtain the audio feature of each voice feature, and determining an initial word unit of each voice data to obtain a voice sample set comprising at least one voice sample. The speech feature may be mel feature or Fbank (Filter Banks) feature, and the extraction of the speech feature may be performed by a mature method, which is not described herein.
The method for acquiring the voice sample set provided by the alternative implementation mode comprises the steps of firstly preprocessing an initial data set to obtain a processed data set; performing data enhancement on the processed data set to obtain a voice data set; based on the voice data set, a voice sample set is obtained, and reliability of the voice sample set is improved.
In some optional implementations of the disclosure, the preprocessing the initial data set to obtain a processed data set includes at least one of: sampling all initial data in the initial data set at the same sampling rate; duplicate and unrecognizable initial data in the initial data set is deleted.
Optionally, preprocessing the initial data set to obtain a processed data set may further include: and processing all the initial data in the initial data set to enable all the initial data to be located in the same audio channel.
Optionally, preprocessing the initial data set to obtain a processed data set may further include: and generating new initial data by adopting a voice synthesis technology, adding the new initial data into a processing data set, and increasing the scale of the data set by supplementing synthesized voice frequency of the language of the category as a training sample because the training sample of the language with fewer samples is fewer. Through training, it was found that the synthetic data set should not be too much, otherwise there may be a side effect on the training effect, mainly possibly because the quality of the synthetic data is difficult to guarantee, for which reason the data volume of the new initial data may be kept at about 30% of the data volume of the total speech sample set.
The alternative implementation mode provides a method for preprocessing an initial data set, which comprises the steps of sampling the initial data set at the same sampling rate, deleting repeated and unrecognizable initial data in the initial data set, and therefore improving the diversity of voice preprocessing.
In some optional implementations of the disclosure, the data enhancing the processed data set to obtain the speech data set includes at least one of: modifying the speech speed of each processing data in the processing data set to obtain an increasing data set, and increasing the increasing data set in the processing data set; adding reverberation to each processing data in the processing data set to obtain a reverberation data set, and adding the reverberation data set in the processing data set; adding noise to each processing data in the processing data set to obtain a noise data set, and adding the noise data set in the processing data set; and modifying the audio spectrogram of each processing data in the processing data set to obtain a spectrum data set, and adding the spectrum data set in the processing data set.
In the alternative implementation mode, the speech speed of the processed data (audio) in the processed data set is modified, the number of training sets is increased, and the data is more diversified; the reverberation is added for processing data in the data processing set at random, so that the aim of simulating various sound scenes can be fulfilled; noise is added for processing data in the data processing set at random, so that the complexity of the data can be improved; the Specaugment technique (used for speech recognition enhancement) can be used to modify the audio spectrogram of the processed data in the processed dataset, improving the quality of the audio features.
The method for enhancing the data of the processed data provided by the alternative implementation mode comprises the following steps: modifying the speech rate of the processed data in the processed data, increasing reverberation for the processed data, increasing noise for the processed data, modifying at least one or more of the audio spectrograms of the processed data, and improving the diversity of the speech data set.
Further, based on the voice recognition model training method provided by the embodiment, the disclosure also provides an embodiment of a voice recognition model testing method, and the voice recognition model testing method disclosed by the disclosure combines the artificial intelligence fields of voice recognition, deep learning and the like.
Referring to fig. 3, a flow 300 of one embodiment of a speech recognition model testing method according to the present disclosure is shown, the speech recognition model testing method provided by the present embodiment includes the steps of:
Step 301, a test set and a trained speech recognition model are obtained.
In this embodiment, the test set includes at least one test sample, the test sample including: an audio feature sequence and a test word unit sequence.
In this embodiment, the test sample is the same as the data structure of the training sample, that is, the test sample includes: the method comprises the steps of replacing a language word unit in an initial word unit sequence with a predicted word unit representing a language, wherein the new word unit sequence comprises at least one new word unit, but a test sample is not used as a training sample for training a voice recognition model.
In this embodiment, the speech recognition model may be a trained speech recognition model obtained by training using the method described in the embodiment of fig. 1, and the specific training process may be described in association with the embodiment of fig. 3, which is not described herein.
In this embodiment, the speech recognition model includes: the system comprises an encoder, a decoder and a keyword module, wherein the encoder and the decoder form a trained recognition sub-model.
Step 302, a test sample is selected from the test set.
In this embodiment, the executing body may select a test sample from the test set in step 301,
In this embodiment, the selection manner and the selection number of selecting the test samples from the test set are not limited in the present application, and the number of iterative training of the speech recognition model is not limited.
Step 303, inputting the audio feature sequence in the test sample into an encoder to obtain audio intermediate features.
In this embodiment, as shown in fig. 4, the audio feature sequence may be an audio mel feature of the extracted audio, and the audio mel feature is input to the encoder of the trained recognition sub-model, so as to obtain an audio intermediate feature generated by the encoder; when the decoder predicts for the first time, the generated starting token sequence (not shown in fig. 4) is used as the current word unit sequence, and as the test time passes, the current word unit sequence is predicted for a plurality of times, and the current word unit sequence is replaced by the current word unit sequence updated based on the predicted word units.
Step 304, inputting the audio intermediate feature and the current word unit sequence into a decoder to obtain a predicted word unit output by the speech recognition model.
In this embodiment, when the decoder predicts for the first time, the audio intermediate feature and the starting token sequence are sent to the decoder together to be decoded, so as to obtain a predicted token output by the decoder, and when the description is needed, one of all the predicted word units output by the decoder may be a predicted word unit representing a language.
And step 305, in response to the predicted word unit being a non-ending symbol, updating the current word unit sequence based on the predicted word unit, and continuously inputting the audio intermediate feature and the current word unit sequence into the decoder until the predicted word unit output by the speech recognition model is the ending symbol, thereby obtaining all the predicted word units corresponding to the test sample.
As shown in fig. 4, when the predicted word unit is a non-ending symbol, i.e., not EOT, the current word unit sequence is updated based on the predicted word unit of the decoder, and in particular, the current word unit sequence may be updated by adding a predicted token (predicted word unit) to the end of the current token sequence. When the decoder predicts for the first time, the predicted word unit output by the decoder is added to the starting token sequence, and then the result of updating the current word unit sequence is used as the current word unit sequence, and the next prediction of the decoder is performed. As shown in fig. 4, when the predictor unit is an terminator, prediction of the audio feature sequence ends.
In this embodiment, after the decoder performs multiple predictions, if the predictor unit output by the decoder is EOT, the decoder ends the prediction at this time, and combines the predictor units output by the decoder from the first output to before the EOT is output, so as to obtain all predictor units.
Step 306, detecting whether the speech recognition model is qualified or not based on all the predicted word units corresponding to the test sample and the test word unit sequences in the test sample.
In this embodiment, all the predicted word units may be arranged together according to the prediction order of the decoder to obtain a predicted word unit sequence, and the predicted word unit sequence is compared with the test word unit sequence in the test sample; in response to the predicted word unit sequence being the same as the test word unit sequence, determining that the speech recognition model is tested to be qualified, and directly applying the speech recognition model to the prediction of the text of the language; and in response to the fact that the predicted word unit sequence is different from the test word unit sequence, determining that the speech recognition model is not qualified in test, and cannot be directly applied to the prediction of the text of the language.
According to the voice recognition model testing method provided by the embodiment of the disclosure, a testing set and a voice recognition model are obtained, a testing sample is selected from the testing set, an audio feature sequence in the testing sample is input into an encoder, and when the voice recognition model does not output an ending symbol, an updating unit sequence is always adopted for replacement. Therefore, the voice recognition model can always predict the audio intermediate features of the test sample to obtain a predicted word unit sequence of the audio intermediate features, and whether the voice recognition model is tested to be qualified or not is detected through the predicted word unit sequence, so that the reliability of the voice recognition model test is improved.
In some optional implementations of the disclosure, detecting whether the speech recognition model is qualified based on all the predicted word units corresponding to the test sample and the sequence of the test word units in the test sample includes: sequencing all the predicted word units corresponding to the test sample to obtain a predicted word unit sequence; calculating the word error rate of the test sample based on the predicted word unit sequence and the test word unit sequence in the test sample; and determining that the speech recognition model is qualified in response to the word error rate being less than the error threshold.
In the alternative implementation mode, the predicted word unit sequence is compared with the test word unit sequence, the number of word units in the predicted word unit sequence, which is different from the test word unit sequence, is determined, and the ratio of the number of word units to the total number of the test word units in the test word unit sequence is calculated to obtain the word error rate.
In this alternative implementation, the error threshold may be determined based on the test requirement, for example, the error threshold is 20%, that is, less than 20% of the predicted word units in the predicted word unit sequence are different from the test word units in the test word unit sequence, so as to determine that the speech recognition model is qualified.
According to the method for detecting whether the voice recognition model is qualified or not, the word error rate of the test sample is calculated through the predicted word unit sequence and the test word unit sequence corresponding to the test sample, a reliable implementation means is provided for testing the voice recognition model, and the reliability of testing the voice recognition model is improved.
Further, based on the voice recognition model training method provided by the embodiment, the disclosure also provides an embodiment of a voice recognition method, and the voice recognition method of the disclosure combines the artificial intelligence fields of voice recognition, deep learning and the like.
Referring to fig. 5, a flow 500 is shown according to one embodiment of the disclosed speech recognition method, which includes the steps of:
Step 501, a voice to be recognized is obtained.
In this embodiment, the execution subject of the voice recognition method may acquire the voice to be recognized in various ways. For example, the execution subject may acquire the voice to be recognized stored therein from the database server through a wired connection or a wireless connection. For another example, the executing body may also receive the voice to be recognized collected by the terminal or other devices in real time.
In this embodiment, the voice to be recognized is voice information that needs to be converted from voice to text, and the voice to be recognized may be a voice of a rare language, for example, a voice of a language with a few samples.
Step 502, processing the voice to be recognized to obtain audio feature data.
In this embodiment, the execution subject may process the to-be-recognized language obtained in step 501 to obtain the audio feature data.
In this embodiment, the audio feature data is data obtained after feature (e.g., mel feature, or Fbank feature) recognition is performed on the voice to be recognized. Specifically, the step 502 includes fixing the length of the voice to be recognized (for example, 30 seconds), and extracting mel or fbank features of the voice to be recognized under the length to obtain audio feature data with a fixed length.
In this embodiment, the audio feature data is the same data as the data structure of the audio feature sequence.
Step 503, inputting the audio feature data into a speech recognition model to obtain a predicted word unit sequence of the speech to be recognized.
In this embodiment, the speech recognition model may be a trained speech recognition model obtained by training using the method described in the embodiment of fig. 1, and the specific training process may be described in association with the embodiment of fig. 1, which is not described herein.
Step 504, obtaining text data of the voice to be recognized based on the predicted word unit sequence.
In this embodiment, the reduction of the single sequence to text data belongs to a mature technical means, and the conversion of the predicted word unit sequence to text data of the voice to be recognized also belongs to a mature means, which is not described herein again.
The voice recognition method provided by the disclosure can be applied to social, learning, working and other scenes of language with few samples. Firstly, the method can be applied to areas with the language of the category as the main language, and helps people to interact with the intelligent device more conveniently. Second, the model can be used in the educational field to assist students in learning the pronunciation and hearing of the language of the category. Furthermore, in the business field, it can be used for customer service and market research to meet the needs of users of this kind of language.
According to the voice recognition method provided by the embodiment of the disclosure, the voice to be recognized is obtained, the voice to be recognized is processed to obtain the audio feature data, the audio feature data is input into the voice recognition model generated by the voice recognition model training method to obtain the predicted word unit sequence of the audio feature data, and the text data of the voice to be recognized is obtained through the predicted word unit sequence. Therefore, a voice recognition result is generated by adopting a voice recognition model, and the reliability and the accuracy of voice recognition are improved.
With further reference to fig. 6, as an implementation of the method illustrated in the foregoing figures, the present disclosure provides an embodiment of a speech recognition model training apparatus, which corresponds to the method embodiment illustrated in fig. 1, and which is particularly applicable in a variety of electronic devices.
As shown in fig. 6, the speech recognition model training apparatus 600 provided in this embodiment includes: sample acquisition unit 601, model acquisition unit 602, replacement unit 603, training unit 604. Wherein, the sample acquiring unit 601 may be configured to acquire a voice sample set, where the voice sample set includes at least one voice sample, and the voice sample includes: an audio feature sequence and an initial word unit sequence. The model obtaining unit 602 may be configured to obtain an initial speech recognition model, where the speech recognition model is used to characterize a correspondence between the audio feature sequence and the predicted word unit sequence. The replacing unit 603 may be configured to replace the language word units in the initial word unit sequence in the voice sample set with the predicted word units of the language to obtain the training sample set, where the predicted word units are predicted word units in the predicted word unit sequence obtained by inputting the voice sample selected from the voice sample set into the voice recognition model. The training unit 604 may be configured to train the speech recognition model based on the training sample set, resulting in a trained speech recognition model.
In the present embodiment, in the speech recognition model training apparatus 600: the specific processing of the sample acquiring unit 601, the model acquiring unit 602, the replacing unit 603, and the training unit 604 and the technical effects thereof may refer to the relevant descriptions of the steps 101, 102, 103, and 104 in the corresponding embodiment of fig. 1, and are not repeated herein.
In some optional implementations of this embodiment, the speech recognition model includes: the system comprises an initial recognition sub-model and a keyword module connected with the initial recognition sub-model; the training unit 604 is configured to: training an initial recognition sub-model based on the training sample set to obtain a trained recognition sub-model; training a keyword module based on the training sample set and the trained recognition sub-model to obtain a trained keyword module; and taking the trained recognition sub-model and the trained keyword module as the trained voice recognition model.
In some optional implementations of this embodiment, the training unit 604 is further configured to: fixing parameters of the trained recognition sub-model; inputting training samples in the training sample set into the trained recognition sub-model to obtain a first probability distribution output by the trained recognition sub-model and a second probability distribution output by the keyword module; calculating a loss value of an initial word unit in the training sample based on the first probability distribution and the second probability distribution; and obtaining a trained keyword module based on the loss value.
In some optional implementations of this embodiment, the training unit 604 is further configured to: selecting a training sample from the training sample set to obtain a selected sample; right shifting a new word unit in the new word unit sequence in the selected sample by one bit to obtain a training word unit sequence; based on the audio feature sequence and the training word unit sequence in the selected sample, training the initial recognition sub-model to obtain a trained recognition sub-model.
In some optional implementations of the present embodiment, the sample acquiring unit 601 is configured to: acquiring an initial data set comprising at least one initial voice; preprocessing an initial data set to obtain a processed data set; performing data enhancement on the processed data set to obtain a voice data set; based on the speech data set, a speech sample set is obtained.
In some optional implementations of the present embodiment, the sample acquisition unit 601 is further configured to implement at least one of: sampling all initial data in the initial data set at the same sampling rate; duplicate and unrecognizable initial data in the initial data set is deleted.
In some optional implementations of the present embodiment, the sample acquisition unit 601 is further configured to implement at least one of: modifying the speech speed of each processing data in the processing data set to obtain an increasing data set, and increasing the increasing data set in the processing data set; adding reverberation to each processing data in the processing data set to obtain a reverberation data set, and adding the reverberation data set in the processing data set; adding noise to each processing data in the processing data set to obtain a noise data set, and adding the noise data set in the processing data set; and modifying the audio spectrogram of each processing data in the processing data set to obtain a spectrum data set, and adding the spectrum data set in the processing data set.
The embodiment of the present disclosure provides a speech recognition model training device, first, a sample acquisition unit 601 acquires a speech sample set, where the speech sample set includes at least one speech sample, and the speech sample includes: an audio feature sequence and an initial word unit sequence; secondly, the model acquisition unit 602 acquires an initial voice recognition model, wherein the voice recognition model is used for representing the corresponding relation between the audio feature sequence and the predicted word unit sequence; again, the replacing unit 603 replaces the language word units in the initial word unit sequence in the voice sample set with the language word units representing the language, so as to obtain a training sample set, wherein the predicted word units are predicted word units in the predicted word unit sequence obtained by inputting the voice sample selected from the voice sample set into the voice recognition model; finally, training unit 604 trains the speech recognition model based on the training sample set, resulting in a trained speech recognition model. Therefore, before the speech recognition model is trained, the language word units in the initial word unit sequence in the speech sample set are replaced by the predicted word units for representing the language, so that the model can predict the audio feature sequence in the familiar word unit environment, the convergence rate of the model is improved, and the training effect of the model is improved.
With further reference to fig. 7, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of a speech recognition model testing apparatus, which corresponds to the method embodiment shown in fig. 3, and which is particularly applicable to various electronic devices.
As shown in fig. 7, the speech recognition model testing apparatus 700 provided in this embodiment includes: an information acquisition unit 701, a selection unit 702, an input unit 703, an acquisition unit 704, an update unit 705, and a test unit 706. Wherein the information obtaining unit 701 may be configured to obtain a test set and a trained speech recognition model, where the test set includes at least one test sample, and the test sample includes: an audio feature sequence and a test word unit sequence; the trained speech recognition model is obtained by training the speech recognition model training device, and the speech recognition model comprises: encoder, decoder and keyword module. The selection unit 702 may be configured to select a test sample from a test set. The input unit 703 may be configured to input the audio feature sequence in the test sample to the encoder, resulting in audio intermediate features. The obtaining unit 704 may be configured to input the audio intermediate feature and the current word unit sequence into a decoder to obtain a predicted word unit output by the speech recognition model. The updating unit 705 may be configured to update the current word unit sequence based on the predicted word unit in response to the predicted word unit being a non-ending symbol, and continue to input the audio intermediate feature and the current word unit sequence into the decoder until the predicted word unit output by the speech recognition model is an ending symbol, thereby obtaining all the predicted word units corresponding to the test sample. The test unit 706 may be configured to detect whether the speech recognition model is qualified based on all the predicted word units corresponding to the test sample and the sequence of the test word units in the test sample.
In the present embodiment, in the voice recognition apparatus 700: the specific processing and the technical effects of the information acquisition unit 701, the selection unit 702, the input unit 703, the obtaining unit 704, the updating unit 705, and the testing unit 706 may refer to the relevant descriptions of step 301, step 302, step 304, step 305, and step 306 in the corresponding embodiment of fig. 3, which are not described herein.
In some optional implementations of the present embodiment, the test unit 706 is further configured to: sequencing all the predicted word units corresponding to the test sample to obtain a predicted word unit sequence; calculating the word error rate of the test sample based on the predicted word unit sequence and the test word unit sequence in the test sample; and determining that the speech recognition model is qualified in response to the word error rate being less than the error threshold.
With further reference to fig. 8, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of a speech recognition apparatus, which corresponds to the method embodiment shown in fig. 5, and which is particularly applicable in various electronic devices.
As shown in fig. 8, the voice recognition apparatus 800 provided in the present embodiment includes: a voice acquisition unit 801, a processing unit 802, a recognition unit 803, and a conversion unit 804. The voice acquiring unit 801 may be configured to acquire a voice to be recognized. The processing unit 802 may be configured to process the speech to be recognized to obtain audio feature data. The above-mentioned recognition unit 803 may be configured to input the audio feature data into a speech recognition model, to obtain a predicted word unit sequence of the speech to be recognized, where the speech recognition model is a trained speech recognition model obtained by the speech recognition model training device of the above-mentioned embodiment. The conversion unit 804 may be configured to obtain text data of the speech to be recognized based on the predicted word unit sequence.
In the present embodiment, in the voice recognition apparatus 800: the specific processing of the voice obtaining unit 801, the processing unit 802, the identifying unit 803, and the converting unit 804 and the technical effects thereof may refer to the descriptions related to step 501, step 502, step 503, and step 504 in the corresponding embodiment of fig. 5, and are not repeated herein.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.
Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.
The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, such as a speech recognition model training method or a speech recognition model testing method or a speech recognition method. For example, in some embodiments, the speech recognition model training method or the speech recognition model testing method or the speech recognition method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the above-described speech recognition model training method or speech recognition model testing method or speech recognition method may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform a speech recognition model training method or a speech recognition model testing method or a speech recognition method in any other suitable way (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable speech recognition model training device or speech recognition model testing device or speech recognition device such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (23)

1. A method of training a speech recognition model, the method comprising:
Obtaining a set of speech samples, the set of speech samples comprising at least one speech sample, the speech sample comprising: an audio feature sequence and an initial word unit sequence;
acquiring an initial voice recognition model, wherein the voice recognition model is used for representing the corresponding relation between an audio characteristic sequence and a predicted word unit sequence;
Replacing language word units in the initial word unit sequence in the voice sample set by using the predicted word units of the characterization language to obtain a training sample set, wherein the predicted word units are predicted word units in the predicted word unit sequence obtained by inputting voice samples selected from the voice sample set into the voice recognition model;
And training the voice recognition model based on the training sample set to obtain a trained voice recognition model.
2. The method of claim 1, wherein the speech recognition model comprises: the system comprises an initial recognition sub-model and a keyword module connected with the initial recognition sub-model; training the voice recognition model based on the training sample set, wherein the obtaining the trained voice recognition model comprises the following steps:
training the initial recognition sub-model based on the training sample set to obtain a trained recognition sub-model;
Training the keyword module based on the training sample set and the trained recognition sub-model to obtain a trained keyword module;
and taking the trained recognition sub-model and the trained keyword module as a trained voice recognition model.
3. The method of claim 2, wherein the training the keyword module based on the training sample set, the trained recognition sub-model, the obtaining the trained keyword module comprises:
Fixing parameters of the trained recognition sub-model;
Inputting training samples in a training sample set into the trained recognition sub-model to obtain a first probability distribution output by the trained recognition sub-model and a second probability distribution output by the keyword module;
calculating a loss value of an initial word unit in a training sample based on the first probability distribution and the second probability distribution;
And obtaining a trained keyword module based on the loss value.
4. The method of claim 2, wherein the training the initial recognition sub-model based on the training sample set, resulting in a trained recognition sub-model comprises:
selecting a training sample from the training sample set to obtain a selected sample;
Right shifting a new word unit in the new word unit sequence in the selected sample by one bit to obtain a training word unit sequence;
and training the initial recognition sub-model based on the audio feature sequence and the training word unit sequence in the selected sample to obtain a trained recognition sub-model.
5. The method of claim 1, wherein the acquiring a set of speech samples comprises:
acquiring an initial data set comprising at least one initial voice;
preprocessing the initial data set to obtain a processed data set;
Performing data enhancement on the processed data set to obtain a voice data set;
and obtaining a voice sample set based on the voice data set.
6. The method of claim 5, wherein the preprocessing the initial dataset to obtain a processed dataset comprises at least one of:
sampling all initial data in the initial data set at the same sampling rate;
Duplicate and unrecognizable initial data in the initial data set is deleted.
7. The method of claim 5, wherein the data enhancing the processed data set to obtain a speech data set comprises at least one of:
Modifying the speech speed of each processing data in the processing data set to obtain an increasing data set, and increasing the increasing data set in the processing data set;
Adding reverberation to each processing data in the processing data set to obtain a reverberation data set, and adding the reverberation data set in the processing data set;
adding noise to each processing data in the processing data set to obtain a noise data set, and adding the noise data set in the processing data set;
And modifying an audio frequency spectrogram of each processing data in the processing data set to obtain a frequency spectrum data set, and adding the frequency spectrum data set in the processing data set.
8. A speech recognition model testing method, the method comprising:
Obtaining a test set and a trained speech recognition model, the test set comprising at least one test sample, the test sample comprising: an audio feature sequence and a test word unit sequence; the trained speech recognition model is obtained by training the speech recognition model training method according to any one of claims 1 to 7, and the speech recognition model comprises: an encoder, a decoder, and a keyword module;
Selecting a test sample from the test set;
inputting the audio feature sequence in the test sample into the encoder to obtain audio intermediate features;
Inputting the audio intermediate characteristics and the current word unit sequence into the decoder to obtain a predicted word unit output by the voice recognition model;
Responding to the predicted word unit as a non-ending symbol, updating a current word unit sequence based on the predicted word unit, and continuously inputting the audio intermediate feature and the current word unit sequence into the decoder until the predicted word unit output by the voice recognition model is the ending symbol, so as to obtain all the predicted word units corresponding to the test sample;
And detecting whether the voice recognition model is qualified or not based on all the predicted word units corresponding to the test sample and the test word unit sequences in the test sample.
9. The method of claim 8, wherein the detecting whether the speech recognition model is qualified based on all the predicted word units corresponding to the test sample and the sequence of the test word units in the test sample comprises:
sequencing all the predicted word units corresponding to the test sample to obtain a predicted word unit sequence;
calculating the word error rate of the test sample based on the predicted word unit sequence and the test word unit sequence in the test sample;
And determining that the speech recognition model is qualified in response to the word error rate being less than an error threshold.
10. A method of speech recognition, the method comprising:
Acquiring voice to be recognized;
processing the voice to be recognized to obtain audio characteristic data;
Inputting the audio feature data into a voice recognition model to obtain a predicted word unit sequence of the voice to be recognized, wherein the voice recognition model is a trained voice recognition model obtained by adopting the voice recognition model training method according to any one of claims 1-7;
And obtaining text data of the voice to be recognized based on the predicted word unit sequence.
11. A speech recognition model training apparatus, the apparatus comprising:
A sample acquisition unit configured to acquire a set of speech samples, the set of speech samples including at least one speech sample, the speech sample comprising: an audio feature sequence and an initial word unit sequence;
the model acquisition unit is configured to acquire an initial voice recognition model, wherein the voice recognition model is used for representing the corresponding relation between the audio feature sequence and the predicted word unit sequence;
The replacing unit is configured to replace language word units in the initial word unit sequence in the voice sample set by using the predicted word units of the characterization language to obtain a training sample set, wherein the predicted word units are predicted word units in the predicted word unit sequence obtained by inputting voice samples selected from the voice sample set into the voice recognition model;
and the training unit is configured to train the voice recognition model based on the training sample set to obtain a trained voice recognition model.
12. The apparatus of claim 11, wherein the speech recognition model comprises: the system comprises an initial recognition sub-model and a keyword module connected with the initial recognition sub-model; the training unit is configured to: training the initial recognition sub-model based on the training sample set to obtain a trained recognition sub-model; training the keyword module based on the training sample set and the trained recognition sub-model to obtain a trained keyword module; and taking the trained recognition sub-model and the trained keyword module as a trained voice recognition model.
13. The apparatus of claim 12, wherein the training unit is further configured to: fixing parameters of the trained recognition sub-model; inputting training samples in a training sample set into the trained recognition sub-model to obtain a first probability distribution output by the trained recognition sub-model and a second probability distribution output by the keyword module; calculating a loss value of an initial word unit in a training sample based on the first probability distribution and the second probability distribution; and obtaining a trained keyword module based on the loss value.
14. The apparatus of claim 12, wherein the training unit is further configured to: selecting a training sample from the training sample set to obtain a selected sample; right shifting a new word unit in the new word unit sequence in the selected sample by one bit to obtain a training word unit sequence; and training the initial recognition sub-model based on the audio feature sequence and the training word unit sequence in the selected sample to obtain a trained recognition sub-model.
15. The apparatus of claim 11, wherein the sample acquisition unit is configured to: acquiring an initial data set comprising at least one initial voice; preprocessing the initial data set to obtain a processed data set; performing data enhancement on the processed data set to obtain a voice data set; and obtaining a voice sample set based on the voice data set.
16. The apparatus of claim 15, wherein the sample acquisition unit is further configured to at least one of:
sampling all initial data in the initial data set at the same sampling rate;
Duplicate and unrecognizable initial data in the initial data set is deleted.
17. The apparatus of claim 15, wherein the sample acquisition unit is further configured to at least one of:
Modifying the speech speed of each processing data in the processing data set to obtain an increasing data set, and increasing the increasing data set in the processing data set;
Adding reverberation to each processing data in the processing data set to obtain a reverberation data set, and adding the reverberation data set in the processing data set;
adding noise to each processing data in the processing data set to obtain a noise data set, and adding the noise data set in the processing data set;
And modifying an audio frequency spectrogram of each processing data in the processing data set to obtain a frequency spectrum data set, and adding the frequency spectrum data set in the processing data set.
18. A speech recognition model testing apparatus, the apparatus comprising:
An information acquisition unit configured to acquire a test set and a trained speech recognition model, the test set including at least one test sample, the test sample including: an audio feature sequence and a test word unit sequence; the trained speech recognition model is obtained by training the speech recognition model training device according to any one of claims 11 to 17, and the speech recognition model comprises: an encoder, a decoder, and a keyword module;
a selecting unit configured to select a test sample from the test set;
an input unit configured to input the audio feature sequence in the test sample to the encoder to obtain audio intermediate features;
The obtaining unit is configured to input the audio intermediate characteristics and the current word unit sequence into the decoder to obtain a predicted word unit output by the voice recognition model;
the updating unit is configured to respond to the predicted word unit as a non-ending symbol, update the current word unit sequence based on the predicted word unit, and continuously input the audio intermediate feature and the current word unit sequence into the decoder until the predicted word unit output by the voice recognition model is the ending symbol, so as to obtain all the predicted word units corresponding to the test sample;
And the test unit is configured to detect whether the voice recognition model is qualified or not based on all the prediction word units corresponding to the test sample and the test word unit sequences in the test sample.
19. The apparatus of claim 18, wherein the test unit is further configured to: sequencing all the predicted word units corresponding to the test sample to obtain a predicted word unit sequence; calculating the word error rate of the test sample based on the predicted word unit sequence and the test word unit sequence in the test sample; and determining that the speech recognition model is qualified in response to the word error rate being less than an error threshold.
20. A speech recognition device, the device comprising:
A voice acquisition unit configured to acquire a voice to be recognized;
the processing unit is configured to process the voice to be recognized to obtain audio characteristic data;
A recognition unit configured to input the audio feature data into a speech recognition model to obtain the predicted word unit sequence of the speech to be recognized, wherein the speech recognition model is a trained speech recognition model obtained by using the speech recognition model training device according to any one of claims 11 to 17;
and the conversion unit is configured to obtain text data of the voice to be recognized based on the predicted word unit sequence.
21. An electronic device, comprising:
at least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.
22. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-10.
23. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-10.
CN202410726281.2A 2024-06-05 2024-06-05 Speech recognition model training, testing and speech recognition method and device Pending CN118553234A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410726281.2A CN118553234A (en) 2024-06-05 2024-06-05 Speech recognition model training, testing and speech recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410726281.2A CN118553234A (en) 2024-06-05 2024-06-05 Speech recognition model training, testing and speech recognition method and device

Publications (1)

Publication Number Publication Date
CN118553234A true CN118553234A (en) 2024-08-27

Family

ID=92443702

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410726281.2A Pending CN118553234A (en) 2024-06-05 2024-06-05 Speech recognition model training, testing and speech recognition method and device

Country Status (1)

Country Link
CN (1) CN118553234A (en)

Similar Documents

Publication Publication Date Title
CN109817213B (en) Method, device and equipment for performing voice recognition on self-adaptive language
WO2021174757A1 (en) Method and apparatus for recognizing emotion in voice, electronic device and computer-readable storage medium
US11062699B2 (en) Speech recognition with trained GMM-HMM and LSTM models
CN110246488B (en) Voice conversion method and device of semi-optimized cycleGAN model
CN112885336B (en) Training and recognition method and device of voice recognition system and electronic equipment
EP4018437B1 (en) Optimizing a keyword spotting system
CN111339278B (en) Method and device for generating training speech generating model and method and device for generating answer speech
CN112349289B (en) Voice recognition method, device, equipment and storage medium
CN114360557B (en) Voice tone conversion method, model training method, device, equipment and medium
CN114038447A (en) Training method of speech synthesis model, speech synthesis method, apparatus and medium
CN112397056B (en) Voice evaluation method and computer storage medium
CN114141228B (en) Training method of speech synthesis model, speech synthesis method and device
CN112489623A (en) Language identification model training method, language identification method and related equipment
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN116092473A (en) Prosody annotation model, training method of prosody prediction model and related equipment
CN116129866A (en) Speech synthesis method, network training method, device, equipment and storage medium
CN113327596B (en) Training method of voice recognition model, voice recognition method and device
CN118230716A (en) Training method of deep learning model, voice synthesis method and device
CN114495977A (en) Speech translation and model training method, device, electronic equipment and storage medium
CN117711376A (en) Language identification method, system, equipment and storage medium
CN113160820A (en) Speech recognition method, and training method, device and equipment of speech recognition model
CN117153142A (en) Speech signal synthesis method and device, electronic equipment and storage medium
CN114373445B (en) Voice generation method and device, electronic equipment and storage medium
CN113555005B (en) Model training method, model training device, confidence determining method, confidence determining device, electronic equipment and storage medium
US20240038213A1 (en) Generating method, generating device, and generating program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination