CN116994570A - Training method and device of voice recognition model, and voice recognition method and device - Google Patents

Training method and device of voice recognition model, and voice recognition method and device Download PDF

Info

Publication number
CN116994570A
CN116994570A CN202311125742.2A CN202311125742A CN116994570A CN 116994570 A CN116994570 A CN 116994570A CN 202311125742 A CN202311125742 A CN 202311125742A CN 116994570 A CN116994570 A CN 116994570A
Authority
CN
China
Prior art keywords
voice
sample
phoneme
sequence
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311125742.2A
Other languages
Chinese (zh)
Inventor
马长伟
潘复平
张彬彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Horizon Information Technology Co Ltd
Original Assignee
Beijing Horizon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Horizon Information Technology Co Ltd filed Critical Beijing Horizon Information Technology Co Ltd
Priority to CN202311125742.2A priority Critical patent/CN116994570A/en
Publication of CN116994570A publication Critical patent/CN116994570A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the disclosure discloses a training method and device of a voice recognition model, and a voice recognition method and device, wherein the method comprises the following steps: obtaining a model to be trained and first training data, wherein the model to be trained comprises a pre-trained basic network module and a phoneme re-scoring module to be trained, and the first training data comprises first sample voice and a first sample label; processing the first sample voice by using a basic network module to obtain a characteristic coding sequence and a decoding result of the first sample voice, and acquiring text data based on the decoding result; converting text data of the first sample speech into a phoneme sequence; training a phoneme re-scoring module to be trained based on the feature coding sequence of the first sample voice, the phoneme sequence of the first sample voice and the first sample tag until a preset training ending condition is met, so as to obtain a voice recognition model. The embodiment of the disclosure can improve the wake-up rate of the voice recognition model under the condition of keeping the general recognition capability of the base network.

Description

Training method and device of voice recognition model, and voice recognition method and device
Technical Field
The disclosure relates to the field of speech recognition, and in particular relates to a training method and device of a speech recognition model, and a speech recognition method and device.
Background
At present, with the popularization of intelligent devices, the application of voice wake-up technology is also becoming more and more widespread. To further enhance the user experience, many smart devices are configured with custom wake word functionality.
To avoid additional speech recognition model building work and decoding processing work, a custom wake-up word function is typically implemented with a hotword function of speech recognition to wake up the device with the custom wake-up word. Because the self-defined wake-up words are flexible and changeable and have higher dependency on the voice recognition model, when the wake-up words self-defined by the hot word function are not common words, the recognition effect of the voice recognition model on the wake-up words is poor, and further the wake-up rate of the equipment is low and false alarm is easy to occur.
Disclosure of Invention
In order to solve the technical problems, the present disclosure provides a training method and apparatus for a speech recognition model, and a speech recognition method and apparatus, so as to improve the wake-up rate of the speech recognition model.
In one aspect of the disclosure, a training method of a speech recognition model is provided, including obtaining a model to be trained and first training data, where the model to be trained includes a basic network module obtained by pre-training and a phoneme re-scoring module to be trained, the first training data includes a first sample speech and a first sample tag, the first sample speech includes a wake-up word, and the first sample tag is a phoneme sequence of the first sample speech; the basic network module is utilized to encode the first sample voice to obtain a characteristic encoding sequence of the first sample voice, and the characteristic encoding sequence of the first sample voice is decoded to obtain a decoding result; based on the decoding result, acquiring text data of the first sample voice; converting the text data of the first sample voice into a phoneme sequence to obtain a phoneme sequence of the first sample voice; training the phoneme re-scoring module to be trained based on the feature coding sequence of the first sample voice, the phoneme sequence of the first sample voice and the first sample tag until a preset training ending condition is met, and obtaining a voice recognition model from the to-be-trained model.
In another aspect of the present disclosure, a method for voice recognition is provided, including: acquiring voice data; processing the voice data by utilizing a basic network module in a voice recognition model to obtain a characteristic coding sequence and a decoding result of the voice data, wherein the voice recognition model is obtained by training the voice recognition model by the training method; based on the decoding result, acquiring candidate text data of the voice data; converting the candidate text data into a phoneme sequence to obtain a candidate phoneme sequence; processing the feature coding sequence and the candidate phoneme sequence by using a phoneme re-scoring module of the speech recognition model to determine a phoneme sequence; based on the phoneme sequence, it is determined whether a wake word is included in the speech data.
In yet another aspect of the present disclosure, a training apparatus for a speech recognition model is provided, including a first obtaining unit configured to obtain a model to be trained including a basic network module obtained by training in advance and a phoneme re-scoring module to be trained, and first training data including a first sample speech including a wake-up word and a first sample tag, the first sample tag being a phoneme sequence of the first sample speech; the identification unit is configured to encode the first sample voice by utilizing the basic network module to obtain a characteristic encoding sequence of the first sample voice, and decode the characteristic encoding sequence of the first sample voice to obtain a decoding result; a first text obtaining unit configured to obtain text data of the first sample speech based on the decoding result; a conversion unit configured to convert text data of the first sample speech into a phoneme sequence, resulting in a phoneme sequence of the first sample speech; the first training unit is configured to train the phoneme re-classifying module to be trained based on the feature coding sequence of the first sample voice, the phoneme sequence of the first sample voice and the first sample label until a preset training ending condition is met, and a voice recognition model is obtained from the model to be trained.
In still another aspect of the present disclosure, there is provided a voice recognition apparatus including a receiving unit configured to acquire voice data; the voice recognition unit is configured to process the voice data by utilizing a basic network module in a voice recognition model to obtain a characteristic coding sequence and a decoding result of the voice data, and the voice recognition model is obtained by training the voice recognition model by the training method; a second text obtaining unit configured to obtain candidate text data of the voice data based on the decoding result; a phoneme conversion unit configured to convert the candidate text data into a phoneme sequence to obtain a candidate phoneme sequence; a re-scoring unit configured to process the feature code sequence and the candidate phoneme sequence with a phoneme re-scoring module of the speech recognition model to determine a phoneme sequence; and a wake-up word recognition unit configured to determine whether a wake-up word is included in the voice data based on the phoneme sequence.
In yet another aspect of the present disclosure, a computer-readable storage medium storing a computer program for implementing the above method is provided.
In another aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing the processor-executable instructions; the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method described above.
According to the embodiment of the disclosure, a phoneme re-scoring model to be trained and a basic network module obtained by pre-training are used as the model to be trained, the first sample voice of a training sample comprises a wake-up word, the basic network module is utilized to process the first sample voice to obtain a decoding result, after text data is obtained according to the decoding result, the feature coding sequence of the first sample voice obtained in the process of processing the first sample voice based on the basic network module, a phoneme sequence obtained by converting the text data and a first sample tag of the first sample voice are simultaneously obtained, the phoneme re-scoring module to be trained is obtained, and a voice recognition model with a recognition function and a phoneme recognition function of the wake-up word is obtained.
In addition, in the training process of the model to be trained, based on the feature coding sequence of the first sample voice obtained by the basic network module in the process of processing the first sample voice and the phoneme sequence obtained by converting the text data output by the basic network module, the phoneme re-scoring module can quickly learn the corresponding relation between the feature coding sequence and the phoneme sequence, so that the training speed and the training efficiency are improved.
Drawings
Fig. 1 shows a schematic diagram of an application scenario of an embodiment of the present disclosure.
Fig. 2 is a flow chart of a training method of a speech recognition model according to an exemplary embodiment of the present disclosure.
Fig. 3 is a flow chart of step S250 provided in an exemplary embodiment of the present disclosure.
Fig. 4 is a flow chart of a training method of a speech recognition model according to another exemplary embodiment of the present disclosure.
Fig. 5 is a flowchart illustrating step S330 according to an exemplary embodiment of the present disclosure.
Fig. 6 is a schematic diagram of an application example of a training method of a speech recognition model according to an exemplary embodiment of the present disclosure.
Fig. 7 is a schematic diagram of a speech recognition model according to an exemplary embodiment of the present disclosure.
Fig. 8 is a flow chart illustrating a speech recognition method according to an exemplary embodiment of the present disclosure.
Fig. 9 is a schematic structural diagram of a training device for a speech recognition model according to an exemplary embodiment of the present disclosure.
Fig. 10 is a schematic structural diagram of a training apparatus for a speech recognition model according to another exemplary embodiment of the present disclosure.
Fig. 11 is a schematic structural view of a voice recognition apparatus according to an exemplary embodiment of the present disclosure.
Fig. 12 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.
Detailed Description
For the purpose of illustrating the present disclosure, exemplary embodiments of the present disclosure will be described in detail below with reference to the drawings, it being apparent that the described embodiments are only some, but not all embodiments of the present disclosure, and it is to be understood that the present disclosure is not limited by the exemplary embodiments.
It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.
Summary of the application
In carrying out the present disclosure, the inventors found through studies that: in the scene of actually applying the wake-up word, when the wake-up word is not a common word but a user-defined word, due to the flexibility and the variability of the user-defined wake-up word, the recognition of the wake-up word has higher dependence on a voice recognition model, and the voice recognition model is generally obtained by using conventional voice text training, so that the problems of low recognition rate of the wake-up word, false alarm of the wake-up word and the like occur when the voice recognition model recognizes the 'unfamiliar' wake-up word.
Exemplary System
The technical scheme of the disclosure can be applied to a scene of waking up the electronic device by using the wake-up word, for example, waking up home appliances, communication devices, automatic driving systems and the like by using the self-defined wake-up word. The specific technical scheme of the disclosure can be applied to training a voice recognition model, and the voice recognition model can be used for recognizing wake-up words after training is completed, wherein the wake-up words comprise common wake-up words, custom wake-up words and the like.
Fig. 1 shows a schematic diagram of an application scenario of an embodiment of the present disclosure. As shown in fig. 1, the model to be trained 100 may be deployed on a computing device (e.g., server, computer, notebook, etc.). The model to be trained 100 comprises: a pre-trained base network module 110 and a phone re-scoring module 120 to be trained. The first sample speech including the custom wake-up word may be collected by an audio collection device (e.g., a microphone, etc.), and then the first sample speech is recognized using a pre-trained speech recognition model to obtain a target text of the first sample speech, the target text of the first sample speech is converted into a phoneme sequence of the first sample speech, and the phoneme sequence of the first sample speech is used as a first sample tag.
The first sample voice is input into the model to be trained 100, the basic network module 110 of the model to be trained 100 outputs the feature coding sequence and decoding result of the first sample, and text data of the first sample voice is determined according to the decoding result of the first sample voice. Text data is converted into a sequence of phonemes. Inputting the feature coding sequence and the phoneme sequence of the first sample into a phoneme re-scoring module 120 to be trained, outputting a predicted phoneme sequence of the first sample voice by the phoneme re-scoring module 120 to be trained, and training the phoneme re-scoring module to be trained based on the predicted phoneme sequence and the first sample label until a preset training ending condition is met, so as to obtain a voice recognition model. The voice recognition model can be deployed in the electronic equipment and is used for waking up the electronic equipment when detecting whether the input voice data comprises the wake-up word or not so as to wake up the electronic equipment when the voice data comprises the wake-up word.
Exemplary method
Fig. 2 is a flow chart of a training method of a speech recognition model according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device, as shown in fig. 2, and includes the following steps:
step S210, a model to be trained and first training data are obtained.
The model to be trained comprises the following components: the method comprises the steps of pre-training an obtained basic network module and a phoneme re-scoring module to be trained.
Here, the base network module is used for speech recognition. The underlying network modules can include, but are not limited to, weNet speech recognition models, ASRT (Auto Speech Recognition Tool, automatic Speech recognition System), whisper speech recognition models, and the like.
The phoneme heavy scoring module to be trained is used for scoring the phoneme sequence. The phoneme re-scoring module to be trained may include, for example, but is not limited to: CNN (Convolutional Neural Network ), RNN (Recurrent Neural Network, recurrent neural network), DNN (Deep-Learning Neural Network, deep neural network), RNN (Recurrent Neural Network ), LSTM (Long Short-Term Memory network), transformer model, and the like.
Wherein the first training data comprises: a first sample speech including a wake word and a first sample tag that is a sequence of phonemes of the first sample speech.
The wake-up words in the first sample voice may be common words, or may be wake-up words customized according to actual requirements, which is not limited in the embodiments of the present disclosure. The first sample speech may be an original audio signal collected by an audio collection device (e.g., a microphone, etc.), or may be speech of an original audio signal after front-end signal processing, which is not limited by the embodiments of the present disclosure. Among others, front-end signal processing may include, for example, but is not limited to: VAD (Voice Activity Detection ), noise reduction, AEC (Acoustic Echo Cancellation, acoustic echo cancellation), dereverberation processing, sound source localization, BF (Beam Forming), and the like.
In some embodiments, the first sample speech may be only a wake word, in which case the first sample tag is a phoneme sequence of the wake word. Alternatively, in other embodiments, the first sample speech may be a speech sample including a wake word, where the first sample tag is a phoneme sequence of the first sample speech and the phoneme sequence includes a phoneme sequence of a custom wake word.
Step S220, the basic network module is utilized to encode the first sample voice to obtain a characteristic encoding sequence of the first sample voice, and the characteristic encoding sequence of the first sample voice is decoded to obtain a decoding result.
The basic network module may adopt a pre-trained neural network for speech coding, so as to implement coding of the first sample speech, where the neural network may be a CNN, RNN, transformer model or the like. The neural network for speech coding can be obtained by training a plurality of speech data marked with characteristic coding sequences. In some embodiments, the base network module may decode the feature encoded sequence of the first sample speech by various decoders, such as CTC (Connectionist Temporal Classification, connection timing class) decoder, lattice decoder, WFST (Weight finite state transducer, weighted finite state transcriber), etc., to obtain a decoding result of the first sample speech.
Wherein, the decoding result of the first sample speech comprises: a plurality of candidate paths for the first sample speech and an acoustic score for each word in each candidate path. For example, if the first sample voice is "Xiaoming Please Start" and "Xiaoming" is a wake word, then any candidate path of the first sample voice may be "Xiaoming-Summing", "Xiaoming-Summing-Start", "Xiaoming-Mong-Start", "Xiaoming-Surt-Ming-Surt-Start", or "Xiaoming-Sut-Start", etc.
Step S230, based on the decoding result of the first sample speech, obtaining text data of the first sample speech.
The text data of the first sample speech may include text corresponding to at least one candidate path in the decoding result. Taking the example in step S220 as an example, the at least one text of the first sample of speech may be "min-please-start", "xiaolovely-please-start", "xiaojong-please start" or "dawn-min-please-start" etc.
Step S240, converting the text data of the first sample voice into a phoneme sequence to obtain the phoneme sequence of the first sample voice.
The phoneme sequence of the first sample voice may include a phoneme sequence corresponding to each text in the text data of the first sample voice.
In some embodiments, text data of the first sample speech may be converted to a phoneme sequence of the first sample speech by a pre-trained phoneme conversion model for phoneme conversion. In a specific implementation, the phoneme conversion model may adopt a neural network model such as RNN, LSTM, DNN, transformer model, and the neural network model is trained by training text labeled with a phoneme sequence to obtain a phoneme conversion model, or may also use a phoneme conversion tool, such as a Phonemizer, to convert text data of the first sample speech to obtain a phoneme sequence of the first sample speech.
In one specific example, the first sample speech is: if the "xiaoming" is a wake-up word, the text data of the first sample voice may include: small lovely-started, small-covered started, small-started, etc., and correspondingly, the phoneme sequence of the first sample speech may include: xiaomeringqidag, xiaomingqidag.
Step S250, training a phoneme re-scoring module to be trained based on the feature coding sequence of the first sample voice, the phoneme sequence of the first sample voice and the first sample tag, namely, adjusting model parameters of the phoneme re-scoring module to be trained until a preset training ending condition is met, and obtaining a voice recognition model from the model to be trained.
Wherein the speech recognition model comprises: the base network module and the phoneme re-scoring module.
In one embodiment, the feature code sequence and the phoneme sequence of the first sample voice may be input into a phoneme re-scoring module to be trained, and the phoneme re-scoring module to be trained outputs a predicted phoneme sequence of the first sample voice; according to the predicted phoneme sequence of the first sample voice and the first sample label of the first sample voice, calculating to obtain a loss value by using a preset loss function, and adjusting model parameters of a phoneme re-scoring module to be trained so as to reduce the loss value. In a specific implementation, the process may be iteratively performed until the loss value is no longer decreasing, resulting in a speech recognition model.
In the embodiment of the disclosure, a phoneme re-scoring model to be trained and a basic network module obtained by pre-training are used as a model to be trained, a first sample voice of a training sample comprises a wake-up word, the basic network module is utilized to process the first sample voice to obtain a decoding result, after text data is obtained according to the decoding result, and meanwhile, a feature coding sequence of the first sample voice, a phoneme sequence obtained by converting the text data and a first sample tag of the first sample voice are obtained based on the basic network module in the process of processing the first sample voice, the phoneme re-scoring module to be trained is trained to obtain a voice recognition model with a recognition function and a phoneme recognition function of the wake-up word, and the voice recognition module is trained after the basic network module is obtained by pre-training. In addition, in the training process of the model to be trained, based on the feature coding sequence of the first sample voice obtained by the basic network module in the process of processing the first sample voice and the phoneme sequence obtained by converting the text data output by the basic network module, the phoneme re-scoring module can quickly learn the corresponding relation between the feature coding sequence and the phoneme sequence, so that the training speed and the training efficiency are improved.
In some alternative embodiments, the above-mentioned base network module may include: a coding sub-module and a link timing classification sub-module.
The encoding submodule is used for extracting the audio characteristics of the first sample voice and carrying out encoding processing on the extracted audio characteristics of the first sample voice. Illustratively, the encoding submodule may be implemented using an acoustic model, such as a GMM-HMM (Gaussian Mixture-hidden Markov model), HMM (Hidden Markov Model ), or the like, or may be implemented using any one of a CNN, RNN, transformer model, a Conformer model, or the like.
The connection time sequence classification submodule is used for determining a decoding result of the first sample voice based on the characteristic coding sequence of the first sample voice, namely, decoding the characteristic coding sequence of the first sample voice to obtain the decoding result of the first sample voice. Illustratively, the concatenated timing classification sub-module may be a CTC decoder or language model, or the like. In some alternative implementations, step 220 in embodiments of the present disclosure may include: and the coding sub-module outputs the characteristic coding sequence of the first sample voice to a phoneme re-classifying module to be trained and a connecting time sequence classifying sub-module. Wherein the audio features are also referred to as acoustic features. The audio features may include, for example, but are not limited to, any of the following feature information: LPC (Linear Predictive Coding, there is linear predictive coding), MFCC (Mel-frequency Cepstrum Coefficients, mel frequency cepstral coefficients), FBank (Mel-scale Filter Bank ), etc. The audio features may be represented as feature vectors or feature graphs, which are not limited by the embodiments of the present disclosure.
In some alternative embodiments, the encoding submodule may extract audio features of the first sample voice through an MFCC algorithm, a Zero-cross Rate algorithm, or the like, to obtain audio features of the first sample voice, and then the encoding submodule may encode the audio features of the first sample voice through a pre-trained neural network for audio feature encoding, to obtain a feature encoding sequence of the first sample voice.
In some optional implementations, step S220 in the embodiments of the present disclosure may further include: and decoding the characteristic coding sequence of the first sample voice by utilizing the connection time sequence classification sub-module to obtain a decoding result, wherein the decoding result comprises the following steps: a plurality of candidate paths and an acoustic score for each word in each candidate path.
In one implementation, the connection timing submodule may perform decoding on the feature code sequence by performing a forward process on the feature code sequence, for example, a forward process of CTCs, to obtain a decoding result, where the forward process may be an inference on the feature code sequence to obtain an acoustic score of each word in each candidate path. For example, the connection timing classification sub-module may perform decoding processing on the feature code sequence of the first sample speech to obtain a plurality of candidate paths and acoustic scores of words in each candidate path, and then output the acoustic scores of each word in the plurality of candidate paths and the plurality of candidate paths. Specifically, a decoding vocabulary may be preset, based on a feature coding sequence of the first sample voice, each voice frame of the first sample voice is determined to correspond to each word and probability in the decoding vocabulary, and then a plurality of candidate paths are obtained through each word corresponding to the first sample voice, and the probability of each word is used as an acoustic score of each candidate path. For example, assuming that 6000 words are included in the decoded vocabulary, the first sample speech includes 50 speech frames, the decoding result may include a probability that each speech frame in the first sample speech corresponds to 6000 words, and a plurality of candidate paths formed by 6000 words corresponding to each frame of speech frame, and the probability that each speech frame in each candidate path corresponds to each word in the decoded vocabulary is determined as the acoustic score of each word.
In an alternative embodiment, the basic network module may further include an attention re-scoring module, where the attention re-scoring module may re-score each word in each candidate path output by the connection timing classification sub-module in combination with the feature coding sequence of the first sample speech, and use the re-scoring result as an acoustic score of each word in each candidate path, and then output a plurality of candidate paths and acoustic scores of each word in the plurality of candidate paths. The attention re-scoring module may be implemented using Attention Rescore (attention re-scoring decoder).
In some optional implementations, step S230 in the embodiments of the present disclosure may include: acquiring the acoustic confidence coefficient of each candidate path based on the acoustic score of each word in each candidate path; and then, based on the acoustic confidence coefficient of each candidate path, acquiring n texts corresponding to n candidate paths with the highest acoustic confidence coefficient from the plurality of candidate paths, and taking the n texts as text data of the first sample voice. The acoustic confidence of the candidate path is used for indicating the confidence that the text corresponding to the candidate path is the first sample voice, and n is an integer greater than or equal to 1.
For each candidate path, the acoustic confidence of the candidate path may be obtained by the acoustic score of each word in the candidate path, for example, the acoustic score of each word in the candidate path may be summed, or an average value may be used as the acoustic confidence of the candidate path. And then, selecting n candidate paths with highest acoustic confidence from the candidate paths according to the acoustic confidence of each candidate path by using a preset path selection algorithm such as a Viterbi algorithm and the like, and taking texts corresponding to the n selected candidate paths as text data of the first sample voice. And simultaneously, respectively taking the acoustic confidence degrees of the n candidate paths as the confidence degrees of texts corresponding to the n candidate paths.
Accordingly, in this embodiment, in step S240, the n texts may be converted into phoneme sequences, respectively, and the obtained n phoneme sequences may be used as phoneme sequences of the first sample speech. In one implementation, the model to be trained in the embodiment of the present disclosure may further include: and the phoneme conversion module is used for converting the text into a phoneme sequence corresponding to the text. For example, the phoneme conversion module may be a trained RNN, LSTM, DNN, transformer model or the like for phoneme conversion. Accordingly, in the above step S240, n texts may be sequentially input into the phoneme conversion module, and the phoneme conversion module outputs a phoneme sequence corresponding to each text.
In the embodiment of the disclosure, a coding sub-module and a connection time sequence classification sub-module in a pre-trained basic network module are utilized to process first sample voice to obtain a feature coding sequence and a decoding result, then acoustic confidence of each candidate path is determined according to the decoding result, text data is determined according to the acoustic confidence of each candidate path, then the text data is converted to obtain a phoneme sequence, when a phoneme re-scoring module to be trained is trained, the existing coding sub-module is utilized to process the first sample voice, and then the feature coding sequence used for training the phoneme re-scoring module to be trained can be obtained without additionally constructing other models used for extracting and coding audio features. In addition, in the embodiment of the disclosure, based on the scoring of each word in each candidate path, that is, the acoustic score of each word, the link time sequence classification sub-module determines the acoustic confidence of each candidate path, and text data of the first sample voice is determined through the acoustic confidence of the candidate path, so that the trained phoneme re-scoring module can effectively reduce false alarm on the wake-up word by using the phoneme sequence converted by the text screened by the acoustic confidence of the candidate path.
In an alternative embodiment, in the step S230, before the text data of the first sample speech is converted into the phoneme sequence, the LM (Language Model) may be used to correct the order of each of the n texts of the first sample speech. For example, when text is included in the text data of the first sample voice as small-bright-please-start, it can be corrected to small-bright-please-start by the LM.
Wherein LM may include, for example, but is not limited to: a rule language model, an N-Gram language model, or a neural network language model (Neural Network Language Model, NNLM), etc., which are not limited by the disclosed embodiments.
In the embodiment of the disclosure, the accuracy of the decoding result of the first sample voice is improved by correcting the word sequence of the text data of the first sample voice, so that the accuracy of the phoneme sequence obtained by converting the text data of the first sample voice is improved, and the recognition performance and the voice awakening performance of the voice recognition model obtained by final training are improved.
Fig. 3 shows a schematic flow chart of step S250 in the embodiment of the disclosure, as shown in fig. 3, on the basis of the embodiment shown in fig. 2, step S250 may include the following steps:
And step S251, respectively carrying out re-scoring on the n phoneme sequences by utilizing a phoneme re-scoring module to be trained, and taking the re-scoring result as the confidence degrees of the n phoneme sequences.
Wherein the re-scoring result includes a scoring score for each of the n phoneme sequences. The confidence of the phoneme sequence represents the confidence that the phoneme sequence is the first sample speech.
In some alternative embodiments, the n phoneme sequences and the feature coding sequence of the first sample voice are input into a phoneme re-scoring module to be trained, the phoneme re-scoring module to be trained re-scores the n phoneme sequences based on the feature coding sequence of the first sample voice, and a re-scoring result of the n phoneme sequences is output.
In step S252, based on the confidence levels of the n phoneme sequences, a phoneme sequence with the highest confidence level is determined from the n phoneme sequences, and the phoneme sequence with the highest confidence level is used as an output phoneme sequence, that is, an output phoneme sequence of the first sample speech.
Step S253, determining a loss function value based on a difference between the output phoneme sequence and the corresponding first sample label.
The first predetermined loss function, such as a cross entropy error function, a mean square error function, etc., may be used as the loss function of the phoneme re-classifying module to be trained. And calculating a loss function value of the phoneme re-scoring module to be trained by using the loss function of the phoneme re-scoring module based on the difference between the output phoneme sequence and the corresponding first sample label.
And step S254, adjusting parameters of the phoneme re-scoring module to be trained based on the loss function value until the loss function converges, and obtaining a speech recognition model from the model to be trained.
The operations of step S251 to step S254 may be performed iteratively, so that the parameters of the phoneme re-scoring module to be trained may be iteratively adjusted, so that the loss function value gradually decreases until the loss function of the phoneme re-scoring module to be trained converges, that is, the loss function value does not drop any more, the training of the phoneme re-scoring model to be trained is completed, and the trained phoneme re-scoring module and the basic network module form the speech recognition model.
In one embodiment, any parameter optimizer may be implemented to adjust the parameters of the phone re-scoring module to be trained. The parameter optimizers may include, but are not Limited to, SGD (Stochastic Gradient Descent, random gradient descent), adagrad (adaptive gradient algorithm), adam (Adaptive Moment Estimation ), RMSprop (Root Mean Square Prop, root mean square), LBFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno, BFGS in Limited memory), and the like. Specifically, the parameter optimizer may be used to calculate the gradient of each parameter of the phoneme re-scoring module to be trained, and adjust each parameter along the direction of the gradient, where the gradient represents the direction in which the loss function value decreases most, and the steps S241-S254 are iteratively performed until the loss function value of the phoneme re-scoring module to be trained is no longer reduced, and the training is completed, where the phoneme re-scoring module and the basic network module obtained by the training obtain the speech recognition model.
In the embodiment of the disclosure, the confidence coefficient of n phoneme sequences of a first sample voice is determined by using a phoneme re-scoring module to be trained based on a feature coding sequence, then a phoneme sequence with high confidence coefficient is selected as an output phoneme sequence and a first sample tag, and then the phoneme re-scoring module to be trained is trained by using the output phoneme sequence and the first sample tag, so that a voice recognition model is obtained. The trained phoneme re-scoring module can accurately score the input phoneme sequence so as to accurately determine the output phoneme sequence, and further improve the voice recognition and awakening performance of the voice recognition model.
Fig. 4 is a flow chart of a training method of a speech recognition model according to another exemplary embodiment of the present disclosure. As shown in fig. 4, on the basis of the embodiment shown in fig. 2, the above-mentioned basic network module may be trained as follows:
step S310, an initial basic network module to be trained, second training data and a test set are obtained.
The initial basic network module may be a WeNet speech recognition model, an ASRT, a whisper speech recognition model, a RNN, DNN, transformer model, and the like. The second training data includes: the second sample labels corresponding to the plurality of second sample voices and the plurality of second sample voices respectively, wherein any second sample label is a labeling decoding result of the corresponding second sample voices, and the labeling decoding result comprises: a plurality of candidate paths and an acoustic score for each word in each candidate path. Any second sample speech may be an original audio signal collected by an audio collection device (e.g., a microphone, etc.), or may be speech of an original audio signal after front-end signal processing, which is not limited by the embodiments of the present disclosure.
Step S320, training the initial basic network module by using the second training data to obtain a trained first basic network module.
Wherein a second predetermined loss function, e.g., a cross entropy error function, a mean square error function, etc., may be used as the loss function of the initial base network module. And respectively inputting each second sample voice in the second training data into an initial basic network module to be trained, and respectively outputting a predictive decoding result of each second sample voice by the initial basic network module to be trained. And calculating a loss function value of the initial basic network module to be trained by using a second preset loss function based on the difference between the prediction decoding result of each second sample voice and the second sample label corresponding to each second sample voice. And iteratively executing the operations of inputting each second sample voice into the initial basic network module to be trained, obtaining the loss function value of the initial basic network module to be trained based on the prediction decoding result of each second sample voice and the second sample label corresponding to each second sample voice, iteratively adjusting each parameter of the initial basic network module to be trained, so that the loss function value of the initial basic network module to be trained is reduced, training the initial basic network module is completed until the loss function value of the initial basic network module to be trained is no longer reduced, and determining the trained initial basic network module as the first basic network module.
And step S330, processing the data in the test set by using the first basic network module, and adjusting decoding parameters of the first basic network module according to the processing result to obtain the trained basic network module.
In one embodiment, when the decoding parameters of the first basic network module are adjusted, other model parameters except the decoding parameters in the first basic network module may be frozen first, and then the decoding parameters in the first basic network module are adjusted based on the test set, and the specific adjustment manner may refer to the manner of training the initial basic network module in the above embodiments, which is not described herein again.
In the embodiment of the disclosure, the initial basic network module is trained by using the second training data to obtain the first basic network module, and then the decoding parameters of the first basic network module are adjusted by the test set to obtain the basic network module. The decoding parameters are further adjusted by using the test set, so that the decoding performance of the basic network module obtained by training is improved, and the performance of the voice recognition model is further improved.
Fig. 5 shows a flow diagram of step S330 in an embodiment of the present disclosure. As shown in fig. 5, step 330 may include the following steps, based on the embodiment shown in fig. 4, described above:
Step S331, the first basic network module is utilized to identify the test voice data in the test set, and a test result is obtained.
The test set in the embodiment of the disclosure includes: test voice data and marking information, wherein the marking information is a marking decoding result of the corresponding test voice data.
In one embodiment, the test voice data includes a wake word. The test voice data may be an original audio signal collected by an audio collection device (e.g., a microphone, etc.), or may be voice of the original audio signal after front-end signal processing, which is not limited by the embodiments of the present disclosure.
The test result comprises a predictive decoding result obtained by the first basic data network identifying the test voice data.
In step S332, the model parameters except for the decoding parameters in the first base network module are frozen, i.e. the model parameters except for the decoding parameters in the first base network module are kept unchanged.
Step S333, adjusting the decoding parameters in the first basic network module until the difference between the marking information of the test voice data and the test result meets the preset condition.
The method comprises the steps of inputting test voice data in a test set into a first basic network module, outputting test results of the test voice data by the first basic network module, calculating a loss function value of the first basic network module based on a prediction decoding result in the test results and a labeling decoding result in the labeling information, freezing model parameters except decoding parameters in the first basic network module, adjusting the decoding parameters of the first basic network module, iteratively executing the operations of the steps S331-S333 until the loss function value is not reduced any more, completing adjustment of the decoding parameters in the first basic network module, and determining the first basic network module with the adjusted decoding parameters as the basic network module.
In the embodiment of the disclosure, the decoding parameters of the first basic network module are frozen first, and then the decoding parameters of the first basic network module are adjusted based on the test voice data and the marking result in the test set to obtain the basic network module, so that the coding accuracy of the basic network module to the audio features can be improved, and the performance of the basic network module is effectively improved.
Fig. 6 is a schematic diagram showing an application example of the training method of the speech recognition model of the present disclosure, and fig. 7 is a schematic diagram showing the structure of the speech recognition model in the embodiment of the present disclosure.
In this application example, the first training data includes: a plurality of first sample voices and a plurality of first sample tags; the second training data includes: a plurality of second sample voices and a plurality of second sample tags; the test set includes: the system comprises a plurality of test voice data and a plurality of mark information, wherein each of the first training data, each of the first sample tags, each of the test voice data and each of the mark information comprises a wake-up word.
In this application example, the initial underlying network module to be trained employs a WeNet speech recognition model. As shown in fig. 7, the went speech recognition model comprises three parts, namely Shared encodings (Encoder), CTC decoders (CTC Decoder), attention Rescore (attention deficit Decoder), the Shared encodings may comprise a multi-layer transducer model or a Conformer model, the CTC decoders being a full-join layer and a softmax layer, and Attention Rescore comprises a multi-layer transducer model layer. Wherein, shared Encoder in WeNet speech recognition model is used as the encoding submodule in the embodiment of the disclosure, and CTC Decoder is used as the link timing classification submodule in the embodiment of the disclosure. Attention Rescore is used to re-score each word in the CTC Decoder output decoding result based on the characteristic coding sequence of the Shared Encoder output.
As shown in fig. 6, a plurality of second sample voices are input into an initial basic network module to be trained, the initial basic network module to be trained outputs a prediction decoding result of each second sample voice, and based on the prediction decoding result of each second sample voice and each second sample label, the initial basic network module to be trained is trained to obtain a first basic network module.
And freezing model parameters except decoding parameters in the first basic network module, inputting a plurality of test voice data into the first basic network module, outputting test results of the test voice data by the first basic network module, and optimizing the decoding parameters of the first basic network module based on the test results of the test voice data and the marking information to obtain the basic network module.
Inputting each first sample voice into a basic network module, respectively outputting a characteristic coding sequence and a decoding result of each first sample voice by a Shared encoding device and a CTC decoding device in the basic network module, and then acquiring text data of the first sample voice based on the decoding result of the first sample voice, wherein the text data comprises n texts.
And correcting the n texts of each first sample voice through the LM to obtain n corrected texts of each first sample voice, and then converting the n corrected texts of each first sample voice into a phoneme sequence of each corresponding first sample voice.
Training a phoneme re-scoring module to be trained by using the phoneme sequence, the feature coding sequence and the first sample labels of each first sample voice until a preset training ending condition is met, and obtaining a voice recognition model from the to-be-trained model.
As shown in fig. 7, the speech recognition model includes: the base network module and the phoneme re-scoring module. The Shared Encoder and the CTC Decoder in the underlying network module are respectively connected with the phoneme re-scoring module.
Fig. 8 is a flow chart illustrating a speech recognition method according to an exemplary embodiment of the present disclosure. The embodiment can be applied to electronic equipment and mobile carriers, and the mobile carriers can be vehicles, ships, unmanned aerial vehicles and the like. As shown in fig. 8, the voice recognition method of the embodiment of the present disclosure includes the steps of:
in step S410, voice data is acquired.
Wherein voice data may be collected by an audio collection device (e.g., microphone, etc.). In one embodiment, front-end speech signal processing may be performed on the speech data.
And step S420, processing the voice data by utilizing a basic network module in the voice recognition model to obtain a characteristic coding sequence and a decoding result of the voice data.
The speech recognition model may be trained based on the training method of the speech recognition model according to any of the embodiments of the present disclosure. The speech recognition model may include: the base network module and the phoneme re-scoring module. The decoding result of the speech data includes a plurality of candidate paths of the speech data and acoustic scores of words in each candidate path.
In one embodiment, the underlying network module in the speech recognition model may include: the coding submodule and the connection time sequence classification submodule. The coding submodule is used for processing the voice data and outputting a characteristic coding sequence of the voice data; the connection time sequence classification submodule determines a decoding result of the voice data based on the characteristic coding sequence of the voice data. The processing procedure of the encoding submodule and the link time sequence classification submodule on the voice data can be referred to the description in the above related embodiments, and will not be repeated here.
Step S430, based on the decoding result of the voice data, obtaining the candidate text data of the voice data.
The method for obtaining the candidate text data through the decoding result of the voice data may refer to a specific implementation method for obtaining the text data through the decoding result of the first sample voice, which is not described herein again.
Step S440, converting the candidate text data into a phoneme sequence to obtain a candidate phoneme sequence.
In one embodiment, the candidate text data may be converted by a pre-trained phoneme conversion model for phoneme conversion to obtain a candidate phoneme sequence.
In one embodiment, the candidate text data obtained through step S430 may include: n candidate texts of the speech data. In this step S440, n candidate texts in the candidate text data may be converted into corresponding candidate phoneme sequences, respectively.
And S450, processing the characteristic coding sequence and the candidate phoneme sequence of the voice data by using a phoneme re-scoring module of the voice recognition model to determine the phoneme sequence.
In one embodiment, the feature code sequence and the candidate phoneme sequence of the speech data are input to a phoneme re-scoring module, which outputs the phoneme sequence of the speech data.
Step S460, based on the phoneme sequence of the voice data, determining whether the voice data comprises a wake-up word.
The wake-up words can be common words or user-defined wake-up words.
In some implementations, the phoneme sequence of the voice data is matched with the phoneme sequence of the wake-up word, and when the phoneme sequence of the voice data is matched with the phoneme sequence of the custom wake-up word, the wake-up word is determined to be included in the voice data. Or, the phoneme sequence of the voice data can be converted into a corresponding text, the converted text is matched with the wake-up word, and if the converted text is matched with the wake-up word, the wake-up word is determined to be included in the voice data. Illustratively, the speech recognition model shown in fig. 7 is taken as an example, and the custom wake words are: the voice data are as follows: the Ming dynasty starts.
The voice data is input into the voice recognition model, the Shared Encoder outputs a feature code sequence of the voice data, and the feature code sequence is transmitted to the CTC Decoder and the phoneme re-scoring module. The CTC Decoder outputs a decoding result of the voice data, and obtains candidate text data of the voice data based on the decoding result of the voice data, wherein the candidate text data comprises: (1) small name-start, (2) small-Mongolian-start, (3) small-Min-start. Performing phoneme conversion on the candidate text data to obtain a candidate phoneme sequence, wherein the candidate phoneme sequence comprises the following components: (1) xiaomingqidaong, (2) xiaomengqidaong, (3) xiaominqidaong; the phoneme re-scoring module outputs a phoneme sequence, e.g., (1) xiaomingqidaong, based on the feature encoding sequence of the speech data and the candidate phoneme sequence. Since the phoneme sequence "xiaoming" of the custom wake-up word "small" is matched with "xiaoming" in the phoneme sequence of the voice data, it is determined that the wake-up word is included in the voice data.
In the embodiment of the disclosure, the voice data is identified by using the voice recognition model trained based on the embodiment, so that the recognition rate of the wake-up words in the voice data can be effectively improved, in addition, whether the wake-up words are included in the voice data is determined by using the phoneme sequence identified by the voice recognition model, the problem of false alarm of the wake-up words caused by homonyms, wrongly written characters and the like can be avoided, the recognition rate and the accuracy rate of the recognition of the wake-up words are improved, and the wake-up rate of the device through the recognized wake-up words is also improved.
Any of the training methods for a speech recognition model, the speech recognition methods provided by embodiments of the present disclosure, may be performed by any suitable device having data processing capabilities, including, but not limited to: terminal equipment, servers, etc. Alternatively, any of the volume control methods provided by the embodiments of the present disclosure may be executed by a processor, such as the processor executing any of the volume control methods mentioned by the embodiments of the present disclosure by invoking corresponding instructions stored in a memory. And will not be described in detail below.
Exemplary apparatus
FIG. 9 is a block diagram of a training apparatus for a speech recognition model in one embodiment of the present disclosure. As shown in fig. 9, the training device of the speech recognition model includes: a first acquisition unit 500, a recognition unit 510, a first text acquisition unit 520, a conversion unit 530, and a first training unit 540.
A first obtaining unit 500 configured to obtain a model to be trained and first training data, where the model to be trained includes a basic network module obtained by pre-training and a phoneme re-scoring module to be trained, the first training data includes a first sample voice and a first sample tag, the first sample voice includes a wake-up word, and the first sample tag is a phoneme sequence of the first sample voice;
The identifying unit 510 is configured to perform coding processing on the first sample voice by using the basic network module to obtain a feature coding sequence of the first sample voice, and perform decoding processing on the feature coding sequence of the first sample voice to obtain a decoding result;
a first text obtaining unit 520 configured to obtain text data of the first sample voice based on the decoding result;
a conversion unit 530 configured to convert text data of the first sample speech into a phoneme sequence, resulting in a phoneme sequence of the first sample speech;
the first training unit 540 is configured to train the phoneme re-classifying module to be trained based on the feature coding sequence of the first sample speech, the phoneme sequence of the first sample speech and the first sample tag until a preset training ending condition is met, and obtain a speech recognition model from the model to be trained.
Fig. 10 is a block diagram of a training apparatus of a speech recognition model in another embodiment of the present disclosure. In one embodiment of the present disclosure, the base network module includes a coding sub-module and a link timing classification sub-module;
as shown in fig. 10, the identification unit 510 in the embodiment of the present disclosure includes:
A first speech processing subunit 511 configured to extract speech features of the first sample speech by using the encoding submodule, and perform encoding processing on the audio features of the first sample speech to obtain a feature encoding sequence of the first sample speech;
a second speech processing subunit 512 configured to decode the feature code sequence of the first sample speech by using the concatenated timing classification subunit to obtain a decoding result, where the decoding result includes: a plurality of candidate paths and acoustic scores for words in each candidate path;
the first text obtaining unit 520 is specifically configured to obtain an acoustic confidence level of each candidate path based on the acoustic score of each word in each candidate path; based on the acoustic confidence coefficient of each candidate path, n texts corresponding to n candidate paths with the highest acoustic confidence coefficient are obtained and used as text data of the first sample voice;
the conversion unit 530 is specifically configured to convert the n texts into phoneme sequences, respectively, and take the obtained n phoneme sequences as phoneme sequences of the first sample speech.
In one embodiment of the present disclosure, the first training unit 540 in the embodiment of the present disclosure includes:
A re-scoring unit 541 configured to re-score the n phoneme sequences with the phoneme re-scoring module to be trained, and take the re-scoring result as the confidence degrees of the n phoneme sequences;
a phoneme conversion subunit 542 configured to determine, based on the confidence levels of the n phoneme sequences, an output phoneme sequence with the highest confidence level from the n phoneme sequences;
a loss value determination subunit 543 configured to determine a loss function value based on a difference between the output phoneme sequence and the first sample tag;
a training subunit 544 is configured to adjust the parameters of the phoneme re-classifying module to be trained based on the loss function value until the loss function converges, resulting in the speech recognition model.
In one embodiment of the present disclosure, the training device of the speech recognition model in the embodiment of the present disclosure further includes:
a second obtaining unit 550 configured to obtain an initial basic network module to be trained, second training data, and a test set;
a second training unit 560 configured to train the initial base network module with the second training data to obtain a first base network module;
And the parameter adjusting unit 570 is configured to process the data in the test set by using the first basic network module, and adjust the decoding parameters of the first basic network module according to the processing result to obtain the trained basic network module.
In one embodiment of the present disclosure, the data in the test set in embodiments of the present disclosure includes labeled test voice data;
in one embodiment of the present disclosure, the parameter adjustment unit 570 in the embodiment of the present disclosure includes:
an identification subunit 571 configured to identify the test voice data with the first base network module, to obtain a test result, where the data in the test set includes test voice data with tag information;
a freezing subunit 572 configured to freeze model parameters in the first base network module other than the decoding parameters;
an adjustment subunit 573 is configured to adjust the decoding parameter until a difference between the flag information of the test speech data and the test result satisfies a preset condition.
Fig. 11 is a block diagram of a voice recognition apparatus in one embodiment of the present disclosure. As shown in fig. 11, the voice recognition apparatus includes: a receiving unit 600, a voice recognition unit 610, a second text obtaining unit 620, a phoneme conversion unit 630, a re-scoring unit 640, and a wake-up word recognition unit 650;
A receiving unit 600 configured to acquire voice data;
a voice recognition unit 610, configured to process the voice data by using a basic network module in a voice recognition model, to obtain a feature coding sequence and decoding result of the voice data, where the voice recognition model is obtained by training the training method of the voice recognition model in any of the above embodiments;
a second text obtaining unit 620 configured to obtain candidate text data of the voice data based on the decoding result;
a phoneme conversion unit 630 configured to convert the candidate text data into a phoneme sequence to obtain a candidate phoneme sequence;
a re-scoring unit 640 configured to process the feature code sequence of the speech data and the candidate phoneme sequence with a phoneme re-scoring module of the speech recognition model to determine a phoneme sequence;
the wake word recognition unit 650 is configured to determine whether wake words are included in the speech data based on the phoneme sequence.
The training device for the speech recognition model in the embodiment of the present disclosure corresponds to the embodiment of the training method for the speech recognition model in the present disclosure, and the relevant contents may be referred to each other and will not be described herein again. The voice recognition device in the embodiment of the present disclosure corresponds to the embodiment of the foregoing voice recognition method in the present disclosure, and the relevant content may be referred to each other and will not be described herein again.
The training device for a speech recognition model of the embodiment of the present disclosure, and the corresponding beneficial technical effects of the exemplary embodiment of the speech recognition device may refer to the corresponding beneficial technical effects of the corresponding exemplary method section, and are not described herein again.
Exemplary electronic device
Fig. 12 is a block diagram of an electronic device according to an embodiment of the present disclosure, including at least one processor 710 and a memory 720.
Processor 720 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in electronic device 10 to perform desired functions.
Memory 710 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium and the processor 710 may execute the one or more computer program instructions to implement the training methods, the speech recognition methods, and/or other desired functions of the speech recognition model of the various embodiments of the disclosure above.
In one example, the electronic device 700 may further include: an input device 730 and an output device 740, which are interconnected by a bus system and/or other form of connection mechanism (not shown).
The input device 730 may also include, for example, a keyboard, mouse, etc.
The output device 740 may output various information to the outside, which may include, for example, a display, a speaker, a printer, and a communication network and a remote output apparatus connected thereto, etc.
Of course, only some of the components of the electronic device 700 that are relevant to the present disclosure are shown in fig. 12, components such as buses, input/output interfaces, etc. are omitted for simplicity. In addition, the electronic device 700 may include any other suitable components depending on the particular application.
Exemplary computer program product and computer readable storage Medium
In addition to the methods and apparatus described above, embodiments of the present disclosure may also provide a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in the training method, the speech recognition method of the speech recognition model of the various embodiments of the present disclosure described in the "exemplary methods" section above.
The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium, having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the training method, the speech recognition method of the speech recognition model of the various embodiments of the present disclosure described in the "exemplary methods" section above.
A computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example but not limited to, a system, apparatus, or device including electronic, magnetic, optical, electromagnetic, infrared, or semiconductor, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The basic principles of the present disclosure have been described above in connection with specific embodiments, but the advantages, benefits, effects, etc. mentioned in this disclosure are merely examples and are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.
Various modifications and alterations to this disclosure may be made by those skilled in the art without departing from the spirit and scope of the application. Thus, the present disclosure is intended to include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (10)

1. A method of training a speech recognition model, comprising:
the method comprises the steps of obtaining a model to be trained and first training data, wherein the model to be trained comprises a basic network module obtained through pre-training and a phoneme re-scoring module to be trained, the first training data comprises first sample voice and a first sample tag, the first sample voice comprises a wake-up word, and the first sample tag is a phoneme sequence of the first sample voice;
The basic network module is utilized to encode the first sample voice to obtain a characteristic encoding sequence of the first sample voice, and the characteristic encoding sequence of the first sample voice is decoded to obtain a decoding result;
based on the decoding result, acquiring text data of the first sample voice;
converting the text data of the first sample voice into a phoneme sequence to obtain a phoneme sequence of the first sample voice;
training the phoneme re-scoring module to be trained based on the feature coding sequence of the first sample voice, the phoneme sequence of the first sample voice and the first sample tag until a preset training ending condition is met, and obtaining a voice recognition model from the to-be-trained model.
2. The method of claim 1, wherein the base network module comprises a coding sub-module and a link timing classification sub-module;
the step of performing coding processing on the first sample voice by using the basic network module to obtain a feature coding sequence of the first sample voice, and performing decoding processing on the feature coding sequence of the first sample voice to obtain a decoding result, including:
Extracting the audio features of the first sample voice by using the coding submodule, and coding the extracted audio features to obtain a feature coding sequence of the first sample voice;
and decoding the characteristic coding sequence of the first sample voice by utilizing the connection time sequence classification sub-module to obtain a decoding result, wherein the decoding result comprises: a plurality of candidate paths and acoustic scores for words in each candidate path;
the obtaining text data of the first sample voice based on the decoding result includes:
acquiring the acoustic confidence coefficient of each candidate path based on the acoustic score of each word in each candidate path;
based on the acoustic confidence coefficient of each candidate path, n texts corresponding to n candidate paths with the highest acoustic confidence coefficient are obtained and used as text data of the first sample voice;
the converting the text data of the first sample voice into a phoneme sequence to obtain the phoneme sequence of the first sample voice comprises the following steps:
and respectively converting the n texts into phoneme sequences, and taking the obtained n phoneme sequences as the phoneme sequences of the first sample voice.
3. The method of claim 2, wherein training the phoneme re-scoring module to be trained based on the feature coding sequence of the first sample speech, the phoneme sequence of the first sample speech, and the first sample tag until a preset training end condition is met, to obtain a speech recognition model, comprises:
The phoneme re-scoring module to be trained is used for re-scoring the n phoneme sequences respectively, and the re-scoring result is used as the confidence level of the n phoneme sequences;
based on the confidence degrees of the n phoneme sequences, determining an output phoneme sequence with the highest confidence degree from the n phoneme sequences;
determining a loss function value based on a difference between the output phoneme sequence and the first sample tag;
and adjusting parameters of the phoneme re-scoring module to be trained based on the loss function value until the loss function converges, so as to obtain the speech recognition model.
4. A method according to any one of claims 1 to 3, wherein the underlying network module is trained by:
acquiring an initial basic network module to be trained, second training data and a test set;
training the initial basic network module by using the second training data to obtain a first basic network module;
and processing the data in the test set by using the first basic network module, and adjusting decoding parameters of the first basic network module according to a processing result to obtain the trained basic network module.
5. The method of claim 4, wherein the data in the test set comprises test voice data with tag information;
The processing the data in the test set by using the first basic network module, and adjusting decoding parameters of the first basic network module according to a processing result to obtain the trained basic network module, including:
identifying the test voice data by using the first basic network module to obtain a test result;
freezing model parameters in the first base network module other than the decoding parameters;
and adjusting the decoding parameters until the difference between the marking information of the test voice data and the test result meets the preset condition.
6. A method of speech recognition, comprising:
acquiring voice data;
processing the voice data by utilizing a basic network module in a voice recognition model to obtain a characteristic coding sequence and a decoding result of the voice data, wherein the voice recognition model is obtained by training the training method of the voice recognition model according to any one of claims 1 to 5;
based on the decoding result, acquiring candidate text data of the voice data;
converting the candidate text data into a phoneme sequence to obtain a candidate phoneme sequence;
processing the feature coding sequence and the candidate phoneme sequence by using a phoneme re-scoring module of the speech recognition model to determine a phoneme sequence;
Based on the phoneme sequence, it is determined whether a wake word is included in the speech data.
7. A training device for a speech recognition model, comprising:
the training device comprises a first acquisition unit, a first training unit and a second acquisition unit, wherein the first acquisition unit is configured to acquire a model to be trained and first training data, the model to be trained comprises a basic network module obtained through pre-training and a phoneme re-scoring module to be trained, the first training data comprises first sample voice and a first sample tag, the first sample voice comprises a wake-up word, and the first sample tag is a phoneme sequence of the first sample voice;
the identification unit is configured to encode the first sample voice by utilizing the basic network module to obtain a characteristic encoding sequence of the first sample voice, and decode the characteristic encoding sequence of the first sample voice to obtain a decoding result;
a first text obtaining unit configured to obtain text data of the first sample speech based on the decoding result;
a conversion unit configured to convert text data of the first sample speech into a phoneme sequence, resulting in a phoneme sequence of the first sample speech;
the first training unit is configured to train the phoneme re-classifying module to be trained based on the feature coding sequence of the first sample voice, the phoneme sequence of the first sample voice and the first sample label until a preset training ending condition is met, and a voice recognition model is obtained from the model to be trained.
8. A speech recognition apparatus, comprising:
a receiving unit configured to acquire voice data;
a voice recognition unit configured to process the voice data by using a basic network module in a voice recognition model, to obtain a feature coding sequence and a decoding result of the voice data, wherein the voice recognition model is obtained by training the voice recognition model according to any one of claims 1 to 5;
a second text obtaining unit configured to obtain candidate text data of the voice data based on the decoding result;
a phoneme conversion unit configured to convert the candidate text data into a phoneme sequence to obtain a candidate phoneme sequence;
a re-scoring unit configured to process the feature code sequence and the candidate phoneme sequence with a phoneme re-scoring module of the speech recognition model to determine a phoneme sequence;
and a wake-up word recognition unit configured to determine whether a wake-up word is included in the voice data based on the phoneme sequence.
9. A computer readable storage medium storing a computer program for performing the method of any one of the preceding claims 1-6.
10. An electronic device, the electronic device comprising:
a processor;
a memory for storing the processor-executable instructions;
the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any of the preceding claims 1-6.
CN202311125742.2A 2023-09-01 2023-09-01 Training method and device of voice recognition model, and voice recognition method and device Pending CN116994570A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311125742.2A CN116994570A (en) 2023-09-01 2023-09-01 Training method and device of voice recognition model, and voice recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311125742.2A CN116994570A (en) 2023-09-01 2023-09-01 Training method and device of voice recognition model, and voice recognition method and device

Publications (1)

Publication Number Publication Date
CN116994570A true CN116994570A (en) 2023-11-03

Family

ID=88532184

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311125742.2A Pending CN116994570A (en) 2023-09-01 2023-09-01 Training method and device of voice recognition model, and voice recognition method and device

Country Status (1)

Country Link
CN (1) CN116994570A (en)

Similar Documents

Publication Publication Date Title
US11769493B2 (en) Training acoustic models using connectionist temporal classification
US10943583B1 (en) Creation of language models for speech recognition
US11496582B2 (en) Generation of automated message responses
US11270685B2 (en) Speech based user recognition
US10176802B1 (en) Lattice encoding using recurrent neural networks
US10121467B1 (en) Automatic speech recognition incorporating word usage information
US11158305B2 (en) Online verification of custom wake word
US11043214B1 (en) Speech recognition using dialog history
US8019602B2 (en) Automatic speech recognition learning using user corrections
JP5072206B2 (en) Hidden conditional random field model for speech classification and speech recognition
US11158307B1 (en) Alternate utterance generation
CN113035231B (en) Keyword detection method and device
US20220358908A1 (en) Language model adaptation
US10199037B1 (en) Adaptive beam pruning for automatic speech recognition
US11705116B2 (en) Language and grammar model adaptation using model weight data
JP2023545988A (en) Transformer transducer: One model that combines streaming and non-streaming speech recognition
US20230110205A1 (en) Alternate natural language input generation
US20230368796A1 (en) Speech processing
US11437026B1 (en) Personalized alternate utterance generation
Tanaka et al. Neural speech-to-text language models for rescoring hypotheses of dnn-hmm hybrid automatic speech recognition systems
US11887583B1 (en) Updating models with trained model update objects
US11328713B1 (en) On-device contextual understanding
CN116994570A (en) Training method and device of voice recognition model, and voice recognition method and device
CN115424616A (en) Audio data screening method, device, equipment and computer readable medium
US11978438B1 (en) Machine learning model updating

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination