WO2022143768A1 - 语音识别方法及装置 - Google Patents

语音识别方法及装置 Download PDF

Info

Publication number
WO2022143768A1
WO2022143768A1 PCT/CN2021/142470 CN2021142470W WO2022143768A1 WO 2022143768 A1 WO2022143768 A1 WO 2022143768A1 CN 2021142470 W CN2021142470 W CN 2021142470W WO 2022143768 A1 WO2022143768 A1 WO 2022143768A1
Authority
WO
WIPO (PCT)
Prior art keywords
phoneme
neural network
network model
prediction result
recognized
Prior art date
Application number
PCT/CN2021/142470
Other languages
English (en)
French (fr)
Inventor
尹旭贤
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21914500.0A priority Critical patent/EP4250285A4/en
Priority to US18/258,316 priority patent/US20240038223A1/en
Publication of WO2022143768A1 publication Critical patent/WO2022143768A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a speech recognition method and device.
  • Speech recognition also known as automatic speech recognition (English full name: Automatic Speech Recognition, referred to as: ASR)
  • ASR Automatic Speech Recognition
  • speech recognition technology As an important way of human-computer interaction, is applied in many different fields.
  • speech recognition technology needs to be used, such as translation between voices in different languages, voice interaction between intelligent electronic devices and users, conversion of real-time voice signals into text information in instant messaging software, etc. .
  • Embodiments of the present application provide a speech recognition method and apparatus.
  • an embodiment of the present application provides a speech recognition method, the method comprising:
  • the terminal device inputs the phoneme to be recognized into the first multi-task neural network model, and the terminal device uses the first multi-task neural network model to output a first prediction result, where the first prediction result includes the character corresponding to the phoneme to be recognized
  • the prediction result and the punctuation prediction result the terminal device displays at least a part of the first prediction result on the display screen of the terminal device according to the first prediction result.
  • the first multi-task neural network model may be deployed on a terminal side (eg, on a terminal device) or a cloud side.
  • the neural network model by constructing a neural network model for simultaneously predicting characters and punctuation corresponding to phonemes (that is, the first multi-task neural network model, the multi-tasking means that the neural network model needs to perform The task of predicting characters corresponding to phonemes and performing the task of predicting punctuation corresponding to phonemes), the neural network model can simultaneously predict characters and punctuation corresponding to phonemes.
  • the phoneme (vector) after the speech conversion to be recognized is used as the input of the neural network model to perform forward inference, and the characters and punctuation corresponding to the phoneme can be output at the same time, and the size of the neural network model is small, which can be deployed on the device side.
  • the "simultaneous”, “simultaneous output”, etc. expressed in this article can be understood as being able to obtain two kinds of information (such as character information corresponding to phonemes and punctuation information corresponding to phonemes) from the output of the neural network model, not just Obtaining one kind of information does not limit the time sequence relationship in which the two kinds of information are obtained. In other words, the "simultaneous” described in this article does not limit the time to be the same moment in time.
  • the first multi-task neural network model is obtained by using training samples to train the second multi-task neural network model, and the training samples include: sample sentences, all The sample sentence includes characters, and the training sample further includes: phonemes and punctuation corresponding to the characters in the sample sentence.
  • the second multi-task neural network model can be deployed on a terminal side (eg, on a terminal device) or a cloud side.
  • the trained neural network model By constructing a neural network model for simultaneously predicting characters and punctuation corresponding to phonemes (ie, the second multi-task neural network model), and constructing a training sample set to train the neural network model, the trained neural network model (ie, the second multi-task neural network model) is obtained.
  • a multi-task neural network model word segmentation processing is not required during the training process, the phoneme (vector) after the speech conversion to be recognized is used as the input of the trained neural network model, and forward inference can be performed, and the corresponding phoneme can be output at the same time. characters and punctuation.
  • the length of the character in the sample sentence is the same as the length of the phoneme and the length of the punctuation.
  • the length of the characters in the sample sentence is aligned with the phoneme length and the length of the punctuation after phonetic, and after training the neural network model using the training sample set constructed by the The network model can perform phoneme-to-character conversion and punctuation prediction at the same time, so that the predicted characters and punctuation results can be output at the same time.
  • the terminal device inputs the phoneme to be recognized into the first multi-task neural network model, and uses the first multi-task neural network model to output the first prediction result, including: The terminal device cyclically sends the phonemes to be recognized into the first multi-task neural network model, and uses the first multi-task neural network model to output the first prediction result based on the length of the currently input phone to be recognized.
  • the prediction result of the phoneme to be recognized refers to both the previous phoneme and the subsequent phoneme, which improves the accuracy of prediction.
  • the terminal device cyclically sends the phonemes to be recognized into the first multi-task neural network model, and uses the first multi-task neural network model.
  • the task neural network model outputs the first prediction result based on the length of the currently input phoneme to be recognized, including:
  • the terminal device Before completing the input of all the phonemes to be recognized into the first multi-task neural network model, if the length of the currently input phoneme is less than the receptive field, the terminal device continues to input the next phoneme;
  • the terminal device Before completing the input of all the phonemes to be recognized into the first multi-task neural network model, if the length of the currently input phoneme is not less than the receptive field, the terminal device obtains the number of the currently input phoneme according to the characters and punctuation of the currently input phoneme. The second prediction result of a phoneme, and the second prediction result is stored; the terminal device converts the feature vector of the first phoneme, the phonemes other than the first phoneme in the currently input phoneme, and the next phoneme in the phoneme to be recognized. A phoneme continues to be input into the first multi-task neural network model;
  • the terminal device When completing the input of all the phonemes to be recognized into the first multi-task neural network model, the terminal device obtains the second prediction result of the currently input phoneme according to the characters and punctuation of the currently input phoneme;
  • the second prediction result of the currently input phoneme is used as the first prediction result of the phoneme to be recognized;
  • the first prediction result of the phoneme to be recognized is obtained according to the second prediction result of the currently input phoneme and the stored second prediction result.
  • the phoneme to be recognized outputted by the acoustic model is cyclically sent to the first multi-task neural network model of the streaming network structure, so that the prediction result of the phoneme to be recognized both refers to the previous phoneme , and refer to the following phonemes to improve the accuracy of prediction.
  • the first multi-task neural network model is a non-streaming network structure
  • Using the first multi-task neural network model to output the first prediction result including:
  • the first prediction result is output based on the relationship between the total length of the phonemes to be recognized and the phoneme length threshold using the first multi-task neural network model.
  • the first multi-task neural network model is used to output all phonemes based on the relationship between the total length of the phoneme to be recognized and the phoneme length threshold.
  • the first prediction result including:
  • the first multi-task neural network model is used to output the first prediction result according to all the phonemes to be recognized;
  • the terminal device determines whether the phoneme is less than the phoneme length threshold or not less than the phoneme length threshold or not less than the phoneme length threshold. If the length of the currently input phoneme is less than the phoneme length threshold, the terminal device continues to input the following A phoneme; if the length of the currently input phoneme is not less than the phoneme length threshold, the terminal device obtains the second prediction result of the first phoneme of the currently input phoneme and stores the second prediction result according to the characters and punctuation of the currently input phoneme , the terminal device continues to input the phoneme other than the first phoneme in the current input phoneme and the next phoneme in the phoneme to be recognized into the first multi-task neural network model;
  • the phoneme length threshold If the total length of the phonemes to be recognized is not less than the phoneme length threshold, when all the phonemes to be recognized are input into the first multi-task neural network model, according to the characters and punctuation of the currently input phoneme, the second number of the currently input phoneme is obtained. forecast result;
  • the second prediction result of the currently input phoneme is used as the first prediction result of the phoneme to be recognized;
  • the first prediction result of the phoneme to be recognized is obtained according to the second prediction result of the currently input phoneme and the stored second prediction result.
  • the non-streaming network structure is adopted, and there is no need to re-input the phonemes whose results have been predicted into the network model.
  • the non-streaming network structure does not need to cache the predicted results.
  • the historical results of reduce the memory space, can further reduce the size of the neural network model, easy to deploy on the device side.
  • the embodiments of the present application provide a method for training a neural network model, the method comprising:
  • the training sample includes: a sample sentence, the sample sentence includes characters, and the training sample further includes: phonemes and punctuations corresponding to the characters in the sample sentence;
  • both the second multi-task neural network model and the first multi-task neural network model can output the first prediction result, display At least part of the first prediction results, the first prediction results include character prediction results and punctuation prediction results.
  • a neural network model for simultaneously predicting characters and punctuation corresponding to phonemes is constructed, and a training sample set is constructed to train the neural network model, and the trained neural network model is obtained.
  • word segmentation processing is not required, and the phoneme (vector) after the speech conversion to be recognized is used as the input of the trained neural network model, and forward inference can be performed, and the characters and punctuation corresponding to the phoneme can be output at the same time, and the neural network model Small in size, it can be deployed on the end side.
  • constructing training samples may include:
  • the characters in the sample sentence are phoneticized to obtain the phoneme corresponding to the character, and the phoneme corresponding to the character is aligned with the character and punctuation.
  • the length of the character in the sample sentence is the same as the length of the phoneme and the length of the punctuation.
  • the phoneme corresponding to the character is aligned with the character and punctuation, including:
  • a polyphonic word in Chinese choose one phoneme from the multiple phonemes corresponding to the polyphonic word as the phoneme corresponding to the polyphonic word; that is to say, the phonemes corresponding to the polyphonic word in the aligned Chinese are, multiple phonemes corresponding to the polyphonic word any one of the phonemes;
  • the aligned English character For English characters, add an alignment character to the character to align the length of the phoneme corresponding to the character; the aligned English character includes an alignment character, and the length of the aligned English character is the same as the length of the phoneme corresponding to the English character; if there is no phoneme after the character For punctuation, set the punctuation corresponding to the character to blank, so that the length of the punctuation is aligned with the length of the character; for characters without punctuation before alignment, the aligned punctuation is blank.
  • the length of the characters in the sample sentence is aligned with the phoneme length and the length of the punctuation after phonetic, and after training the neural network model using the training sample set constructed by the The network model can perform phoneme-to-character conversion and punctuation prediction at the same time, so that the predicted characters and punctuation results can be output at the same time.
  • the training sample is used to train the second multi-task neural network model to obtain the first multi-task neural network model, including:
  • the parameters of the second multi-task neural network model are adjusted according to the weighted cross-entropy loss to obtain the trained first multi-task neural network model.
  • the training method of the multi-task neural network model of the present application can realize the training of the tasks of character prediction and punctuation prediction at the same time.
  • the training method of the multi-task neural network model of the present application can also realize the training of multiple language recognition (prediction) tasks.
  • the multi-task neural network model obtained by the training method of the multi-task neural network model according to the embodiment of the present application can simultaneously perform prediction of multiple languages and punctuation, and the size of the multi-task neural network model is compared with that of the traditional acoustic model. Small and can be deployed on the device side.
  • an embodiment of the present application provides a speech recognition device, the device comprising:
  • an input module for inputting the phoneme to be recognized into the first multi-task neural network model
  • an inference module configured to use the first multi-task neural network model to output a first prediction result, where the first prediction result includes a character prediction result and a punctuation prediction result corresponding to the phoneme to be recognized;
  • a display module configured to display at least a part of the first prediction result on the display screen of the terminal device according to the first prediction result.
  • the speech recognition apparatus of the embodiment of the present application constructs a neural network model for simultaneously predicting characters and punctuation corresponding to phonemes, and uses the phoneme (vector) after the speech conversion to be recognized as the input of the neural network model, and performs forward Inference, the characters and punctuation corresponding to the phoneme can be output at the same time, and the neural network model is small in size and can be deployed on the device side.
  • the first multi-task neural network model is obtained by using training samples to train the second multi-task neural network model, and the training samples include: sample sentences, all The sample sentence includes characters, and the training sample further includes: phonemes and punctuation corresponding to the characters in the sample sentence.
  • the trained neural network model By constructing a neural network model for simultaneously predicting characters and punctuation corresponding to phonemes (ie, the second multi-task neural network model), and constructing a training sample set to train the neural network model, the trained neural network model (ie, the second multi-task neural network model) is obtained.
  • a multi-task neural network model word segmentation processing is not required during the training process, the phoneme (vector) after the speech conversion to be recognized is used as the input of the trained neural network model, and forward inference can be performed, and the corresponding phoneme can be output at the same time. characters and punctuation.
  • the length of the characters in the sample sentence is the same as the length of the phoneme and the length of the punctuation.
  • the length of the characters in the sample sentence is aligned with the phoneme length and the length of the punctuation after phonetic, and after training the neural network model using the training sample set constructed by the The network model can perform phoneme-to-character conversion and punctuation prediction at the same time, so that the predicted characters and punctuation results can be output at the same time.
  • the first multi-task neural network model is a streaming network structure
  • the input module includes: a first input unit configured to cyclically send the phonemes to be recognized to into the first multi-task neural network model
  • the inference module includes: a first inference unit for using the first multi-task neural network model to output the first prediction based on the length of the currently input phoneme to be recognized result.
  • the prediction result of the phoneme to be recognized refers to both the previous phoneme and the subsequent phoneme, which improves the accuracy of prediction.
  • the first input unit is further configured to: before inputting all the phonemes to be recognized into the first multi-task neural network model , if the length of the current input phoneme is less than the receptive field, the terminal device continues to input the next phoneme; , then the first inference unit is used to obtain the second prediction result of the first phoneme of the currently input phoneme according to the characters and punctuation of the currently input phoneme, and store the second prediction result; the first input unit is also used to The feature vector of the first phoneme, the phonemes other than the first phoneme in the currently input phoneme, and the next phoneme in the phoneme to be recognized continue to be input into the first multi-task neural network model.
  • the first reasoning unit is also used for: when completing the input of all the phonemes to be recognized into the first multi-task neural network model, according to the characters and punctuation of the currently input phoneme, obtain the second prediction result of the currently input phoneme; There is a stored second prediction result, then the second prediction result of the currently input phoneme is used as the first prediction result of the phoneme to be recognized; if there is a stored second prediction result, then according to the second prediction result of the currently input phoneme The prediction result and the stored second prediction result are used to obtain the second prediction result of the phoneme to be recognized.
  • the phoneme to be recognized outputted by the acoustic model is cyclically sent to the first multi-task neural network model of the streaming network structure, so that the prediction result of the phoneme to be recognized refers to the previous phoneme , and refer to the following phonemes to improve the accuracy of prediction.
  • the first multi-task neural network model is a non-streaming network structure
  • the inference module includes: a second inference unit configured to adopt the first multi-task neural network model.
  • the task neural network model outputs the first prediction result based on the relationship between the total length of the phonemes to be recognized and the phoneme length threshold.
  • the second inference unit is further configured to: if the total length of the phonemes to be recognized is less than a phoneme length threshold, use the first A multi-task neural network model outputs the first prediction result according to all the phonemes to be recognized;
  • the total length of the phonemes to be recognized is not less than the phoneme length threshold, before completing the input of all the phonemes to be recognized into the first multi-task neural network model: if the length of the currently input phoneme is less than the phoneme length threshold, continue to input the next phoneme If the length of the phoneme of the current input is not less than the phoneme length threshold, then according to the character and the punctuation of the phoneme of the current input, obtain the second prediction result of the first phoneme of the current input phoneme and store the second prediction result, the current input The phonemes other than the first phoneme in the phoneme and the next phoneme in the phoneme to be recognized continue to be input into the first multi-task neural network model;
  • the phoneme length threshold If the total length of the phonemes to be recognized is not less than the phoneme length threshold, when all the phonemes to be recognized are input into the first multi-task neural network model, according to the characters and punctuation of the currently input phoneme, the second number of the currently input phoneme is obtained. forecast result;
  • the second prediction result of the currently input phoneme is used as the first prediction result of the phoneme to be identified;
  • the first prediction result of the phoneme to be recognized is obtained according to the second prediction result of the currently input phoneme and the stored second prediction result.
  • the non-streaming network structure is adopted, and there is no need to re-input the phonemes whose results have been predicted into the network model.
  • the non-streaming network structure does not need to cache the predicted results.
  • the historical results of reduce the memory space, can further reduce the size of the neural network model, easy to deploy on the device side.
  • embodiments of the present application provide an apparatus for training a neural network model, the apparatus comprising:
  • the training sample includes: a sample sentence, the sample sentence includes characters, and the training sample further includes: phonemes and punctuation corresponding to the characters in the sample sentence;
  • the training module is used for using the training samples to train the second multi-task neural network model to obtain the first multi-task neural network model; wherein, both the second multi-task neural network model and the first multi-task neural network model can output the first multi-task neural network model.
  • a prediction result displaying at least a part of the first prediction result, where the first prediction result includes a character prediction result and a punctuation prediction result.
  • the neural network training device of the embodiment of the present application constructs a neural network model for simultaneously predicting characters and punctuation corresponding to phonemes, and constructs a training sample set to train the neural network model, and obtains the trained neural network model.
  • word segmentation processing is not required, and the phoneme (vector) after the speech conversion to be recognized is used as the input of the trained neural network model, and forward inference can be performed, and the characters and punctuation corresponding to the phoneme can be output at the same time, and the neural network model Small in size, it can be deployed on the end side.
  • the building module includes:
  • Alignment unit for carrying out phonetic transcription to the character in the sample sentence according to the phonetic dictionary to obtain the phoneme corresponding to the character, and aligning the phoneme corresponding to the character, the character and the punctuation, the length of the character in the sample sentence and the length of the phoneme and the punctuation of the same length.
  • the alignment unit is further configured to:
  • the phoneme corresponding to the polyphonic word in the aligned Chinese is, any of the multiple phonemes corresponding to the polyphonic word One;
  • the aligned English character includes an alignment character, and the length of the aligned English character is the same as the length of the phoneme corresponding to the English character;
  • the length of the characters in the sample sentence is aligned with the phoneme length and the length of the punctuation after phonetic, and after training the neural network model using the training sample set constructed by the The network model can perform phoneme-to-character conversion and punctuation prediction at the same time, so that the predicted characters and punctuation results can be output at the same time.
  • the training module includes:
  • a determining unit for inputting the training sample into the second multi-task neural network model, and determining the character probability matrix and the punctuation probability matrix corresponding to the training sample;
  • a first calculation unit configured to calculate the character cross-entropy loss and the punctuation cross-entropy loss respectively according to the character probability matrix and the punctuation probability matrix;
  • the second calculation unit is configured to calculate the weighted cross entropy loss according to the character cross entropy loss, the first weight corresponding to the character cross entropy loss, the punctuation cross entropy loss, and the second weight corresponding to the punctuation cross entropy loss;
  • the adjustment unit is configured to adjust the parameters of the second multi-task neural network model according to the weighted cross-entropy loss to obtain the trained first multi-task neural network model.
  • the training device of the multi-task neural network model of the present application can realize the training of the tasks of character prediction and punctuation prediction at the same time.
  • the training method of the multi-task neural network model of the present application can also realize the training of multiple language recognition (prediction) tasks.
  • the multi-task neural network model obtained by the training device for the multi-task neural network model according to the embodiment of the present application can simultaneously predict multiple languages and punctuation, and the size of the multi-task neural network model is compared with that of the traditional acoustic model. Small and can be deployed on the device side.
  • embodiments of the present application provide a speech recognition apparatus, comprising: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to implement the first Aspect or one or more speech recognition methods in multiple possible implementation manners of the first aspect.
  • embodiments of the present application provide an apparatus for training a neural network model, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to implement the instructions when executing the instructions.
  • the second aspect or one or more neural network model training methods in multiple possible implementation manners of the second aspect.
  • embodiments of the present application provide a non-volatile computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, implement the first aspect or the first aspect.
  • FIG. 1 shows an application scenario of a speech recognition method according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of the composition and structure of the apparatus for training a speech recognition model provided by an embodiment of the present application.
  • FIG. 3 is a block diagram showing a partial structure of a mobile phone provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a software structure of a mobile phone 100 according to an embodiment of the present application.
  • Figure 5a shows a block diagram of a neural network model according to an embodiment of the present application.
  • Figure 5b shows a schematic diagram illustrating an encoder-decoder model according to an example of the present application.
  • Figure 5c shows a schematic diagram illustrating an encoder model according to an example of the present application.
  • FIG. 6 shows a schematic diagram of a process of constructing a training sample set according to an embodiment of the present application.
  • FIG. 7 shows an example of a process of constructing a training sample set according to an embodiment of the present application.
  • FIG. 8 shows a flowchart of a method for training a multi-task neural network model according to an embodiment of the present application.
  • Fig. 9a shows a schematic diagram of an application scenario of speech recognition performed by the terminal device side according to an embodiment of the present application.
  • Fig. 9b shows a schematic diagram of a process of speech recognition according to an example of the prior art of the present application.
  • FIG. 10 shows a flowchart of a speech recognition method according to an embodiment of the present application.
  • FIG. 11 shows a flowchart of a speech recognition method according to an embodiment of the present application.
  • FIG. 12 shows a flowchart of a speech recognition method according to an embodiment of the present application.
  • FIG. 13 shows a flowchart of a speech recognition method according to an embodiment of the present application.
  • FIG. 14 shows a block diagram of a speech recognition apparatus according to an embodiment of the present application.
  • FIG. 15 shows a block diagram of an apparatus for training a neural network model according to an embodiment of the present application.
  • Punuation marks are used as part of words to construct training text and dictionary files, and language models are trained to achieve the effect of outputting punctuation marks while outputting text.
  • the acoustic model uses a ternary grammar model, which requires segmentation of sentences in the training process.
  • the acoustic model uses Gaussian mixture model and hidden Markov model to align phonemes, etc.
  • the processing process is more complicated, resulting in the same acoustic model.
  • the existing model is large and cannot be deployed on the device side; and because the acoustic model is used for punctuation prediction, it cannot be adjusted according to the context, and the prediction accuracy is not high.
  • the related speech recognition technology has the technical problems that the model cannot be deployed on the device side, and the prediction accuracy of the punctuation prediction by using the acoustic model is not high.
  • FIG. 1 shows an application scenario of a speech recognition method according to an embodiment of the present application.
  • the terminal device including the terminal device 10-1 and the terminal device 10-2
  • the terminal device is provided with a client of speech recognition software, and the user can input the corresponding sentence to be recognized by the speech recognition software client through the set speech recognition software client , the chat client can also receive the corresponding speech recognition results, and display the received speech recognition results to the user, or perform tasks matching the voice commands.
  • the terminal device is connected to the server 200 through the network 300.
  • the network 300 may be a wide area network or a local area network, or a combination of the two, and a wired or wireless link may be used to realize data transmission.
  • An example of an application not limiting this application in any way.
  • the server 200 is configured to deploy a speech recognition model and train the speech recognition model, deploy the trained speech recognition model in a corresponding terminal device, and use the deployed speech recognition model to
  • the media asset class uses the voice information in the environment for processing.
  • the speech recognition model may be the second multi-task neural network model or the first multi-task neural network model provided in the embodiment of the present application, and the speech recognition model before training deployed on the server 200 may be the second multi-task neural network model , the speech recognition model after training and deployed in the terminal device may be the first multi-task neural network model.
  • Both the second multi-task neural network model and the first multi-task neural network model incorporate multiple tasks that can simultaneously predict characters and punctuation accurately, and the model size is small and can be deployed on the device side.
  • the speech recognition model also needs to be trained, which specifically includes: constructing a training sample, and the training sample includes: a sample sentence, in the sample sentence Including characters, and the training samples further include: phonemes and punctuations corresponding to the characters in the sample sentences; using the training samples to train the second multi-task neural network model to obtain the first multi-task neural network model.
  • the speech recognition method provided by the embodiment of the present application is realized based on artificial intelligence, and artificial intelligence (Artificial Intelligence, AI) uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, obtain Knowledge and theories, methods, techniques and applied systems for using knowledge to achieve optimal results.
  • artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology.
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the artificial intelligence software technologies mainly involved include the above-mentioned voice processing technologies and machine learning and other directions.
  • speech recognition technology Automatic Speech Recognition, ASR
  • speech technology speech technology
  • speech signal preprocessing speech signal preprocessing
  • speech signal frequency domain analysis speech signal frequency analyzing
  • speech signal features Extraction speech signal feature extraction
  • speech signal feature matching/recognition speech signal feature matching/recognition
  • speech training speech training
  • Machine learning is a multi-disciplinary interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance.
  • Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its applications are in all fields of artificial intelligence.
  • Machine learning usually includes deep learning (Deep Learning) and other technologies, deep learning includes artificial neural network (artificial neural network), such as convolutional neural network (Convolutional Neural Network, CNN), recurrent neural network (Recurrent Neural Network, RNN), deep Neural network (Deep neural network, DNN) and so on.
  • the speech recognition model training method and speech recognition provided by this application can be applied to intelligent devices (Intelligent device), and the intelligent device can be any kind of device with a voice command recognition function, such as an intelligent terminal, an intelligent device.
  • Home equipment such as smart speakers, smart washing machines, etc.
  • smart wearable devices such as smart watches
  • in-vehicle intelligent central control systems with voice commands to wake up small programs that perform different tasks in the terminal
  • AI smart medical equipment through voice commands wake-up trigger
  • the structure of the speech recognition model training device will be described in detail below.
  • the speech recognition model training device may be implemented in various forms, such as a dedicated terminal with a speech recognition model training function, or a speech recognition model training device provided with a speech recognition model training function.
  • a functional server such as server 200 in FIG. 1 .
  • FIG. 2 is a schematic diagram of the composition and structure of a speech recognition model training device provided by an embodiment of the present application. It can be understood that FIG. 2 only shows an exemplary structure of the speech recognition model training device, but not the entire structure. part or all of the structure.
  • the apparatus for training a speech recognition model includes: at least one processor 201 , a storage unit 202 , a user interface 203 , and at least one network interface 204 .
  • the various components in the speech recognition model training apparatus are coupled together through the bus system 205 .
  • the bus system 205 is used to implement the connection communication between these components.
  • the bus system 205 also includes a power bus, a control bus and a status signal bus.
  • the various buses are labeled as bus system 205 in FIG. 2 .
  • the user interface 203 may include a display, a keyboard, a mouse, a trackball, a click wheel, keys, buttons, a touch pad or a touch screen, and the like.
  • the storage unit 202 may be a volatile memory or a non-volatile memory, and may also include both volatile and non-volatile memory.
  • the storage unit 202 in this embodiment of the present application can store data to support the operation of the terminal device (eg, 10-1). Examples of such data include: any computer programs, such as operating systems and applications, used to operate on the terminal device (eg 10-1).
  • the operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks.
  • Applications can contain various applications.
  • the speech recognition model training apparatus provided by the embodiments of the present application may be implemented by a combination of software and hardware.
  • the speech recognition model training apparatus provided by the embodiments of the present application may be in the form of hardware decoding processors.
  • a processor which is programmed to execute the speech recognition model training method provided by the embodiments of the present application.
  • a processor in the form of a hardware decoding processor may adopt one or more Application Specific Integrated Circuits (ASIC, Application Specific Integrated Circuit), DSP, Programmable Logic Device (PLD, Programmable Logic Device), Complex Programmable Logic Device ( CPLD, Complex Programmable Logic Device), Field Programmable Gate Array (FPGA, Field-Programmable Gate Array) or other electronic components.
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Signal Processing
  • PLD Programmable Logic Device
  • CPLD Complex Programmable Logic Device
  • FPGA Field-Programmable Gate Array
  • the speech recognition model training apparatus provided by the embodiment of the present application may be directly embodied as a combination of software modules executed by the processor 201, and the software modules may be located in In the storage medium, the storage medium is located in the storage unit 202, and the processor 201 reads the executable instructions included in the software modules in the storage unit 202, and combines necessary hardware (for example, including the processor 201 and other components connected to the bus 205) to complete this operation.
  • the speech recognition model training method provided by the application embodiment is provided.
  • the processor 201 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gates or transistor logic devices , discrete hardware components, etc., where a general-purpose processor may be a microprocessor or any conventional processor, or the like.
  • DSP Digital Signal Processor
  • the apparatus provided in the embodiment of the present application may be directly executed by a processor 201 in the form of a hardware decoding processor, for example, by one or more Application Specific Integrated Circuit (ASIC, Application Specific Integrated Circuit), DSP, Programmable Logic Device (PLD, Programmable Logic Device), Complex Programmable Logic Device (CPLD, Complex Programmable Logic Device), Field Programmable Gate Array (FPGA, Field -Programmable Gate Array) or other electronic components to implement the speech recognition model training method provided by the embodiments of the present application.
  • ASIC Application Specific Integrated Circuit
  • DSP Programmable Logic Device
  • PLD Programmable Logic Device
  • CPLD Complex Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • Field -Programmable Gate Array Field -Programmable Gate Array
  • the storage unit 202 in this embodiment of the present application is used to store various types of data to support the operation of the apparatus for training the speech recognition model. Examples of such data include any executable instructions for operating on the speech recognition model training device.
  • the speech recognition model training apparatus may be implemented in software.
  • FIG. 2 shows the speech recognition model training apparatus stored in the storage unit 202, which may be in the form of programs and plug-ins.
  • software and includes a series of modules, as an example of the program stored in the storage unit 202, can include a speech recognition model training device, and the speech recognition model training device includes the following software modules: building modules, used to construct training samples,
  • the training samples include: sample sentences, the sample sentences include characters, and the training samples further include: phonemes and punctuations corresponding to the characters in the sample sentences; a training module for using the training samples for the second multi-task
  • the neural network model is trained to obtain a first multi-task neural network model; wherein, both the second multi-task neural network model and the first multi-task neural network model can output the first prediction result and display at least a part of the first prediction result,
  • the first prediction results include character prediction results and punctuation prediction results.
  • the speech recognition method provided by the embodiments of the present application can be applied to mobile phones, tablet computers, wearable devices, in-vehicle devices, augmented reality (AR)/virtual reality (VR) devices, notebook computers, and ultra-mobile personal devices.
  • Computer ultra-mobile personal computer, UMPC
  • netbook personal digital assistant
  • PDA personal digital assistant
  • other terminal equipment can also be applied to databases, servers and service response systems based on terminal artificial intelligence, used to respond to speech recognition Request, the embodiments of this application do not impose any restrictions on the specific type of the terminal device.
  • the terminal device may be a station (STAION, ST) in a WLAN, a cellular phone, a cordless phone, a Session Initiation Protocol (Session Initiation Protocol, SIP) phone, a Wireless Local Loop (WLL) station, Personal Digital Assistant (PDA) devices, handheld devices with wireless communication capabilities, computing devices or other processing devices connected to wireless modems, computers, laptop computers, handheld communication devices, handheld computing devices, and /or other devices for communicating on wireless systems and next-generation communication systems, for example, mobile terminals in a 5G network or mobile terminals in a future evolved Public Land Mobile Network (PLMN) network, etc.
  • STAION Session Initiation Protocol
  • SIP Session Initiation Protocol
  • WLL Wireless Local Loop
  • PDA Personal Digital Assistant
  • the wearable device when the terminal device is a wearable device, the wearable device may also be a general term for the intelligent design of daily wear and the development of wearable devices using wearable technology, such as glasses, gloves, Watches, clothing and shoes, etc.
  • a wearable device is a portable device that is directly worn on the body or integrated into the user's clothes or accessories, and collects the user's atrial fibrillation signal by attaching to the user's body. Wearable device is not only a hardware device, but also realizes powerful functions through software support, data interaction, and cloud interaction.
  • wearable smart devices include full-featured, large-scale, complete or partial functions without relying on smart phones, such as smart watches or smart glasses, and only focus on a certain type of application function, which needs to be used in conjunction with other devices such as smart phones. , such as various types of smart bracelets and smart jewelry that monitor physical signs.
  • FIG. 3 is a block diagram showing a partial structure of a mobile phone provided by an embodiment of the present application.
  • the mobile phone includes: a radio frequency (Radio Frequency, RF) circuit 110, a memory 120, an input unit 130, a display unit 140, a sensor 150, an audio circuit 160, a near field communication module 170, a processor 180, and a power supply 190 and other components .
  • RF Radio Frequency
  • FIG. 3 does not constitute a limitation on the mobile phone, and may include more or less components than the one shown, or combine some components, or arrange different components.
  • the RF circuit 110 can be used for receiving and sending signals during sending and receiving of information or during a call.
  • the RF circuit includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like.
  • the RF circuitry 110 may also communicate with networks and other devices via wireless communication.
  • the above-mentioned wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile communication (GSM), General Packet Radio Service (General Packet Radio Service, GPRS), Code Division Multiple Access (Code Division) Multiple Access, CDMA), Wideband Code Division Multiple Access (Wideband Code Division Multiple Access, WCDMA), Long Term Evolution (Long Term Evolution, LTE)), email, Short Messaging Service (Short Messaging Service, SMS), etc., through the RF circuit 110 receives voice signals collected by other terminals, recognizes the voice signals, and outputs corresponding text information.
  • GSM Global System of Mobile communication
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • SMS Short Messaging Service
  • the memory 120 can be used to store software programs and modules, and the processor 180 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 120, such as storing the trained real-time speech recognition algorithm in the memory 120.
  • the memory 120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like; Data created by the use of the mobile phone (such as audio data, phone book, etc.), etc.
  • memory 120 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
  • the input unit 130 may be used to receive input numerical or character information, and generate key signal input related to user settings and function control of the mobile phone 100 .
  • the input unit 130 may include a touch panel 131 and other input devices 132 .
  • the touch panel 131 also referred to as a touch screen, can collect the user's touch operations on or near it (such as the user's finger, stylus, etc., any suitable objects or accessories on or near the touch panel 131 ). operation), and drive the corresponding connection device according to the preset program.
  • the display unit 140 may be used to display information input by the user or information provided to the user and various menus of the mobile phone, such as outputting text information after voice recognition.
  • the display unit 140 may include a display panel 141, and optionally, the display panel 141 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an organic light-emitting diode (Organic Light-Emitting Diode, OLED), and the like.
  • the touch panel 131 may cover the display panel 141, and when the touch panel 131 detects a touch operation on or near it, it transmits it to the processor 180 to determine the type of the touch event, and then the processor 180 determines the type of the touch event according to the touch event.
  • Type provides corresponding visual output on display panel 141 .
  • the touch panel 131 and the display panel 141 are used as two independent components to realize the input and input functions of the mobile phone, in some embodiments, the touch panel 131 and the display panel 141 can be integrated to form a Realize the input and output functions of the mobile phone.
  • the cell phone 100 may also include at least one sensor 150, such as a light sensor, a motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 141 according to the brightness of the ambient light, and the proximity sensor may turn off the display panel 141 and/or when the mobile phone is moved to the ear. or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in all directions (usually three axes), and can detect the magnitude and direction of gravity when it is stationary. games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; as for other sensors such as gyroscope, barometer, hygrometer, thermometer, infrared sensor, etc. Repeat.
  • the audio circuit 160, the speaker 161, and the microphone 162 can provide an audio interface between the user and the mobile phone.
  • the audio circuit 160 can transmit the received audio data converted electrical signal to the speaker 161, and the speaker 161 converts it into a sound signal for output; on the other hand, the microphone 162 converts the collected sound signal into an electrical signal, which is converted by the audio circuit 160 After receiving, it is converted into audio data, and then the audio data is output to the processor 180 for processing, and then sent to, for example, another mobile phone through the RF circuit 110, or the audio data is output to the memory 120 for further processing.
  • the terminal device can collect the target voice signal of the user through the microphone 162, and send the converted electrical signal to the processor of the terminal device for voice recognition.
  • the terminal device can receive atrial fibrillation signals sent by other devices through the near field communication module 170.
  • the near field communication module 170 integrates a Bluetooth communication module, establishes a communication connection with the wearable device through the Bluetooth communication module, and receives feedback from the wearable device. target speech signal.
  • FIG. 3 shows the near field communication module 170, it can be understood that it does not belong to the necessary structure of the mobile phone 100, and can be completely omitted as required within the scope of not changing the essence of the application.
  • the processor 180 is the control center of the mobile phone, using various interfaces and lines to connect various parts of the entire mobile phone, by running or executing the software programs and/or modules stored in the memory 120, and calling the data stored in the memory 120.
  • the processor 180 may include one or more processing units; preferably, the processor 180 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, etc. , the modem processor mainly deals with wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 180 .
  • the mobile phone 100 also includes a power source 190 (such as a battery) for supplying power to various components.
  • a power source 190 such as a battery
  • the power source can be logically connected to the processor 180 through a power management system, so as to manage charging, discharging, and power consumption management functions through the power management system.
  • FIG. 4 is a schematic diagram of a software structure of a mobile phone 100 according to an embodiment of the present application.
  • the Android system is divided into four layers, which are an application layer, an application framework layer (framework, FWK), a system layer, and a hardware abstraction layer.
  • the layers communicate through software interfaces.
  • the application layer may include a series of application packages, and the application packages may include applications such as short message, calendar, camera, video, navigation, gallery, and call.
  • the speech recognition algorithm can be embedded in the application program, the speech recognition process is started through the relevant controls in the application program, and the collected target speech signal is processed to obtain the corresponding text information.
  • the application framework layer provides an application programming interface (API) and a programming framework for the applications of the application layer.
  • the application framework layer may include some predefined functions, such as functions for receiving events sent by the application framework layer.
  • the application framework layer may include a window manager, a resource manager, a notification manager, and the like.
  • a window manager is used to manage window programs.
  • the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.
  • Content providers are used to store and retrieve data and make these data accessible to applications.
  • the data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.
  • the resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
  • the notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the electronic device vibrates, and the indicator light flashes.
  • the application framework layer can also include:
  • the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and the like. View systems can be used to build applications.
  • a display interface can consist of one or more views.
  • the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
  • the phone manager is used to provide the communication function of the mobile phone 100 .
  • the management of call status including connecting, hanging up, etc.).
  • the system layer can include multiple functional modules. For example: sensor service module, physical state recognition module, 3D graphics processing library (eg: OpenGL ES), etc.
  • sensor service module physical state recognition module
  • 3D graphics processing library eg: OpenGL ES
  • the sensor service module is used to monitor the sensor data uploaded by various sensors at the hardware layer and determine the physical state of the mobile phone 100;
  • the physical state recognition module is used to analyze and recognize user gestures, faces, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
  • the system layer can also include:
  • the Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the hardware abstraction layer is the layer between hardware and software.
  • the hardware abstraction layer may include a display driver, a camera driver, a sensor driver, a microphone driver, etc., and is used to drive the related hardware of the hardware layer, such as a display screen, a camera, a sensor, and a microphone.
  • the microphone module is activated through the microphone drive, and the target voice information of the user is collected, and the subsequent voice recognition process is performed in a straight line.
  • the speech recognition method provided in the embodiment of the present application may be performed in any of the above-mentioned layers, which is not limited herein.
  • a neural network model for simultaneously predicting characters and punctuation corresponding to phonemes is constructed, and a training sample set is constructed to train the neural network model, and the trained neural network model is obtained.
  • the training process There is no need to perform word segmentation processing, and the phoneme (vector) after the speech conversion to be recognized is used as the input of the trained neural network model, and forward inference can be performed, and the characters and punctuation corresponding to the phoneme can be output at the same time, and the size of the neural network model Small and can be deployed on the device side.
  • the "simultaneous”, “simultaneous output”, etc. expressed in this article can be understood as being able to obtain two kinds of information (such as character information corresponding to phonemes and punctuation information corresponding to phonemes) from the output of the neural network model, not just Obtaining one kind of information does not limit the time sequence relationship in which the two kinds of information are obtained. In other words, the "simultaneous” described in this article does not limit the time to be the same moment in time.
  • Figure 5a shows a block diagram of a neural network model according to an embodiment of the present application.
  • the input of the neural network model is the label sequence corresponding to the phoneme obtained by converting the speech to be recognized.
  • the neural network model can perform feature extraction on the label sequence. Specifically, the label sequence can be characterized by the embedding layer. The corresponding feature vector is extracted, and then the characters and punctuation corresponding to the phoneme are predicted according to the feature vector, and the characters and punctuation corresponding to the speech are output at the same time.
  • the neural network model can complete multiple tasks at the same time, so it is hereinafter referred to as a multi-task neural network model.
  • a classifier when the neural network model predicts the characters and punctuation corresponding to the phoneme according to the feature vector, a classifier can be used to predict the character and punctuation corresponding to each phoneme, so as to output the characters and punctuation at the same time, and A multi-task neural network model that simultaneously realizes character and punctuation prediction can be deployed on the device side.
  • punctuation may include blank, comma, period, question mark, exclamation mark, and the like.
  • commas, periods, question marks and exclamation marks can also be divided into two forms: Chinese full-width and English half-width.
  • a Chinese character may have multiple pinyin, and an English character corresponds to multiple English phonemes, which will lead to inconsistent lengths of phonemes and characters, and the number of punctuation in a sentence may be different from the length of characters and phonemes, that is, the input sequence Inconsistent with the length of the output sequence, the prior art cannot output the prediction results at the same time. If using encoder-decoder (encoder-decoder) can solve the inconsistent length of input sequence and output sequence, using encoder-decoder can solve the problem of inconsistent length of input sequence and output sequence, but the current output must depend on the previous output.
  • Figure 5b shows a schematic diagram illustrating an encoder-decoder model according to an example of the present application.
  • the input sequence of the encoder in the encoder-decoder model is "X1X2X3X4"
  • the encoder is encoded as a vector C and output to the decoder
  • the decoder is decoded to obtain an output sequence of length 3 "Y1Y2Y3"
  • Y1Y2Y3 Before outputting "Y2”, "Y1" must be output first, and "Y1Y2Y3" cannot be output at the same time, which results in poor performance in real-time output recognition results.
  • Figure 5c shows a schematic diagram illustrating an encoder model according to an example of the present application.
  • the encoder model shown in Figure 5c can include an encoder and a Softmax classifier, where the encoder model is used to encode the input sequence to obtain a feature vector C, and the Softmax classifier can obtain an output sequence according to the feature vector C.
  • the input sequence "X1X2X3X4"
  • "Y1Y2Y3Y4" can be output at the same time, but it can only be applied to the scene where the input sequence and the output sequence have the same length.
  • the embodiment of the present application provides a method for constructing a training sample set.
  • the method for constructing a training sample set in the embodiment of the present application aligns the length of the character in the sample sentence with the length of the phoneme after phonetic and the length of the punctuation.
  • the encoder model shown in Fig. 5c can be used to realize the conversion of phonemes to characters and punctuation.
  • the neural network model can simultaneously perform phoneme-to-character conversion and punctuation prediction in Chinese and English, and solves the problem that the lengths of input and output in the above-mentioned related technologies are different. In the case of , the technical problem that the results cannot be output at the same time.
  • Embodiments of the present application also provide a method for training a multi-task neural network model, in which training samples in a training sample set are input into a second multi-task neural network model for training, and a trained first multi-task neural network model is obtained , among which, the second multi-task neural network model and the first multi-task neural network model incorporate punctuation prediction and character prediction. While generating characters in real time, punctuation is also generated in real time, realizing multi-task training at the same time, and the first multi-task The neural network model is small in size and can be deployed on the device side.
  • the speech recognition method of the present application will be described below according to the processes of training sample set construction, neural network model training, and neural network model inference.
  • the neural network model before training is referred to as the second multi-task neural network model
  • the neural network model obtained after training is referred to as the first multi-task neural network model.
  • “first” and “second” are only for distinguishing different features, and do not indicate a specific order or size relationship.
  • FIG. 6 shows a schematic diagram of a process of constructing a training sample set according to an embodiment of the present application.
  • a phonetic dictionary can be constructed, and the constructed phonetic dictionary may include a dictionary and a phoneme-character mapping table.
  • the dictionary may include one language or multiple languages, for example, may include a Chinese dictionary or an English dictionary, or a Chinese-English mixed dictionary, or a mixed dictionary of other languages, which is not limited in this application.
  • a language dictionary it can also include multiple different dictionaries, and multiple different dictionaries can be classified according to the characteristics of the language.
  • the Chinese dictionary as an example, it can also be subdivided into rare word dictionaries and polyphonic word dictionaries. , idiom dictionary, name dictionary, etc., and further subdivide the dictionary according to the characteristics of the language, which helps to improve the training effect and improve the accuracy of prediction.
  • the phoneme-character mapping table is used to store the correspondence between characters and corresponding phonemes.
  • a character can correspond to one or more phonemes.
  • the processor can phoneme the characters according to the phoneme-character mapping table to obtain the phonemes corresponding to the characters. For example, for Chinese characters, due to the existence of polyphonic words, a Chinese character can correspond to one or more phonemes; for English characters, since some English words include multiple syllables, an English character can also correspond to one or more phonemes; processing The device can look up the phoneme-to-character mapping table according to the character to determine one or more phonemes corresponding to the character.
  • the corpus used to construct the training sample set can be a single language or a mixed corpus including multiple languages.
  • the processor can phoneticize the characters in the corpus according to the phonetic dictionary to obtain the phonemes corresponding to the characters, and align the phonemes corresponding to the characters with the characters and punctuation.
  • the length of the character and the length of the punctuation are the same as the length of the corresponding phoneme.
  • the processor may perform phonetic transcription for the characters in the corpus one by one to obtain the phoneme corresponding to the character, and determine whether the length of the character is the same as the length of the phoneme corresponding to the character, if not, the processor may The length of the character and the length of the phoneme corresponding to the character are aligned.
  • the processor may first perform phonetic transcription on all characters in the corpus to obtain corresponding phonemes, and then perform alignment processing on the factors corresponding to the characters and the characters.
  • the embodiments of the present application do not limit the order in which the phonetic and alignment steps are performed.
  • the alignment may be handled differently for different languages.
  • the processor can choose one phoneme from multiple phonemes as the phoneme corresponding to the character, that is to say, the phoneme corresponding to the polyphonic word in the aligned Chinese is: Any one of the phonemes; for English characters, the processor can add alignment characters to the characters for alignment, the aligned English characters include alignment characters, and the length of the aligned English characters is the same as that of the phonemes corresponding to the English characters;
  • the position of the alignment character may be located before or after the character, which is not limited in this application.
  • the alignment character may be any symbol other than English letters, for example, the alignment character may be "@", "*", "&", or "%" and so on.
  • the processor can split the English characters in the corpus to obtain multiple independent sub-characters, In the Zhuyin dictionary, there are characters that are the same as subcharacters, and the processor can phoneticize and align the subcharacters.
  • one character corresponds to one punctuation.
  • the processor may also perform alignment processing on characters and punctuation.
  • punctuation It can include blank sum and comma, period, question mark, exclamation mark, etc. If there is no punctuation after the original character, you can set the punctuation corresponding to the character as blank. For characters without punctuation before alignment, the punctuation after alignment is blank, so that the length of the punctuation is equal to If the length of the character is aligned, the punctuation corresponding to the character is blank when outputting.
  • the processor may perform alignment processing on characters, phonemes and punctuation at the same time, or may perform the alignment processing in steps, which is not limited in this application.
  • the method for constructing a training sample set of the present application may also align the lengths of multiple different sentences that are simultaneously trained. For example, if multiple sentences are trained at the same time, that is, when batch size>1, if the lengths of multiple sentences to be trained at the same time are different, you can fill in Null after the characters, phonemes, and punctuation points corresponding to the sentences with short length. The characters, phonemes, and punctuation of the short sentences are the same length as the longest sentence after null-padded.
  • the corpus before the corpus is phoneticized and aligned, the corpus may also be preprocessed.
  • the specific content of preprocessing can be determined according to the constructed dictionary and the specific language type. For example, if the dictionary does not include numbers, the preprocessing of the corpus can include: converting the numbers in the corpus into Chinese characters (as shown in Figure 6). Numbers are regular); if the English words in the dictionary are capitalized, the preprocessing of the corpus may further include: converting the English in the corpus from lowercase to uppercase (the English letter conversion shown in Figure 6). The preprocessing may also include: converting traditional Chinese to simplified Chinese, removing special characters, etc.
  • the specific method of preprocessing can be determined according to phonetic dictionaries and language characteristics, etc. The specific preprocessing method is not limited in this application.
  • FIG. 7 shows an example of a process of constructing a training sample set according to an embodiment of the present application. For example, as shown in Figure 7, taking the mixed Chinese-English sentence "Use P30 to open CCTV to watch NBA video.” As an example, you can preprocess the corpus first, convert numbers into Chinese characters, and convert English from lowercase to uppercase, You can get "Open CCTV with P30 to watch NBA VIDEO.”.
  • Pinyin for "Open CCTV with P30 to watch NBA VIDEO.” Pinyin for Chinese, and corresponding English phonemes for English. As shown in Figure 7, the English character “NBA” corresponds to three phonemes “en bi ei” , the English character “VIDEO” corresponds to two phonemes “vi diu”, since the English character “CCTV” is not in the phonetic dictionary, "CCTV” can be split into four independent sub-characters, and the sub-characters are phoneticized according to the phonetic dictionary The corresponding phoneme “see see ti vi” can be obtained, and the final phoneme can be "yong4 pi san1 ling2 da3 kai1 see see ti vi kan4 en bi ei vi diu".
  • the processor may perform alignment processing during the phonetic notation process, or may perform alignment processing uniformly after the phonetic notation, which is not limited in this application.
  • the alignment character used in the example of the present application can be "@”, and the character “NBA” is aligned to obtain “@@NBA", because "NBA" corresponds to three phonemes; the character “VIDEO” "Corresponding to two phonemes, the processor can obtain “@VIDEO” by aligning the character "VIDEO".
  • the phonetic can get “chang2
  • the processor can randomly select one as the final Chinese pinyin for alignment processing.
  • the result of the characters obtained after the characters and phonemes are aligned is shown in Figure 7.
  • the last step "use P30 to open CCTV to watch @@NBA@VIDEO".
  • the second multi-task neural network model is trained to obtain the first multi-task neural network model.
  • the network model for scenarios where the lengths of the input sequence and the output sequence are different, the second neural network model obtained after training through the alignment processing of the present application can also output the prediction result according to the input phoneme to be recognized, and the prediction result can include the to-be-recognized phoneme.
  • Characters and punctuation corresponding to phonemes that is to say, while generating characters in real time, punctuation can also be generated in real time, and the size of the first multi-task neural network model is small and can be deployed on the device side.
  • FIG. 8 shows a flowchart of a method for training a multi-task neural network model according to an embodiment of the present application.
  • the method for training a multi-task neural network model provided by the embodiments of the present application can be applied to the apparatus shown in FIG. 2 .
  • a training sample can be selected from the training sample set as the input of the multi-task neural network, and the size of the training sample can be represented as (B, U), where B can represent a training sample
  • B can represent a training sample
  • U can represent the length of the phoneme corresponding to the sample with the largest length in the batch of training samples.
  • B can be 128, indicating that the number of samples for one training is the phonemes corresponding to 128 sentences, and the length of the phoneme corresponding to the longest sentence in the 128 sentences is U.
  • multiple batches of training samples can be selected and input to the second multi-task neural network model for training.
  • the larger the amount of training data the more accurate the characters and punctuation predicted by the first multi-task neural network model during inference. higher rate.
  • the input to a neural network model must be numeric values, not strings, so training samples can be converted to numeric data before training.
  • the numerical value corresponding to each phoneme in the dictionary may be preset as the label of the phoneme.
  • the corresponding labels can be searched according to phonemes, so that the training samples can be converted into label sequences, that is, converted into numerical vectors as the input data of the neural network, and the neural network can be trained.
  • the input is (1,6)
  • the training sample is converted into a label sequence to get (10,148,148,2456,30,40). That is to say, each phoneme has a corresponding label, and the label corresponding to the phoneme can be a number.
  • the sample sequence can be converted into a vector representation to participate in the subsequent calculation process.
  • the multi-task neural network model training method provided by this application may include:
  • Step S801 input the input data into the second multi-task neural network model, and determine the character probability matrix and the punctuation probability matrix corresponding to the input data;
  • Step S802 according to the character probability matrix and the punctuation probability matrix, respectively calculate the character cross-entropy loss and the punctuation cross-entropy loss;
  • Step S803 calculate the weighted cross-entropy loss according to the character cross-entropy loss and the punctuation cross-entropy loss;
  • Step S804 Adjust parameters of the second multi-task neural network model according to the weighted cross-entropy loss to obtain a trained first multi-task neural network model.
  • the input data may be a label sequence after phoneme conversion, that is, a vector corresponding to the phoneme to be recognized.
  • step S801 the training device may operate on the input data through the second multi-task neural network model to obtain a feature vector of the input data; then, the training device may use the second multi-task neural network model
  • the feature vector is operated to predict the characters and punctuation corresponding to the training samples, and the character probability matrix and the punctuation probability matrix are obtained.
  • the input data may be the above-mentioned training samples.
  • the second multi-task neural network model may include the encoder in the encoder model as shown in Figure 5c, and the encoder is used to perform feature extraction on the phoneme (input data) to be recognized to obtain features vector.
  • the encoder may include an embedding layer
  • the training device may perform operations on the input data through the embedding layer to extract feature vectors.
  • the training device can perform operations according to the encoding method specifically adopted by the embedding layer and the input data to obtain a feature vector, such as the vector C shown in FIG. 5c .
  • each phoneme can be represented by a one-dimensional vector after encoding, and the length of the vector can be determined according to the number of phonemes in the dictionary. For example, in the example of this application, 512 phonemes can be used. Data represents a phoneme. In the embodiment of the present application, the correspondence between the label corresponding to the phoneme and the vector corresponding to the phoneme after encoding may be recorded.
  • the dimension of the input data is (1, 6), that is, the labels obtained by converting 6 phonemes.
  • the obtained feature vector can be (1, 6, 512).
  • the second multi-task neural network model may further include a classifier (such as the Softmax classifier shown in FIG. 5c ), and the training device may use the classifier in the second multi-task neural network model to classify the features
  • the vector is classified, and the character probability matrix and the punctuation probability matrix can be obtained.
  • the character probability matrix represents the first probability of the character corresponding to the phoneme
  • the punctuation probability matrix represents the second probability of the punctuation corresponding to each phoneme. According to the character probability matrix and the punctuation probability matrix, the characters and punctuation corresponding to the phonemes can be obtained.
  • a correspondence relationship between characters and first index values corresponding to characters, and punctuation and second index values corresponding to punctuation can be established in advance to form a vocabulary.
  • the neural network model can obtain phonemes according to the obtained character probability matrix, punctuation probability matrix, and vocabulary list. Corresponding characters and punctuation.
  • the first index value of the character corresponding to the largest first probability can be obtained through the character probability matrix, and the character corresponding to the phoneme can be obtained according to the first index value and the vocabulary.
  • the second index value of the punctuation corresponding to the largest second probability can be obtained through the punctuation probability matrix, and the punctuation corresponding to the phoneme can be obtained according to the second index value and the vocabulary. That is to say, what is obtained by the Softmax classifier is the probability matrix of the character corresponding to the phoneme (input data) to be recognized, and the first probability in the matrix represents the probability that the character corresponding to the phoneme is the character corresponding to the first probability, which can be determined.
  • the character corresponding to the largest first probability is the character corresponding to the phoneme.
  • the punctuation corresponding to the phoneme can be determined in the same way.
  • the training device may calculate the character cross-entropy loss according to the cross-entropy loss function and the character probability matrix.
  • the specific calculation formula is as follows:
  • y(C) represents the cross-entropy loss of all characters
  • P(ci) can represent the first probability corresponding to the character ci
  • i represents the subscript of the character
  • the value of i ranges from 1 to n, where n is positive Integer. According to the above formula and the character probability matrix, the character cross entropy loss can be calculated.
  • the training device may calculate the punctuation cross loss according to the cross entropy loss function and the punctuation probability matrix.
  • the specific calculation formula is as follows:
  • y(P) represents the cross-entropy loss of all punctuation points
  • P( pi ) may represent the second probability corresponding to the punctuation point pi .
  • the first weight corresponding to the character intersection loss and the second weight corresponding to the punctuation intersection loss may be set.
  • the weighted cross-entropy loss may be calculated according to the character cross-entropy loss, the first weight, the punctuation cross-entropy loss, and the second weight.
  • the cross-entropy loss can be calculated according to the following formula:
  • y(C+P) can represent the weighted cross-entropy loss of characters and punctuation
  • w1 can represent the first weight
  • w2 can represent the second weight.
  • the second weight may be 0.3.
  • the training device may update the weight of the second multi-task neural network model according to the weighted cross-entropy through the back-propagation algorithm to obtain the trained first multi-task neural network model.
  • the weight update can be implemented using the Adam optimizer.
  • the training method of the multi-task neural network model of the present application can realize the training of the tasks of character prediction and punctuation prediction at the same time.
  • the training method of the multi-task neural network model of the present application can also realize the training of multiple language recognition (prediction) tasks.
  • the multi-task neural network model obtained by the training method of the multi-task neural network model according to the embodiment of the present application can simultaneously perform prediction of multiple languages and punctuation, and the size of the multi-task neural network model is compared with that of the traditional acoustic model. Small and can be deployed on the device side.
  • the phoneme to be recognized can be input into the first multi-task neural network model, and forward inference can be performed to realize the character and punctuation corresponding to the phoneme. Simultaneous prediction and output.
  • the present application also provides a speech recognition method, which can be applied to the terminal device shown in FIG. 1 or FIG. 3 . After the first multi-task neural network model is obtained, the first multi-task neural network model can be deployed in the terminal device.
  • Fig. 9a shows a schematic diagram of an application scenario of speech recognition performed by the terminal device side according to an embodiment of the present application.
  • an acoustic model and a neural network model (a first multi-task neural network model) may be deployed in the terminal device.
  • the terminal device can input the collected speech signal or the received speech signal into the acoustic model, and by processing the speech signal through the acoustic model, the phoneme corresponding to the speech signal can be obtained and output to the first multi-task neural network model.
  • FIG. 10 shows a flowchart of a speech recognition method according to an embodiment of the present application.
  • the speech recognition method of an embodiment provided by the present application may include the following steps:
  • Step S901 input the phoneme to be recognized into a first multi-task neural network model, wherein the first multi-task neural network model is obtained by training the second multi-task neural network model by using training samples.
  • the training samples include: sample sentences, the sample sentences include characters, and the training samples include: phonemes and punctuations corresponding to the characters in the sample sentences.
  • phonemes, characters and punctuation are of the same length.
  • Both the second multi-task neural network model and the first multi-task neural network model can output a first prediction result and display at least a part of the first prediction result, and the first prediction result includes a character prediction result and a punctuation prediction
  • the second multi-task neural network model and the first multi-task neural network model can simultaneously predict the characters and punctuation corresponding to the to-be-recognized phoneme according to the to-be-recognized phoneme.
  • the phoneme to be recognized may be obtained by processing the speech signal to be recognized by using an acoustic model, and the speech signal to be recognized may be a signal collected or received by a terminal device, which is not discussed in this application. limited.
  • the terminal device opens the social APP, detects that the microphone is turned on, and collects the voice signal; if the terminal device detects a conversion request for converting the voice signal into text, it can input the voice signal into the acoustic model. .
  • the terminal device opens a social APP, receives a voice signal sent by another terminal device, and the terminal device detects a conversion request, then the terminal device can input the voice signal into the acoustic model. After the acoustic model receives the speech signal, it can process the speech signal to obtain the phoneme to be recognized. The terminal device can input the phoneme to be recognized into the first multi-task neural network model.
  • the to-be-identified phoneme output by the acoustic model may be a label sequence corresponding to the phoneme.
  • the speech recognition method of an embodiment provided by the present application may further include:
  • Step S902 the terminal device uses the first multi-task neural network model to output a first prediction result, where the first prediction result includes a character prediction result and a punctuation prediction result corresponding to the phoneme to be recognized;
  • Step S903 the terminal device displays at least a part of the first prediction result on the display screen of the terminal device according to the first prediction result.
  • the first multi-task neural network model can perform feature extraction, extract the feature vector of the phoneme to be recognized, and then classify it according to the feature vector by the classifier, and can predict each phoneme to be recognized.
  • the classifier can classify according to the input phoneme to be recognized, obtain the corresponding characters and punctuation, and output the predicted result (the first predicted result).
  • the terminal device can simultaneously display the predicted characters and punctuation.
  • the first multi-task neural network model can use the encoder model shown in FIG. 5c to process the input phoneme to be recognized, and can obtain corresponding characters and punctuation for simultaneous output.
  • Fig. 9b shows a schematic diagram of a process of speech recognition according to an example of the prior art of the present application.
  • the phoneme in the traditional phoneme-to-character and punctuation method, the phoneme can be first mapped to a character, and then the corresponding punctuation can be predicted.
  • the phoneme can be mapped to the character through the N-Gram language model, and after the character is obtained, the punctuation can be obtained through the punctuation prediction model.
  • Characters and punctuation need to be predicted by two models respectively, characters and punctuation cannot be output at the same time, and the model is too large to be deployed on the device side.
  • the speech recognition method adopted in this application can output characters and punctuation at the same time through a neural network model shown in FIG. 9a, and the model can be deployed on the terminal side due to the simplicity of the model.
  • the neural network model is trained through a specially constructed training sample set, and the multi-task neural network obtained after training is used for training.
  • the model is deployed on the device side, the predicted characters and punctuation can be output and displayed at the same time.
  • the first multi-task neural network model may be a streaming network structure
  • the terminal device inputs the phoneme to be recognized into the first multi-task neural network model
  • the first multi-task neural network is used.
  • the model outputs the first prediction result, which may include: the terminal device may cyclically send the phonemes to be recognized into the first multi-task neural network model, and use the first multi-task neural network model based on the currently input phonemes to be recognized. The length of output the first prediction result.
  • the terminal device cyclically sends the phonemes to be recognized into the first multi-task neural network model, and uses the first multi-task neural network model based on the length of the currently input phone to be recognized
  • Outputting the first prediction result may include:
  • the terminal device Before completing the input of all the phonemes to be recognized into the first multi-task neural network model, if the length of the currently input phoneme is less than the receptive field, the terminal device continues to input the next phoneme;
  • the terminal device Before completing the input of all the phonemes to be recognized into the first multi-task neural network model, if the length of the currently input phoneme is not less than the receptive field, the terminal device obtains the number of the currently input phoneme according to the characters and punctuation of the currently input phoneme. The second prediction result of a phoneme, and the second prediction result is stored; the terminal device converts the feature vector of the first phoneme, the phonemes other than the first phoneme in the currently input phoneme, and the next phoneme in the phoneme to be recognized. A phoneme continues to be input into the first multi-task neural network model;
  • the terminal device When completing the input of all the phonemes to be recognized into the first multi-task neural network model, the terminal device obtains the second prediction result of the currently input phoneme according to the characters and punctuation of the currently input phoneme;
  • the terminal device uses the second prediction result of the currently input phoneme as the first prediction result of the phoneme to be recognized;
  • the first prediction result of the phoneme to be recognized is obtained according to the second prediction result of the currently input phoneme and the stored second prediction result.
  • the second prediction result is the final result of one or several phonemes to be recognized, and the prediction results of the phonemes except the first phoneme in the currently input phoneme are all temporary prediction results. Therefore, the terminal device stores the second prediction result. , and finally fuse all the second prediction results to obtain the first prediction result (the final result of all the phonemes to be recognized).
  • FIG. 11 shows a flowchart of a speech recognition method according to an embodiment of the present application.
  • the phoneme to be recognized can be sent to the first cyclically according to the relationship between the length of the phoneme to be recognized and the receptive field of the first multi-task neural network model.
  • Character and punctuation prediction in a multi-task neural network model can include the following processes:
  • Step S1100 judging whether the input of all the phonemes to be recognized has been completed; if the input of all the phonemes to be recognized has not been completed, then step S1101 is performed; if the input of all the phonemes to be recognized has been completed, then step S1104 is performed;
  • Step S1101 determine whether the length of the currently input phoneme is less than the receptive field; if the length of the currently input phoneme is less than the receptive field, then perform step S1102; if the length of the currently input phoneme is not less than the receptive field, then perform step S1103;
  • Step S1102 predicts the character and punctuation of the phoneme of the current input, obtains the temporary result of the phoneme of the current input, and continues to input the next phoneme, returns to step S1100;
  • Step S1103 predict the characters and punctuation of the currently input phoneme, obtain the final result of the first phoneme of the currently input phoneme, and the terminal device can store the final result; the terminal device stores the feature vector of the first phoneme, the current The phoneme other than the first phoneme in the input phoneme and the next phoneme in the phoneme to be recognized continue to input the first multi-task neural network model; return to step S1100;
  • Step S1104 predict the characters and punctuation of the currently input phoneme, obtain the final result of the currently input phoneme, and judge whether there is a stored final result; if there is a stored final result, then execute step S1105; Store the final result, then execute step S1106;
  • Step S1105 fuse the final result of the stored final result and the final result of the currently input phoneme to obtain the final result of the phoneme to be recognized, and end the cycle;
  • Step S1106 taking the final result of the currently input phoneme as the final result of the phoneme to be recognized, and ending the loop.
  • step S1100 the terminal device can judge whether the input of all the phonemes to be recognized has been completed according to the output of the acoustic model connected above, and if the acoustic model no longer outputs new phonemes, the terminal device can judge that all the phonemes have been input and output. When it comes to the first multi-task neural network model, otherwise, the terminal device can judge that there is no input of all the phonemes to be recognized.
  • a VAD Voice Activity Detection, voice endpoint detection
  • the VAD can detect when a segment of audio has a voice and when the voice ends. After detecting the end of a human voice in the audio, the acoustic model can be controlled not to output.
  • the length of the currently input phoneme is 1 from the beginning of input, and the length of the currently input phoneme gradually increases as more phonemes are input gradually. If the length of all the phonemes to be recognized is greater than or equal to the receptive field, then the length of the currently input phoneme will not change when it increases to the same size as the receptive field. If there is a new input phoneme, then the current input phoneme The first phoneme of is no longer input into the first multitask neural network model. If the lengths of all the phonemes to be recognized are smaller than the receptive field, then the maximum value of the lengths of the currently input phonemes is smaller than the receptive field.
  • the receptive field of the first multi-task neural network model is 8, and the length of the phoneme to be recognized is 15.
  • the lengths of the currently input phonemes are 1, 2, 3, 4, 5, 6, and 7 respectively, and the length of the currently input phoneme is smaller than the receptive field.
  • the length of the currently input phoneme is 8, which is not less than the receptive field.
  • the ninth phoneme to be recognized is input, the length of the currently input phoneme is still 8, and the currently input phonemes are 2, 3, 4, 5, 6, 7, 8, and 9 respectively.
  • the 10th phoneme to be recognized and subsequent phonemes to be recognized Assuming that the receptive field of the first multi-task neural network model is 8 and the length of the phoneme to be recognized is 7, then the maximum length of the currently input phoneme is 7, which is smaller than the receptive field.
  • the terminal device may execute step S1102 to predict the characters and punctuation of the currently input phoneme to obtain a temporary result of the currently input phoneme.
  • the length of the currently input phoneme is smaller than the receptive field, indicating that the characters and punctuation obtained by predicting the currently input phoneme may also change according to the phoneme input later. Therefore, when the length of the currently input phoneme is smaller than the receptive field, the terminal device The result of prediction on the currently input phoneme can be used as a temporary result.
  • the terminal device may input the next to-be-recognized phoneme predicted by the acoustic model into the first multi-task neural network model, and then returns to step S1100 to continue judging whether the input of all the to-be-recognized phonemes has been completed.
  • the currently input phonemes are the 1st, 2nd, 3rd, 4th, and 5th phonemes in total, and the length of the currently input phoneme is 5, which is smaller than the receptive field 8. Therefore, the terminal device can Take the prediction results of the characters and punctuation of the 1st, 2nd, 3rd, 4th, and 5th phonemes as temporary results, and input the next (6th) phoneme to be recognized, that is, the current input phoneme is the 1st, 2nd phoneme , 3, 4, 5, 6, a total of 6 phonemes.
  • the terminal device may execute step S1103 to predict the characters and punctuation of the currently input phoneme to obtain the final result of the first phoneme of the currently input phoneme, and the terminal device may Store the final result.
  • the prediction result of the terminal device on the phonemes other than the first phoneme in the currently input phoneme is a temporary result.
  • the terminal device can continue to input the feature vector of the first phoneme extracted in this prediction process, the phonemes other than the first phoneme in the currently input phoneme, and the next phoneme of the phoneme to be recognized into the first multi-task neural network. Model. Then return to step S1100, and continue to judge whether the input of all the phonemes to be recognized has been completed.
  • the terminal device may use the prediction result for the first phoneme as the final result, and store the final result.
  • the terminal device may use the predicted results for the 2nd to 8th phonemes as provisional results.
  • the terminal device may input the feature vector of the first phoneme, the second to eighth phoneme, and the ninth phoneme extracted in this prediction process into the first multi-task neural network model. Continue the reasoning, obtain the predicted result of the second (the first phoneme of the currently input phoneme) as the final result, and store the final result.
  • the terminal device can input the feature vector of the 2nd phoneme, the 3rd-9th phoneme and the 10th phoneme extracted in this prediction process into the first multi-task neural network model, continue to reason... and repeat the above process, Until the input of all the phonemes to be recognized is completed.
  • step S1103 when the input of the first multi-task neural network model is the current input phoneme and the feature vector of the first phoneme of the previously input phoneme, the feature vector of the currently input phoneme can be extracted, and the current input phoneme Perform a splicing operation with the feature vector of the first phoneme of the previously input phoneme.
  • the terminal device can perform a convolution operation to further extract the feature vector and predict the result according to the extracted feature vector.
  • the currently input phonemes are the feature vectors of the 2nd-9th and the 1st phoneme.
  • the terminal device may extract the feature vectors of the 2nd to 9th phonemes, and perform a concatenation operation (concat) on connecting the feature vectors of the 1st phoneme and the features of the 2nd to 9th phonemes.
  • the terminal device can perform a convolution operation to further extract the feature vector, and predict the result according to the extracted feature vector.
  • the terminal device may also perform a clipping operation on the feature vector obtained after splicing, and clip the feature vector corresponding to the second phoneme as an input for the next prediction.
  • the terminal device may execute step S1104 to predict the characters and punctuation of the currently input phoneme to obtain the final result of the currently input phoneme.
  • the terminal device can determine whether there is a stored final result, because if the length of all the phonemes to be recognized is not less than the receptive field, then the terminal device has stored the final result of the previous part of the phoneme. is smaller than the receptive field, then the terminal device has not stored the final result.
  • the terminal device may perform step S1105 to fuse the stored final result with the final result of the currently input phoneme to obtain the final result of the phoneme to be recognized, and end the loop.
  • the specific fusion method may be that the final result of the currently input phoneme is spliced with the stored final result to obtain the final result of the phoneme to be recognized.
  • the terminal device may execute step S1106 to take the final result of the currently input phoneme as the final result of the phoneme to be recognized, and end the loop.
  • the currently input phonemes are the feature vectors of the 8th, 9th, 10th, 11th, 12th, 13th, 14th, 15th and 7th phonemes.
  • the terminal device may judge that the input of all the phonemes to be recognized has been completed, and execute step S1104 to predict the characters and punctuation of the 8th to 15th phonemes, and obtain the final result of the 8th to 15th phonemes.
  • the terminal device can determine that the final result of the 1st to 7th phonemes has been stored. Therefore, the terminal device may fuse the final results of the 1st-7th phonemes and the final results of the 8th-15th phonemes to obtain the final results of the 1st-15th phonemes.
  • the phoneme to be recognized outputted by the acoustic model is cyclically sent to the first multi-task neural network model of the streaming network structure, so that the prediction result of the phoneme to be recognized both refers to the previous phoneme , and refer to the following phonemes to improve the accuracy of prediction.
  • the streaming network structure is adopted, and the previous input is sent to the network as a buffer, which reduces the calculation amount of the model and realizes fast inference.
  • CNN Convolutional Neural Networks
  • the real receptive field is 15, and the receptive field is based on the center position as a reference , 7 are needed on the left and right
  • the streaming network structure caches all 7 of the history, and caches the historical features through the buffer of each layer. Therefore, it is enough to calculate 8 for each calculation, and the actual receptive field is 8, which can reduce the amount of calculation compared to the receptive field of 15.
  • punctuation prediction and character prediction into a model can ensure that punctuation is generated in real time while characters are generated in real time. It is not necessary to wait for all speech recognition results to complete punctuation prediction, and characters and punctuation can be output at the same time.
  • the multi-task neural network model is smaller in size than traditional acoustic models and can be deployed on the device side.
  • the result obtained when the receptive field is not satisfied is a temporary result.
  • the input satisfies (not less than) the receptive field, that is, when 'chun1 mian2 bu4 jue2 xiao3 chu4 chu4 wen2' is input to the neural network model
  • the current input The length of the phoneme is 8, which is equal to the receptive field 8, and the output is 'spring sleep without realizing it, smell everywhere. ’, at this time, the result predicted for the first input phoneme ‘chun1’ is the final result, and the characters and punctuation corresponding to the phoneme ‘chun1’ are stored as the final result.
  • the length of the currently input phoneme is 8, which is equal to the receptive field 8.
  • the output is 'Sleep without realizing it, hear the cry everywhere. ', at this time, the characters and punctuation predicted by the first input phoneme 'mian2' are stored as the final result.
  • the feature vector of the phoneme 'mian' is input into the neural network model as a buffer.
  • the current input phoneme is 'bu4 jue2 xiao3 chu4 chu4 wen2 ti2 niao3'. ’. Since there is no phoneme generation at this time, the characters and punctuation predicted from the currently input phoneme are the final results, which are merged with the previously saved results to obtain the final output "Chun Mian does not wake up, hear birds sing everywhere.”.
  • the terminal device may store the predicted temporary result in a buffer, and the terminal device may preset the number of temporary buffers (preset number) for storing the temporary buffer.
  • the size and receptive field size can be the same.
  • the terminal device can also judge whether the length of the currently input phoneme is smaller than the receptive field by judging whether the preset number of buffers is full. The length is less than the receptive field. If the preset number of buffers is full, the length of the currently input phoneme is not less than the receptive field.
  • the process of judging whether the length of the currently input phoneme is smaller than the receptive field may be performed after the prediction result is obtained by predicting the currently input phoneme.
  • FIG. 12 shows a flowchart of a speech recognition method according to an embodiment of the present application.
  • the following steps may be included:
  • Step S1200 judge whether the input of all the phonemes to be recognized has been completed; if the input of all the phonemes to be recognized has not been completed, then step S1201 is performed; if the input of all the phonemes to be recognized has been completed, then step S1204 is performed;
  • Step S1201 predict the character and punctuation of the phoneme currently input, obtain the temporary result of the phoneme currently input, store the temporary result in the temporary buffer, and judge whether the temporary buffer is full; if the temporary buffer is not full, Then go to step S1202; if the temporary buffer is full, go to step S1203;
  • Step S1202 continue to input the next phoneme, and return to step S1200;
  • step S1203 the prediction result of the first phoneme of the currently input phoneme is used as the final result, and the terminal device can store the final result;
  • the next phoneme in the other phoneme and the phoneme to be recognized continues to input the first multi-task neural network model; return to step S1100;
  • Step S1204 predict the characters and punctuation of the currently input phoneme, obtain the final result of the currently input phoneme, and judge whether there is a stored final result; if there is a stored final result, then execute step S1205; Store the final result, then execute step S1206;
  • Step S1205 fuse the final result of the stored final result and the final result of the currently input phoneme to obtain the final result of the phoneme to be recognized, and end the cycle;
  • Step S1206 taking the final result of the currently input phoneme as the final result of the phoneme to be recognized, and ending the loop.
  • steps S1201-S1203 and steps S1101-S1103 determine whether the order of receptive fields is satisfied and the order of prediction is different.
  • steps S1201-S1203 and steps S1101-S1103 determine whether the order of receptive fields is satisfied and the order of prediction is different.
  • the first multi-task neural network model may also be a non-flow network structure.
  • the terminal device can sequentially input the phonemes to be recognized, instead of inputting the phonemes whose results have been predicted in a loop.
  • the non-streaming network structure does not need to cache the predicted historical results, reduces the memory space occupied, and can further reduce the size of the neural network model.
  • using the first multi-task neural network model to output the first prediction result may include: using the first multi-task neural network model based on the relationship between the total length of the phoneme to be recognized and the phoneme length threshold , and output the first prediction result. Specifically, the following steps can be included:
  • the terminal device obtains the second prediction result of the first phoneme of the currently input phoneme and stores it according to the characters and punctuation of the currently input phoneme
  • the second prediction result is that the terminal device continues to input the phonemes other than the first phoneme in the currently input phoneme and the next phoneme in the phoneme to be recognized into the first multi-task neural network model; 2.3 If the total length of the phoneme to be recognized Not less than the phoneme length threshold, when all the phonemes to be recognized are input into the first multi-task neural network model, the second prediction result of the currently input phoneme is obtained according to the characters and punctuation of the currently input phoneme; 2.4
  • the terminal device can set a phoneme length threshold, and when the total length of the phonemes to be recognized is less than the phoneme length threshold, the terminal device can input the phonemes to be recognized into the first multi-task neural network model for inference, and the obtained prediction result is used as the final result.
  • the terminal device may input the phonemes to be recognized one by one into the first multi-task neural network model for inference, and when the length of the currently input phoneme is not less than the phoneme length threshold, the The prediction result of the first phoneme of the currently input phoneme is stored as the final result, and the next phoneme to be recognized is continued to be input, and the reasoning is continued until the last phoneme to be recognized is input, and the prediction result of the currently input phoneme is used as the final As a result, the final result of the currently input phoneme and the stored final result are fused to obtain the final result of the phoneme to be recognized.
  • FIG. 13 shows a flowchart of a speech recognition method according to an embodiment of the present application. As shown in FIG. 13 , the speech recognition method of this embodiment may include the following steps:
  • Step S1300 judge whether the input of all the phonemes to be recognized has been completed; if the input of all the phonemes to be recognized has not been completed, then step S1301 is performed; if the input of all the phonemes to be recognized has been completed, then step S1304 is performed;
  • Step S1301 judge whether the length of the currently input phoneme is less than the phoneme length threshold; if the length of the current input phoneme is less than the phoneme length threshold, then execute step S1302; if the length of the current input phoneme is not less than the phoneme length threshold, then execute step S1303 ;
  • Step S1302 predict the characters and punctuation of the currently input phoneme, obtain the temporary result of the currently input phoneme, and continue to input the next phoneme, and return to step S1300;
  • Step S1303 predict the characters and punctuation of the currently input phoneme, obtain the final result of the first phoneme of the current input phoneme, and the terminal device can store the final result;
  • the next phoneme in the phoneme and the phoneme to be recognized continues to input the first multi-task neural network model; return to step S1300;
  • Step S1304 predict the characters and punctuation of the currently input phoneme, obtain the final result of the currently input phoneme, and judge whether there is a stored final result; if there is a stored final result, then execute step S1305; Store the final result, then execute step S1306;
  • Step S1305 fuse the final result of the stored final result and the final result of the currently input phoneme to obtain the final result of the phoneme to be recognized, and end the cycle;
  • Step S1306 taking the final result of the currently input phoneme as the final result of the phoneme to be recognized, and ending the loop.
  • the terminal device determines whether the length of the currently input phoneme is less than the phoneme length threshold, and when performing character and punctuation prediction on the phoneme, it refers to the phoneme after the phoneme.
  • the number of phonemes is the phoneme length threshold.
  • the terminal device may set the phoneme length threshold to 32. It can be understood that the size of the phoneme length threshold may be set according to actual needs, which is not specifically limited in this application.
  • step S1303 if the length of the currently input phoneme is not less than the phoneme length threshold, the terminal device saves the final result of the first phoneme of the currently input phoneme, but does not use the feature vector of the first phoneme as the next reasoning input of. Instead, the phonemes other than the first phoneme in the currently input phoneme and the next phoneme in the phoneme to be recognized are continuously input into the first multi-task neural network model for inference.
  • the terminal device when the total length of the phonemes to be recognized is less than 32, the terminal device inputs the phonemes to be recognized into the first multi-task neural network model one by one, and the length of all the currently input phonemes is less than 32.
  • the terminal device adopts the first multi-task neural network model to perform inference to determine the temporary result of the currently input phoneme to be recognized, and refresh the current input according to all the phonemes to be recognized currently input.
  • the temporary result of the phoneme before the phoneme to be recognized .... Repeat the above process until all the phonemes to be recognized are input into the first multi-task neural network model, and the result obtained by inference is the final result.
  • the input is 'Spring sleep does not wake up, I hear birds singing everywhere.
  • the actual vector input to the first multi-task neural network model is [chun1, 0, 0, 0..., 0], and 0 to 32 bits are filled in where there is none.
  • the output is [Tsubaki,0,0...,0].
  • 'chun1mian2' can be sent to the first multi-task neural network model together, the input is [chun1,mian2,0,0,...,0], the output is [spring, sleep, 0, 0...,0], the obtained result will refresh the original temporary result of the phoneme "chun”... Repeat the above process until [chun1,mian2,bu,jue,xiao,chu,chu,wen,ti,niao0,0 ,...,0] input the first multi-task neural network model to get the final result.
  • the non-streaming network structure is adopted, and there is no need to re-input the phonemes whose results have been predicted into the network model.
  • the non-streaming network structure does not need to cache the predicted results.
  • the historical results of reduce the memory space, can further reduce the size of the neural network model, easy to deploy on the device side.
  • the non-streaming network structure has a large amount of computation, there are no operators such as splicing and segmentation in the network, and time-consuming operations such as memory transfer are not required. , which can be reasoned quickly.
  • the introduction of the speech recognition method of the present application shows that, in order to solve the problem that the model cannot be deployed on the terminal side in the related speech recognition technology, the prediction accuracy of using the acoustic model to predict the punctuation is not high.
  • the application provides a speech recognition method of an embodiment, which specifically includes the following steps:
  • the terminal device inputs the phoneme to be recognized into the first multi-task neural network model, wherein the first multi-task neural network model is obtained by using training samples to train the second multi-task neural network model, and the training samples Including: sample sentences, the sample sentences include characters, and the training samples include: phonemes and punctuations corresponding to the characters in the sample sentences;
  • the terminal device uses the first multi-task neural network model to output a first prediction result, where the first prediction result includes a character prediction result and a punctuation prediction result corresponding to the phoneme to be recognized;
  • the terminal device displays at least a part of the first prediction result on the display screen of the terminal device according to the first prediction result.
  • a neural network model for simultaneously predicting characters and punctuation corresponding to phonemes is constructed, and a training sample set is constructed to train the neural network model, and the trained neural network model is obtained.
  • the training process There is no need to perform word segmentation processing, and the phoneme (vector) after the speech conversion to be recognized is used as the input of the trained neural network model, and forward inference can be performed, and the characters and punctuation corresponding to the phoneme can be output at the same time, and the size of the neural network model Small and can be deployed on the device side.
  • the length of the character in the sample sentence in the constructed training sample is the same as the length of the phoneme.
  • the length is the same as the length of the punctuation.
  • the length of the characters in the sample sentence is aligned with the phoneme length and the length of the punctuation after phonetic, and after training the neural network model using the training sample set constructed by the The network model can perform phoneme-to-character conversion and punctuation prediction at the same time, so that the predicted characters and punctuation results can be output at the same time.
  • the first multi-task neural network model is a streaming network structure
  • the terminal device inputs the phoneme to be recognized into the first multi-task neural network model
  • the first multi-task neural network model is used.
  • the network model outputting the first prediction result may include: the terminal device cyclically sends the phonemes to be recognized into the first multi-task neural network model, and using the first multi-task neural network model based on the currently input phonemes to be recognized The length of output the first prediction result.
  • the prediction result of the phoneme to be recognized refers to both the previous phoneme and the subsequent phoneme, which improves the accuracy of prediction.
  • the terminal device cyclically sends the phonemes to be recognized into the first multi-task neural network model, and uses the first multi-task neural network model based on the length of the currently input phone to be recognized
  • Outputting the first prediction result may include:
  • the terminal device Before completing the input of all the phonemes to be recognized into the first multi-task neural network model, if the length of the currently input phoneme is less than the receptive field, the terminal device continues to input the next phoneme;
  • the terminal device Before completing the input of all the phonemes to be recognized into the first multi-task neural network model, if the length of the currently input phoneme is not less than the receptive field, the terminal device obtains the number of the currently input phoneme according to the characters and punctuation of the currently input phoneme. The second prediction result of a phoneme, and the second prediction result is stored; the terminal device converts the feature vector of the first phoneme, the phonemes other than the first phoneme in the currently input phoneme, and the next phoneme in the phoneme to be recognized. A phoneme continues to feed into the first multi-task neural network model.
  • the terminal device cyclically sends the phonemes to be recognized into the first multi-task neural network model, and uses the first multi-task neural network model based on the length of the currently input phone to be recognized
  • Outputting the first prediction result may further include:
  • the terminal device When completing the input of all the phonemes to be recognized into the first multi-task neural network model, the terminal device obtains the second prediction result of the currently input phoneme according to the characters and punctuation of the currently input phoneme;
  • the second prediction result of the currently input phoneme is used as the first prediction result of the phoneme to be recognized;
  • the first prediction result of the phoneme to be recognized is obtained according to the second prediction result of the currently input phoneme and the stored second prediction result.
  • the phoneme to be recognized outputted by the acoustic model is cyclically sent to the first multi-task neural network model of the streaming network structure, so that the prediction result of the phoneme to be recognized both refers to the previous phoneme , and refer to the following phonemes to improve the accuracy of prediction.
  • the first multi-task neural network model is a non-streaming network structure
  • the terminal device uses the first multi-task neural network model to output a first prediction result, which may include: a terminal device
  • the first prediction result is output based on the relationship between the total length of the phonemes to be recognized and the phoneme length threshold using the first multi-task neural network model.
  • the terminal device adopts the first multi-task neural network model to output the first prediction result based on the relationship between the total length of the phonemes to be recognized and the phoneme length threshold, which may include:
  • the first multi-task neural network model is used to output the first prediction result according to all the phonemes to be recognized.
  • the terminal device adopts the first multi-task neural network model to output the first prediction result based on the relationship between the total length of the phonemes to be recognized and the phoneme length threshold, and may further include:
  • the terminal device determines whether the phoneme is less than the phoneme length threshold or not less than the phoneme length threshold. If the total length of the phonemes to be recognized is not less than the phoneme length threshold, before all the phonemes to be recognized are input into the first multi-task neural network model: if the length of the currently input phoneme is less than the phoneme length threshold, the terminal device continues to input the following A phoneme; if the length of the currently input phoneme is not less than the phoneme length threshold, the terminal device obtains the second prediction result of the first phoneme of the currently input phoneme and stores the second prediction result according to the characters and punctuation of the currently input phoneme , the terminal device continues to input the phoneme other than the first phoneme in the currently input phoneme and the next phoneme in the phoneme to be recognized into the first multi-task neural network model.
  • the first multi-task neural network model is used to output the first prediction result based on the relationship between the total length of the phonemes to be recognized and the phoneme length threshold, and may further include:
  • the phoneme length threshold If the total length of the phonemes to be recognized is not less than the phoneme length threshold, when all the phonemes to be recognized are input into the first multi-task neural network model, according to the characters and punctuation of the currently input phoneme, the second number of the currently input phoneme is obtained. forecast result;
  • the second prediction result of the currently input phoneme is used as the first prediction result of the phoneme to be recognized;
  • the first prediction result of the phoneme to be recognized is obtained according to the second prediction result of the currently input phoneme and the stored second prediction result.
  • the non-streaming network structure is adopted, and there is no need to re-input the phonemes whose results have been predicted into the network model.
  • the non-streaming network structure does not need to cache the predicted results.
  • the historical results of reduce the memory space, can further reduce the size of the neural network model, easy to deploy on the device side.
  • the application also provides a neural network model training method, the method includes:
  • the training sample includes: a sample sentence, the sample sentence includes characters, and the training sample further includes: phonemes and punctuation corresponding to the characters in the sample sentence;
  • both the second multi-task neural network model and the first multi-task neural network model can output the first prediction result, display At least a part of the first prediction result, the first prediction result includes the character prediction result and the punctuation prediction result, and the characters and punctuation of the phoneme are predicted at the same time.
  • the trained neural network model is obtained, In the training process, no word segmentation is required, and the phoneme (vector) after the speech conversion to be recognized is used as the input of the neural network model after training, and forward inference can be performed, and the characters and punctuation corresponding to the phoneme can be output at the same time, and the neural network
  • the model is small in size and can be deployed end-to-end.
  • the construction of training samples may include:
  • the characters in the sample sentence are phoneticized to obtain the phoneme corresponding to the character, and the phoneme corresponding to the character is aligned with the character and punctuation.
  • the length of the character in the sample sentence is the same as the length of the phoneme and the length of the punctuation.
  • the length of the characters in the sample sentence is aligned with the phoneme length and the length of the punctuation after phonetic, and after training the neural network model using the training sample set constructed by the The network model can perform phoneme-to-character conversion and punctuation prediction at the same time, so that the predicted characters and punctuation results can be output at the same time.
  • the phoneme corresponding to the polyphonic word in the aligned Chinese is any one of the multiple phonemes corresponding to the polyphonic word;
  • the aligned English characters include aligned characters, and the aligned English characters have The length is the same as that of the phoneme corresponding to the English character; for characters without punctuation before alignment, the punctuation after alignment is blank.
  • the above-mentioned aligning processing of the phoneme corresponding to the character with the character and punctuation may include:
  • the positions where the alignment characters are added in the characters may be located on both sides of the characters to be aligned, for example, before or after. That is, characters and phonemes can be left-aligned or right-aligned. Right-alignment can be adding the alignment character to the left of the character that needs to be aligned, and left-alignment can be adding the alignment character to the right of the character that needs to be aligned.
  • right-alignment can be adding the alignment character to the left of the character that needs to be aligned
  • left-alignment can be adding the alignment character to the right of the character that needs to be aligned.
  • the form and adding method of the alignment characters reference may be made to the introduction in the part of FIG. 7 above, and details are not repeated here.
  • the above three steps of the alignment process can be performed separately or simultaneously, which is not limited in this application.
  • using the training sample to train the second multi-task neural network model to obtain the first multi-task neural network model which may include:
  • the parameters of the second multi-task neural network model are adjusted according to the weighted cross-entropy loss to obtain the trained first multi-task neural network model.
  • the training method of the multi-task neural network model of the present application can realize the training of the tasks of character prediction and punctuation prediction at the same time.
  • the training method of the multi-task neural network model of the present application can also realize the training of multiple language recognition (prediction) tasks.
  • the multi-task neural network model obtained by the training method of the multi-task neural network model according to the embodiment of the present application can simultaneously perform prediction of multiple languages and punctuation, and the size of the multi-task neural network model is compared with that of the traditional acoustic model. Small and can be deployed on the device side.
  • FIG. 14 shows a block diagram of a speech recognition apparatus according to an embodiment of the present application.
  • the speech recognition apparatus may include:
  • the input module 1400 is used for inputting the phoneme to be recognized into the first multi-task neural network model, wherein the first multi-task neural network model is obtained by using training samples to train the second multi-task neural network model,
  • the training samples include: sample sentences, the sample sentences include characters, and the training samples further include: phonemes and punctuations corresponding to the characters in the sample sentences;
  • the reasoning module 1401 is configured to use the first multi-task neural network model to output a first prediction result, where the first prediction result includes a character prediction result and a punctuation prediction result corresponding to the phoneme to be recognized;
  • the display module 1402 is configured to display at least a part of the first prediction result on the display screen of the terminal device according to the first prediction result.
  • the trained neural network model is obtained, In the training process, no word segmentation is required, and the phoneme (vector) after the speech conversion to be recognized is used as the input of the neural network model after training, and forward inference can be performed, and the characters and punctuation corresponding to the phoneme can be output at the same time, and the neural network
  • the model is small in size and can be deployed end-to-end.
  • the length of the character in the sample sentence is the same as the length of the phoneme and the length of the punctuation.
  • the length of the characters in the sample sentence is aligned with the phoneme length and the length of the punctuation after phonetic, and after training the neural network model using the training sample set constructed by the The network model can perform phoneme-to-character conversion and punctuation prediction at the same time, so that the predicted characters and punctuation results can be output at the same time.
  • the first multi-task neural network model is a streaming network structure
  • the input module 1400 may include: a first input unit for cyclically feeding the phoneme to be recognized into the first
  • the inference module 1401 includes: a first inference unit, configured to use the first multi-task neural network model to output the first prediction result based on the length of the currently input phoneme to be recognized.
  • the prediction result of the phoneme to be recognized refers to both the previous phoneme and the subsequent phoneme, which improves the accuracy of prediction.
  • the first input unit is further configured to: before inputting all the phonemes to be recognized into the first multi-task neural network model, if the length of the currently input phoneme is less than the receptive field, then the terminal The device continues to input the next phoneme;
  • the first reasoning unit is used to obtain the current input according to the characters and punctuation of the currently input phoneme.
  • the second prediction result of the first phoneme of the phoneme is stored, and the second prediction result is stored;
  • the next phoneme in the phoneme continues to be input to the first multi-task neural network model.
  • the first reasoning unit is further configured to: obtain the currently input phoneme according to the characters and punctuation of the currently input phoneme when all the phonemes to be recognized are input into the first multi-task neural network model. the second prediction result of the phoneme;
  • the second prediction result of the currently input phoneme is used as the first prediction result of the phoneme to be recognized;
  • the first prediction result of the phoneme to be recognized is obtained according to the second prediction result of the currently input phoneme and the stored second prediction result.
  • the phoneme to be recognized outputted by the acoustic model is cyclically sent to the first multi-task neural network model of the streaming network structure, so that the prediction result of the phoneme to be recognized refers to the previous phoneme , and refer to the following phonemes to improve the accuracy of prediction.
  • the first multi-task neural network model is a non-streaming network structure
  • the reasoning module 1401 includes:
  • the second reasoning unit is configured to use the first multi-task neural network model to output the first prediction result based on the relationship between the total length of the phonemes to be recognized and the phoneme length threshold.
  • the second reasoning unit is further configured to, if the total length of the phonemes to be recognized is less than the phoneme length threshold, adopt the first multi-task neural network model according to all the phonemes to be recognized, The first prediction result is output.
  • the second reasoning unit is further configured to:
  • the terminal device will The phonemes other than the first phoneme in the currently input phoneme and the next phoneme in the to-be-recognized phoneme continue to be input into the first multi-task neural network model.
  • the second reasoning unit is further configured to:
  • the phoneme length threshold If the total length of the phonemes to be recognized is not less than the phoneme length threshold, when all the phonemes to be recognized are input into the first multi-task neural network model, according to the characters and punctuation of the currently input phoneme, the second number of the currently input phoneme is obtained. forecast result;
  • the second prediction result of the currently input phoneme is used as the first prediction result of the phoneme to be recognized;
  • the first prediction result of the phoneme to be recognized is obtained according to the second prediction result of the currently input phoneme and the stored second prediction result.
  • the non-streaming network structure is adopted, and there is no need to re-input the phonemes whose results have been predicted into the network model.
  • the non-streaming network structure does not need to cache the predicted results.
  • the historical results of reduce the memory space, can further reduce the size of the neural network model, easy to deploy on the device side.
  • FIG. 15 shows a block diagram of a neural network model training apparatus according to an embodiment of the present application.
  • the neural network model training apparatus may include:
  • the construction module 1500 is used for constructing training samples, the training samples include: sample sentences, the sample sentences include characters, and the training samples further include: phonemes and punctuations corresponding to the characters in the sample sentences;
  • the training module 1501 is used to train the second multi-task neural network model by using the training samples to obtain the first multi-task neural network model; wherein, both the second multi-task neural network model and the first multi-task neural network model can output A first prediction result, displaying at least a part of the first prediction result, where the first prediction result includes a character prediction result and a punctuation prediction result, and simultaneously predicts the characters and punctuation of the phoneme.
  • the neural network training device of the embodiment of the present application constructs a neural network model capable of performing phoneme conversion and punctuation prediction simultaneously, and constructs a training sample set to train the neural network model, and obtains the trained neural network model.
  • the phoneme (vector) after the speech conversion to be recognized is used as the input of the trained neural network model, and forward inference can be performed, and the characters and punctuation corresponding to the phoneme can be output at the same time, and the size of the neural network model is small, It can be deployed on the device side.
  • the building module 1500 includes:
  • Alignment unit for carrying out phonetic transcription to the character in the sample sentence according to the phonetic dictionary to obtain the phoneme corresponding to the character, and aligning the phoneme corresponding to the character, the character and the punctuation, the length of the character in the sample sentence and the length of the phoneme and the punctuation of the same length.
  • the phoneme corresponding to the polyphonic word in the aligned Chinese is any one of the multiple phonemes corresponding to the polyphonic word;
  • the aligned English characters include aligned characters, and the aligned English characters have The length is the same as that of the phoneme corresponding to the English character; for characters without punctuation before alignment, the punctuation after alignment is blank.
  • the alignment unit is also used for:
  • the length of the characters in the sample sentence is aligned with the phoneme length and the length of the punctuation after phonetic, and after training the neural network model using the training sample set constructed by the The network model can perform phoneme-to-character conversion and punctuation prediction at the same time, so that the predicted characters and punctuation results can be output at the same time.
  • the training module 1501 includes:
  • a determining unit for inputting the training sample into the second multi-task neural network model, and determining the character probability matrix and the punctuation probability matrix corresponding to the training sample;
  • a first calculation unit configured to calculate the character cross-entropy loss and the punctuation cross-entropy loss respectively according to the character probability matrix and the punctuation probability matrix;
  • the second calculation unit is configured to calculate the weighted cross entropy loss according to the character cross entropy loss, the first weight corresponding to the character cross entropy loss, the punctuation cross entropy loss, and the second weight corresponding to the punctuation cross entropy loss;
  • the adjustment unit is configured to adjust the parameters of the second multi-task neural network model according to the weighted cross-entropy loss to obtain the trained first multi-task neural network model.
  • the training device of the multi-task neural network model of the present application can realize the training of the tasks of character prediction and punctuation prediction at the same time.
  • the training method of the multi-task neural network model of the present application can also realize the training of multiple language recognition (prediction) tasks.
  • the multi-task neural network model obtained by the training device for the multi-task neural network model according to the embodiment of the present application can simultaneously predict multiple languages and punctuation, and the size of the multi-task neural network model is compared with that of the traditional acoustic model. Small and can be deployed on the device side.
  • An embodiment of the present application provides a speech recognition apparatus, including: a processor and a memory for storing instructions executable by the processor; wherein the processor is configured to implement the above method when executing the instructions.
  • An embodiment of the present application provides a neural network model training apparatus, including: a processor and a memory for storing instructions executable by the processor; wherein the processor is configured to implement the above method when executing the instructions.
  • Embodiments of the present application provide a non-volatile computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, implement the above method.
  • Embodiments of the present application provide a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes the above method.
  • a computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (Electrically Programmable Read-Only-Memory, EPROM or flash memory), static random access memory (Static Random-Access Memory, SRAM), portable compact disk read-only memory (Compact Disc Read-Only Memory, CD - ROM), Digital Video Disc (DVD), memory sticks, floppy disks, mechanically encoded devices, such as punch cards or raised structures in grooves on which instructions are stored, and any suitable combination of the foregoing .
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read-only memory
  • EPROM Errically Programmable Read-Only-Memory
  • SRAM static random access memory
  • portable compact disk read-only memory Compact Disc Read-Only Memory
  • CD - ROM Compact Disc Read-Only Memory
  • DVD Digital Video Disc
  • memory sticks floppy disks
  • Computer readable program instructions or code described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • the computer program instructions used to perform the operations of the present application may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more source or object code written in any combination of programming languages, including object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as the "C" language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer may be connected to the user's computer through any kind of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or, may be connected to an external computer (eg, use an internet service provider to connect via the internet).
  • electronic circuits such as programmable logic circuits, Field-Programmable Gate Arrays (FPGA), or Programmable Logic Arrays (Programmable Logic Arrays), are personalized by utilizing state information of computer-readable program instructions.
  • Logic Array, PLA the electronic circuit can execute computer readable program instructions to implement various aspects of the present application.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in hardware (eg, circuits or ASICs (Application) that perform the corresponding functions or actions. Specific Integrated Circuit, application-specific integrated circuit)), or can be implemented by a combination of hardware and software, such as firmware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

本申请涉及语音识别方法及装置。语音识别方法包括:终端设备将待识别的音素输入到第一多任务神经网络模型中;第一多任务神经网络模型输出第一预测结果,第一预测结果包括待识别的音素对应的字符预测结果和标点预测结果;终端设备将第一预测结果的至少一部分显示在终端设备的显示屏上。通过构建一个用于同时预测音素对应的字符和标点的神经网络模型,可以同时输出音素对应的字符和标点,并且神经网络模型尺寸小,可以在端侧部署。

Description

语音识别方法及装置 技术领域
本申请涉及人工智能技术领域,尤其涉及一种语音识别方法及装置。
背景技术
语音识别,也称为自动语音识别(英文全称:Automatic Speech Recognition,简称:ASR),是一种通过计算机将语音转换为相应文字的技术。随着终端设备技术的发展,语音识别技术作为人机交互的重要方式,被应用在多个不同的领域。在电子设备应用的很多场景中,需要使用语音识别技术,例如,不同语言的语音之间的翻译、智能电子设备与用户的语音交互、即时通信软件中即时语音信号到文本信息的转换,等等。
发明内容
本申请的实施例提出了一种语音识别方法及装置。
第一方面,本申请的实施例提供了一种语音识别方法,所述方法包括:
终端设备将待识别的音素输入到第一多任务神经网络模型中,终端设备采用第一多任务神经网络模型输出第一预测结果,所述第一预测结果包括所述待识别的音素对应的字符预测结果和标点预测结果,终端设备根据所述第一预测结果将所述第一预测结果的至少一部分显示在所述终端设备的显示屏上。
所述第一多任务神经网络模型可以部署在端侧(如终端设备上)或云侧。
本申请的实施方式的语音识别方法,通过构建一个用于同时预测音素对应的字符和标点的神经网络模型(即第一多任务神经网络模型,所述多任务是指该神经网络模型需执行对音素对应的字符进行预测的任务,和执行对音素对应的标点进行预测的任务),所述神经网络模型能够同时预测音素对应的字符和标点。将待识别的语音转换后的音素(向量)作为神经网络模型的输入,进行正向推理,可以同时输出音素对应的字符和标点,并且神经网络模型尺寸小,可以在端侧部署。本文中所表述的“同时”、“同时输出”等,可以理解为是能够从神经网络模型的输出中获得两种信息(如音素对应的字符信息和音素对应的标点信息),而不仅仅是获得一种信息,并不限制两种信息被获得的时间先后关系,换句话说,本文中所述的“同时”并不限定时间上一定要是相同时刻。
根据第一方面的第一种可能的实现方式中,所述第一多任务神经网络模型为采用训练样本对第二多任务神经网络模型进行训练得到的,所述训练样本包括:样本语句,所述样本语句中包括字符,所述训练样本还包括:样本语句中的字符对应的音素、标点。
所述第二多任务神经网络模型可以部署在端侧(如终端设备上)或云侧。
通过构建一个用于同时预测音素对应的字符和标点的神经网络模型(即第二多任务神经网络模型),并构建训练样本集对神经网络模型进行训练,得到训练后的神经网络模型(即第一多任务神经网络模型),训练过程中可以不需要进行分词处理,将待识别的语音转换后的音素(向量)作为训练后的神经网络模型的输入,进行正向推理,可以同时输出音素对应的字符和标点。
根据第一方面的第一种可能的实现方式,在第二种可能的实现方式中,所述样本语句中字符的长度与音素的长度和标点的长度相同。通过构建训练样本集的过程中,将样本语句中字符的长度和注音后的音素的长度、标点的长度进行对齐,采用本申请的实施方式构建的训练样本集对神经网络模型进行训练后,神经网络模型可以同时进行音素到字符的转换、以及标点预测,从而可以同时输出预测的字符和标点结果。
根据第一方面的第三种可能的实现方式中,终端设备将待识别的音素输入到第一多任务神经网络模型中,采用所述第一多任务神经网络模型输出第一预测结果,包括:所述终端设备将待识别的音素循环送入第一多任务神经网络模型中,采用所述第一多任务神经网络模型基于当前输入的待识别的音素的长度输出所述第一预测结果。使得待识别的音素的预测结果既参考了之前的音素、又参考了之后的音素,提高了预测的准确率。
根据第一方面的第三种可能的实现方式,在第四种可能的实现方式中,所述终端设备将待识别的音素循环送入第一多任务神经网络模型中,采用所述第一多任务神经网络模型基于当前输入的待识别的音素的长度输出所述第一预测结果,包括:
在完成将全部待识别的音素输入第一多任务神经网络模型之前,如果当前输入的音素的长度小于感受野,则终端设备继续输入下一个音素;
在完成将全部待识别的音素输入第一多任务神经网络模型之前,如果当前输入的音素的长度不小于感受野,则终端设备根据当前输入的音素的字符和标点,得到当前输入的音素的第一个音素的第二预测结果,并存储第二预测结果;终端设备将所述第一个音素的特征向量、当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型;
在完成将全部待识别的音素输入第一多任务神经网络模型时,终端设备根据当前输入的音素的字符和标点,得到当前输入的音素的第二预测结果;
若不存在已存储的第二预测结果,则将当前输入的音素的第二预测结果作为待识别的音素的第一预测结果;
若存在已存储的第二预测结果,则根据当前输入的音素的第二预测结果和已存储的第二预测结果,得到待识别的音素的第一预测结果。
根据本申请上述实施方式的语音识别方法,将声学模型输出的待识别的音素循环送入流式网络结构的第一多任务神经网络模型,使得待识别的音素的预测结果既参考了之前的音素、 又参考了之后的音素,提高了预测的准确率。
根据第一方面的第五种可能的实现方式中,所述第一多任务神经网络模型为非流式网络结构,
采用所述第一多任务神经网络模型输出第一预测结果,包括:
采用所述第一多任务神经网络模型基于待识别的音素的总长度和音素长度阈值的关系,输出所述第一预测结果。
根据第一方面的第五种可能的实现方式,在第六种可能的实现方式中,采用所述第一多任务神经网络模型基于待识别的音素的总长度和音素长度阈值的关系,输出所述第一预测结果,包括:
若待识别的音素的总长度小于音素长度阈值,采用所述第一多任务神经网络模型根据全部的待识别的音素,输出所述第一预测结果;
若待识别的音素的总长度不小于音素长度阈值,在完成将全部待识别的音素输入第一多任务神经网络模型之前:如果当前输入的音素的长度小于音素长度阈值,则终端设备继续输入下一个音素;如果当前输入的音素的长度不小于音素长度阈值,则终端设备根据当前输入的音素的字符和标点,得到当前输入的音素的第一个音素的第二预测结果并存储第二预测结果,终端设备将当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型;
若待识别的音素的总长度不小于音素长度阈值,在完成将全部待识别的音素输入第一多任务神经网络模型时,根据当前输入的音素的字符和标点,得到当前输入的音素的第二预测结果;
若不存在已存储的第二预测结果,则将当前输入的音素的第二预测结果作为待识别的音素的第一预测结果;
若存在已存储的第二预测结果,则根据当前输入的音素的第二预测结果和已存储的第二预测结果,得到待识别的音素的第一预测结果。
根据本申请上述实施方式的语音识别方法,采用非流式网络结构,无需将已经预测了结果的音素重新输入网络模型中,相比于流式网络结构,非流式网络结构不需要缓存已经预测的历史结果,减少占用内存空间,可以进一步减小神经网络模型的尺寸,易于在端侧进行部署。并且,由于计算过程中,不需要对历史结果和当前输入的音素进行拼接、切分等操作,能够加快推理速度,在长语音识别中,实现实时输出的效果显著。
第二方面,本申请的实施例提供了一种神经网络模型训练方法,所述方法包括:
构建训练样本,所述训练样本包括:样本语句,所述样本语句中包括字符,所述训练样 本还包括:样本语句中的字符对应的音素、标点;
采用所述训练样本对第二多任务神经网络模型进行训练得到第一多任务神经网络模型;其中,第二多任务神经网络模型和第一多任务神经网络模型都能够输出第一预测结果、显示所述第一预测结果的至少一部分,所述第一预测结果包括字符预测结果和标点预测结果。
本申请的实施方式的神经网络训练方法,通过构建一个用于同时预测音素对应的字符和标点的神经网络模型,并构建训练样本集对神经网络模型进行训练,得到训练后的神经网络模型,训练过程中可以不需要进行分词处理,将待识别的语音转换后的音素(向量)作为训练后的神经网络模型的输入,进行正向推理,可以同时输出音素对应的字符和标点,并且神经网络模型尺寸小,可以在端侧部署。
根据第二方面的第一种可能的实现方式中,构建训练样本,可以包括:
根据注音词典对样本语句中的字符进行注音得到字符对应的音素、并对字符对应的音素与字符和标点进行对齐处理,所述样本语句中字符的长度与音素的长度和标点的长度相同。
根据第二方面的第一种可能的实现方式,在第二种可能的实现方式中,对字符对应的音素与字符和标点进行对齐处理,包括:
对于中文中的多音字,从多音字对应的多个音素中任选一个音素作为多音字对应的音素;也就是说,对齐后的中文中的多音字对应的音素为,多音字对应的多个音素中的任意一个;
对于英文字符,在字符中添加对齐字符与字符对应的音素的长度对齐;对齐后的英文字符中包括对齐字符,对齐后的英文字符的长度和英文字符对应的音素的长度相同;若字符之后没有标点,则设置字符对应的标点为blank,使得标点的长度与字符的长度对齐;对于对齐之前没有标点的字符,对齐后的标点为blank。
通过构建训练样本集的过程中,将样本语句中字符的长度和注音后的音素的长度、标点的长度进行对齐,采用本申请的实施方式构建的训练样本集对神经网络模型进行训练后,神经网络模型可以同时进行音素到字符的转换、以及标点预测,从而可以同时输出预测的字符和标点结果。
根据第二方面的第三种可能的实现方式中,采用所述训练样本对第二多任务神经网络模型进行训练得到第一多任务神经网络模型,包括:
将训练样本输入第二多任务神经网络模型,确定所述训练样本对应的字符概率矩阵和标点概率矩阵;
根据字符概率矩阵和标点概率矩阵,分别计算字符交叉熵损失和标点交叉熵损失;
根据字符交叉熵损失、字符交叉熵损失对应的第一权值和标点交叉熵损失、标点交叉熵损失对应的第二权值,计算加权交叉熵损失;
根据所述加权交叉熵损失调整第二多任务神经网络模型的参数,得到训练后的第一多任务神经网络模型。
本申请的多任务神经网络模型的训练方法,可以实现同时对字符预测和标点预测的任务进行训练。另外,由于构建的训练样本集中包括多种语言,因此,本申请的多任务神经网络模型的训练方法还可以实现对多种语言识别(预测)的任务进行训练。根据本申请的实施方式的多任务神经网络模型的训练方法进行训练得到的多任务神经网络模型,可以同时进行多种语言和标点的预测,并且多任务神经网络模型相比于传统的声学模型尺寸小,可以在端侧部署。
第三方面,本申请的实施例提供了一种语音识别装置,所述装置包括:
输入模块,用于将待识别的音素输入到第一多任务神经网络模型中;
推理模块,用于采用所述第一多任务神经网络模型输出第一预测结果,所述第一预测结果包括所述待识别的音素对应的字符预测结果和标点预测结果;
显示模块,用于根据所述第一预测结果将所述第一预测结果的至少一部分显示在所述终端设备的显示屏上。
本申请的实施方式的语音识别装置,通过构建一个用于同时预测音素对应的字符和标点的神经网络模型,将待识别的语音转换后的音素(向量)作为神经网络模型的输入,进行正向推理,可以同时输出音素对应的字符和标点,并且神经网络模型尺寸小,可以在端侧部署。
根据第三方面的第一种可能的实现方式中,所述第一多任务神经网络模型为采用训练样本对第二多任务神经网络模型进行训练得到的,所述训练样本包括:样本语句,所述样本语句中包括字符,所述训练样本还包括:样本语句中的字符对应的音素、标点。
通过构建一个用于同时预测音素对应的字符和标点的神经网络模型(即第二多任务神经网络模型),并构建训练样本集对神经网络模型进行训练,得到训练后的神经网络模型(即第一多任务神经网络模型),训练过程中可以不需要进行分词处理,将待识别的语音转换后的音素(向量)作为训练后的神经网络模型的输入,进行正向推理,可以同时输出音素对应的字符和标点。
根据第三方面的第一种可能的实现方式,在第二种可能的实现方式中,所述样本语句中字符的长度与音素的长度和标点的长度相同。
通过构建训练样本集的过程中,将样本语句中字符的长度和注音后的音素的长度、标点的长度进行对齐,采用本申请的实施方式构建的训练样本集对神经网络模型进行训练后,神经网络模型可以同时进行音素到字符的转换、以及标点预测,从而可以同时输出预测的字符和标点结果。
根据第三方面的第三种可能的实现方式中,所述第一多任务神经网络模型为流式网络结 构,所述输入模块,包括:第一输入单元,用于将待识别的音素循环送入第一多任务神经网络模型中;所述推理模块,包括:第一推理单元,用于采用所述第一多任务神经网络模型基于当前输入的待识别的音素的长度输出所述第一预测结果。使得待识别的音素的预测结果既参考了之前的音素、又参考了之后的音素,提高了预测的准确率。
根据第三方面的第三种可能的实现方式,在第四种可能的实现方式中,所述第一输入单元还用于:在完成将全部待识别的音素输入第一多任务神经网络模型之前,如果当前输入的音素的长度小于感受野,则终端设备继续输入下一个音素;在完成将全部待识别的音素输入第一多任务神经网络模型之前,如果当前输入的音素的长度不小于感受野,则第一推理单元用于根据当前输入的音素的字符和标点,得到当前输入的音素的第一个音素的第二预测结果,并存储第二预测结果;第一输入单元还用于将所述第一个音素的特征向量、当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型。述第一推理单元还用于:在完成将全部待识别的音素输入第一多任务神经网络模型时,根据当前输入的音素的字符和标点,得到当前输入的音素的第二预测结果;若不存在已存储的第二预测结果,则将当前输入的音素的第二预测结果作为待识别的音素的第以预测结果;若存在已存储的第二预测结果,则根据当前输入的音素的第二预测结果和已存储的第二预测结果,得到待识别的音素的第二预测结果。
根据本申请上述实施方式的语音识别装置,将声学模型输出的待识别的音素循环送入流式网络结构的第一多任务神经网络模型,使得待识别的音素的预测结果既参考了之前的音素、又参考了之后的音素,提高了预测的准确率。
根据第三方面的第五种可能的实现方式中,所述第一多任务神经网络模型为非流式网络结构,所述推理模块,包括:第二推理单元,用于采用所述第一多任务神经网络模型基于待识别的音素的总长度和音素长度阈值的关系,输出所述第一预测结果。
根据第三方面的第五种可能的实现方式,在第六种可能的实现方式中,所述第二推理单元还用于:若待识别的音素的总长度小于音素长度阈值,采用所述第一多任务神经网络模型根据根据全部的待识别的音素,输出所述第一预测结果;
若待识别的音素的总长度不小于音素长度阈值,在完成将全部待识别的音素输入第一多任务神经网络模型之前:如果当前输入的音素的长度小于音素长度阈值,则继续输入下一个音素;如果当前输入的音素的长度不小于音素长度阈值,则根据当前输入的音素的字符和标点,得到当前输入的音素的第一个音素的第二预测结果并存储第二预测结果,将当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型;
若待识别的音素的总长度不小于音素长度阈值,在完成将全部待识别的音素输入第一多任务神经网络模型时,根据当前输入的音素的字符和标点,得到当前输入的音素的第二预测结果;
若不存在已存储的第二预测结果,则将当前输入的音素的第二预测结果作为待识别的音 素的第一预测结果;
若存在已存储的第二预测结果,则根据当前输入的音素的第二预测结果和已存储的第二预测结果,得到待识别的音素的第一预测结果。
根据本申请上述实施方式的语音识别装置,采用非流式网络结构,无需将已经预测了结果的音素重新输入网络模型中,相比于流式网络结构,非流式网络结构不需要缓存已经预测的历史结果,减少占用内存空间,可以进一步减小神经网络模型的尺寸,易于在端侧进行部署。并且,由于计算过程中,不需要对历史结果和当前输入的音素进行拼接、切分等操作,能够加快推理速度,在长语音识别中,实现实时输出的效果显著。
第四方面,本申请的实施例提供了一种神经网络模型训练装置,所述装置包括:
构建模块,用于构建训练样本,所述训练样本包括:样本语句,所述样本语句中包括字符,所述训练样本还包括:样本语句中的字符对应的音素、标点;
训练模块,用于采用所述训练样本对第二多任务神经网络模型进行训练得到第一多任务神经网络模型;其中,第二多任务神经网络模型和第一多任务神经网络模型都能够输出第一预测结果、显示所述第一预测结果的至少一部分,所述第一预测结果包括字符预测结果和标点预测结果。
本申请的实施方式的神经网络训练装置,通过构建一个用于同时预测音素对应的字符和标点的神经网络模型,并构建训练样本集对神经网络模型进行训练,得到训练后的神经网络模型,训练过程中可以不需要进行分词处理,将待识别的语音转换后的音素(向量)作为训练后的神经网络模型的输入,进行正向推理,可以同时输出音素对应的字符和标点,并且神经网络模型尺寸小,可以在端侧部署。
根据第四方面的第一种可能的实现方式中,所述构建模块,包括:
对齐单元,用于根据注音词典对样本语句中的字符进行注音得到字符对应的音素、并对字符对应的音素与字符和标点进行对齐处理,所述样本语句中字符的长度与音素的长度和标点的长度相同。根据第四方面的第一种可能的实现方式,在第二种可能的实现方式中,所述对齐单元还用于:
对于中文中的多音字,从多音字对应的多个音素中任选一个音素作为多音字对应的音素;对齐后的中文中的多音字对应的音素为,多音字对应的多个音素中的任意一个;
对于英文字符,在字符中添加对齐字符与字符对应的音素的长度对齐;对齐后的英文字符中包括对齐字符,对齐后的英文字符的长度和英文字符对应的音素的长度相同;
若字符之后没有标点,则设置字符对应的标点为blank,使得标点的长度与字符的长度对齐;对于对齐之前没有标点的字符,对齐后的标点为blank。
通过构建训练样本集的过程中,将样本语句中字符的长度和注音后的音素的长度、标点的长度进行对齐,采用本申请的实施方式构建的训练样本集对神经网络模型进行训练后,神经网络模型可以同时进行音素到字符的转换、以及标点预测,从而可以同时输出预测的字符和标点结果。
根据第四方面的第三种可能的实现方式中,所述训练模块,包括:
确定单元,用于将训练样本输入第二多任务神经网络模型,确定所述训练样本对应的字符概率矩阵和标点概率矩阵;
第一计算单元,用于根据字符概率矩阵和标点概率矩阵,分别计算字符交叉熵损失和标点交叉熵损失;
第二计算单元,用于根据字符交叉熵损失、字符交叉熵损失对应的第一权值和标点交叉熵损失、标点交叉熵损失对应的第二权值,计算加权交叉熵损失;
调整单元,用于根据所述加权交叉熵损失调整第二多任务神经网络模型的参数,得到训练后的第一多任务神经网络模型。
本申请的多任务神经网络模型的训练装置,可以实现同时对字符预测和标点预测的任务进行训练。另外,由于构建的训练样本集中包括多种语言,因此,本申请的多任务神经网络模型的训练方法还可以实现对多种语言识别(预测)的任务进行训练。根据本申请的实施方式的多任务神经网络模型的训练装置进行训练得到的多任务神经网络模型,可以同时进行多种语言和标点的预测,并且多任务神经网络模型相比于传统的声学模型尺寸小,可以在端侧部署。
第五方面,本申请的实施例提供了一种语音识别装置,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令时实现第一方面或者第一方面的多种可能的实现方式中的一种或几种的语音识别方法。
第六方面,本申请的实施例提供了一种神经网络模型训练装置,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令时实现第二方面或者第二方面的多种可能的实现方式中的一种或几种的神经网络模型训练方法。
第七方面,本申请的实施例提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现第一方面或者第一方面的多种可能的实现方式中的一种或几种的语音识别方法,或者,实现第二方面或者第二方面的多种可能的实现方式中的一种或几种的神经网络模型训练方法。
本申请的这些和其他方面在以下(多个)实施例的描述中会更加简明易懂。
附图说明
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本申请的示例性实施例、特征和方面,并且用于解释本申请的原理。
图1示出根据本申请一实施方式的语音识别方法的应用场景。
图2为本申请实施例提供的语音识别模型训练装置的组成结构示意图。
图3示出的是与本申请实施例提供的手机的部分结构的框图。
图4是本申请实施例的手机100的软件结构示意图。
图5a示出根据本申请一实施方式的神经网络模型的框图。
图5b示出示出根据本申请一示例的编码器-解码器模型的示意图。
图5c示出示出根据本申请一示例的编码器模型的示意图。
图6示出根据本申请一实施方式的构建训练样本集的过程的示意图。
图7示出根据本申请一实施例的构建训练样本集的过程的示例。
图8示出根据本申请一实施方式的多任务神经网络模型训练方法的流程图。
图9a示出根据本申请一实施方式的终端设备侧进行语音识别的应用场景的示意图。
图9b示出根据本申请一示例的现有技术进行语音识别的过程的示意图。
图10示出根据本申请一实施方式的语音识别方法的流程图。
图11示出根据本申请一实施方式的语音识别方法的流程图。
图12示出根据本申请一实施方式的语音识别方法的流程图。
图13示出根据本申请一实施方式的语音识别方法的流程图。
图14示出根据本申请一实施例的语音识别装置的框图。
图15示出根据本申请一实施例的神经网络模型训练装置的框图。
具体实施方式
以下将参考附图详细说明本申请的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。
另外,为了更好的说明本申请,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本申请同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本申请的主旨。
传统语音识别通过基于统计的N-Gram语言模型实现音素转字符(中文指的是拼音转汉字),该方法需要的模型较大,一般为GB级别,无法在端侧部署。
传统的标点预测在语音识别结束之后进行,特别是在长语音识别中,无法在输出转换后的字符的同时输出标点。相关技术中,将标点符号作为词的一部分构造训练文本和词典文件,对语言模型进行训练,达到输出文本的同时输出标点符号的效果。但声学模型采用三元文法模型,在训练过程中需要对句子进行分词处理,声学模型采用高斯混合模型和隐马尔科夫模 型对音素进行对齐处理,等等,处理过程比较复杂,导致声学模型同样存在模型较大,无法在端侧部署;并且由于是采用声学模型进行标点预测,无法根据上下文进行调整,预测准确性不高。
因此,相关的语音识别技术中存在模型无法在端侧部署、采用声学模型预测标点的预测准确性不高的技术问题。
为了解决上述技术问题,本申请提出了一种语音识别方法。图1示出根据本申请一实施方式的语音识别方法的应用场景。如图1所示,终端设备(包括终端设备10-1和终端设备10-2)上设置有语音识别软件的客户端,用户通过所设置的语音识别软件客户端可以输入相应的待语音识别语句,聊天客户端也可以接收相应的语音识别结果,并将所接收的语音识别结果向用户进行展示,或者执行与语音指令相匹配的任务。终端设备通过网络300连接服务器200,网络300可以是广域网或者局域网,又或者是二者的组合,可以使用有线或者无线链路实现数据传输,图1中采用无线链路传输数据的方式仅仅是本申请的一个示例,不以任何方式限制本申请。
作为一个示例,服务器200用于布设语音识别模型并对所述语音识别模型进行训练,并将经过训练的语音识别模型部署在相应的终端设备中,并通过终端设备利用所部署的语音识别模型对媒资类使用环境中的语音信息进行处理。其中,语音识别模型可以是本申请实施例提供的第二多任务神经网络模型或者第一多任务神经网络模型,在服务器200上部署的训练之前的语音识别模型可以为第二多任务神经网络模型,进行训练后并部署在终端设备中的语音识别模型可以为第一多任务神经网络模型。第二多任务神经网络模型和第一多任务神经网络模型都融入了可以同时对字符和标点进行准确预测的多个任务,模型尺寸小,可以部署在端侧。
当然在通过语音识别模型对语音信息进行处理以生成相应的语音识别结果之前,还需要对语音识别模型进行训练,具体包括:构建训练样本,所述训练样本包括:样本语句,所述样本语句中包括字符,所述训练样本还包括:样本语句中的字符对应的音素、标点;采用所述训练样本对第二多任务神经网络模型进行训练得到第一多任务神经网络模型。
其中,本申请实施例所提供的语音识别方法是基于人工智能实现的,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
在本申请实施例中,主要涉及的人工智能软件技术包括上述语音处理技术和机器学习等方向。例如,可以涉及语音技术(Speech Technology)中的语音识别技术(Automatic Speech Recognition,ASR),其中包括语音信号预处理(Speech signal preprocessing)、语音信号频域分析(Speech signal frequency analyzing)、语音信号特征提取(Speech signal feature extraction)、语音信号特征匹配/识别(Speech signal feature matching/recognition)、语音的训练(Speech training)等。
例如可以涉及机器学习(Machine learning,ML),机器学习是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习通常包括深度学习(Deep Learning)等技术,深度学习包括人工神经网络(artificial neural network),例如卷积神经网络(Convolutional Neural Network,CNN)、循环神经网络(Recurrent Neural Network,RNN)、深度神经网络(Deep neural network,DNN)等。
可以理解的是,本申请提供的语音识别模型训练方法以及语音识别可以应用于智能设备(Intelligent device)上,智能设备可以是任何一种具有语音指令识别功能的设备,例如可以是智能终端、智能家居设备(如智能音箱、智能洗衣机等)、智能穿戴设备(如智能手表)、车载智能中控系统(通过语音指令唤醒终端中执行不同任务的小程序)或者AI智能医疗设备(通过语音指令进行唤醒触发)等。
下面对本申请实施例的语音识别模型训练装置的结构做详细说明,语音识别模型训练装置可以各种形式来实施,如带有语音识别模型训练功能的专用终端,也可以为设置有语音识别模型训练功能的服务器,例如图1中的服务器200。图2为本申请实施例提供的语音识别模型训练装置的组成结构示意图,可以理解,图2仅仅示出了语音识别模型训练装置的示例性结构而非全部结构,根据需要可以实施图2示出的部分结构或全部结构。
本申请实施例提供的语音识别模型训练装置包括:至少一个处理器201、存储单元202、用户接口203和至少一个网络接口204。语音识别模型训练装置中的各个组件通过总线系统205耦合在一起。可以理解,总线系统205用于实现这些组件之间的连接通信。总线系统205除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2中将各种总线都标为总线系统205。
其中,用户接口203可以包括显示器、键盘、鼠标、轨迹球、点击轮、按键、按钮、触感板或者触摸屏等。
可以理解,存储单元202可以是易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。本申请实施例中的存储单元202能够存储数据以支持终端设备(如10-1)的操作。这些数据的示例包括:用于在终端设备(如10-1)上操作的任何计算机程序,如操作系统和应用程序。其中,操作系统包含各种系统程序,例如框架层、核心库层、驱动层等, 用于实现各种基础业务以及处理基于硬件的任务。应用程序可以包含各种应用程序。
在一些实施例中,本申请实施例提供的语音识别模型训练装置可以采用软硬件结合的方式实现,作为示例,本申请实施例提供的语音识别模型训练装置可以是采用硬件译码处理器形式的处理器,其被编程以执行本申请实施例提供的语音识别模型训练方法。例如,硬件译码处理器形式的处理器可以采用一个或多个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、DSP、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)或其他电子元件。
作为本申请实施例提供的语音识别模型训练装置采用软硬件结合实施的示例,本申请实施例所提供的语音识别模型训练装置可以直接体现为由处理器201执行的软件模块组合,软件模块可以位于存储介质中,存储介质位于存储单元202,处理器201读取存储单元202中软件模块包括的可执行指令,结合必要的硬件(例如,包括处理器201以及连接到总线205的其他组件)完成本申请实施例提供的语音识别模型训练方法。
作为示例,处理器201可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
作为本申请实施例提供的语音识别模型训练装置采用硬件实施的示例,本申请实施例所提供的装置可以直接采用硬件译码处理器形式的处理器201来执行完成,例如,被一个或多个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、DSP、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)或其他电子元件执行实现本申请实施例提供的语音识别模型训练方法。
本申请实施例中的存储单元202用于存储各种类型的数据以支持语音识别模型训练装置的操作。这些数据的示例包括:用于在语音识别模型训练装置上操作的任何可执行指令。
在另一些实施例中,本申请实施例提供的语音识别模型训练装置可以采用软件方式实现,图2示出了存储在存储单元202中的语音识别模型训练装置,其可以是程序和插件等形式的软件,并包括一系列的模块,作为存储单元202中存储的程序的示例,可以包括语音识别模型训练装置,语音识别模型训练装置中包括以下的软件模块:构建模块,用于构建训练样本,所述训练样本包括:样本语句,所述样本语句中包括字符,所述训练样本还包括:样本语句中的字符对应的音素、标点;训练模块,用于采用所述训练样本对第二多任务神经网络模型进行训练得到第一多任务神经网络模型;其中,第二多任务神经网络模型和第一多任务神经网络模型都能够输出第一预测结果、显示所述第一预测结果的至少一部分,所述第一预测结果包括字符预测结果和标点预测结果。
本申请实施例提供的语音识别的方法可以应用于手机、平板电脑、可穿戴设备、车载设 备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)等终端设备上,还可以应用于数据库、服务器以及基于终端人工智能的服务响应系统,用于响应语音识别请求,本申请实施例对终端设备的具体类型不作任何限制。
例如,所述终端设备可以是WLAN中的站点(STAION,ST),可以是蜂窝电话、无绳电话、会话启动协议(Session InitiationProtocol,SIP)电话、无线本地环路(Wireless Local Loop,WLL)站、个人数字处理(Personal Digital Assistant,PDA)设备、具有无线通信功能的手持设备、计算设备或连接到无线调制解调器的其它处理设备、电脑、膝上型计算机、手持式通信设备、手持式计算设备、和/或用于在无线系统上进行通信的其它设备以及下一代通信系统,例如,5G网络中的移动终端或者未来演进的公共陆地移动网络(Public Land Mobile Network,PLMN)网络中的移动终端等。
作为示例而非限定,当所述终端设备为可穿戴设备时,该可穿戴设备还可以是应用穿戴式技术对日常穿戴进行智能化设计、开发出可以穿戴的设备的总称,如眼镜、手套、手表、服饰及鞋等。可穿戴设备即直接穿在身上,或是整合到用户的衣服或配件的一种便携式设备,通过附着与用户身上,采集用户的房颤信号。可穿戴设备不仅仅是一种硬件设备,更是通过软件支持以及数据交互、云端交互来实现强大的功能。广义穿戴式智能设备包括功能全、尺寸大、可不依赖智能手机实现完整或者部分的功能,如智能手表或智能眼镜等,以及只专注于某一类应用功能,需要和其它设备如智能手机配合使用,如各类进行体征监测的智能手环、智能首饰等。
以所述终端设备为手机为例。图3示出的是与本申请实施例提供的手机的部分结构的框图。参考图3,手机包括:射频(Radio Frequency,RF)电路110、存储器120、输入单元130、显示单元140、传感器150、音频电路160、近场通信模块170、处理器180、以及电源190等部件。本领域技术人员可以理解,图3中示出的手机结构并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图3对手机的各个构成部件进行具体的介绍:
RF电路110可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器180处理;另外,将设计上行的数据发送给基站。通常,RF电路包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low Noise Amplifier,LNA)、双工器等。此外,RF电路110还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(Global System of Mobile communication,GSM)、通用分组无线服务(General Packet Radio Service,GPRS)、码分多址(Code Division Multiple Access,CDMA)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、长期演进(Long Term Evolution,LTE))、电子邮件、短消息服务(Short Messaging Service,SMS)等,通过RF电路110接收其他终端采集的语音信号,并对语音信号进行识别,输出对应的文本信息。
存储器120可用于存储软件程序以及模块,处理器180通过运行存储在存储器120的软件程序以及模块,从而执行手机的各种功能应用以及数据处理,例如将训练好的实时语音识别算法存储于存储器120内。存储器120可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器120可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元130可用于接收输入的数字或字符信息,以及产生与手机100的用户设置以及功能控制有关的键信号输入。具体地,输入单元130可包括触控面板131以及其他输入设备132。触控面板131,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板131上或在触控面板131附近的操作),并根据预先设定的程式驱动相应的连接装置。
显示单元140可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单,例如输出语音识别后的文本信息。显示单元140可包括显示面板141,可选的,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板141。进一步的,触控面板131可覆盖显示面板141,当触控面板131检测到在其上或附近的触摸操作后,传送给处理器180以确定触摸事件的类型,随后处理器180根据触摸事件的类型在显示面板141上提供相应的视觉输出。虽然在图3中,触控面板131与显示面板141是作为两个独立的部件来实现手机的输入和输入功能,但是在某些实施例中,可以将触控面板131与显示面板141集成而实现手机的输入和输出功能。
手机100还可包括至少一种传感器150,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板141的亮度,接近传感器可在手机移动到耳边时,关闭显示面板141和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于手机还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路160、扬声器161,传声器162可提供用户与手机之间的音频接口。音频电路160可将接收到的音频数据转换后的电信号,传输到扬声器161,由扬声器161转换为声音信号输出;另一方面,传声器162将收集的声音信号转换为电信号,由音频电路160接收后转换为音频数据,再将音频数据输出处理器180处理后,经RF电路110以发送给比如另一手机,或者将音频数据输出至存储器120以便进一步处理。例如,终端设备可以通过传声器162,采集用户的目标语音信号,并将转换后的电信号发送给终端设备的处理器进行语音识别。
终端设备可以通过近场通信模块170可以接收其他设备发送的房颤信号,例如该近场通信模块170集成有蓝牙通信模块,通过蓝牙通信模块与可佩戴设备建立通信连接,并接收可穿戴设备反馈的目标语音信号。虽然图3示出了近场通信模块170,但是可以理解的是,其 并不属于手机100的必须构成,完全可以根据需要在不改变申请的本质的范围内而省略。
处理器180是手机的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器120内的软件程序和/或模块,以及调用存储在存储器120内的数据,执行手机的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器180可包括一个或多个处理单元;优选的,处理器180可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器180中。
手机100还包括给各个部件供电的电源190(比如电池),优选的,电源可以通过电源管理系统与处理器180逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
图4是本申请实施例的手机100的软件结构示意图。以手机100操作系统为Android系统为例,在一些实施例中,将Android系统分为四层,分别为应用程序层、应用程序框架层(framework,FWK)、系统层以及硬件抽象层,层与层之间通过软件接口通信。
如图4所示,所述应用程序层可以一系列应用程序包,应用程序包可以包括短信息,日历,相机,视频,导航,图库,通话等应用程序。特别地,语音识别算法可以嵌入至应用程序内,通过应用程序内的相关控件启动语音识别流程,并处理采集到的目标语音信号,得到对应的文本信息。
应用程序框架层为应用程序层的应用程序提供应用编程接口(applicationprogramming interface,API)和编程框架。应用程序框架层可以包括一些预先定义的函数,例如用于接收应用程序框架层所发送的事件的函数。
如图4所示,应用程序框架层可以包括窗口管理器、资源管理器以及通知管理器等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。
应用程序框架层还可以包括:
视图系统,所述视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供手机100的通信功能。例如通话状态的管理(包括接通,挂断等)。
系统层可以包括多个功能模块。例如:传感器服务模块,物理状态识别模块,三维图形处理库(例如:OpenGL ES)等。
传感器服务模块,用于对硬件层各类传感器上传的传感器数据进行监测,确定手机100的物理状态;
物理状态识别模块,用于对用户手势、人脸等进行分析和识别;
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。
系统层还可以包括:
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。
硬件抽象层是硬件和软件之间的层。硬件抽象层可以包括显示驱动、摄像头驱动、传感器驱动、麦克风驱动等,用于驱动硬件层的相关硬件,如显示屏、摄像头、传感器以及麦克风等。特别地,通过麦克风驱动启动麦克风模块,采集用户的目标语音信息,以直线后续的语音识别流程。
需要说明的是,本申请实施例提供的语音识别的方法可以在上述任一层级中执行,在此不做限定。
本申请的实施方式的语音识别方法,通过构建一个用于同时预测音素对应的字符和标点的神经网络模型,并构建训练样本集对神经网络模型进行训练,得到训练后的神经网络模型,训练过程中可以不需要进行分词处理,将待识别的语音转换后的音素(向量)作为训练后的神经网络模型的输入,进行正向推理,可以同时输出音素对应的字符和标点,并且神经网络模型尺寸小,可以在端侧部署。
本文中所表述的“同时”、“同时输出”等,可以理解为是能够从神经网络模型的输出中获得两种信息(如音素对应的字符信息和音素对应的标点信息),而不仅仅是获得一种信息,并不限制两种信息被获得的时间先后关系,换句话说,本文中所述的“同时”并不限定时间上一定要是相同时刻。
图5a示出根据本申请一实施方式的神经网络模型的框图。如图5a所示,神经网络模型 的输入为由待识别的语音转换后得到的音素对应的label序列,神经网络模型可以对label序列进行特征提取,具体地,可以通过embedding层对label序列进行特征提取得到对应的特征向量,然后根据特征向量预测音素对应的字符和标点,同时输出语音对应的字符和标点。在本申请的实施方式中,神经网络模型可以同时完成多个任务,因此,下文中称作多任务神经网络模型。
具体地,在本申请的实施方式中,神经网络模型在根据特征向量预测音素对应的字符和标点时,可以采用分类器预测每个音素对应的字符和标点,从而实现同时输出字符和标点,而且同时实现字符和标点预测的多任务神经网络模型能够部署在端侧。
在本申请的实施方式中,标点可以包括blank以及逗号、句号、问号和感叹号等。其中,逗号、句号、问号和感叹号还可以分为中文的全角和英文的半角两种形式。
一个中文字符可能会有多个拼音,英文字符对应多个英文音素,这样就会导致音素和字符的长度不一致,一个句子中标点的数量可能与字符、音素的长度也不相同,也就是输入序列和输出序列的长度不一致,现有技术无法同时输出预测结果。如果使用encoder-decoder(编码器-解码器)可以解决输入序列和输出序列长度不一致,使用encoder-decoder虽然可以解决输入序列和输出序列长度不一致的问题,但当前的输出必须依赖于前一个输出。图5b示出示出根据本申请一示例的编码器-解码器模型的示意图。如图5b所示,比如说,编码器-解码器模型中encoder的输入序列为“X1X2X3X4”,encoder编码为向量C输出到decoder,通过decoder进行解码,得到长度为3的输出序列“Y1Y2Y3”,输出“Y2”之前必须先输出“Y1”,不能同时输出“Y1Y2Y3”,这就导致在实时输出识别结果上效果不佳。
图5c示出示出根据本申请一示例的编码器模型的示意图。虽然图5c所示编码器模型可以包括编码器和Softmax分类器,其中编码器模型用于对输入序列进行编码得到特征向量C,Softmax分类器可以根据特征向量C得到输出序列。在图5c的示例中,根据输入序列“X1X2X3X4”可以同时输出“Y1Y2Y3Y4”,但是只能应用于输入序列和输出序列长度相同的场景。
本申请的实施方式提供了一种训练样本集的构建方法,本申请实施方式的训练样本集的构建方法,将样本语句中字符的长度和注音后的音素的长度、标点的长度进行对齐,本申请的实施方式构建的上述神经网络结构中可以采用图5c所示的编码器模型实现对音素到字符、标点的转换,由于编码器模型适用于输入序列和输出序列长度相同的场景,因此,采用本申请的实施方式构建的训练样本集对神经网络模型进行训练后,神经网络模型可以同时进行中英文音素到字符的转换、以及标点预测,并且解决了上述相关技术中输入和输出的长度不相同的情况下,无法同时输出结果的技术问题。
本申请的实施方式还提供了一种多任务神经网络模型的训练方法,将训练样本集中的训练样本输入到第二多任务神经网络模型中进行训练,得到训练后的第一多任务神经网络模型,其中,第二多任务神经网络模型和第一多任务神经网络模型中融入了标点预测和字符预测,在实时生成字符的同时,也实时生成标点,实现多任务同时训练,并且第一多任务神经网络模型的尺寸小,可以在端侧部署。
下面按照训练样本集构建、神经网络模型训练和神经网络模型推理的过程,对本申请的语音识别方法进行说明。为了清楚的描述本申请提供的各实施方式,将训练之前的神经网络模型称作第二多任务神经网络模型,训练之后得到的神经网络模型称作第一多任务神经网络模型。其中,“第一”和“第二”仅仅是为了区分不同的特征,并不表示特定的顺序或者大小关系。
训练样本集构建
图6示出根据本申请一实施方式的构建训练样本集的过程的示意图。如图6所示,可以构建注音词典,构建的注音词典可以包括词典以及音素字符映射表。其中,词典可以包括一种语言或者多种语言,比如说可以包括中文词典或者英文词典,或者中英文混合词典,或者其他多种语言的混合词典,本申请对此不作限定。对于一种语言的词典,还可以包括多个不同的词典,多个不同的词典可以是根据语言的特点进行分类得到的,以中文词典为例,还可以细分为生僻字词典,多音字词典,成语词典,人名词典等,进一步根据语言的特点对词典进行细分,有助于提高训练效果,提高预测的准确性。
音素字符映射表用于存储字符和对应的音素之间的对应关系,一个字符可以对应一个或多个音素,处理器可以根据音素字符映射表对字符进行注音得到字符对应的音素。举例来说,对于中文字符,由于存在多音字,一个中文字符可以对应一个或多个音素;对于英文字符,由于一些英文单词包括多个音节,一个英文字符也可以对应一个或多个音素;处理器可以根据字符查找音素字符映射表,确定字符对应的一个或多个音素。
用于构建训练样本集的语料可以是单一的一种语言,也可以指包括多种语言的混合语料。处理器可以根据注音词典对语料中的字符进行注音得到字符对应的音素、并对字符对应的音素与字符和标点进行对齐处理,字符的长度和标点的长度与对应的音素的长度相同。
在一种可能的实现方式中,处理器可以逐个为语料中的字符进行注音得到字符对应的音素,并判断字符的长度和字符对应的音素的长度是否相同,如果不相同,则处理器可以将字符的长度和字符对应的音素的长度进行对齐处理。或者,处理器也可以先对语料中的所有字符进行注音得到对应的音素,然后对字符对应的因素和字符进行对齐处理。本申请的实施方式对注音和对齐步骤执行的顺序不作限定。
在一种可能的实现方式中,对于不同的语言,对齐处理的方式可以不同。比如说,对于中文中的多音字,处理器可以从多个音素中任选一个音素作为字符对应的音素,也就是说,对齐后的中文中的多音字对应的音素为,多音字对应的多个音素中的任意一个;对于英文字符,处理器可以在字符中添加对齐字符进行对齐,对齐后的英文字符中包括对齐字符,对齐后的英文字符的长度和英文字符对应的音素的长度相同;在字符中添加对齐字符时,对齐字符的位置可以位于字符之前或者之后,本申请对此不作限定。其中,对齐字符可以为除了英文字母以外的任何符号,比如说,对齐字符可以为“@”、“*”、“&”或者“%”等。
在一种可能的实现方式中,在注音时,若注音词典中没有与语料中的英文字符相同的字符,则处理器可以对语料中的英文字符进行拆分,得到多个独立的子字符,在注音词典中存 在与子字符相同的字符,处理器可以对子字符进行注音并对齐。
在一种可能的实现方式中,一个字符对应一个标点,在本申请的实施方式中,除了对音素和字符进行对齐处理,处理器还可以对字符和标点进行对齐处理,如上文所述,标点可以包括blank和以及逗号、句号、问号和感叹号等,若原字符之后没有标点,则可以设置字符对应的标点为blank,对于对齐之前没有标点的字符,对齐后的标点为blank,使得标点的长度与字符的长度对齐,则在输出时该字符对应的标点为blank。处理器可以同时对字符、音素和标点进行对齐处理,也可以分步骤进行,本申请对此不作限定。
在一种可能的实现方式中,本申请的训练样本集构建方法还可以对多条同时进行训练的不同语句的长度进行对齐。比如说,如果多条语句同时进行训练,也就是batch size>1时,若多条同时训练的语句的长度不同,则可以在长度短的语句对应的字符、音素和标点后面补齐Null,长度短的语句的字符、音素和标点在补齐Null后长度与最长的句子的长度相同。
举例来说,有两条语句一起训练,分别为:
你好!
真不错。
“你好!”这条语句只有两个字,长度比较短,可以在语句对应的字符、音素和标点后用“Null”补齐,因此在本实施例中,因此标点语句对应的标点为“[BLANK]![Null]”;“真不错。”的标点是“[BLANK][BLANK]。”这样可以保证两条语句的长度相同,能够用于训练中。
在一种可能的实现方式中,在对语料进行注音并对齐之前,还可以对语料进行预处理。预处理的具体内容可以根据构建的词典和具体的语言类型确定,比如说,如果词典中不包括数字,对语料进行预处理可以包括:将语料中的数字转换成汉字(如图6所示的数字规整);如果词典中的英文单词为大写,则对语料进行预处理还可以包括:将语料中的英文由小写转换为大写(如图6所示的英文字母转换)。预处理还可以包括:繁体转简体、去除特殊字符等处理,预处理的具体方式可以根据注音词典以及语言特点等进行确定,本申请对具体的预处理方式不作限定。
图7示出根据本申请一实施例的构建训练样本集的过程的示例。举例来说,如图7所示,以中英文混合句子“用P30打开CCTV看NBA video。”为例,可以先对语料进行预处理,将数字转换为汉字、将英文由小写转换为大写,可以得到“用P三零打开CCTV看NBA VIDEO。”。
对“用P三零打开CCTV看NBA VIDEO。”进行注音:中文注成拼音,英文注成对应的英文音素,如图7中所示,英文字符“NBA”对应三个音素“en bi ei”,英文字符“VIDEO”对应两个音素“vi diu”,由于英文字符“CCTV”不在注音词典中,可以将“CCTV”拆分为四个独立的子字符,并根据注音词典对子字符分别注音可以得到对应的音素“see see ti vi”,最终得到的音素可以为“yong4 pi san1 ling2 da3 kai1 see see ti vi kan4 en bi ei vi  diu”。
对齐处理:处理器可以在注音过程中进行对齐处理,也可以在注音之后统一进行对齐处理,本申请对此不作限定。如图7所示,本申请的示例中采用的对齐字符可以为“@”,对字符“NBA”进行对齐处理,可以得到“@@NBA”,因为“NBA”对应三个音素;字符“VIDEO”对应两个音素,处理器对字符“VIDEO”进行对齐处理可以得到“@VIDEO”。对于中文的多音字场景,例如:“长头发”,注音可以得到“chang2|zhang3tou2fa4”,这样就会导致汉字“长”对应两个拼音,对于这种中文多音字,本申请的实施方式中,处理器可以随机选择一个作为最终的汉字拼音,以实现对齐处理。字符和音素对齐之后得到的字符的结果如图7中最后一个步骤“用P三零打开CCTV看@@NBA@VIDEO”。
上述示例中,语料最后有一个标点符号:句号,字符VIDEO、音素diu与句号相对应。对于其他没有标点的字符,全部将标点设置为blank,如图7所示,包括15个blank和一个句号,一共16个标点,与16个音素是对齐的。
通过将字符的长度、标点的长度和字符对应的音素的长度进行对齐处理,采用本申请的实施方式构建得到的训练样本集,对第二多任务神经网络模型进行训练后得到第一多任务神经网络模型,对于输入序列和输出序列的长度不同的场景,通过本申请的对齐处理进行训练后得到的第二神经网络模型也可以根据输入的待识别音素同时输出预测结果,预测结果可以包括待识别音素对应的字符和标点,也就是说,可以实现实时生成字符的同时,也实时生成标点,并且第一多任务神经网络模型的尺寸小,可以在端侧部署。
模型训练
图8示出根据本申请一实施方式的多任务神经网络模型训练方法的流程图。本申请的实施方式提供的多任务神经网络模型训练的方法可以应用于图2所示的装置。
在本申请的实施方式中,进行训练之前,可以从训练样本集中选择训练样本作为多任务神经网络的输入,训练样本的尺寸可以表示为(B,U),其中,B可以表示一次训练的样本数量,U可以表示这批训练样本中最大长度的样本对应的音素的长度。比如说,B可以为128,表示一次训练的样本数量为128句话对应的音素,这128句话中最长的一句话对应的音素的长度为U。
需要说明的是,可以选择多批训练样本输入到第二多任务神经网络模型进行训练,训练的数据量越大,得到的第一多任务神经网络模型在推理时,预测得到的字符和标点准确率更高。
神经网络模型的输入必须是数值,而不能是字符串,因此,在进行训练之前可以将训练样本转换为数值表示的数据。在本申请的实施方式中,可以预先设置词典中每个音素对应的数值作为该音素的标签(label)。在进行训练之前,可以根据音素查找对应的标签,从而将训练样本转换成标签序列,也就是转换成数值表示的向量作为神经网络的输入数据,对神经网络进行训练。
比如说,以一条训练样本为例,(jin1 tian1 tian1 qi4 bu2 cuo4),输入为(1,6),将训练样本转换为label序列可以得到(10,148,148,2456,30,40)。也就是说,每个音素都有对应的标签,音素对应的标签可以为数字,通过将样本序列转换为标签序列,可以将样本序列转换为向量的表示形式参与后续的计算过程。
如图8所示,本申请提供的多任务神经网络模型训练方法可以包括:
步骤S801,将输入数据输入到第二多任务神经网络模型,确定所述输入数据对应的字符概率矩阵和标点概率矩阵;
步骤S802,根据字符概率矩阵和标点概率矩阵,分别计算字符交叉熵损失和标点交叉熵损失;
步骤S803,根据字符交叉熵损失和标点交叉熵损失,计算加权交叉熵损失;
步骤S804,根据所述加权交叉熵损失调整第二多任务神经网络模型的参数,得到训练后的第一多任务神经网络模型。
其中,在步骤S801中,输入数据可以是对音素转换后的标签序列,也就是待识别的音素对应的向量。
在一种可能的实现方式中,步骤S801中,训练装置可以通过第二多任务神经网络模型对输入数据进行运算,得到输入数据的特征向量;然后,训练装置可以通过第二多任务神经网络模型对特征向量进行运算,预测训练样本对应的字符和标点,得到字符概率矩阵和标点概率矩阵。其中,输入数据可以为上述的训练样本。
在另一种可能的实现方式中,第二多任务神经网络模型中可以包括如图5c所示的编码器模型中的编码器,编码器用于对待识别的音素(输入数据)进行特征提取得到特征向量。
举例来说,编码器可以包括embedding层,训练装置可以通过embedding层对输入数据进行运算,提取特征向量。具体地,训练装置可以根据embedding层具体采用的编码方式以及输入数据进行运算,得到特征向量,如图5c所示的向量C。
在一种可能的实现方式中,对于每一个音素编码后可以通过一个一维向量进行表示,向量的长度可以根据词典中音素的数量确定,比如说,在本申请的示例中,可以采用512个数据表示一个音素。在本申请的实施方式中,可以记录音素对应的标签和音素编码后对应的向量之间的对应关系。
仍然以上述示例为例,输入数据的维度为(1,6),也就是6个音素转换得到的标签,经过编码器处理后,得到的特征向量可以为(1,6,512)。
在一种可能的实现方式中,第二多任务神经网络模型中还可以包括分类器(如图5c所示的Softmax分类器),训练装置可以通过第二多任务神经网络模型中分类器对特征向量进行分 类,可以得到字符概率矩阵和标点概率矩阵。其中,字符概率矩阵中表示的是音素对应的字符的第一概率,标点概率矩阵中表示的是每个音素对应的标点的第二概率。根据字符概率矩阵和标点概率矩阵,可以得到音素对应的字符和标点。在一种可能的实现方式中,可以预先建立字符和字符对应的第一索引值、标点和标点对应的第二索引值的对应关系,形成词表。这样,在对第二多任务神经网络模型进行训练时,或者用第一多任务神经网络模型进行推理时,可以神经网络模型可以根据得到的字符概率矩阵、标点概率矩阵、以及词表,得到音素对应的字符和标点。
具体地,通过字符概率矩阵可以得到最大的第一概率对应的字符的第一索引值,根据第一索引值和词表可以得到音素对应的字符。通过标点概率矩阵可以得到最大的第二概率对应的标点的第二索引值,根据第二索引值和词表可以得到音素对应的标点。也就是说,通过Softmax分类器得到的是待识别的音素(输入数据)对应的字符的概率矩阵,矩阵中的第一概率表示音素对应的字符为该第一概率对应的字符的概率,可以确定最大的第一概率对应的字符为音素对应的字符。音素对应的标点可以采用同样的方式确定。
在一种可能的实现方式中,在步骤S802中,训练装置可以根据交叉熵损失函数和字符概率矩阵,计算字符交叉熵损失。具体的计算公式如下:
Figure PCTCN2021142470-appb-000001
其中,y(C)表示所有字符的交叉熵损失,P(c i)可以表示字符c i对应的第一概率,i表示字符的下标,i的取值范围为1~n,n为正整数。根据以上公式以及字符概率矩阵可以计算得到字符交叉熵损失。
同样的,在步骤S802中,训练装置可以根据交叉熵损失函数和标点概率矩阵,可以计算标点交叉损失。具体的计算公式如下:
Figure PCTCN2021142470-appb-000002
其中,y(P)表示所有标点的交叉熵损失,P(p i)可以表示标点p i对应的第二概率。根据以上公式以及标点概率矩阵可以计算得到标点交叉熵损失。
在一种可能的实现方式中,根据对字符预测的准确率和对标点预测的准确率的要求的不同,可以设置字符交叉损失对应的第一权值和标点交叉损失对应的第二权值。在步骤S803中,可以根据字符交叉熵损失、第一权值和标点交叉熵损失、第二权值,计算加权交叉熵损失。可以根据以下公式计算交叉熵损失:
y(C+P)=w1×y(C)+w2×y  (P)
其中,y(C+P)可以表示字符和标点的加权交叉熵损失,w1可以表示第一权值,w2可以 表示第二权值。在一种可能的实现方式中,第一权值和第二权值的和为1,也就是说,w2=1-w1。举例来说,假设第一权值为0.7,第二权值可以为0.3。
需要说明的是,以上关于第一权值和第二权值的设置方式和举例仅仅是本申请的一些示例,不以任何方式限制本申请。
得到加权交叉熵损失后,在步骤S804,训练装置可以通过反向传播算法,根据加权交叉熵对第二多任务神经网络模型的权重进行更新,得到训练后的第一多任务神经网络模型。在一种可能的实现方式中,可以使用Adam优化器实现权重更新。
本申请的多任务神经网络模型的训练方法,可以实现同时对字符预测和标点预测的任务进行训练。另外,由于构建的训练样本集中包括多种语言,因此,本申请的多任务神经网络模型的训练方法还可以实现对多种语言识别(预测)的任务进行训练。根据本申请的实施方式的多任务神经网络模型的训练方法进行训练得到的多任务神经网络模型,可以同时进行多种语言和标点的预测,并且多任务神经网络模型相比于传统的声学模型尺寸小,可以在端侧部署。
模型推理
在对第二多任务神经网络模型进行训练得到第一多任务神经网络模型后,可以将待识别的音素输入到第一多任务神经网络模型,进行正向推理实现对音素对应的字符和标点进行同时预测和输出。
因此,本申请还提供了一种语音识别方法,可以应用于如图1或者图3所示的终端设备。在得到第一多任务神经网络模型后,可以将第一多任务神经网络模型部署到终端设备中。
图9a示出根据本申请一实施方式的终端设备侧进行语音识别的应用场景的示意图。如图9a所示,终端设备中可以部署有声学模型和神经网络模型(第一多任务神经网络模型)。终端设备可以将采集的语音信号或者接收到的语音信号输入到声学模型中,通过声学模型对语音信号进行处理可以得到语音信号对应的音素并输出到第一多任务神经网络模型中。
图10示出根据本申请一实施方式的语音识别方法的流程图。如图10所示,本申请提供的一种实施方式的语音识别方法可以包括以下步骤:
步骤S901,将待识别的音素输入到第一多任务神经网络模型中,其中,所述第一多任务神经网络模型为采用训练样本对第二多任务神经网络模型进行训练得到的。
所述训练样本包括:样本语句,所述样本语句中包括字符,所述训练样本包括:样本语句中的字符对应的音素、标点。在一种可能的实现方式中,音素、字符和标点的长度相同。
所述第二多任务神经网络模型和第一多任务神经网络模型,都能够输出第一预测结果、显示所述第一预测结果的至少一部分,所述第一预测结果包括字符预测结果和标点预测结果,也就是说,第二多任务神经网络模型和第一多任务神经网络模型可以根据待识别的音素同时 预测待识别的音素对应的字符和标点。
构建训练样本和根据训练样本对第二多任务神经网络模型进行训练得到第一多任务神经网络模型的过程,可以参见上文的描述,不再赘述。
如图9a所示,待识别的音素可以是采用声学模型对待识别的语音信号进行处理后得到的,待识别的语音信号可以是终端设备采集到的信号或者接收到的信号,本申请对此不作限定。
比如说,在一个示例,终端设备打开社交APP,检测到麦克风被打开,并采集到了语音信号;若终端设备检测到了请求将语音信号转换成文字的转换请求,则可以将语音信号输入到声学模型。在另一个示例中,终端设备打开社交APP,接收到了另一终端设备发来的语音信号,并且终端设备检测到了转换请求,则终端设备可以将语音信号输入到声学模型中。声学模型接收到语音信号后,可以对语音信号进行处理得到待识别的音素。终端设备可以将待识别的音素输入到第一多任务神经网络模型中。
在本申请的实施方式中,声学模型输出的待识别的音素可以是音素对应的标签序列。
如图10所示,本申请的供的一种实施方式的语音识别方法还可以包括:
步骤S902,终端设备采用所述第一多任务神经网络模型输出第一预测结果,所述第一预测结果包括所述待识别的音素对应的字符预测结果和标点预测结果;
步骤S903,终端设备根据所述第一预测结果将所述第一预测结果的至少一部分显示在所述终端设备的显示屏上。
根据图5a所示的神经网络模型的框图可知,第一多任务神经网络模型可以进行特征提取,提取待识别音素的特征向量,然后由分类器根据特征向量进行分类,可以预测每一个待识别音素对应的字符和标点,比如说,分类器可以根据输入的待识别音素得到分类,得到对应的字符和标点,并输出预测的结果(第一预测结果)。第一多任务神经网络模型输出第一预测结果后,终端设备可以同时显示预测的字符和标点。或者,第一多任务神经网络模型可以采用图5c所示的编码器模型对输入的待识别音素进行处理,可以得到对应的字符和标点进行同时输出。
图9b示出根据本申请一示例的现有技术进行语音识别的过程的示意图。如图9b所示,传统的音素转字符和标点的方法中可以先将音素映射为字符,然后预测得到对应的标点。在一个示例中,首先可以通过N-Gram语言模型将音素映射为字符,得到字符后,再通过标点预测模型得到标点。需要通过两个模型分别进行字符和标点的预测,无法同时输出字符和标点,并且模型较大,无法在端侧部署。而本申请采用的语音识别方法通过图9a所示的一个神经网络模型可以同时输出字符和标点,并且由于模型简单可以在端侧部署模型。
根据本申请的实施方式提供的语音识别方法,由于在神经网络模型中融入了字符预测和标点预测,通过专门构建的训练样本集对神经网络模型进行训练,并将训练后得到的多任务神经网络模型部署在端侧,即可实现同时输出、显示预测得到的字符和标点。
在一种可能的实现方式中,第一多任务神经网络模型可以为流式网络结构,终端设备将待识别的音素输入到第一多任务神经网络模型中,采用所述第一多任务神经网络模型输出第一预测结果,可以包括:所述终端设备可以将待识别的音素循环送入第一多任务神经网络模型中,采用所述第一多任务神经网络模型基于当前输入的待识别的音素的长度输出所述第一预测结果。
在一种可能的实现方式中,所述终端设备将待识别的音素循环送入第一多任务神经网络模型中,采用所述第一多任务神经网络模型基于当前输入的待识别的音素的长度输出所述第一预测结果,可以包括:
在完成将全部待识别的音素输入第一多任务神经网络模型之前,如果当前输入的音素的长度小于感受野,则终端设备继续输入下一个音素;
在完成将全部待识别的音素输入第一多任务神经网络模型之前,如果当前输入的音素的长度不小于感受野,则终端设备根据当前输入的音素的字符和标点,得到当前输入的音素的第一个音素的第二预测结果,并存储第二预测结果;终端设备将所述第一个音素的特征向量、当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型;
在完成将全部待识别的音素输入第一多任务神经网络模型时,终端设备根据当前输入的音素的字符和标点,得到当前输入的音素的第二预测结果;
若不存在已存储的第二预测结果,则终端设备将当前输入的音素的第二预测结果作为待识别的音素的第一预测结果;
若存在已存储的第二预测结果,则根据当前输入的音素的第二预测结果和已存储的第二预测结果,得到待识别的音素的第一预测结果。
其中,第二预测结果为一个或几个待识别的音素的最终结果,当前输入的音素中除了第一个音素以外的音素的预测结果都为临时预测结果,因此,终端设备存储第二预测结果,最后对所有的第二预测结果进行融合得到第一预测结果(全部待识别的音素的最终结果)。图11示出根据本申请一实施方式的语音识别方法的流程图。如图11所示,在本申请的实施方式的语音识别方法中,可以根据待识别的音素的长度和第一多任务神经网络模型的感受野的关系,将待识别的音素循环送入第一多任务神经网络模型中进行字符和标点预测。具体可以包括以下过程:
步骤S1100,判断是否已经完成全部待识别的音素的输入;若没有完成全部待识别的音素的输入,则执行步骤S1101;若已经完成全部待识别的音素的输入,则执行步骤S1104;
步骤S1101,判断当前输入的音素的长度是否小于感受野;若当前输入的音素的长度小于感受野,则执行步骤S1102;若当前输入的音素的长度不小于感受野,则执行步骤S1103;
步骤S1102,对当前输入的音素的字符和标点进行预测,得到当前输入的音素的临时结 果,并继续输入下一个音素,返回步骤S1100;
步骤S1103,对当前输入的音素的字符和标点进行预测,得到当前输入的音素的第一个音素的最终结果,终端设备可以存储最终结果;终端设备将所述第一个音素的特征向量、当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型;返回步骤S1100;
步骤S1104,对当前输入的音素的字符和标点进行预测,得到当前输入的音素的最终结果,判断是否存在已存储的最终结果;若存在已存储的最终结果,则执行步骤S1105;若不存在已存储的最终结果,则执行步骤S1106;
步骤S1105,将已存储的最终结果和当前输入的音素的最终结果进行融合得到待识别的音素的最终结果,结束循环;
步骤S1106,将当前输入的音素的最终结果作为待识别的音素的最终结果,结束循环。
在步骤S1100中,终端设备可以根据前面连接的声学模型的输出,判断是否已经完成全部待识别的音素的输入,若声学模型不再输出新的音素,则终端设备可以判断已经将全部的音素出入到了第一多任务神经网络模型,否则,终端设备可以判断没有完全全部待识别的音素的输入。
在一种可能的实现方式中,终端设备上还可以设置有VAD(Voice Activity Detection,语音端点检测),VAD可以检测一段音频什么时候有人声以及什么时候人声结束。在检测到音频中的人声结束后,可以控制声学模型不再输出。
当前输入的音素的长度从刚开始输入时为1,随着逐渐输入更多的音素,当前输入的音素的长度逐渐增加。如果全部的待识别音素的长度大于或者等于感受野,那么,当前输入的音素的长度在增加到与感受野的大小相同时,不再变化,如果有新的输入的音素,那么当前输入的音素的第一个音素不再输入到第一多任务神经网络模型中。如果全部的待识别音素的长度小于感受野,那么当前输入的音素的长度的最大值小于感受野。
举例来说,假设第一多任务神经网络模型的感受野为8,待识别的音素的长度为15。当输入前7个待识别的音素时,当前输入的音素的长度分别为1、2、3、4、5、6、7,而且当前输入的音素的长度小于感受野,从输入第8个待识别的音素开始,当前输入的音素的长度为8,不小于感受野。当输入第9个待识别的音素时,当前输入的音素的长度仍然为8,当前输入的音素分别为2、3、4、5、6、7、8、9。第10个待识别的音素以及以后的待识别的音素也是如此。假设第一多任务神经网络模型的感受野为8,待识别的音素的长度为7,那么,当前输入的音素的长度最大为7,小于感受野。
如果判断当前输入的音素的长度小于感受野,那么终端设备可以执行步骤S1102,对当前输入的音素的字符和标点进行预测,得到当前输入的音素的临时结果。当前输入的音素的长度小于感受野,说明对当前输入的音素进行预测得到的字符和标点还有可能根据之后输入 的音素而变化,因此,在当前输入的音素的长度小于感受野时,终端设备可以将对当前输入的音素预测的结果作为临时结果。终端设备可以将声学模型预测的下一个待识别的音素输入到第一多任务神经网络模型中,然后返回步骤S1100,继续判断是否已经完成全部待识别的音素的输入。
举例来说,仍然以上述示例为例,当前输入的音素为第1、2、3、4、5一共5个音素,当前输入的音素的长度为5,小于感受野8,因此,终端设备可以将对第1、2、3、4、5个音素的字符和标点的预测结果作为临时结果,并输入下一个(第6个)待识别的音素,也就是当前输入的音素为第1、2、3、4、5、6一共6个音素。
如果判断当前输入的音素的长度不小于感受野,则终端设备可以执行步骤S1103,对当前输入的音素的字符和标点进行预测,得到当前输入的音素的第一个音素的最终结果,终端设备可以存储最终结果。终端设备对当前输入的音素中除了第一个音素以外的音素的预测结果为临时结果。终端设备可以将在本次预测过程中提取的第一个音素的特征向量、当前输入的音素中除了第一个音素以外的音素和待识别的音素的下一个音素继续输入第一多任务神经网络模型。然后返回步骤S1100,继续判断是否已经完成全部待识别的音素的输入。
举例来说,仍然以上述示例为例,当前输入的音素为第1-8个音素,当前输入的音素的长度为8,等于感受野8,也就是说,不小于感受野8。因此,终端设备可以将对第1个音素的预测的结果作为最终结果,并存储最终结果。终端设备可以将对第2-8个音素的预测的结果作为临时结果。终端设备可以将在本次预测过程中提取的第1个音素的特征向量、第2-8个音素和第9个音素输入到第一多任务神经网络模型中。继续推理,得到第2个(当前输入的音素的第一个音素)的预测的结果作为最终结果,并存储最终结果。终端设备可以将在本次预测过程中提取的第2个音素的特征向量、第3-9个音素和第10个音素输入到第一多任务神经网络模型中,继续推理…,重复以上过程,直到完成全部待识别的音素的输入。
对于步骤S1103,在第一多任务神经网络模型的输入为当前输入的音素以及前一次输入的音素的第一个音素的特征向量时,可以提取当前输入的音素的特征向量,对当前输入的音素的特征向量和前一次输入的音素的第一个音素的特征向量进行拼接操作,对于拼接后得到的特征向量,终端设备可以进行卷积操作进一步提取特征向量、根据提取的特征向量预测结果。举例来说,当前输入的音素为第2-9、以及第1个音素的特征向量。终端设备在进行预测时,可以提取第2-9个音素的特征向量,并对第1个音素的特征向量和第2-9个音素的特征相连进行拼接操作(concat)。对于拼接后得到的特征向量,终端设备可以进行卷积操作进一步提取特征向量、根据提取的特征向量预测结果。终端设备还可以进行对拼接后得到的特征向量进行剪切操作,将第2个音素对应的特征向量剪切出来,作为下一次预测的输入。
回到步骤S1100,若已经完成了全部待识别的音素的输入,终端设备可以执行步骤S1104,对当前输入的音素的字符和标点进行预测,得到当前输入的音素的最终结果。此时终端设备可以判断是否存在已存储的最终结果,因为,如果全部待识别的音素的长度不小于感受野,那么终端设备已经存储了前面一部分音素的最终结果,如果全部待识别的音素的长度小于感受野,那么终端设备未存储过最终结果。
如果终端设备判断存在已存储的最终结果,终端设备可以执行步骤S1105,将已存储的最终结果和当前输入的音素的最终结果进行融合得到待识别的音素的最终结果,结束循环。具体融合的方式可以是,将当前输入的音素的最终结果与已存储的最终结果进行拼接得到待识别音素的最终结果。如果终端设备判断未存储最终弄结果,终端设备可以执行步骤S1106,将当前输入的音素的最终结果作为待识别的音素的最终结果,结束循环。
举例来说,仍然以上述示例为例,当前输入的音素为第8、9、10、11、12、13、14、15个音素和第7个音素的特征向量。终端设备可以判断已经完成全部待识别的音素的输入,执行步骤S1104,对第8-15个音素的字符和标点进行预测,得到第8-15个音素的最终结果。终端设备可以判断已存储了第1-7个音素的最终结果。因此,终端设备可以对第1-7个音素的最终结果和第8-15个音素的最终结果进行融合得到第1-15个音素的最终结果。
根据本申请上述实施方式的语音识别方法,将声学模型输出的待识别的音素循环送入流式网络结构的第一多任务神经网络模型,使得待识别的音素的预测结果既参考了之前的音素、又参考了之后的音素,提高了预测的准确率。
另外,采用流式网络结构,将之前的输入作为buffer送入到网络中,减少模型计算量,实现快速推理。具体地,由于CNN(Convolutional Neural Networks,卷积神经网络)是有感受野的,举例来说,假设卷积层数总共为7层,真实的感受野是15,感受野是以中心位置为参照,左右各需要7个,流式网络结构把历史的7个全部缓存,通过每一层的buffer来缓存历史特征。因此,每次计算时计算8个就可以了,实际的感受野是8,相比于感受野15可以减少计算量。
并且将标点预测和字符预测融入到一个模型中,可以保证在实时生成字符的同时也实时生成标点,不需要等所有的语音识别结果结束之后才进行标点预测,可以同时输出字符和标点。并且多任务神经网络模型相比于传统的声学模型尺寸小,可以在端侧部署。
下面结合一个具体的应用示例,对本申请的语音识别方法进行进一步的说明。
以输入“chun1 mian2 bu4 jue2 xiao3 chu4 chu4 wen2 ti2 niao3”为例,循环输入时,首先输入‘chun1’到神经网络模型中,当前输入的音素的长度为1,小于感受野8,预测得到临时结果‘春。’。输入‘chun1 mian2’到神经网络模型中,当前输入的音素的长度为2,小于感受野8,预测得到临时结果‘春眠。’,输入‘chun1 mian2 bu4 jue2 xiao3’,到神经网络模型中,当前输入的音素的长度为5,小于感受野8,预测得到临时结果‘春眠不觉晓。’。
由于模型感受野为8,不满感受野时得到的结果是临时结果,当输入满足(不小于)感受野时,即输入‘chun1 mian2 bu4 jue2 xiao3 chu4 chu4 wen2’到神经网络模型时,当前输入的音素的长度为8,等于感受野8,输出‘春眠不觉晓,处处闻。’,此时对第一个输入音素‘chun1’预测得到的结果为最终结果,音素‘chun1’对应的字符和标点作为最终结果进行存储。
下次推理时将音素‘chun1’的特征向量作为buffer输入到神经网络模型中,输入‘mian2  bu4 jue2 xiao3 chu4 chu4 wen2 ti2’到神经网络模型时,当前输入的音素的长度为8,等于感受野8,输出为‘眠不觉晓,处处闻啼。’,此时对第一个输入音素‘mian2’预测得到的字符和标点为最终结果进行存储。
将音素‘mian’的特征向量作为buffer输入到神经网络模型中,当前输入的音素为‘bu4 jue2 xiao3 chu4 chu4 wen2 ti2 niao3’,预测得到‘不觉晓,处处闻啼鸟。’。由于此时再没有音素生成,因此,对当前输入的音素预测得到的字符和标点为最终结果,与之前保存的结果融合在一起得到最终的输出“春眠不觉晓,处处闻啼鸟。”。
在一种可能的实现方式中,终端设备可以将预测得到的临时结果存储在缓存器中,终端设备可以预先设置用于存储临时缓存器的个数(预设个数),预设个数的大小和感受野大小可以相同。这样终端设备也可以通过判断预设个数的缓存器有没有存满,来判断当前输入的音素的长度是否小于感受野,如果预设个数的缓存器未存满,则当前输入的音素的长度小于感受野,如果预设个数的缓存器存满,则当前输入的音素的长度不小于感受野。在这个实施方式中,判断当前输入的音素的长度是否小于感受野的过程,可以在对当前输入的音素进行预测得到预测结果之后。
图12示出根据本申请一实施方式的语音识别方法的流程图,在本实施方式的语音识别方法中,可以包括以下步骤:
步骤S1200,判断是否已经完成全部待识别的音素的输入;若没有完成全部待识别的音素的输入,则执行步骤S1201;若已经完成全部待识别的音素的输入,则执行步骤S1204;
步骤S1201,对当前输入的音素的字符和标点进行预测,得到当前输入的音素的临时结果,将临时结果存入临时缓存器,判断临时缓存器有没有存满;若临时缓存器未存满,则执行步骤S1202;若临时缓存器存满,则执行步骤S1203;
步骤S1202,继续输入下一个音素,返回步骤S1200;
步骤S1203,将当前输入的音素的第一个音素的预测结果作为最终结果,终端设备可以存储最终结果;终端设备将所述第一个音素的特征向量、当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型;返回步骤S1100;
步骤S1204,对当前输入的音素的字符和标点进行预测,得到当前输入的音素的最终结果,判断是否存在已存储的最终结果;若存在已存储的最终结果,则执行步骤S1205;若不存在已存储的最终结果,则执行步骤S1206;
步骤S1205,将已存储的最终结果和当前输入的音素的最终结果进行融合得到待识别的音素的最终结果,结束循环;
步骤S1206,将当前输入的音素的最终结果作为待识别的音素的最终结果,结束循环。
在上述实施方式中,步骤S1201-S1203与步骤S1101-S1103部分判断是否满足感受野的顺序和预测的顺序不同。其他内容可以参见图11部分的解释。
在另一种可能的实现方式中,第一多任务神经网络模型也可以是非流式网络结构。终端设备可以顺序输入待识别的音素,不再循环输入已经预测了结果的音素。相比于流式网络结构,非流式网络结构不需要缓存已经预测的历史结果,减少占用内存空间,可以进一步减小神经网络模型的尺寸。
对于非流式网络结构,采用所述第一多任务神经网络模型输出第一预测结果,可以包括:采用所述第一多任务神经网络模型基于待识别的音素的总长度和音素长度阈值的关系,输出所述第一预测结果。具体可以包括以下步骤:
1、若待识别的音素的总长度小于音素长度阈值,采用所述第一多任务神经网络模型根据全部的待识别的音素,输出所述第一预测结果;
2、若待识别的音素的总长度不小于音素长度阈值,在完成将全部待识别的音素输入第一多任务神经网络模型之前:2.1如果当前输入的音素的长度小于音素长度阈值,则终端设备继续输入下一个音素;2.2如果当前输入的音素的长度不小于音素长度阈值,则终端设备根据当前输入的音素的字符和标点,得到当前输入的音素的第一个音素的第二预测结果并存储第二预测结果,终端设备将当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型;2.3若待识别的音素的总长度不小于音素长度阈值,在完成将全部待识别的音素输入第一多任务神经网络模型时,根据当前输入的音素的字符和标点,得到当前输入的音素的第二预测结果;2.4若不存在已存储的第二预测结果,则将当前输入的音素的第二预测结果作为待识别的音素的第一预测结果;2.5若存在已存储的第二预测结果,则根据当前输入的音素的第二预测结果和已存储的第二预测结果,得到待识别的音素的第一预测结果。
终端设备可以设置音素长度阈值,在待识别的音素的总长度小于音素长度阈值时,终端设备可以将待识别的音素输入第一多任务神经网络模型进行推理,得到的预测结果作为最终结果。在待识别的音素的总长度大于音素长度阈值时,终端设备可以将待识别的音素逐个输入到第一多任务神经网络模型进行推理,在当前输入的音素的长度不小于音素长度阈值时,将当前输入的音素的第一个音素的预测结果作为最终结果进行存储,并继续输入下一个待识别的音素,继续推理,直到输入最后一个待识别的音素,将当前输入的音素的预测结果作为最终结果,将当前输入的音素的最终结果和已存储的最终结果进行融合,可以得到待识别的音素的最终结果。
图13示出根据本申请一实施方式的语音识别方法的流程图。如图13所示,本实施方式的语音识别方法可以包括以下步骤:
步骤S1300,判断是否已经完成全部待识别的音素的输入;若没有完成全部待识别的音素的输入,则执行步骤S1301;若已经完成全部待识别的音素的输入,则执行步骤S1304;
步骤S1301,判断当前输入的音素的长度是否小于音素长度阈值;若当前输入的音素的长度小于音素长度阈值,则执行步骤S1302;若当前输入的音素的长度不小于音素长度阈值,则执行步骤S1303;
步骤S1302,对当前输入的音素的字符和标点进行预测,得到当前输入的音素的临时结果,并继续输入下一个音素,返回步骤S1300;
步骤S1303,对当前输入的音素的字符和标点进行预测,得到当前输入的音素的第一个音素的最终结果,终端设备可以存储最终结果;终端设备将当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型;返回步骤S1300;
步骤S1304,对当前输入的音素的字符和标点进行预测,得到当前输入的音素的最终结果,判断是否存在已存储的最终结果;若存在已存储的最终结果,则执行步骤S1305;若不存在已存储的最终结果,则执行步骤S1306;
步骤S1305,将已存储的最终结果和当前输入的音素的最终结果进行融合得到待识别的音素的最终结果,结束循环;
步骤S1306,将当前输入的音素的最终结果作为待识别的音素的最终结果,结束循环。
相比于图11的实施方式,在图13的实施方式中,终端设备判断当前输入的音素的长度是否小于音素长度阈值,对音素进行字符和标点预测时,参考了该音素之后的音素,参考的音素的数量即音素长度阈值。在本申请的实施例中,终端设备可以设置音素长度阈值为32,可以理解的是,音素长度阈值的大小可以根据实际的需求设置,本申请对此不作具体限定。
在步骤S1303中,若当前输入的音素的长度不小于音素长度阈值,终端设备将当前输入的音素的第一个音素的最终结果保存,但不在将该第一个音素的特征向量作为下一次推理的输入。而是将当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型,进行推理。
举例来说,假设音素长度阈值为32,当待识别的音素的总长度小于32时,终端设备逐个将待识别的音素输入第一多任务神经网络模型,在当前输入的所有音素的长度小于32时,当输入一个新的待识别的音素时,终端设备采用第一多任务神经网络模型进行推理可以确定当前输入的待识别音素的临时结果,并根据当前输入的所有的待识别音素刷新当前输入的待识别的音素之前的音素的临时结果….重复以上过程,直到将全部待识别的音素输入第一多任务神经网络模型后,进行推理得到的结果为最终结果。
假如输入为‘春眠不觉晓,处处闻啼鸟。’,当输入为‘chun1’时,输入到第一多任务神经网络模型的实际向量为[chun1,0,0,0…,0],没有的地方补齐0到32位。输出为[椿,0,0…,0]。输入下一个音素‘mian2’时,可以将‘chun1mian2’一起送入第一多任务神经网络模型,输入为[chun1,mian2,0,0,…,0],输出为[春,眠,0,0…,0],得到的结果会将音 素“chun”原来的临时结果刷新……重复以上过程,直到将[chun1,mian2,bu,jue,xiao,chu,chu,wen,ti,niao0,0,…,0]输入第一多任务神经网络模型,得到最终结果。
当待识别的音素的总长度不小于32时,逐个将待识别的音素输入到第一多任务神经网络模型中,在当前输入的音素的长度不小于32时,与前述总长度小于32时的过程相同,不再赘述。在当前输入的音素的长度不小于32时,将当前输入的音素的第一个音素的预测结果作为最终结果保存,将当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型,继续进行推理…,重复这个过程,直到将待识别音素的最后一个音素输入之后进行推理得到最终结果,将之前保存的最终结果与最后一次识别的32个最终结果进行融合得到待识别的音素的最终结果。
根据本申请上述实施方式的语音识别方法,采用非流式网络结构,无需将已经预测了结果的音素重新输入网络模型中,相比于流式网络结构,非流式网络结构不需要缓存已经预测的历史结果,减少占用内存空间,可以进一步减小神经网络模型的尺寸,易于在端侧进行部署。并且,与流式网络结构相比,非流式网络结构虽然计算量大,但是网络中没有拼接、切分等算子,不需要内存搬运等耗时操作,在GPU等高并行计算的设备中,可以快速推理。
根据以上从训练样本集构建到模型推理过程,对本申请的语音识别方法的介绍可知,为了解决相关的语音识别技术中存在模型无法在端侧部署、采用声学模型预测标点的预测准确性不高的技术问题,本申请提供了一种实施方式的语音识别方法,具体包括以下步骤:
终端设备将待识别的音素输入到第一多任务神经网络模型中,其中,所述第一多任务神经网络模型为采用训练样本对第二多任务神经网络模型进行训练得到的,所述训练样本包括:样本语句,所述样本语句中包括字符,所述训练样本包括:样本语句中的字符对应的音素、标点;
终端设备采用所述第一多任务神经网络模型输出第一预测结果,所述第一预测结果包括所述待识别的音素对应的字符预测结果和标点预测结果;
终端设备根据所述第一预测结果将所述第一预测结果的至少一部分显示在所述终端设备的显示屏上。
具体的过程可以参见上文中对图10所述的语音识别方法的过程的说明,不再赘述。
本申请的实施方式的语音识别方法,通过构建一个用于同时预测音素对应的字符和标点的神经网络模型,并构建训练样本集对神经网络模型进行训练,得到训练后的神经网络模型,训练过程中可以不需要进行分词处理,将待识别的语音转换后的音素(向量)作为训练后的神经网络模型的输入,进行正向推理,可以同时输出音素对应的字符和标点,并且神经网络模型尺寸小,可以在端侧部署。
为了进一步解决相关技术中输入和输出的长度不相同的情况下,无法同时输出预测结果的技术问题,本申请的实施方式中,构建的训练样本中的所述样本语句中字符的长度与音素 的长度和标点的长度相同。通过构建训练样本集的过程中,将样本语句中字符的长度和注音后的音素的长度、标点的长度进行对齐,采用本申请的实施方式构建的训练样本集对神经网络模型进行训练后,神经网络模型可以同时进行音素到字符的转换、以及标点预测,从而可以同时输出预测的字符和标点结果。
在一种可能的实现方式中,所述第一多任务神经网络模型为流式网络结构,终端设备将待识别的音素输入到第一多任务神经网络模型中,采用所述第一多任务神经网络模型输出第一预测结果,可以包括:所述终端设备将待识别的音素循环送入第一多任务神经网络模型中,采用所述第一多任务神经网络模型基于当前输入的待识别的音素的长度输出所述第一预测结果。使得待识别的音素的预测结果既参考了之前的音素、又参考了之后的音素,提高了预测的准确率。
在一种可能的实现方式中,所述终端设备将待识别的音素循环送入第一多任务神经网络模型中,采用所述第一多任务神经网络模型基于当前输入的待识别的音素的长度输出所述第一预测结果,可以包括:
在完成将全部待识别的音素输入第一多任务神经网络模型之前,如果当前输入的音素的长度小于感受野,则终端设备继续输入下一个音素;
在完成将全部待识别的音素输入第一多任务神经网络模型之前,如果当前输入的音素的长度不小于感受野,则终端设备根据当前输入的音素的字符和标点,得到当前输入的音素的第一个音素的第二预测结果,并存储第二预测结果;终端设备将所述第一个音素的特征向量、当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型。
在一种可能的实现方式中,所述终端设备将待识别的音素循环送入第一多任务神经网络模型中,采用所述第一多任务神经网络模型基于当前输入的待识别的音素的长度输出所述第一预测结果,还可以包括:
在完成将全部待识别的音素输入第一多任务神经网络模型时,终端设备根据当前输入的音素的字符和标点,得到当前输入的音素的第二预测结果;
若不存在已存储的第二预测结果,则将当前输入的音素的第二预测结果作为待识别的音素的第一预测结果;
若存在已存储的第二预测结果,则根据当前输入的音素的第二预测结果和已存储的第二预测结果,得到待识别的音素的第一预测结果。
关于采用流式网络结构进行推理的过程,具体的示例可以参见上文中图11和图12部分的描述,需要说明的是,图11和图12的推理过程仅仅是本申请的一些实例,其中的步骤的执行顺序以及具体数值不以任何方式限制本申请。
根据本申请上述实施方式的语音识别方法,将声学模型输出的待识别的音素循环送入流 式网络结构的第一多任务神经网络模型,使得待识别的音素的预测结果既参考了之前的音素、又参考了之后的音素,提高了预测的准确率。
在一种可能的实现方式中,所述第一多任务神经网络模型为非流式网络结构,所述终端设备采用所述第一多任务神经网络模型输出第一预测结果,可以包括:终端设备采用所述第一多任务神经网络模型基于待识别的音素的总长度和音素长度阈值的关系,输出所述第一预测结果。
在一种可能的实现方式中,终端设备采用所述第一多任务神经网络模型基于待识别的音素的总长度和音素长度阈值的关系,输出所述第一预测结果,可以包括:
若待识别的音素的总长度小于音素长度阈值,采用所述第一多任务神经网络模型根据全部的待识别的音素,输出所述第一预测结果。
在一种可能的实现方式中,终端设备采用所述第一多任务神经网络模型基于待识别的音素的总长度和音素长度阈值的关系,输出所述第一预测结果,还可以包括:
若待识别的音素的总长度不小于音素长度阈值,在完成将全部待识别的音素输入第一多任务神经网络模型之前:如果当前输入的音素的长度小于音素长度阈值,则终端设备继续输入下一个音素;如果当前输入的音素的长度不小于音素长度阈值,则终端设备根据当前输入的音素的字符和标点,得到当前输入的音素的第一个音素的第二预测结果并存储第二预测结果,终端设备将当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型。
在一种可能的实现方式中,采用所述第一多任务神经网络模型基于待识别的音素的总长度和音素长度阈值的关系,输出所述第一预测结果,还可以包括:
若待识别的音素的总长度不小于音素长度阈值,在完成将全部待识别的音素输入第一多任务神经网络模型时,根据当前输入的音素的字符和标点,得到当前输入的音素的第二预测结果;
若不存在已存储的第二预测结果,则将当前输入的音素的第二预测结果作为待识别的音素的第一预测结果;
若存在已存储的第二预测结果,则根据当前输入的音素的第二预测结果和已存储的第二预测结果,得到待识别的音素的第一预测结果。
关于采用非流式网络结构进行推理的过程,具体的示例可以参见上文中图13部分的描述,需要说明的是,图13的推理过程仅仅是本申请的一些实例,其中的步骤的执行顺序以及具体数值不以任何方式限制本申请。
根据本申请上述实施方式的语音识别方法,采用非流式网络结构,无需将已经预测了结果的音素重新输入网络模型中,相比于流式网络结构,非流式网络结构不需要缓存已经预测 的历史结果,减少占用内存空间,可以进一步减小神经网络模型的尺寸,易于在端侧进行部署。并且,由于计算过程中,不需要对历史结果和当前输入的音素进行拼接、切分等操作,能够加快推理速度,在长语音识别中,实现实时输出的效果显著。
本申请还提供了一种神经网络模型训练方法,所述方法包括:
构建训练样本,所述训练样本包括:样本语句,所述样本语句中包括字符,所述训练样本还包括:样本语句中的字符对应的音素、标点;
采用所述训练样本对第二多任务神经网络模型进行训练得到第一多任务神经网络模型;其中,第二多任务神经网络模型和第一多任务神经网络模型都能够输出第一预测结果、显示所述第一预测结果的至少一部分,所述第一预测结果包括字符预测结果和标点预测结果,同时对音素的字符和标点进行预测。
本申请的实施方式的神经网络模型训练方法,通过构建一个用于同时预测音素对应的字符和标点的神经网络模型,并构建训练样本集对神经网络模型进行训练,得到训练后的神经网络模型,训练过程中可以不需要进行分词处理,将待识别的语音转换后的音素(向量)作为训练后的神经网络模型的输入,进行正向推理,可以同时输出音素对应的字符和标点,并且神经网络模型尺寸小,可以在端侧部署。
为了进一步解决相关技术中的输入序列和输出序列长度不同的场景下,无法同时输出预测结果的技术问题,在一种可能的实现方式中,构建训练样本,可以包括:
根据注音词典对样本语句中的字符进行注音得到字符对应的音素、并对字符对应的音素与字符和标点进行对齐处理,所述样本语句中字符的长度与音素的长度和标点的长度相同。
通过构建训练样本集的过程中,将样本语句中字符的长度和注音后的音素的长度、标点的长度进行对齐,采用本申请的实施方式构建的训练样本集对神经网络模型进行训练后,神经网络模型可以同时进行音素到字符的转换、以及标点预测,从而可以同时输出预测的字符和标点结果。
在一种可能的实现方式中,对齐后的中文中的多音字对应的音素为,多音字对应的多个音素中的任意一个;对齐后的英文字符中包括对齐字符,对齐后的英文字符的长度和英文字符对应的音素的长度相同;对于对齐之前没有标点的字符,对齐后的标点为blank。具体地,上述对字符对应的音素与字符和标点进行对齐处理,可以包括:
对于中文中的多音字,从多音字对应的多个音素中任选一个音素作为多音字对应的音素;
对于英文字符,在字符中添加对齐字符与字符对应的音素的长度对齐;
若字符之后没有标点,则设置字符对应的标点为blank,使得标点的长度与字符的长度对齐。
其中,对于英文字符,在字符中添加对齐字符的位置可以位于需要对齐的字符的两侧,比如,之前或者之后。也就是说,可以将字符和音素进行左对齐或者右对齐,右对齐可以是将对齐字符添加在需要对齐的字符的左侧,左对齐可以是将对齐字符添加在需要对齐的字符的右侧。对于对齐字符的形式以及添加方式可以参见上文中图7部分的介绍,不再赘述。并且,以上对齐处理的三步可以分别进行,也可以同时进行,本申请对此不作限定。
在一种可能的实现方式中,采用所述训练样本对第二多任务神经网络模型进行训练得到第一多任务神经网络模型,可以包括:
将训练样本输入第二多任务神经网络模型,确定所述训练样本对应的字符概率矩阵和标点概率矩阵;
根据字符概率矩阵和标点概率矩阵,分别计算字符交叉熵损失和标点交叉熵损失;
根据字符交叉熵损失、字符交叉熵损失对应的第一权值和标点交叉熵损失、标点交叉熵损失对应的第二权值,计算加权交叉熵损失;
根据所述加权交叉熵损失调整第二多任务神经网络模型的参数,得到训练后的第一多任务神经网络模型。
对于训练过程的具体介绍可以参见上文中图8部分的内容,不再赘述。
本申请的多任务神经网络模型的训练方法,可以实现同时对字符预测和标点预测的任务进行训练。另外,由于构建的训练样本集中包括多种语言,因此,本申请的多任务神经网络模型的训练方法还可以实现对多种语言识别(预测)的任务进行训练。根据本申请的实施方式的多任务神经网络模型的训练方法进行训练得到的多任务神经网络模型,可以同时进行多种语言和标点的预测,并且多任务神经网络模型相比于传统的声学模型尺寸小,可以在端侧部署。
本申请的实施例还提供了一种语音识别装置,可以应用于如图1或者图3所示的终端设备。图14示出根据本申请一实施例的语音识别装置的框图,如图14所示,所述语音识别装置可以包括:
输入模块1400,用于将待识别的音素输入到第一多任务神经网络模型中,其中,所述第一多任务神经网络模型为采用训练样本对第二多任务神经网络模型进行训练得到的,所述训练样本包括:样本语句,所述样本语句中包括字符,所述训练样本还包括:样本语句中的字符对应的音素、标点;
推理模块1401,用于采用所述第一多任务神经网络模型输出第一预测结果,所述第一预测结果包括所述待识别的音素对应的字符预测结果和标点预测结果;
显示模块1402,用于根据所述第一预测结果将所述第一预测结果的至少一部分显示在所述终端设备的显示屏上。
本申请的实施方式的语音识别装置,通过构建一个用于同时预测、输出音素对应的字符和标点的神经网络模型,并构建训练样本集对神经网络模型进行训练,得到训练后的神经网络模型,训练过程中可以不需要进行分词处理,将待识别的语音转换后的音素(向量)作为训练后的神经网络模型的输入,进行正向推理,可以同时输出音素对应的字符和标点,并且神经网络模型尺寸小,可以在端侧部署。
在一种可能的实现方式中,所述样本语句中字符的长度与音素的长度和标点的长度相同。
通过构建训练样本集的过程中,将样本语句中字符的长度和注音后的音素的长度、标点的长度进行对齐,采用本申请的实施方式构建的训练样本集对神经网络模型进行训练后,神经网络模型可以同时进行音素到字符的转换、以及标点预测,从而可以同时输出预测的字符和标点结果。
在一种可能的实现方式中,所述第一多任务神经网络模型为流式网络结构,所述输入模块1400,可以包括:第一输入单元,用于将待识别的音素循环送入第一多任务神经网络模型中;所述推理模块1401,包括:第一推理单元,用于采用所述第一多任务神经网络模型基于当前输入的待识别的音素的长度输出所述第一预测结果。使得待识别的音素的预测结果既参考了之前的音素、又参考了之后的音素,提高了预测的准确率。
在一种可能的实现方式中,所述第一输入单元还用于:在完成将全部待识别的音素输入第一多任务神经网络模型之前,如果当前输入的音素的长度小于感受野,则终端设备继续输入下一个音素;
在完成将全部待识别的音素输入第一多任务神经网络模型之前,如果当前输入的音素的长度不小于感受野,则第一推理单元用于根据当前输入的音素的字符和标点,得到当前输入的音素的第一个音素的第二预测结果,并存储第二预测结果;终端设备将所述第一个音素的特征向量、当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型。
在一种可能的实现方式中,所述第一推理单元还用于:在完成将全部待识别的音素输入第一多任务神经网络模型时根据当前输入的音素的字符和标点,得到当前输入的音素的第二预测结果;
若不存在已存储的第二预测结果,则将当前输入的音素的第二预测结果作为待识别的音素的第一预测结果;
若存在已存储的第二预测结果,则根据当前输入的音素的第二预测结果和已存储的第二预测结果,得到待识别的音素的第一预测结果。
根据本申请上述实施方式的语音识别装置,将声学模型输出的待识别的音素循环送入流式网络结构的第一多任务神经网络模型,使得待识别的音素的预测结果既参考了之前的音素、又参考了之后的音素,提高了预测的准确率。
在一种可能的实现方式中,所述第一多任务神经网络模型为非流式网络结构,所述推理模块1401,包括:
第二推理单元,用于采用所述第一多任务神经网络模型基于待识别的音素的总长度和音素长度阈值的关系,输出所述第一预测结果。
在一种可能的实现方式中,所述第二推理单元还用于若待识别的音素的总长度小于音素长度阈值,采用所述第一多任务神经网络模型根据根据全部的待识别的音素,输出所述第一预测结果。
在一种可能的实现方式中,所述第二推理单元还用于:
若待识别的音素的总长度不小于音素长度阈值,在完成将全部待识别的音素输入第一多任务神经网络模型之前:如果当前输入的音素的长度小于音素长度阈值,则继续输入下一个音素;如果当前输入的音素的长度不小于音素长度阈值,则根据当前输入的音素的字符和标点,得到当前输入的音素的第一个音素的第二预测结果并存储第二预测结果,终端设备将当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型。
在一种可能的实现方式中,所述第二推理单元还用于:
若待识别的音素的总长度不小于音素长度阈值,在完成将全部待识别的音素输入第一多任务神经网络模型时,根据当前输入的音素的字符和标点,得到当前输入的音素的第二预测结果;
若不存在已存储的第二预测结果,则将当前输入的音素的第二预测结果作为待识别的音素的第一预测结果;
若存在已存储的第二预测结果,则根据当前输入的音素的第二预测结果和已存储的第二预测结果,得到待识别的音素的第一预测结果。
根据本申请上述实施方式的语音识别装置,采用非流式网络结构,无需将已经预测了结果的音素重新输入网络模型中,相比于流式网络结构,非流式网络结构不需要缓存已经预测的历史结果,减少占用内存空间,可以进一步减小神经网络模型的尺寸,易于在端侧进行部署。并且,由于计算过程中,不需要对历史结果和当前输入的音素进行拼接、切分等操作,能够加快推理速度,在长语音识别中,实现实时输出的效果显著。
本申请的实施例还提供了一种神经网络模型训练装置。图15示出根据本申请一实施例的神经网络模型训练装置的框图,如图15所示,所述神经网络模型训练装置可以包括:
构建模块1500,用于构建训练样本,所述训练样本包括:样本语句,所述样本语句中包括字符,所述训练样本还包括:样本语句中的字符对应的音素、标点;
训练模块1501,用于采用所述训练样本对第二多任务神经网络模型进行训练得到第一多任务神经网络模型;其中,第二多任务神经网络模型和第一多任务神经网络模型都能够输出第一预测结果、显示所述第一预测结果的至少一部分,所述第一预测结果包括字符预测结果和标点预测结果,同时对音素的字符和标点进行预测。
本申请的实施方式的神经网络训练装置,通过构建能够同时进行音素转字符和标点预测的神经网络模型,并构建训练样本集对神经网络模型进行训练,得到训练后的神经网络模型,训练过程中不需要进行分词处理,将待识别的语音转换后的音素(向量)作为训练后的神经网络模型的输入,进行正向推理,可以同时输出音素对应的字符和标点,并且神经网络模型尺寸小,可以在端侧部署。在一种可能的实现方式中,所述构建模块1500,包括:
对齐单元,用于根据注音词典对样本语句中的字符进行注音得到字符对应的音素、并对字符对应的音素与字符和标点进行对齐处理,所述样本语句中字符的长度与音素的长度和标点的长度相同。
在一种可能的实现方式中,对齐后的中文中的多音字对应的音素为,多音字对应的多个音素中的任意一个;对齐后的英文字符中包括对齐字符,对齐后的英文字符的长度和英文字符对应的音素的长度相同;对于对齐之前没有标点的字符,对齐后的标点为blank。具体地,所述对齐单元还用于:
对于中文中的多音字,从多音字对应的多个音素中任选一个音素作为多音字对应的音素;
对于英文字符,在字符中添加对齐字符与字符对应的音素的长度对齐;
若字符之后没有标点,则设置字符对应的标点为blank,使得标点的长度与字符的长度对齐。
通过构建训练样本集的过程中,将样本语句中字符的长度和注音后的音素的长度、标点的长度进行对齐,采用本申请的实施方式构建的训练样本集对神经网络模型进行训练后,神经网络模型可以同时进行音素到字符的转换、以及标点预测,从而可以同时输出预测的字符和标点结果。
在一种可能的实现方式中,所述训练模块1501,包括:
确定单元,用于将训练样本输入第二多任务神经网络模型,确定所述训练样本对应的字符概率矩阵和标点概率矩阵;
第一计算单元,用于根据字符概率矩阵和标点概率矩阵,分别计算字符交叉熵损失和标点交叉熵损失;
第二计算单元,用于根据字符交叉熵损失、字符交叉熵损失对应的第一权值和标点交叉熵损失、标点交叉熵损失对应的第二权值,计算加权交叉熵损失;
调整单元,用于根据所述加权交叉熵损失调整第二多任务神经网络模型的参数,得到训练后的第一多任务神经网络模型。
本申请的多任务神经网络模型的训练装置,可以实现同时对字符预测和标点预测的任务进行训练。另外,由于构建的训练样本集中包括多种语言,因此,本申请的多任务神经网络模型的训练方法还可以实现对多种语言识别(预测)的任务进行训练。根据本申请的实施方式的多任务神经网络模型的训练装置进行训练得到的多任务神经网络模型,可以同时进行多种语言和标点的预测,并且多任务神经网络模型相比于传统的声学模型尺寸小,可以在端侧部署。
本申请的实施例提供了一种语音识别装置,包括:处理器以及用于存储处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令时实现上述方法。
本申请的实施例提供了一种神经网络模型训练装置,包括:处理器以及用于存储处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令时实现上述方法。
本申请的实施例提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述方法。
本申请的实施例提供了一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当所述计算机可读代码在电子设备的处理器中运行时,所述电子设备中的处理器执行上述方法。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(Random Access Memory,RAM)、只读存储器(Read Only Memory,ROM)、可擦式可编程只读存储器(Electrically Programmable Read-Only-Memory,EPROM或闪存)、静态随机存取存储器(Static Random-Access Memory,SRAM)、便携式压缩盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、数字多功能盘(Digital Video Disc,DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。
这里所描述的计算机可读程序指令或代码可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本申请操作的计算机程序指令可以是汇编指令、指令集架构(Instruction Set Architecture,ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意 种类的网络—包括局域网(Local Area Network,LAN)或广域网(Wide Area Network,WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或可编程逻辑阵列(Programmable Logic Array,PLA),该电子电路可以执行计算机可读程序指令,从而实现本申请的各个方面。
这里参照根据本申请实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本申请的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本申请的多个实施例的装置、系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。
也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行相应的功能或动作的硬件(例如电路或ASIC(Application Specific Integrated Circuit,专用集成电路))来实现,或者可以用硬件和软件的组合,如固件等来实现。
尽管在此结合各实施例对本申请进行了描述,然而,在实施所要求保护的本申请过程中,本领域技术人员通过查看所述附图、公开内容、以及所附权利要求书,可理解并实现所述公开实施例的其它变化。在权利要求中,“包括”(comprising)一词不排除其他组成部分或步骤,“一”或“一个”不排除多个的情况。单个处理器或其它单元可以实现权利要求中列举的若干项功能。相互不同的从属权利要求中记载了某些措施,但这并不表示这些措施不能组合起来产生良好的效果。
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技 术人员能理解本文披露的各实施例。

Claims (31)

  1. 一种语音识别方法,其特征在于,所述方法包括:
    终端设备将待识别的音素输入到第一多任务神经网络模型中;
    采用所述第一多任务神经网络模型输出第一预测结果,所述第一预测结果包括所述待识别的音素对应的字符预测结果和标点预测结果;
    终端设备根据所述第一预测结果将所述第一预测结果的至少一部分显示在所述终端设备的显示屏上。
  2. 根据权利要求1所述的方法,其特征在于,所述第一多任务神经网络模型为采用训练样本对第二多任务神经网络模型进行训练得到的,所述训练样本包括:样本语句,所述样本语句中包括字符,所述训练样本还包括:样本语句中的字符对应的音素、标点。
  3. 根据权利要求2所述的方法,其特征在于,所述样本语句中字符的长度与音素的长度和标点的长度相同。
  4. 根据权利要求1所述的方法,其特征在于,所述第一多任务神经网络模型为流式网络结构,
    终端设备将待识别的音素输入到第一多任务神经网络模型中,采用所述第一多任务神经网络模型输出第一预测结果,包括:所述终端设备将待识别的音素循环送入第一多任务神经网络模型中,采用所述第一多任务神经网络模型基于当前输入的待识别的音素的长度输出所述第一预测结果。
  5. 根据权利要求4所述的方法,其特征在于,所述终端设备将待识别的音素循环送入第一多任务神经网络模型中,采用所述第一多任务神经网络模型基于当前输入的待识别的音素的长度输出所述第一预测结果,包括:
    在完成将全部待识别的音素输入第一多任务神经网络模型之前,如果当前输入的音素的长度小于感受野,则终端设备继续输入下一个音素;
    在完成将全部待识别的音素输入第一多任务神经网络模型之前,如果当前输入的音素的长度不小于感受野,则终端设备根据当前输入的音素的字符和标点,得到当前输入的音素的第一个音素的第二预测结果,并存储第二预测结果;终端设备将所述第一个音素的特征向量、当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型。
  6. 根据权利要求4或5所述的方法,其特征在于,所述终端设备将待识别的音素循环送入第一多任务神经网络模型中,采用所述第一多任务神经网络模型基于当前输入的待识别的音素的长度输出所述第一预测结果,还包括:
    在完成将全部待识别的音素输入第一多任务神经网络模型时,终端设备根据当前输入的音素的字符和标点,得到当前输入的音素的第二预测结果;
    若不存在已存储的第二预测结果,则终端设备将当前输入的音素的第二预测结果作为待识别的音素的第一预测结果;
    若存在已存储的第二预测结果,则根据当前输入的音素的第二预测结果和已存储的第二预测结果,得到待识别的音素的第一预测结果。
  7. 根据权利要求1所述的方法,其特征在于,所述第一多任务神经网络模型为非流式网络结构,
    采用所述第一多任务神经网络模型输出第一预测结果,包括:
    采用所述第一多任务神经网络模型基于待识别的音素的总长度和音素长度阈值的关系,输出所述第一预测结果。
  8. 根据权利要求7所述的方法,其特征在于,采用所述第一多任务神经网络模型基于待识别的音素的总长度和音素长度阈值的关系,输出所述第一预测结果,包括:
    若待识别的音素的总长度小于音素长度阈值,采用所述第一多任务神经网络模型根据全部的待识别的音素,输出所述第一预测结果。
  9. 根据权利要求7或8所述的方法,其特征在于,采用所述第一多任务神经网络模型基于待识别的音素的总长度和音素长度阈值的关系,输出所述第一预测结果,包括:
    若待识别的音素的总长度不小于音素长度阈值,在完成将全部待识别的音素输入第一多任务神经网络模型之前:如果当前输入的音素的长度小于音素长度阈值,则终端设备继续输入下一个音素;如果当前输入的音素的长度不小于音素长度阈值,则终端设备根据当前输入的音素的字符和标点,得到当前输入的音素的第一个音素的第二预测结果并存储第二预测结果,终端设备将当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型。
  10. 根据权利要求9所述的方法,其特征在于,采用所述第一多任务神经网络模型基于待识别的音素的总长度和音素长度阈值的关系,输出所述第一预测结果,还包括:
    若待识别的音素的总长度不小于音素长度阈值,在完成将全部待识别的音素输入第一多任务神经网络模型时,根据当前输入的音素的字符和标点,得到当前输入的音素的第二预测结果;
    若不存在已存储的第二预测结果,则将当前输入的音素的第二预测结果作为待识别的音素的第一预测结果;
    若存在已存储的第二预测结果,则根据当前输入的音素的第二预测结果和已存储的第二 预测结果,得到待识别的音素的第一预测结果。
  11. 一种神经网络模型训练方法,其特征在于,所述方法包括:
    构建训练样本,所述训练样本包括:样本语句,所述样本语句中包括字符,所述训练样本还包括:样本语句中的字符对应的音素、标点;
    采用所述训练样本对第二多任务神经网络模型进行训练得到第一多任务神经网络模型;其中,第二多任务神经网络模型和第一多任务神经网络模型都能够输出第一预测结果、显示所述第一预测结果的至少一部分,所述第一预测结果包括字符预测结果和标点预测结果。
  12. 根据权利要求11所述的方法,其特征在于,构建训练样本,包括:
    根据注音词典对样本语句中的字符进行注音得到字符对应的音素、并对字符对应的音素与字符和标点进行对齐处理,所述样本语句中字符的长度与音素的长度和标点的长度相同。
  13. 根据权利要求12所述的方法,其特征在于,
    对齐后的中文中的多音字对应的音素为,多音字对应的多个音素中的任意一个;
    对齐后的英文字符中包括对齐字符,对齐后的英文字符的长度和英文字符对应的音素的长度相同;
    对于对齐之前没有标点的字符,对齐后的标点为blank。
  14. 根据权利要求11所述的方法,其特征在于,采用所述训练样本对第二多任务神经网络模型进行训练得到第一多任务神经网络模型,包括:
    将训练样本输入第二多任务神经网络模型,确定所述训练样本对应的字符概率矩阵和标点概率矩阵;
    根据字符概率矩阵和标点概率矩阵,分别计算字符交叉熵损失和标点交叉熵损失;
    根据字符交叉熵损失、字符交叉熵损失对应的第一权值和标点交叉熵损失、标点交叉熵损失对应的第二权值,计算加权交叉熵损失;
    根据所述加权交叉熵损失调整第二多任务神经网络模型的参数,得到训练后的第一多任务神经网络模型。
  15. 一种语音识别装置,其特征在于,所述装置包括:
    输入模块,用于将待识别的音素输入到第一多任务神经网络模型中;
    推理模块,用于采用所述第一多任务神经网络模型输出第一预测结果,所述第一预测结果包括所述待识别的音素对应的字符预测结果和标点预测结果;
    显示模块,用于根据所述第一预测结果将所述第一预测结果的至少一部分显示在终端设备的显示屏上。
  16. 根据权利要求15所述的装置,其特征在于,所述第一多任务神经网络模型为采用训练样本对第二多任务神经网络模型进行训练得到的,所述训练样本包括:样本语句,所述样本语句中包括字符,所述训练样本还包括:样本语句中的字符对应的音素、标点。
  17. 根据权利要求16所述的装置,其特征在于,所述样本语句中字符的长度与音素的长度和标点的长度相同。
  18. 根据权利要求15所述的装置,其特征在于,所述第一多任务神经网络模型为流式网络结构,
    所述输入模块,包括:第一输入单元,用于将待识别的音素循环送入第一多任务神经网络模型中;
    所述推理模块,包括:
    第一推理单元,用于采用所述第一多任务神经网络模型基于当前输入的待识别的音素的长度输出所述第一预测结果。
  19. 根据权利要求18所述的装置,其特征在于,
    所述第一输入单元还用于:在完成将全部待识别的音素输入第一多任务神经网络模型之前,如果当前输入的音素的长度小于感受野,则终端设备继续输入下一个音素;
    在完成将全部待识别的音素输入第一多任务神经网络模型之前,如果当前输入的音素的长度不小于感受野,则第一推理单元用于根据当前输入的音素的字符和标点,得到当前输入的音素的第一个音素的第二预测结果,并存储第二预测结果;第一输入单元还用于将所述第一个音素的特征向量、当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型。
  20. 根据权利要求18或19所述的装置,其特征在于,所述第一推理单元还用于:
    在完成将全部待识别的音素输入第一多任务神经网络模型时,根据当前输入的音素的字符和标点,得到当前输入的音素的第二预测结果;
    若不存在已存储的第二预测结果,则将当前输入的音素的第二预测结果作为待识别的音素的第以预测结果;
    若存在已存储的第二预测结果,则根据当前输入的音素的第二预测结果和已存储的第二预测结果,得到待识别的音素的第二预测结果。
  21. 根据权利要求15所述的装置,其特征在于,所述第一多任务神经网络模型为非流式 网络结构,
    所述推理模块,包括:
    第二推理单元,用于采用所述第一多任务神经网络模型基于待识别的音素的总长度和音素长度阈值的关系,输出所述第一预测结果。
  22. 根据权利要求21所述的装置,其特征在于,所述第二推理单元还用于若待识别的音素的总长度小于音素长度阈值,采用所述第一多任务神经网络模型根据根据全部的待识别的音素,输出所述第一预测结果。
  23. 根据权利要求21或22所述的装置,其特征在于,所述第二推理单元还用于:
    若待识别的音素的总长度不小于音素长度阈值,在完成将全部待识别的音素输入第一多任务神经网络模型之前:如果当前输入的音素的长度小于音素长度阈值,则继续输入下一个音素;如果当前输入的音素的长度不小于音素长度阈值,则根据当前输入的音素的字符和标点,得到当前输入的音素的第一个音素的第二预测结果并存储第二预测结果,将当前输入的音素中除了第一个音素以外的音素和待识别的音素中的下一个音素继续输入第一多任务神经网络模型。
  24. 根据权利要求23所述的装置,其特征在于,所述第二推理单元还用于:
    若待识别的音素的总长度不小于音素长度阈值,在完成将全部待识别的音素输入第一多任务神经网络模型时,根据当前输入的音素的字符和标点,得到当前输入的音素的第二预测结果;
    若不存在已存储的第二预测结果,则将当前输入的音素的第二预测结果作为待识别的音素的第一预测结果;
    若存在已存储的第二预测结果,则根据当前输入的音素的第二预测结果和已存储的第二预测结果,得到待识别的音素的第一预测结果。
  25. 一种神经网络模型训练装置,其特征在于,所述装置包括:
    构建模块,用于构建训练样本,所述训练样本包括:样本语句,所述样本语句中包括字符,所述训练样本还包括:样本语句中的字符对应的音素、标点;
    训练模块,用于采用所述训练样本对第二多任务神经网络模型进行训练得到第一多任务神经网络模型;其中,第二多任务神经网络模型和第一多任务神经网络模型都能够输出第一预测结果、显示所述第一预测结果的至少一部分,所述第一预测结果包括字符预测结果和标点预测结果。
  26. 根据权利要求25所述的装置,其特征在于,所述构建模块,包括:
    对齐单元,用于根据注音词典对样本语句中的字符进行注音得到字符对应的音素、并对字符对应的音素与字符和标点进行对齐处理,所述样本语句中字符的长度与音素的长度和标点的长度相同。
  27. 根据权利要求26所述的装置,其特征在于,
    对齐后的中文中的多音字对应的音素为,多音字对应的多个音素中的任意一个;
    对齐后的英文字符中包括对齐字符,对齐后的英文字符的长度和英文字符对应的音素的长度相同;
    对于对齐之前没有标点的字符,对齐后的标点为blank。
  28. 根据权利要求25所述的装置,其特征在于,所述训练模块,包括:
    确定单元,用于将训练样本输入第二多任务神经网络模型,确定所述训练样本对应的字符概率矩阵和标点概率矩阵;
    第一计算单元,用于根据字符概率矩阵和标点概率矩阵,分别计算字符交叉熵损失和标点交叉熵损失;
    第二计算单元,用于根据字符交叉熵损失、字符交叉熵损失对应的第一权值和标点交叉熵损失、标点交叉熵损失对应的第二权值,计算加权交叉熵损失;
    调整单元,用于根据所述加权交叉熵损失调整第二多任务神经网络模型的参数,得到训练后的第一多任务神经网络模型。
  29. 一种语音识别装置,其特征在于,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为执行所述指令时实现权利要求1-10任意一项所述的方法。
  30. 一种神经网络模型训练装置,其特征在于,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为执行所述指令时实现权利要求11-14任意一项所述的方法。
  31. 一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1-10中任意一项所述的方法,或者,实现权 利要求11-14任意一项所述的方法。
PCT/CN2021/142470 2020-12-31 2021-12-29 语音识别方法及装置 WO2022143768A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21914500.0A EP4250285A4 (en) 2020-12-31 2021-12-29 METHOD AND DEVICE FOR SPEECH RECOGNITION
US18/258,316 US20240038223A1 (en) 2020-12-31 2021-12-29 Speech recognition method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011625075.0 2020-12-31
CN202011625075.0A CN114694636A (zh) 2020-12-31 2020-12-31 语音识别方法及装置

Publications (1)

Publication Number Publication Date
WO2022143768A1 true WO2022143768A1 (zh) 2022-07-07

Family

ID=82135071

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/142470 WO2022143768A1 (zh) 2020-12-31 2021-12-29 语音识别方法及装置

Country Status (4)

Country Link
US (1) US20240038223A1 (zh)
EP (1) EP4250285A4 (zh)
CN (1) CN114694636A (zh)
WO (1) WO2022143768A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1031498A (ja) * 1996-07-15 1998-02-03 Toshiba Corp 辞書登録装置及び辞書登録方法
CN110010153A (zh) * 2019-03-25 2019-07-12 平安科技(深圳)有限公司 一种基于神经网络的静音检测方法、终端设备及介质
CN111145728A (zh) * 2019-12-05 2020-05-12 厦门快商通科技股份有限公司 语音识别模型训练方法、系统、移动终端及存储介质
CN111261162A (zh) * 2020-03-09 2020-06-09 北京达佳互联信息技术有限公司 语音识别方法、语音识别装置及存储介质
CN112634876A (zh) * 2021-01-04 2021-04-09 北京有竹居网络技术有限公司 语音识别方法、装置、存储介质及电子设备
CN112652291A (zh) * 2020-12-15 2021-04-13 携程旅游网络技术(上海)有限公司 基于神经网络的语音合成方法、系统、设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102017216571B4 (de) * 2017-09-19 2022-10-06 Volkswagen Aktiengesellschaft Kraftfahrzeug
US11195513B2 (en) * 2017-09-27 2021-12-07 International Business Machines Corporation Generating phonemes of loan words using two converters

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1031498A (ja) * 1996-07-15 1998-02-03 Toshiba Corp 辞書登録装置及び辞書登録方法
CN110010153A (zh) * 2019-03-25 2019-07-12 平安科技(深圳)有限公司 一种基于神经网络的静音检测方法、终端设备及介质
CN111145728A (zh) * 2019-12-05 2020-05-12 厦门快商通科技股份有限公司 语音识别模型训练方法、系统、移动终端及存储介质
CN111261162A (zh) * 2020-03-09 2020-06-09 北京达佳互联信息技术有限公司 语音识别方法、语音识别装置及存储介质
CN112652291A (zh) * 2020-12-15 2021-04-13 携程旅游网络技术(上海)有限公司 基于神经网络的语音合成方法、系统、设备及存储介质
CN112634876A (zh) * 2021-01-04 2021-04-09 北京有竹居网络技术有限公司 语音识别方法、装置、存储介质及电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4250285A4 *

Also Published As

Publication number Publication date
CN114694636A (zh) 2022-07-01
EP4250285A4 (en) 2024-04-24
EP4250285A1 (en) 2023-09-27
US20240038223A1 (en) 2024-02-01

Similar Documents

Publication Publication Date Title
CN108304846B (zh) 图像识别方法、装置及存储介质
US20220115034A1 (en) Audio response messages
US11610354B2 (en) Joint audio-video facial animation system
WO2021196981A1 (zh) 语音交互方法、装置和终端设备
CN107077841B (zh) 用于文本到语音的超结构循环神经网络
CN111261144B (zh) 一种语音识别的方法、装置、终端以及存储介质
US8370143B1 (en) Selectively processing user input
US11934780B2 (en) Content suggestion system
US11017173B1 (en) Named entity recognition visual context and caption data
US20220351453A1 (en) Animated expressive icon
US20240105159A1 (en) Speech processing method and related device
WO2023207541A1 (zh) 一种语音处理方法及相关设备
EP3803855A1 (en) A highly empathetic tts processing
US20220035495A1 (en) Interactive messaging stickers
CN113821589A (zh) 一种文本标签的确定方法及装置、计算机设备和存储介质
CN112487137A (zh) 使用集成共享资源来流线化对话处理
CN113220848A (zh) 用于人机交互的自动问答方法、装置和智能设备
WO2022143768A1 (zh) 语音识别方法及装置
CN115631251A (zh) 基于文本生成图像的方法、装置、电子设备和介质
KR102446305B1 (ko) 하이라이팅 기능이 포함된 감정 분석 서비스를 위한 방법 및 장치
CN113821609A (zh) 一种答案文本的获取方法及装置、计算机设备和存储介质
US20230341948A1 (en) Multimodal ui with semantic events
CN116959407A (zh) 一种读音预测方法、装置及相关产品
CN113221644A (zh) 槽位词的识别方法、装置、存储介质及电子设备
CN113934501A (zh) 翻译方法、装置、存储介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21914500

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18258316

Country of ref document: US

ENP Entry into the national phase

Ref document number: 2021914500

Country of ref document: EP

Effective date: 20230620

NENP Non-entry into the national phase

Ref country code: DE