CN115376515A - Voice recognition method and device and electronic equipment - Google Patents

Voice recognition method and device and electronic equipment Download PDF

Info

Publication number
CN115376515A
CN115376515A CN202110552936.5A CN202110552936A CN115376515A CN 115376515 A CN115376515 A CN 115376515A CN 202110552936 A CN202110552936 A CN 202110552936A CN 115376515 A CN115376515 A CN 115376515A
Authority
CN
China
Prior art keywords
real
model
streaming
time
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110552936.5A
Other languages
Chinese (zh)
Inventor
丁科
向鸿雨
万广鲁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN202110552936.5A priority Critical patent/CN115376515A/en
Publication of CN115376515A publication Critical patent/CN115376515A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Abstract

The application discloses a voice recognition method, belongs to the technical field of computers, and is beneficial to improving the efficiency of voice recognition. The method comprises the following steps: coding a voice fragment acquired in real time through a coding module of a pre-trained streaming model, and outputting a first hidden layer vector representation of the voice fragment acquired in real time; decoding the first hidden layer vector representation through a decoding module of a streaming model, and determining a real-time recognition result corresponding to a voice fragment acquired in real time; and through a pre-trained non-streaming model, re-scoring the real-time recognition result based on the first hidden layer vector representation of the voice segment acquired at each moment in the voice input to which the voice segment belongs, and determining the accuracy score of the real-time recognition result corresponding to the whole voice input.

Description

Voice recognition method and device and electronic equipment
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a voice recognition method, a voice recognition device, electronic equipment and a computer-readable storage medium.
Background
The end-to-end voice recognition technology is an important voice recognition technology, and achieves better recognition effect than the traditional voice recognition scheme under many scenes. For example, in some application scenarios (such as voice search and voice input method) at the device end, streaming recognition of voice data is required, that is, a user speaks while returning a recognition result, and real-time recognition of input voice can be realized by applying a streaming end-to-end voice recognition technology. In order to improve the performance of the streaming end-to-end speech recognition technology, a scheme of two-pass encoding and decoding is usually adopted in the prior art, that is, one streaming end-to-end model is used for the first-pass encoding and decoding to obtain n candidate recognition results with the highest score, and then another non-streaming end-to-end model is used for re-scoring the n candidate recognition results, which is the second-pass encoding and decoding. The scheme of two-pass encoding and decoding adopted in the prior art has at least the following defects: the streaming decoding model adopted by the first encoding and decoding pass and the non-streaming decoding model adopted by the second encoding and decoding pass are two different models, the input voice needs to be encoded and decoded twice, the voice recognition computation amount is large, and the efficiency is low.
It can be seen that there is a need for improvement in the speech recognition methods of the prior art.
Disclosure of Invention
The embodiment of the application provides a voice recognition method which is beneficial to improving the efficiency of voice recognition.
In a first aspect, an embodiment of the present application provides a speech recognition method, including:
coding a real-time acquired voice segment through a coding module of a pre-trained streaming model, and outputting a first hidden layer vector representation of the real-time acquired voice segment;
decoding the first hidden layer vector representation through a decoding module of the streaming model, and determining a real-time recognition result corresponding to the real-time acquired voice segment;
through a pre-trained non-streaming model, re-scoring the real-time recognition result based on the first hidden layer vector representation of the voice fragment obtained at each moment in the voice input to which the real-time obtained voice fragment belongs, and determining the accuracy score of the real-time recognition result corresponding to the voice input whole; wherein the non-streaming model shares model parameters of the streaming model.
In a second aspect, an embodiment of the present application provides a speech recognition apparatus, including:
the streaming coding module is used for coding the real-time acquired voice segment through a pre-trained coding module of the streaming model and outputting a first hidden layer vector representation of the real-time acquired voice segment;
the streaming decoding module is used for decoding the first hidden layer vector representation through the decoding module of the streaming model and determining a real-time recognition result corresponding to the real-time acquired voice fragment;
the recognition result re-scoring module is used for re-scoring the real-time recognition result based on the first hidden layer vector representation of the voice segment acquired at each moment in the voice input to which the real-time acquired voice segment belongs through a pre-trained non-streaming model, and determining the accuracy score of the real-time recognition result corresponding to the whole voice input; wherein the non-streaming model shares model parameters of the streaming model.
In a third aspect, an embodiment of the present application further discloses an electronic device, which includes a memory, a processor, and a computer program that is stored in the memory and is executable on the processor, and when the processor executes the computer program, the speech recognition method according to the embodiment of the present application is implemented.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor, and the program includes steps of the speech recognition method disclosed in the present application.
The speech recognition method disclosed by the embodiment of the application encodes a speech segment acquired in real time through a pre-trained encoding module of a streaming model, and outputs a first hidden layer vector representation of the speech segment acquired in real time; decoding the first hidden layer vector representation through a decoding module of the streaming model, and determining a real-time recognition result corresponding to the real-time acquired voice segment; through a pre-trained non-streaming model, based on the first hidden layer vector representation of the voice segment acquired at each moment in the voice input to which the voice segment acquired in real time belongs, the real-time recognition result is re-scored, the accuracy score of the real-time recognition result corresponding to the voice input whole is determined, and the efficiency of voice recognition is improved.
The above description is only an overview of the technical solutions of the present application, and the present application may be implemented in accordance with the content of the description so as to make the technical means of the present application more clearly understood, and the detailed description of the present application will be given below in order to make the above and other objects, features, and advantages of the present application more clearly understood.
Drawings
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
FIG. 1 is a flowchart of a speech recognition method according to a first embodiment of the present application;
FIG. 2 is a schematic diagram of a streaming model structure adopted in a testing stage according to an embodiment of the present application;
FIG. 3 is a schematic diagram of data transmission between a streaming model and a non-streaming model in a training stage according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a speech recognition apparatus according to a second embodiment of the present application;
fig. 5 is a second schematic structural diagram of a speech recognition apparatus according to a second embodiment of the present application;
FIG. 6 schematically shows a block diagram of an electronic device for performing a method according to the present application; and
fig. 7 schematically shows a storage unit for holding or carrying program code implementing a method according to the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
Example one
As shown in fig. 1, a speech recognition method disclosed in an embodiment of the present application includes: step 110 to step 130.
And step 110, coding the voice segment acquired in real time through a coding module of a pre-trained streaming model, and outputting a first hidden layer vector representation of the voice segment acquired in real time.
The structures of the streaming model and the non-streaming model used in the speech recognition method according to the embodiment of the present application are end-to-end neural network structures, and for example, an AED model may be used as a basic model. In the prior art, the AED model includes: the device comprises an encoding module and an attention-based decoding module. When speech recognition is carried out, the coding module is used for coding an input speech signal into a high-efficiency hidden layer expression vector, and the decoding module is used for gradually decoding a recognition result in an autoregressive mode by taking the hidden layer expression vector output by the coding module as input.
In particular, the present application improves upon prior art AED model structures. As shown in fig. 2, the structure of the streaming model used in the speech recognition method according to the embodiment of the present application is an end-to-end neural network structure, and includes: an encoding module 210, a decoding module 220 and a prediction module 230, wherein the decoding module 220 is further composed of an attention sub-module 2201 and a decoding sub-module 2202. In some embodiments of the present application, the encoding module 210 and the decoding sub-module 2202 may employ a typical encoding and decoding structure, such as a circular network structure based on LSTM, or a feedforward network structure based on transform, GLU, or the like; the attention submodule 2201 may adopt a typical attention mechanism network structure, such as a multimedia scaled dot-production association structure; the prediction module 230 may be built based on a fully connected network. In the embodiment of the present application, the network structures of the encoding module 210, the decoding submodule 2202, the attention submodule 2201, and the prediction module 230 are not limited.
The encoding module 210 is configured to encode an input speech segment and output a first hidden layer vector representation; the prediction module 230 is configured to perform length prediction on the first hidden vector representation output by the encoding module 210, and output the number of decoded characters corresponding to the speech segment; the attention submodule 2201 is configured to perform weighting processing on the first hidden layer vector representation output by the encoding module 210 and the output of the encoding module 210 at a previous time, and control the decoding submodule 2202 to perform one or more rounds of decoding operations according to the number of decoded characters output by the prediction module 230, so as to output the number of characters of the decoded characters.
In some embodiments of the present application, the non-streaming model is constructed using the same basic model structure as the streaming model, for example, the non-streaming model is constructed based on the same model structure as the streaming model (e.g., an AED model node in the prior art), by configuring the structure of the encoding and decoding modules of the non-streaming model as a network structure consistent with the encoding and decoding modules in the streaming model, and configuring the non-streaming model to share model parameters of the basic network structure of the streaming model. Thus, in the testing stage, the decoding module of the non-streaming model can directly use the hidden layer vector output by the encoding module of the streaming model for overall decoding of the input speech; in the training stage, the decoding module and the decoding module of the non-streaming model can share the parameters of the encoding module and the decoding module of the streaming model, so that the model parameters needing to be trained are reduced, and the model training efficiency can be improved.
On the other hand, in the embodiment of the present application, the AED model in the prior art is modified in a streaming manner based on a chunk scheme, the input speech is divided into chunks (speech segments) with equal lengths, for example, each speech segment is speech data with a length of 320ms, and then the number of characters that each chunk should be decoded is predicted by the set prediction module 230.
In the testing stage, after the speech segment obtained in real time is input to the streaming model, the encoding module 210 of the streaming model encodes the input speech segment and outputs the first hidden layer vector representation of the speech segment.
The specific implementation of the encoding module 210 for encoding the input speech segment is the prior art, and is not described in this embodiment.
And step 120, decoding the first hidden layer vector representation through a decoding module of the streaming model, and determining a real-time recognition result corresponding to the real-time acquired voice segment.
Next, the first hidden layer vector representation output by the encoding module 210 is input to the decoding module 220, and the decoding module 220 performs decoding mapping to obtain a speech recognition result of the speech segment obtained in real time.
In some embodiments of the application, the decoding, by the decoding module of the streaming model, the first hidden layer vector representation to determine a real-time recognition result corresponding to the speech segment obtained in real time includes: performing length prediction on the first hidden vector representation through a prediction module of the streaming model, and predicting the number of matched decoding characters of the first hidden vector representation; performing, by the decoding module of the streaming model, auto-regression decoding based on the first hidden layer vector representation, the context vector representation of the first hidden layer vector representation, and the number of decoded characters, and determining a real-time recognition result of each character of the number of decoded characters corresponding to the real-time obtained speech segment.
As described above, in the speech recognition method disclosed in the embodiment of the present application, in order to improve the accuracy of stream recognition, the speech input is modified, the speech input is divided into a plurality of speech segments, and when the speech input is detected, speech recognition can be performed on a currently acquired speech segment after an acquired complete speech segment arrives. At this time, the encoding module 210 performs forward reasoning according to the currently acquired speech segment to obtain a hidden layer vector representation corresponding to the current time. Since the speech input is not completed, that is, all speech segments in a speech input do not arrive completely, it is necessary to predict how many characters should be decoded in the current decoding process by means of the prediction module 230. Taking the current time as t and the hidden layer vector corresponding to the current time as e _ t as an example, the prediction module 230 accepts e _ t as an input and outputs the predicted number n of decoded characters, where n is an integer greater than or equal to 0.
Thereafter, the attention sub-module 2201 in the decoding module 220 fuses the first hidden layer vector representations (e.g., the first hidden layer vector representations e _1, e _2, 8230; e _ t obtained by encoding all the speech segments before t time of the current speech input by the encoding module 210) that have been received in all the speech segments that have been received and belong to the current speech input, and performs weighted fusion. Then, the decoding submodule 2202 further performs, based on the output decoding result at the time t _1, autoregressive decoding on the hidden layer vector output by the attention submodule 2201 after weighted fusion, to obtain a decoding result.
In some embodiments of the present application, when the number of decoded characters predicted by the prediction module 230 is 1, the attention submodule 2201 and the decoding submodule 2202 sequentially execute one turn, and a speech recognition result of 1 character corresponding to a currently obtained speech segment can be output. In other embodiments of the present application, when the number n of the decoded characters predicted by the prediction module 230 is greater than 1, the attention submodule 2201 and the decoding submodule 2202 need to sequentially execute n rounds, so as to sequentially output the voice recognition result of n characters corresponding to the currently acquired voice segment.
In some embodiments of the present application, after determining a real-time recognition result corresponding to the voice segment obtained in real time, the method further includes: and outputting the real-time recognition result with the highest score. For example, after the decoding module 220 outputs a plurality of candidate recognition results of a character of the speech segment obtained at the time t, the upper layer application may select a candidate recognition result with the highest score as a character recognition result corresponding to the speech segment obtained at the time t, and output and display the character recognition result.
And step 130, re-scoring the real-time recognition result through a pre-trained non-streaming model based on the first hidden layer vector representation of the voice segment acquired at each moment in the voice input to which the real-time acquired voice segment belongs, and determining the accuracy score of the real-time recognition result corresponding to the voice input whole.
Wherein the non-streaming model shares model parameters of the streaming model.
In some embodiments of the present application, the non-streaming model includes, as previously described: an encoding module and a decoding module, wherein the model structure and parameters of the encoding module of the non-streaming model are the same as those of the encoding module 210 of the streaming model, and the model structure and parameters of the decoding module of the non-streaming model are the same as those of the decoding module 220 of the streaming model.
And in the testing stage, after the complete voice input is obtained, secondary decoding can be executed so as to re-score the decoding result output in real time. In some embodiments of the present application, through a pre-trained non-streaming model, re-scoring the real-time recognition result based on the first hidden-layer vector representation of the speech segment obtained at each time in the speech input to which the real-time obtained speech segment belongs, and determining the accuracy score of the real-time recognition result corresponding to the speech input as a whole includes: decoding the first hidden layer vector representation of the voice fragment acquired at each moment in the voice input to which the voice fragment acquired in real time belongs through a pre-trained decoding module of a non-streaming model, and determining at least one group of non-streaming recognition results corresponding to the voice input; respectively calculating the error between each group of the non-streaming recognition results and the real-time recognition result with the highest score; and determining the accuracy score of the real-time recognition result relative to the corresponding non-flow recognition result according to the error.
For example, each speech segment in the complete speech input may be first encoded by an encoding module of a pre-trained non-streaming model to obtain a second hidden layer vector representation corresponding to each speech segment, and then each second hidden layer vector representation is decoded by a decoding module of the non-streaming model to determine at least one set of non-streaming recognition results corresponding to the speech input.
In some embodiments of the present application, since the encoding module of the non-streaming model has the same structure and parameters as the encoding module of the streaming model, the decoding module of the non-streaming model may directly use the first hidden layer vector representation corresponding to each speech segment output by the encoding module of the streaming model to perform decoding processing, and determine at least one set of non-streaming recognition results corresponding to the speech input.
For a specific implementation of the non-streaming recognition result of the complete speech input by decoding the input vector representation sequence by the decoding module of the non-streaming model, refer to the prior art (for example, refer to a decoding processing scheme of an AED model in the prior art), which is not described in detail in this embodiment of the present application.
In some embodiments of the present application, each set of non-streaming recognition results output by the non-streaming model includes: a sequence of characters and a probability corresponding to the sequence of characters.
The real-time recognition result in the embodiment of the application is determined by the candidate recognition result output after the speech recognition is performed on each speech segment acquired in real time based on the streaming model. For example, for the voice segment X acquired in real time from 1 st to t th time of the current voice input 1 、X 2 、…、X t The candidate recognition results output after speech recognition based on the streaming model are respectively represented as Y 1 ={Y 1 _ s1 ,…,Y 1 _ sm }、Y 2 ={Y 2 _ s1 ,…,Y 2 _ sm }、…、Y t ={Y t _ s1 ,…,Y t _ sm And obtaining candidate recognition results of the speech segments obtained at each moment, wherein the candidate recognition results of the speech segments obtained at each moment comprise m (m is an integer greater than 1) characters. In some embodiments of the present application, for each speech segment, the candidate recognition result with the highest score is selected as the real-time recognition result of the speech segment, and the real-time recognition results of all speech segmentsAnd if the results are arranged in sequence, obtaining the real-time recognition result of the complete voice input. For example, a real-time recognition result of a certain speech input is "Y 1 _ s1 Y 2 _ s1 …Y t _ sm ”。
As described above, the non-streaming recognition result in this embodiment is obtained by, after the complete speech input is obtained, recognizing the complete speech input through the non-streaming model, that is, the non-streaming recognition result is a speech recognition result of a whole sentence. In order to facilitate the upper-layer application to output a more accurate voice recognition result, in some embodiments of the present application, the recognition result of each speech is further re-scored according to the voice recognition result of the whole sentence, and a re-scored result is output.
As described above, after the streaming model outputs the candidate recognition result of the speech segment obtained in real time, the upper layer application outputs and displays the candidate recognition result with the highest score of each speech segment as the real-time recognition result of the speech segment. For example, when the user inputs a voice of "run twelve kilometers", the real-time recognition result may be "run", "step", "hour", "two", "public", "in". And the final stream type recognition result may be 'run twelve kilometers' or 'run two kilometers', and then the score of each group of recognition results is obtained through forward reasoning by a non-stream type model. For example, the output layer of the non-streaming model may use the real-time recognition result of the speech input as the tag of the speech input, use each group of non-streaming recognition results as the predicted speech recognition result value of the speech input, and use the error between each group of non-streaming recognition results and the tag. As can be seen from the foregoing description, if the error between the non-flow recognition result with a high score and the real-time recognition result is small, it indicates that the accuracy of the real-time recognition result is higher, and if the error between the non-flow recognition result with a high score and the real-time recognition result is large, it indicates that the accuracy of the real-time recognition result is lower. Therefore, the real-time recognition result can be re-scored further according to the error and the prediction score of each group of non-streaming recognition results. In some embodiments of the present application, a weight that should be negatively correlated with the error may be determined and used as an accuracy score of the real-time recognition result relative to the corresponding non-streaming recognition result based on a product of the weight and a score of the corresponding non-streaming recognition result.
In order to improve the speech recognition efficiency and reduce the calculation amount in the prediction stage, in some embodiments of the present application, through a pre-trained non-streaming model, based on the first hidden layer vector representation of the speech segment acquired at each time in the speech input to which the speech segment acquired in real time belongs, the re-scoring of the real-time recognition result is performed to determine the accuracy score of the real-time recognition result corresponding to the speech input as a whole, including: and executing a teacher forcing method through a pre-trained non-streaming model, taking the first hidden layer vector representation of the voice fragment acquired at each moment in the voice input to which the voice fragment belongs and the real-time recognition result as input, performing forward reasoning, re-scoring the real-time recognition result, and determining the accuracy score of the real-time recognition result corresponding to the whole voice input.
Teacher Forcing (Teacher Forcing) is a network training method that does not use the output of the previous time as the input of the next time each time, but directly uses the corresponding previous item of the standard answer (ground route) of the training data as the input of the next time. In the embodiment of the application, in the prediction stage, a teacher forced training method is executed through a non-streaming model, that is, a forward calculation process in the training stage is executed, the first hidden vector representation of the voice segment acquired at each time in the voice input to which the real-time acquired voice segment belongs is taken as the feature input of the non-streaming model corresponding to each time, the real-time recognition result corresponding to each time is taken as the input of the next time, instead of using the output of the non-streaming model at the previous time as the input of the next time, and forward reasoning is executed according to the method to determine at least one group of non-streaming recognition results. And then, according to the method, respectively calculating the errors between each group of non-streaming recognition results and the real-time recognition result, and determining the accuracy score of the real-time recognition result corresponding to the whole voice input based on the errors. In some embodiments of the present application, after determining that the accuracy score of the real-time recognition result corresponds to the whole voice input, the method further includes: outputting an accuracy score of the real-time recognition result relative to the corresponding non-streaming recognition result. For example, the output layer of the non-streaming model outputs the accuracy scores of the real-time recognition results relative to each non-streaming recognition result, and the upper application may further present the non-streaming recognition results to the user and present the accuracy scores of the real-time recognition results relative to each non-streaming recognition result, so that the user can obtain more information related to the speech recognition results.
In order to facilitate the reader to understand the innovation point of the present application, the following further describes the training schemes of the streaming model and the non-streaming model in conjunction with the structure and data transmission schematic diagram of the streaming model and the non-streaming model shown in fig. 3.
In some embodiments of the present application, the streaming model and the non-streaming model are trained by: acquiring a plurality of training samples, wherein sample data of the training samples are voice fragment sequences, and sample labels are character sequence true values corresponding to the voice fragment sequences; for each training sample, determining a first model loss value corresponding to the training sample according to a first estimated character sequence obtained by performing streaming coding and decoding on a voice fragment sequence included in the training sample by using the streaming model and a character sequence true value corresponding to the voice fragment sequence; determining a second model loss value corresponding to the training sample according to a second estimated character sequence obtained by integrally coding and decoding the voice segment sequence included by the training sample through the non-streaming model and a character sequence real value corresponding to the voice segment sequence; determining a joint loss value of the streaming model and the non-streaming model according to the first model loss value and the second model loss value corresponding to each training sample; and optimizing the model parameters of the streaming model by taking the optimization of the joint loss value as a target, and synchronously updating the model parameters shared with the streaming model in the non-streaming model so as to perform joint training on the streaming model and the non-streaming model.
As shown in fig. 3, since the non-streaming model is constructed based on an end-to-end model (such as an AED model) in the prior art, the non-streaming model includes: an encoding module 310 and a decoding module 320. The streaming model is obtained by modifying the same end-to-end model (such as an AED model), the main structure of the model is the same, and the streaming model comprises: the encoding module 330, the decoding module 340, and the prediction module 350, and share parameters of the encoding module and the decoding module, so that the streaming model and the non-streaming model can be trained based on the same training sample set. In this embodiment of the present application, sample data of each training sample used for training the streaming model and the non-streaming model is a sequence of a voice segment (e.g., a voice segment with a length of 320 ms) of a complete voice input at different times, and a sample tag is a true character value corresponding to a corresponding voice segment, that is, a sample tag is a true character value corresponding to a sequence of a voice segment in the sample data.
In an embodiment of the present application, the non-streaming model and the streaming model are jointly trained based on the training sample set. For example, for a training sample x, sequentially encoding and decoding voice segments at each time in the training sample x through a stream model, sequentially determining a candidate recognition result of each voice segment, and selecting the candidate recognition result with the highest score of each voice segment as a stream recognition result of the training sample x output by the stream model, namely a first estimated character sequence; then, by the decoding module of the non-streaming model, non-streaming codec is performed based on the hidden layer vector output by the encoding module of the streaming model when the training sample x is subjected to streaming codec, and a non-streaming recognition result, namely a second estimated character sequence, corresponding to the training sample x is determined.
For the speech segment at the time t, firstly, the speech segment is encoded by the encoding module 330 of the streaming model, and a first hidden layer vector representation of the speech segment is determined; then, predicting the first hidden layer vector representation through a prediction module 350, and predicting the number of characters obtained by speech segment recognition at the time t; then, decoding processing is performed based on the first hidden vector representation through a decoding module 340 of the streaming model, and a candidate recognition result of the character number of characters is output.
In some embodiments of the present application, the second estimated character sequence is obtained by: acquiring a first hidden layer vector representation corresponding to each voice segment output by a coding module of the streaming model when the streaming model carries out streaming coding and decoding on the voice segment sequence included by the training sample; and decoding the first hidden layer vector representation corresponding to each voice fragment through a decoding module of the non-streaming model to obtain a second pre-estimated character sequence of the training sample. After the decoding processing of all the speech segments in the training sample X is completed, the decoding module 320 of the non-streaming model then fuses the first hidden layer vector representations corresponding to all the speech segments in the training sample X, and performs decoding processing based on the fused vectors, and the decoding module 320 of the non-streaming model outputs the non-streaming recognition result of the training sample X.
In other embodiments of the present application, a decoding module 310 of a non-streaming model may first perform decoding processing on all speech segments in a training sample X to obtain a second hidden layer vector of each speech segment, then a decoding module 320 of the non-streaming model performs fusion on second hidden layer vector representations corresponding to all speech segments in the training sample X, performs decoding processing based on the fused vectors, and the decoding module 320 of the non-streaming model outputs a non-streaming recognition result of the training sample X.
Since the encoding module 310 of the non-streaming model and the encoding module 330 of the streaming model have the same network structure and model parameters, the second hidden layer vector and the first hidden layer vector are the same. The decoding operation is directly performed by using the first hidden layer vector output by the encoding module 330 of the stream model, so that the operation amount can be reduced, and the model training efficiency can be improved.
Next, for each training sample, calculating the prediction loss of the stream model to the training sample according to the stream type recognition result, namely the error between the first estimated character sequence and the sample label of the training sample, and taking the prediction loss as the first model loss value corresponding to the training sample; and calculating the prediction loss of the non-flow model to the training sample according to the non-flow recognition result, namely the error between the second predicted character sequence and the sample label of the training sample, and taking the prediction loss as the second model loss value corresponding to the training sample. Then, for each training sample, the first model loss value and the second model loss value corresponding to the training sample may be added, and the sum obtained by the addition may be used as the joint loss of the training sample corresponding to the streaming model and the non-streaming model. And finally, obtaining the error of the current round of prediction of the training samples by the flow model and the non-flow model according to the joint loss of all the accumulated training samples.
Further, a gradient descent method can be adopted to conduct the error of the current prediction round reversely, optimize the model parameters of the non-flow model and the flow model, and perform the next round of iterative training until the error is not converged continuously, or the prediction error is converged to a preset range or the iterative training times reach a preset time threshold.
In some embodiments of the present application, because the flow model and the non-flow model share model parameters of a basic model structure, compared with the flow model and the non-flow model of different structures in the prior art, the number of model parameters to be optimized is reduced by one time, and a significant effect is achieved on improving the model training efficiency.
In some embodiments of the present application, the prediction module is a pre-trained network. For example, the prediction module may be a three-layer MLP network (Multilayer Perceptron). The training data of the prediction module is a data pair consisting of a speech segment and an M (M is an integer greater than 1) class tag, wherein the M class tag is used for indicating the number of characters corresponding to the speech segment, and the M class tag can be labeled manually or automatically after processing the speech segment by using a CTC (connectivity temporal classification) algorithm in the prior art. The specific training scheme of the prediction module refers to a scheme for training a multilayer neural network in the prior art, and details are not repeated in the embodiment of the present application.
The voice recognition method disclosed by the embodiment of the application encodes a voice fragment acquired in real time through a pre-trained encoding module of a streaming model, and outputs a first hidden layer vector representation of the voice fragment acquired in real time; decoding the first hidden layer vector representation through a decoding module of the streaming model, and determining a real-time recognition result corresponding to the real-time acquired voice segment; through a pre-trained non-streaming model, based on the first hidden layer vector representation of the voice segment obtained at each moment in the voice input to which the real-time obtained voice segment belongs, re-scoring the real-time recognition result, and determining the accuracy score of the real-time recognition result corresponding to the voice input whole, wherein the non-streaming model shares the model parameters of the streaming model; the efficiency of speech recognition is facilitated to be improved.
According to the voice recognition method disclosed by the embodiment of the application, the model structure of the end-to-end non-flow model is modified, the flow model is constructed based on the basic network of the non-flow model, so that the flow model and the non-flow model can share the model parameters of the basic network, in the prediction stage, the intermediate output result of the non-flow model during flow encoding and decoding processing based on the flow model is subjected to forward calculation, the real-time recognition result is re-scored, the encoding step of the non-flow model is saved, and the voice recognition efficiency is effectively improved. On the other hand, as the streaming model and the non-streaming model can share the model parameters of the basic network, a joint training mode can be adopted, so that the model parameters to be optimized in the training stage are reduced by one time, the calculated amount is reduced, the model training efficiency is improved, and the data processing resources consumed by model training are reduced.
Furthermore, the streaming model and the non-streaming model can share the model parameters of the basic network, so that the number of the model parameters for voice recognition is less, and the accuracy of the voice recognition is improved.
Example two
As shown in fig. 4, a speech recognition apparatus disclosed in an embodiment of the present application includes:
a streaming coding module 410, configured to code a speech segment obtained in real time through a coding module of a pre-trained streaming model, and output a first hidden layer vector representation of the speech segment obtained in real time;
a streaming decoding module 420, configured to decode, by using the decoding module of the streaming model, the first hidden layer vector representation, and determine a real-time recognition result corresponding to the real-time obtained speech segment;
a recognition result re-scoring module 430, configured to re-score the real-time recognition result based on the first hidden layer vector representation of the speech segment obtained at each time in the speech input to which the real-time obtained speech segment belongs through a pre-trained non-streaming model, and determine an accuracy score of the real-time recognition result corresponding to the whole speech input.
Wherein the non-streaming model shares model parameters of the streaming model.
In some embodiments of the present application, as shown in fig. 5, the apparatus further comprises:
the first recognition result output module 440 is configured to output the real-time recognition result with the highest score.
In some embodiments of the present application, the recognition result re-scoring module 430 is further configured to:
and executing a teacher forcing method through a pre-trained non-streaming model, taking the first hidden layer vector representation of the voice fragment acquired at each moment in the voice input to which the real-time acquired voice fragment belongs and the real-time recognition result as input, performing forward reasoning, re-scoring the real-time recognition result, and determining the accuracy score of the real-time recognition result corresponding to the voice input as a whole.
In some embodiments of the present application, as shown in fig. 5, the apparatus further comprises:
a second recognition result output module 450, configured to output an accuracy score of the real-time recognition result relative to the corresponding non-streaming recognition result.
In some embodiments of the present application, the streaming decoding module 420 is further configured to:
performing length prediction on the first hidden vector representation through a prediction module of the streaming model, and predicting the number of matched decoding characters of the first hidden vector representation;
performing, by the decoding module of the streaming model, auto-regression decoding based on the first hidden layer vector representation, the context vector representation of the first hidden layer vector representation, and the number of decoded characters, and determining a real-time recognition result of each character of the number of decoded characters corresponding to the real-time obtained speech segment.
In some embodiments of the application, the streaming model and the non-streaming model are trained by:
obtaining a plurality of training samples, wherein sample data of the training samples are voice fragment sequences, and sample labels are real values of character sequences corresponding to the voice fragment sequences;
for each training sample, determining a first model loss value corresponding to the training sample according to a first estimated character sequence obtained by performing streaming coding and decoding on a voice fragment sequence included in the training sample by using the streaming model and a character sequence true value corresponding to the voice fragment sequence; determining a second model loss value corresponding to the training sample according to a second estimated character sequence obtained by integrally coding and decoding the voice segment sequence included by the training sample through the non-streaming model and a character sequence real value corresponding to the voice segment sequence;
determining a joint loss value of the streaming model and the non-streaming model according to the first model loss value and the second model loss value corresponding to each training sample;
and optimizing model parameters of the streaming model by taking the optimization of the joint loss value as a target, and synchronously updating the model parameters shared by the non-streaming model and the streaming model so as to perform joint training on the streaming model and the non-streaming model.
The speech recognition device disclosed in the embodiment of the present application is used to implement the speech recognition method described in the first embodiment of the present application, and specific implementation manners of each module of the device are not described again, and reference may be made to specific implementation manners of corresponding steps in the method embodiments.
The speech recognition device disclosed by the embodiment of the application encodes a speech segment acquired in real time through a pre-trained encoding module of a streaming model, and outputs a first hidden layer vector representation of the speech segment acquired in real time; decoding the first hidden layer vector representation through a decoding module of the streaming model, and determining a real-time recognition result corresponding to the real-time acquired voice segment; through a pre-trained non-streaming model, re-scoring the real-time recognition result based on the first hidden layer vector representation of the voice segment acquired at each moment in the voice input to which the real-time acquired voice segment belongs, and determining the accuracy score of the real-time recognition result corresponding to the voice input whole; the non-streaming model shares the model parameters of the streaming model, which is helpful to improve the efficiency of speech recognition.
The speech recognition device disclosed by the embodiment of the application is characterized in that the model structure of the end-to-end non-flow model is modified, the flow model is constructed based on the basic network of the non-flow model, so that the flow model and the non-flow model can share the model parameters of the basic network, in the prediction stage, the intermediate output result of the non-flow model during flow coding and decoding processing based on the flow model is subjected to forward calculation, the real-time recognition result is re-scored, the coding step of the non-flow model is saved, and the speech recognition efficiency is effectively improved. On the other hand, as the streaming model and the non-streaming model can share the model parameters of the basic network, a joint training mode can be adopted, so that the model parameters to be optimized in the training stage are reduced by one time, the calculated amount is reduced, the model training efficiency is improved, and the data processing resources consumed by model training are reduced.
Furthermore, the streaming model and the non-streaming model can share the model parameters of the basic network, so that the number of the model parameters for voice recognition is less, and the accuracy of the voice recognition is improved.
The embodiments in the present specification are all described in a progressive manner, and each embodiment focuses on differences from other embodiments, and portions that are the same and similar between the embodiments may be referred to each other. For the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for relevant points.
The foregoing describes in detail a speech recognition method and apparatus provided by the present application, and specific examples are applied herein to explain the principles and embodiments of the present application, and the descriptions of the foregoing examples are only used to help understanding the method and a core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in an electronic device according to embodiments of the present application. The present application may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
For example, fig. 6 illustrates an electronic device that may implement a method in accordance with the present application. The electronic device can be a PC, a mobile terminal, a personal digital assistant, a tablet computer and the like. The electronic device conventionally comprises a processor 610 and a memory 620 and program code 630 stored on said memory 620 and executable on the processor 610, said processor 610 implementing the method described in the above embodiments when executing said program code 630. The memory 620 may be a computer program product or a computer readable medium. The memory 620 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 620 has a storage space 6201 for program code 630 of a computer program for performing any of the method steps described above. For example, the storage space 6201 for the program code 630 may include respective computer programs for implementing the various steps in the above method, respectively. The program code 630 is computer readable code. The computer programs may be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. The computer program comprises computer readable code which, when run on an electronic device, causes the electronic device to perform the method according to the above embodiments.
The embodiment of the present application also discloses a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the speech recognition method according to the first embodiment of the present application.
Such a computer program product may be a computer-readable storage medium that may have memory segments, memory spaces, etc. arranged similarly to the memory 620 in the electronic device shown in fig. 6. The program code may be stored in a computer readable storage medium, for example, compressed in a suitable form. The computer readable storage medium is typically a portable or fixed storage unit as described with reference to fig. 7. Typically, the storage unit comprises computer readable code 630', said computer readable code 630' being code read by a processor, which when executed by the processor implements the steps of the method described above.
Reference herein to "one embodiment," "an embodiment," or "one or more embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Furthermore, it is noted that instances of the word "in one embodiment" are not necessarily all referring to the same embodiment.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (10)

1. A speech recognition method, comprising:
coding a voice fragment acquired in real time through a coding module of a pre-trained streaming model, and outputting a first hidden layer vector representation of the voice fragment acquired in real time;
decoding the first hidden layer vector representation through a decoding module of the streaming model, and determining a real-time recognition result corresponding to the real-time acquired voice segment;
through a pre-trained non-streaming model, re-scoring the real-time recognition result based on the first hidden layer vector representation of the voice fragment obtained at each moment in the voice input to which the real-time obtained voice fragment belongs, and determining the accuracy score of the real-time recognition result corresponding to the voice input whole; wherein the non-streaming model shares model parameters of the streaming model.
2. The method according to claim 1, wherein the step of determining the real-time recognition result corresponding to the real-time acquired speech segment further comprises:
and outputting the real-time recognition result with the highest score.
3. The method according to claim 1, wherein the step of re-scoring the real-time recognition result and determining the accuracy score of the real-time recognition result corresponding to the whole voice input based on the first hidden layer vector representation of the voice segment obtained at each time in the voice input to which the real-time obtained voice segment belongs by using the pre-trained non-streaming model comprises:
and executing a teacher forcing method through a pre-trained non-streaming model, taking the first hidden layer vector representation of the voice fragment acquired at each moment in the voice input to which the real-time acquired voice fragment belongs and the real-time recognition result as input, performing forward reasoning, re-scoring the real-time recognition result, and determining the accuracy score of the real-time recognition result corresponding to the voice input as a whole.
4. The method of claim 3, wherein the step of determining that the real-time recognition result corresponds to the accuracy score of the voice input as a whole further comprises:
and outputting the accuracy scores of the real-time recognition results relative to the corresponding non-streaming recognition results.
5. The method according to any one of claims 1 to 4, wherein the step of determining the real-time recognition result corresponding to the real-time obtained speech segment by decoding the first hidden vector representation through a decoding module of the streaming model comprises:
through a prediction module of the streaming model, performing length prediction on the first hidden vector representation, and predicting the number of decoding characters matched with the first hidden vector representation;
performing, by the decoding module of the streaming model, autoregressive decoding based on the first hidden layer vector representation, the context vector representation represented by the first hidden layer vector, and the number of decoded characters, and determining a real-time recognition result of each character of the number of decoded characters corresponding to the speech segment acquired in real time.
6. The method of any of claims 1 to 4, wherein the streaming model and the non-streaming model are trained by:
obtaining a plurality of training samples, wherein sample data of the training samples are voice fragment sequences, and sample labels are real values of character sequences corresponding to the voice fragment sequences;
for each training sample, determining a first model loss value corresponding to the training sample according to a first estimated character sequence obtained by performing streaming coding and decoding on a voice fragment sequence included in the training sample by using the streaming model and a character sequence true value corresponding to the voice fragment sequence; determining a second model loss value corresponding to the training sample according to a second estimated character sequence obtained by integrally coding and decoding the voice segment sequence included by the training sample through the non-streaming model and a character sequence real value corresponding to the voice segment sequence;
determining a joint loss value of the streaming model and the non-streaming model according to the first model loss value and the second model loss value corresponding to each training sample;
and optimizing model parameters of the streaming model by taking the optimization of the joint loss value as a target, and synchronously updating the model parameters shared by the non-streaming model and the streaming model so as to perform joint training on the streaming model and the non-streaming model.
7. A speech recognition apparatus, comprising:
the streaming coding module is used for coding the voice fragment acquired in real time through a coding module of a pre-trained streaming model and outputting a first hidden layer vector representation of the voice fragment acquired in real time;
the streaming decoding module is used for decoding the first hidden layer vector representation through the decoding module of the streaming model and determining a real-time recognition result corresponding to the real-time acquired voice fragment;
the recognition result re-scoring module is used for re-scoring the real-time recognition result based on the first hidden layer vector representation of the voice segment acquired at each moment in the voice input to which the real-time acquired voice segment belongs through a pre-trained non-streaming model, and determining the accuracy score of the real-time recognition result corresponding to the whole voice input; wherein the non-streaming model shares model parameters of the streaming model.
8. The apparatus of claim 7, wherein the streaming decoding module is further configured to:
through a prediction module of the streaming model, performing length prediction on the first hidden vector representation, and predicting the number of decoding characters matched with the first hidden vector representation;
performing, by the decoding module of the streaming model, auto-regression decoding based on the first hidden layer vector representation, the context vector representation of the first hidden layer vector representation, and the number of decoded characters, and determining a real-time recognition result of each character of the number of decoded characters corresponding to the real-time obtained speech segment.
9. An electronic device comprising a memory, a processor and program code stored on the memory and executable on the processor, wherein the processor implements the speech recognition method of any one of claims 1 to 6 when executing the program code.
10. A computer-readable storage medium, on which a program code is stored, characterized in that the program code realizes the steps of the speech recognition method of any one of claims 1 to 6 when executed by a processor.
CN202110552936.5A 2021-05-20 2021-05-20 Voice recognition method and device and electronic equipment Pending CN115376515A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110552936.5A CN115376515A (en) 2021-05-20 2021-05-20 Voice recognition method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110552936.5A CN115376515A (en) 2021-05-20 2021-05-20 Voice recognition method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN115376515A true CN115376515A (en) 2022-11-22

Family

ID=84058484

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110552936.5A Pending CN115376515A (en) 2021-05-20 2021-05-20 Voice recognition method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN115376515A (en)

Similar Documents

Publication Publication Date Title
US10762305B2 (en) Method for generating chatting data based on artificial intelligence, computer device and computer-readable storage medium
CN109785824B (en) Training method and device of voice translation model
CN110366734B (en) Optimizing neural network architecture
CN108520741A (en) A kind of whispering voice restoration methods, device, equipment and readable storage medium storing program for executing
CN112257437B (en) Speech recognition error correction method, device, electronic equipment and storage medium
US11514925B2 (en) Using a predictive model to automatically enhance audio having various audio quality issues
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN112767917B (en) Speech recognition method, apparatus and storage medium
CN112084435A (en) Search ranking model training method and device and search ranking method and device
CN113327599B (en) Voice recognition method, device, medium and electronic equipment
CN116884391B (en) Multimode fusion audio generation method and device based on diffusion model
Kim et al. Sequential labeling for tracking dynamic dialog states
CN112802444A (en) Speech synthesis method, apparatus, device and storage medium
CN113793591A (en) Speech synthesis method and related device, electronic equipment and storage medium
CN116341651A (en) Entity recognition model training method and device, electronic equipment and storage medium
CN112951209B (en) Voice recognition method, device, equipment and computer readable storage medium
CN111477212B (en) Content identification, model training and data processing method, system and equipment
CN111651674A (en) Bidirectional searching method and device and electronic equipment
CN115376515A (en) Voice recognition method and device and electronic equipment
US11501759B1 (en) Method, system for speech recognition, electronic device and storage medium
CN112989794A (en) Model training method and device, intelligent robot and storage medium
CN116206616A (en) Speech translation and speech recognition method based on sequence dynamic compression
CN114048301B (en) Satisfaction-based user simulation method and system
CN113409792B (en) Voice recognition method and related equipment thereof
CN113408287B (en) Entity identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination