CN117558265B

CN117558265B - Dialect stream type voice recognition method and device, electronic equipment and storage medium

Info

Publication number: CN117558265B
Application number: CN202410044548.XA
Authority: CN
Inventors: 吕召彪; 赵文博; 肖清; 许程冲
Original assignee: China Unicom Guangdong Industrial Internet Co Ltd
Current assignee: China Unicom Guangdong Industrial Internet Co Ltd
Priority date: 2024-01-12
Filing date: 2024-01-12
Publication date: 2024-04-19
Anticipated expiration: 2044-01-12
Also published as: CN117558265A

Abstract

The invention provides a dialect stream type voice recognition method, a dialect stream type voice recognition device, electronic equipment and a storage medium, wherein the method comprises the following steps: respectively adjusting the attention mechanism and the convolution receptive field of the pre-training voice recognition model to carry out streaming processing on the pre-training voice recognition model; introducing distillation loss into the streaming pre-trained speech recognition model to realize knowledge migration from the non-streaming model to the streaming model; preprocessing and segmenting a dialect voice sample corresponding to the target dialect voice, and performing fine tuning training on the pre-trained voice recognition model subjected to knowledge migration by utilizing the segmented dialect voice sample to obtain a target dialect voice recognition model; and after preprocessing and segmenting the target dialect voice, inputting the segmented target dialect voice into the target dialect voice recognition model to obtain a recognition result of the target dialect voice. According to the invention, the pre-training voice recognition model is subjected to streaming processing, and knowledge migration is utilized to assist, so that the recognition accuracy of the streaming model can be remarkably improved.

Description

Dialect stream type voice recognition method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a dialect stream type speech recognition method, a device, an electronic apparatus, and a storage medium.

Background

In the practical application scene of the streaming voice recognition, the output text can be obtained from the input audio in real time so as to realize timely combination with the downstream task, thereby greatly reducing the time delay of the whole system.

Currently, when performing streaming speech recognition, a large amount of labeled speech data is usually required to train a recognition model to use the recognition model for target speech recognition. For mandarin, tens of thousands of hours of voice data in various fields such as networks, telephones, lectures, etc. are required to achieve practical performance.

However, for the dialect, the data volume is rare, and a large number of labeling personnel who understand the dialect are required to label and calibrate the collected voice data, so that the dialect can be used in model training, the acquisition of a recognition model is difficult, and the streaming recognition performance of the recognition model is difficult to ensure.

Disclosure of Invention

The invention provides a dialect stream type voice recognition method, a dialect stream type voice recognition device, electronic equipment and a storage medium, which are used for solving the defect of poor recognition performance of dialect voice in the prior art and achieving the aim of effectively improving the recognition performance of the dialect voice.

The invention provides a dialect stream type voice recognition method, which comprises the following steps:

structurally adjusting the attention mechanism and the convolution receptive field of the pre-training voice recognition model respectively so as to carry out streaming processing on the pre-training voice recognition model;

Introducing distillation loss into the streaming pre-training speech recognition model to realize knowledge migration from the non-streaming pre-training speech recognition model to the streaming pre-training speech recognition model;

preprocessing a dialect voice sample corresponding to target dialect voice, segmenting according to an audio sampling point, and performing fine tuning training on a pre-training voice recognition model subjected to knowledge migration by utilizing the segmented dialect voice sample to obtain a target dialect voice recognition model;

and preprocessing the target dialect voice, segmenting according to the audio sampling points, and inputting the segmented target dialect voice into the target dialect voice recognition model so as to acquire a recognition result of the target dialect voice by utilizing the target dialect voice recognition model.

According to the dialect stream type voice recognition method provided by the invention, the pre-training voice recognition model comprises a transducer encoder layer, wherein the transducer encoder layer calculates the attention score by calculating the dot product of a query vector Q, a key vector K and a value vector V;

accordingly, adjusting the attention mechanism of the pre-trained speech recognition model includes:

And designing a mask matrix aiming at the attention mechanism, and limiting the range of the value vector V which participates in calculation after the dot product of the query vector Q and the key vector K by utilizing the mask matrix.

According to the dialect stream type voice recognition method provided by the invention, the pre-training voice recognition model comprises a one-dimensional convolution layer for position coding;

correspondingly, the convolution receptive field of the pre-training voice recognition model is adjusted, which comprises

And after 0 is complemented by the one-dimensional convolution layer, performing truncation processing to transform the one-dimensional convolution layer into a causal convolution layer.

According to the dialect stream type voice recognition method provided by the invention, distillation loss is introduced into a stream type pre-training voice recognition model, and the method comprises the following steps:

after the high-dimensional representation output by the streaming pre-training speech recognition model, adding a connection sense time classification CTC module, and adding the distillation loss on the output loss of the connection sense time classification CTC module to obtain the overall loss, wherein the overall loss is expressed as follows:

where Loss represents the overall Loss, ctc_loss represents the output Loss of the connection-oriented time-classified CTC module, kd_loss represents the distillation Loss, and α represents the weight coefficient.

According to the dialect stream type voice recognition method provided by the invention, the distillation loss is specifically cross entropy loss, mean square error loss or CTC guiding loss;

the cross entropy loss is expressed as follows:

The mean square error loss is expressed as follows:

The CTC guide loss is expressed as follows:

Where L _CE denotes the cross entropy loss, L _MSE denotes the mean square error loss, L _G denotes the CTC guide loss, N denotes the total class number, t _i denotes the i-th class teacher model, s _i denotes the i-th class student model to be trained, M (X) denotes the mask matrix derived from the non-streaming pre-trained speech recognition model, and P (X) denotes the probability matrix of the encoder output of the streaming pre-trained speech recognition model.

According to the dialect stream type voice recognition method provided by the invention, the segmented dialect voice sample is utilized to carry out fine tuning training on the pre-training voice recognition model which is subjected to knowledge migration, and the method comprises the following steps:

Traversing the segmented dialect voice sample from the first segmented dialect voice sample, inputting the traversed segmented dialect voice sample into the knowledge migration pre-training voice recognition model, performing forward calculation by using the knowledge migration pre-training voice recognition model, acquiring the overall loss, and back-propagating the overall loss in the knowledge migration pre-training voice recognition model to perform fine tuning processing on the parameters of the knowledge migration pre-training voice recognition model until the overall loss is within a preset range, and acquiring the knowledge migration pre-training voice recognition model under the current parameters as the target dialect voice recognition model.

The invention also provides a dialect stream type voice recognition device, which comprises:

the streaming module is used for structurally adjusting the attention mechanism and the convolution receptive field of the pre-training voice recognition model respectively so as to carry out streaming processing on the pre-training voice recognition model;

The knowledge migration module is used for introducing distillation loss into the streaming pre-training voice recognition model so as to realize knowledge migration from the non-streaming pre-training voice recognition model to the streaming pre-training voice recognition model;

the fine tuning training module is used for carrying out fine tuning training on the pre-trained voice recognition model subjected to knowledge migration by utilizing the segmented dialect voice samples after preprocessing the dialect voice samples corresponding to the target dialect voice and segmenting the dialect voice samples according to the audio sampling points to obtain a target dialect voice recognition model;

And the classification recognition module is used for preprocessing the target dialect voice, segmenting the target dialect voice according to the audio sampling points, and inputting the segmented target dialect voice into the target dialect voice recognition model so as to acquire a recognition result of the target dialect voice by utilizing the target dialect voice recognition model.

The invention also provides an electronic device comprising a memory, a processor and a program or instructions stored in the memory and executable on the processor, wherein the steps of the dialect stream speech recognition method according to any one of the above are realized when the processor executes the program or instructions.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a program or instructions which, when executed by a computer, implement the steps of the dialect stream speech recognition method as described in any of the above.

The present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a dialect stream speech recognition method as described in any of the above.

According to the dialect streaming voice recognition method, the dialect streaming voice recognition device, the electronic equipment and the storage medium, streaming processing is carried out on the pre-trained voice recognition model, knowledge migration is utilized for assistance, training of the streaming model can be effectively utilized for guiding, the training effect on a large number of data sets can be achieved only by a small amount of data, and recognition accuracy of the streaming model is remarkably improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, a brief description will be given below of the drawings that are needed in the embodiments of the invention or the description of the prior art, it being obvious that the drawings in the following description are some embodiments of the invention and that other drawings can be obtained from these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a dialect stream speech recognition method according to the present invention;

FIG. 2 is a schematic diagram of a training process of a pre-training speech recognition model in a dialect stream speech recognition method according to the present invention;

FIG. 3 is a schematic diagram of an attention mechanism mask matrix in a dialect stream speech recognition method according to the present invention;

FIG. 4 is a schematic diagram of a causal convolution layer in a dialect stream speech recognition method according to the present invention;

FIG. 5 is a second flowchart of a dialect stream speech recognition method according to the present invention;

Fig. 6 is a schematic structural diagram of a dialect stream type speech recognition device provided by the present invention;

fig. 7 is a schematic diagram of an entity structure of an electronic device according to the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Aiming at the problems of poor voice recognition performance of the counterpart and the like in the prior art, the invention can effectively utilize the non-streaming model to guide the training of the streaming model by carrying out streaming processing on the pre-trained voice recognition model and assisting by utilizing knowledge migration, thereby realizing the training effect on a large number of data sets by using a small amount of data and further obviously improving the recognition accuracy of the streaming model. The invention will be described and illustrated hereinafter with reference to the drawings, particularly by means of a number of embodiments.

Fig. 1 is a schematic flow chart of a dialect stream type speech recognition method provided by the present invention, as shown in fig. 1, the method includes:

S101, structurally adjusting the attention mechanism and the convolution receptive field of the pre-training voice recognition model respectively so as to conduct streaming processing on the pre-training voice recognition model.

It can be understood that the model is streaming by structurally adjusting the pre-trained voice recognition model, so that the model is suitable for streaming recognition of target dialect voices (such as low-resource dialect voices of a guest, a Chaoshan, and the like). The pre-training speech recognition model can be a speech recognition model which is obtained by training a large number of standard language speech (such as mandarin) samples according to the existing model training method in advance.

Alternatively, the pre-trained speech recognition model of the present invention uses wav2vec2.0 to perform the modeling task by predicting the speech units of the masked portion of speech. The pre-trained speech recognition model consists mainly of 7-layer convolutions and a transducer-based encoder layer. The wav2vec2.0 training process is shown in fig. 2, and is a schematic diagram of a training process of a pre-training speech recognition model in the dialect stream speech recognition method according to the present invention, in which a product quantization operation is introduced, and the quantizer selects a speech unit from the learned unit list as a potential audio characterization vector, while masking out about half of the audio representation. The model needs to identify the correct quantized speech unit by masking the position and compare it to the masked part to calculate the resistance and diversity losses.

Since the pre-trained speech recognition model does not support streaming speech recognition itself, adjustments to the model structure are required during fine-tuning to adapt streaming speech recognition. The structure of the attention mechanism in the model is specifically adjusted, so that the calculation range of controlling the attention during training is realized, the input of the whole sentence is segmented, the scene during use is adapted, and the data can be optionally cut off or the attention mechanism can be optionally limited.

Optionally, the pre-trained speech recognition model includes a transducer encoder layer that calculates an attention score by calculating a dot product of a query vector Q, a key vector K, and a value vector V; accordingly, adjusting the attention mechanism of the pre-trained speech recognition model includes: and designing a mask matrix aiming at the attention mechanism, and limiting the range of the value vector V which participates in calculation after the dot product of the query vector Q and the key vector K by utilizing the mask matrix.

It will be appreciated that in the pre-trained speech recognition model, the transducer encoder layer needs to calculate the dot product of the query vector (Q), key vector (K) threshold (value, V) vector when calculating the attention score. However, in the actual scene of the streaming voice recognition, the voice is input in a segmented mode, and only the current segment and the previous segment can be calculated when the attention score is calculated, which is not matched with the global calculation of the attention mechanism during training, so that the accuracy is reduced.

Thus, to facilitate training and matching with pre-trained speech recognition models, the present invention chooses to limit the attention calculation results. Specifically, a mask matrix of the attention mechanism can be designed to control the calculation range when KQV dot products are calculated. The mask matrix used in the present invention may optionally be shown in fig. 3, which is a schematic diagram of an attention mechanism mask matrix in the dialect stream speech recognition method provided in the present invention, and may limit the range of V participating in calculation after QK dot product, so as to adjust the attention mechanism.

Wherein the pre-trained speech recognition model calculates the length of each block, chunk_size, as the length of the final speech segment input, the theoretical delay of the portion being. After the voice calculation of the current block is finished, the state is cached and used as the history information in the next block calculation, so that the calculation is saved. Alternatively, the entire history information may be utilized in calculating the current block.

In addition, since the calculation of time t can only take into account t and the previous inputs for the time series, the convolution needs to be adjusted to realize the streaming operation. The convolution layer of the pre-training voice recognition model can be replaced by a causal convolution layer during training, so that the convolution receptive field of the pre-training voice recognition model can be adjusted.

Optionally, the pre-trained speech recognition model comprises a one-dimensional convolutional layer for position coding; correspondingly, the method for adjusting the convolution receptive field of the pre-training voice recognition model comprises the following steps: and after 0 is complemented by the one-dimensional convolution layer, performing truncation processing to transform the one-dimensional convolution layer into a causal convolution layer.

It can be understood that the pre-training speech recognition model wav2vec adopts a one-dimensional convolution to realize position coding, but in the streaming speech recognition, speech is calculated according to block input, and the position coding at this time is not matched with the global position coding during training, so that the performance is reduced.

Thus, the present invention replaces the convolution layer in the model with a causal convolution layer during training. Causal convolution (causal convolution) is a type of convolution that computes only histories. Fig. 4 is a schematic diagram of causal convolution layer in a dialect stream speech recognition method according to the present invention, which is implemented by performing truncation processing after 0 is complemented by the convolution layer calculation, so that the output of the last layer is related to the input before the current time of the previous layer.

S102, introducing distillation loss into the streaming pre-training speech recognition model to realize knowledge migration from the non-streaming pre-training speech recognition model to the streaming pre-training speech recognition model.

It can be understood that the present invention considers knowledge migration strategies that use knowledge distillation as a primary method in training a streaming pre-trained speech recognition model to mitigate performance differences between non-streaming models and streaming models. Knowledge distillation is a model compression method that typically requires a better performing teacher model (teacher) and a student model (student) to be trained. The method comprises the steps of introducing distillation loss outside the main loss of a streaming pre-training voice recognition model, and participating the integral loss added with the distillation loss into the back propagation of the training process, so that knowledge migration of the pre-training voice recognition model from a non-streaming state to a streaming state is realized, and a pre-training voice recognition model subjected to knowledge migration is obtained.

S103, preprocessing a dialect voice sample corresponding to the target dialect voice, segmenting according to the audio sampling points, and performing fine tuning training on the pre-trained voice recognition model subjected to knowledge migration by utilizing the segmented dialect voice sample to obtain the target dialect voice recognition model.

It can be understood that after the pre-trained voice recognition model is streamed and knowledge is transferred, the pre-trained voice recognition model is subjected to fine tuning training by using the dialect voice sample belonging to the same dialect as the target dialect voice, so that the recognition model is suitable for effectively recognizing the target dialect voice.

Specifically, the dialect voice sample can be subjected to pretreatment including noise removal, pre-emphasis, endpoint detection and the like, and then the pretreated dialect voice sample is segmented according to the audio sampling points, so that the segmented dialect voice sample is obtained. And then, carrying out iterative training and gradient updating on the pre-training speech recognition model subjected to knowledge migration by utilizing the segmented dialect speech samples until the training prediction error is within a preset range, and completing training to obtain a corresponding target dialect speech recognition model.

Optionally, the training the knowledge-migrated pre-training speech recognition model with segmented dialect speech samples comprises: traversing the segmented dialect voice sample from the first segmented dialect voice sample, inputting the traversed segmented dialect voice sample into the knowledge migration pre-training voice recognition model, performing forward calculation by using the knowledge migration pre-training voice recognition model, acquiring the overall loss, and back-propagating the overall loss in the knowledge migration pre-training voice recognition model to perform fine tuning processing on the parameters of the knowledge migration pre-training voice recognition model until the overall loss is within a preset range, and acquiring the knowledge migration pre-training voice recognition model under the current parameters as the target dialect voice recognition model.

It will be appreciated that fine-tuning (fine-tuning) of the pre-trained speech recognition model in the present invention is to train the modified network under the dialect speech samples of the target dialect speech with the trained parameters, so that the model parameters are applicable to the target dialect speech. After the segmented dialect voice samples are obtained, the segmented dialect voice samples are input into the knowledge-migrated pre-training voice recognition model one by one, the prediction result of the model is obtained through forward calculation of the knowledge-migrated pre-training voice recognition model, and is compared with the labeling result of the dialect voice samples, so that the predicted overall error loss is calculated. And finally, reversely adjusting parameters of the model according to the calculated error loss, so that the error loss of the recognition model is gradually reduced until a preset range is reached, and taking the recognition model under the model parameters at the moment as a target dialect voice recognition model.

The pre-training speech recognition model is subjected to fine adjustment on dialect corpus, the pre-training model is streamed, and knowledge migration from a non-streaming model to a streaming model is introduced, so that the recognition performance of the streaming model can be effectively improved.

S104, preprocessing the target dialect voice, segmenting according to audio sampling points, and inputting the segmented target dialect voice into the target dialect voice recognition model so as to acquire a recognition result of the target dialect voice by using the target dialect voice recognition model.

It is understood that after the training of fine tuning the knowledge-migrated pre-trained speech recognition model by using the dialect speech samples to obtain the target dialect speech recognition model, the target dialect speech recognition model can be applied to recognition of target dialect speech. Specifically, the target dialect voice is processed according to the same preprocessing and segmentation method as the dialect voice sample to obtain segmented target dialect voice, the segmented target dialect voice is input into a target dialect voice recognition model for forward calculation, and finally the result output by a target dialect voice recognition model encoder is sent to a decoder for decoding to obtain a recognition result.

According to the dialect flow type voice recognition method provided by the invention, the pre-training voice recognition model is subjected to flow type processing and is assisted by utilizing knowledge migration, so that training of the flow type model can be guided by effectively utilizing the non-flow type model, the training effect on a large number of data sets can be achieved by using a small amount of data, and the recognition accuracy of the flow type model is further obviously improved.

The dialect flow type voice recognition method provided according to the foregoing embodiments optionally introduces distillation loss in the flow type pre-training voice recognition model, including: after the high-dimensional representation output by the streaming pre-training speech recognition model, adding a connection sense time classification CTC module, and adding the distillation loss on the output loss of the connection sense time classification CTC module to obtain the overall loss, wherein the overall loss is expressed as follows:

It will be appreciated that the present invention is directed to recognition of target dialect speech using a pre-trained speech recognition model as the speech feature extractor, adding a connective temporal classification (Connectionist Temporal Classification, CTC) module to the high-dimensional representation of the original model encoder output, the CTC module comprising a classifier layer and a CTC calculation unit. The input of the CTC module is the output of a pre-training speech recognition model encoder, the output of the CTC module is a probability matrix of a modeling unit of dialect speech, and the dimension isWhere B represents the size of the batch at one calculation, T represents the speech length, and N represents the dialect modeling unit. The contrast penalty for the pre-training phase is no longer used in performing the training penalty calculation, but instead the CTC loss master is calculated. To/>For the full path, ctc_loss can be expressed as/>The following partial derivatives need to be calculated during training:

。

thus, when model non-streaming to streaming knowledge migration is performed, distillation loss KD_loss is directly introduced outside main loss CTC_loss, and weight is matched The resulting loss that ultimately participates in the retransmission update is shown by the above-described expression of the overall loss.

Wherein, the dialect stream type voice recognition method provided according to the above embodiments is optional, and the distillation loss is specifically cross entropy loss, mean square error loss or CTC guiding loss;

the cross entropy loss is expressed as follows:

The mean square error loss is expressed as follows:

The CTC guide loss is expressed as follows:

It will be appreciated that the distillation loss KD_loss can be calculated in any of the above-described calculation modes when knowledge migration is performed in the present invention.

The cross entropy loss (Cross Entropy loss, ce_loss) is mainly used for measuring the similarity of two distributions, and the more similar the distributions are, the smaller the cross entropy is. In multi-classification tasks, cross entropy loss is typically expressed as the cross entropy loss expression described above. When knowledge distillation is performed, the value of each frame in the high-dimensional representation output by the two model encoders can be directly calculated, and the sum average is taken as the distillation loss of the current output.

The mean square error loss (Mean Square Error loss, mse_loss) is calculated as the mean value of the sum of squares of the point errors corresponding to the predicted data and the original data, and the calculation formula is expressed as the mean square error loss expression. In knowledge distillation, unlike cross entropy loss which is calculated only on the output of the last layer, mean square error loss is used in the middle of a plurality of encoder layers, similarity between hidden layer high-dimensional characterizations is calculated, and distillation loss as the current output is summed.

The CTC guidance loss is to take the CTC peak position of the teacher model as guidance, and match the CTC peak position with the corresponding position of the student model, so as to calculate the similarity. The non-streaming model and the streaming model have different peak positions due to the reason of audio segmentation, and direct frame-by-frame corresponding calculation can lead to dislocation of output of the two models, so that the distillation effect is reduced. Meanwhile, the CTC peak of the non-flow model is relatively forward, and the non-flow model can be output faster than the flow model, so that the corresponding position in the non-flow model is selected, a mask matrix is generated and sleeved into the flow model, and the loss calculation is as the CTC guiding loss expression.

According to the knowledge distillation method, the non-flow model is used as a teacher model, the flow guiding model learns and characterizes the teacher model, and the performance of the flow model can be improved while the convergence is accelerated, so that the flow model is closer to the performance of the non-flow model.

To further illustrate the solution of the present invention, the following description is more detailed with reference to fig. 5, but does not limit the scope of the invention as claimed.

As shown in fig. 5, a second flowchart of the dialect stream speech recognition method provided by the present invention includes the following processing steps:

Firstly, a dialect voice sample corresponding to target dialect voice is preprocessed and segmented to obtain segmented dialect voice samples, and the segmented dialect voice samples are sent into a pre-training voice recognition model after multi-layer convolution is performed to extract features and dimension reduction.

And then, adjusting the structure of the pre-training voice recognition model into a structure required by streaming through adjusting the action range of a model attention mechanism and the receptive field of a convolution layer, realizing streaming of the model, and adding knowledge migration from the non-streaming model to the streaming model while converting the non-streaming pre-training voice recognition model into the streaming model.

And then, when the features extracted by the multi-layer convolution and subjected to dimension reduction are transmitted in the pre-trained voice recognition model subjected to knowledge migration, performing fine tuning training on the model for multiple times, and finally obtaining a recognition model with the prediction error loss within a preset range as a final target dialect voice recognition model.

And then, preprocessing and segmenting the target dialect language according to the same processing mode as the dialect voice sample to obtain segmented target dialect voice, and inputting the segmented target dialect voice into a target dialect voice recognition model.

And finally, sending the output result of the encoder CTC of the target dialect speech recognition model to a decoder for decoding to obtain a final recognition result and outputting the final recognition result.

The invention solves the problem of larger performance difference between the pre-training model after streaming and the non-streaming pre-training model in the prior method by applying the pre-training voice recognition model to streaming scenes, thereby effectively utilizing the audio information irrelevant to the target dialect, assisting the supervised training on the small language problems such as the target dialect, and the like, realizing the training result on a large number of data sets by using a small amount of data, and saving a large amount of labor cost. Meanwhile, by the aid of the knowledge migration method, training of the streaming model is guided by using the non-streaming model, recognition accuracy of the streaming model can be remarkably improved, and time delay is reduced. The method can be expanded to the recognition tasks of various dialects, so that the dialects can be recognized in a streaming mode by using only a small amount of data, and the method has good mobility and can lay a foundation for intelligent popularization.

Based on the same inventive concept, the present invention also provides a dialect stream type speech recognition apparatus according to the above embodiments, which is used for implementing dialect stream type speech recognition in the above embodiments. Therefore, the description and the definition in the dialect stream speech recognition method in the above embodiments may be used for understanding each execution module in the present invention, and specific reference may be made to the above method embodiments, which are not described herein.

According to an embodiment of the present invention, a structure of a dialect stream speech recognition apparatus is shown in fig. 6, which is a schematic structural diagram of the dialect stream speech recognition apparatus provided by the present invention, where the apparatus may be used to implement dialect stream speech recognition in the above method embodiments, and the apparatus includes: a streaming module 601, a knowledge migration module 602, a fine-tuning training module 603, and a classification recognition module 604. Wherein:

The streaming module 601 is configured to structurally adjust a focus mechanism and a convolution receptive field of a pre-training speech recognition model, so as to perform streaming processing on the pre-training speech recognition model;

The knowledge migration module 602 is configured to introduce distillation loss into the streaming pre-training speech recognition model, so as to implement knowledge migration from the non-streaming pre-training speech recognition model to the streaming pre-training speech recognition model;

The fine tuning training module 603 is configured to perform fine tuning training on the knowledge-migrated pre-training speech recognition model by using the segmented dialect speech samples after preprocessing the dialect speech samples corresponding to the target dialect speech and segmenting the dialect speech samples according to the audio sampling points, so as to obtain a target dialect speech recognition model;

The classification recognition module 604 is configured to pre-process the target dialect voice and segment the target dialect voice according to the audio sampling points, and then input the segmented target dialect voice into the target dialect voice recognition model, so as to obtain a recognition result of the target dialect voice by using the target dialect voice recognition model.

According to the dialect flow type voice recognition device provided by the invention, the pre-training voice recognition model is subjected to flow type processing and is assisted by utilizing knowledge migration, so that training of the flow type model can be guided by effectively utilizing the non-flow type model, the training effect on a large number of data sets can be achieved by using a small amount of data, and the recognition accuracy of the flow type model is further obviously improved.

Optionally, the pre-trained speech recognition model includes a transducer encoder layer that calculates an attention score by calculating a dot product of a query vector Q, a key vector K, and a value vector V;

accordingly, the streaming module, when used to adjust the attention mechanism of the pre-trained speech recognition model, is configured to:

Optionally, the pre-trained speech recognition model comprises a one-dimensional convolutional layer for position coding;

Accordingly, the streaming module, when used for adjusting the convolution receptive field of the pre-trained speech recognition model, is configured to:

Optionally, the knowledge migration module, when used to introduce distillation loss in the streaming pre-trained speech recognition model, is to:

Optionally, the distillation loss is specifically a cross entropy loss, a mean square error loss, or a CTC guide loss;

the cross entropy loss is expressed as follows:

The mean square error loss is expressed as follows:

The CTC guide loss is expressed as follows:

Optionally, the fine tuning training module, when used for fine tuning training of the knowledge-migrated pre-trained speech recognition model using segmented dialect speech samples, is configured to:

It will be appreciated that the relevant program modules in the apparatus of the embodiments described above may be implemented by a hardware processor (hardware processor) in the present invention. In addition, the dialect stream type speech recognition device of the present invention can implement the dialect stream type speech recognition flow of each method embodiment by using the program modules, and when the dialect stream type speech recognition device is used for implementing the dialect stream type speech recognition in each method embodiment, the beneficial effects generated by the device of the present invention are the same as those generated by the corresponding method embodiments, and reference may be made to the method embodiments, so that the description thereof will not be repeated.

As still another aspect of the present invention, according to the above embodiments, there is further provided an electronic device including a memory, a processor, and a program or instructions stored on the memory and executable on the processor, the processor implementing the steps of the dialect stream speech recognition method as described in the above embodiments when executing the program or instructions.

Further, the electronic device of the present invention may also include a communication interface and a bus. Referring to fig. 7, a schematic structural diagram of an electronic device according to the present invention includes: at least one memory 701, at least one processor 702, a communication interface 703, and a bus 704.

The memory 701, the processor 702 and the communication interface 703 complete communication with each other through the bus 704, and the communication interface 703 is used for information transmission between the electronic device and the dialect voice acquisition or storage device; the memory 701 stores a program or instructions executable on the processor 702, and when executed by the processor 702, the steps of the dialect stream speech recognition method according to the above embodiments are implemented.

It should be understood that the electronic device at least includes a memory 701, a processor 702, a communication interface 703 and a bus 704, where the memory 701, the processor 702 and the communication interface 703 form a communication connection with each other through the bus 704, and can perform communication with each other, for example, the processor 702 reads program instructions of the dialect stream speech recognition method from the memory 701. In addition, the communication interface 703 may also implement communication connection between the electronic device and the dialect voice collection or storage device, and may complete information transmission between each other, for example, implement reading of dialect voice through the communication interface 703.

When the electronic device is running, the processor 702 invokes the program instructions in the memory 701 to perform the methods provided in the above method embodiments, for example, including: structurally adjusting the attention mechanism and the convolution receptive field of the pre-training voice recognition model respectively so as to carry out streaming processing on the pre-training voice recognition model; introducing distillation loss into the streaming pre-training speech recognition model to realize knowledge migration from the non-streaming pre-training speech recognition model to the streaming pre-training speech recognition model; preprocessing a dialect voice sample corresponding to target dialect voice, segmenting according to an audio sampling point, and performing fine tuning training on a pre-training voice recognition model subjected to knowledge migration by utilizing the segmented dialect voice sample to obtain a target dialect voice recognition model; and after preprocessing the target dialect voice and segmenting according to the audio sampling points, inputting the segmented target dialect voice into the target dialect voice recognition model so as to acquire a recognition result and the like of the target dialect voice by utilizing the target dialect voice recognition model.

The program instructions in the memory 701 may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a separate product. Or all or part of the steps of implementing the above method embodiments may be implemented by hardware related to program instructions, where the foregoing program may be stored in a computer readable storage medium, and when the program is executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a program or instructions which, when executed by a computer, implement the steps of the dialect stream speech recognition method according to the above embodiments, for example, including: structurally adjusting the attention mechanism and the convolution receptive field of the pre-training voice recognition model respectively so as to carry out streaming processing on the pre-training voice recognition model; introducing distillation loss into the streaming pre-training speech recognition model to realize knowledge migration from the non-streaming pre-training speech recognition model to the streaming pre-training speech recognition model; preprocessing a dialect voice sample corresponding to target dialect voice, segmenting according to an audio sampling point, and performing fine tuning training on a pre-training voice recognition model subjected to knowledge migration by utilizing the segmented dialect voice sample to obtain a target dialect voice recognition model; and after preprocessing the target dialect voice and segmenting according to the audio sampling points, inputting the segmented target dialect voice into the target dialect voice recognition model so as to acquire a recognition result and the like of the target dialect voice by utilizing the target dialect voice recognition model.

As still another aspect of the present invention, there is also provided a computer program product according to the above embodiments, the computer program product including a computer program stored on a non-transitory computer readable storage medium, the computer program including program instructions which, when executed by a computer, are capable of executing the dialect stream speech recognition method provided by the above method embodiments, the method including, for example: structurally adjusting the attention mechanism and the convolution receptive field of the pre-training voice recognition model respectively so as to carry out streaming processing on the pre-training voice recognition model; introducing distillation loss into the streaming pre-training speech recognition model to realize knowledge migration from the non-streaming pre-training speech recognition model to the streaming pre-training speech recognition model; preprocessing a dialect voice sample corresponding to target dialect voice, segmenting according to an audio sampling point, and performing fine tuning training on a pre-training voice recognition model subjected to knowledge migration by utilizing the segmented dialect voice sample to obtain a target dialect voice recognition model; and after preprocessing the target dialect voice and segmenting according to the audio sampling points, inputting the segmented target dialect voice into the target dialect voice recognition model so as to acquire a recognition result and the like of the target dialect voice by utilizing the target dialect voice recognition model.

According to the electronic equipment, the non-transitory computer readable storage medium and the computer program product provided by the invention, the steps of the dialect stream type voice recognition method described in the above embodiments are executed to perform streaming processing on the pre-trained voice recognition model, and knowledge migration is utilized to assist, so that training of the stream type model can be guided by effectively utilizing the non-stream type model, training effect on a large number of data sets can be achieved by using a small amount of data, and recognition accuracy of the stream type model is further remarkably improved.

It will be appreciated that the embodiments of the apparatus, electronic device and storage medium described above are merely illustrative, wherein the elements illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed over different network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course by means of hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a usb disk, a mobile hard disk, a ROM, a RAM, a magnetic disk or an optical disk, etc., and includes several instructions for causing a computer device (such as a personal computer, a server, or a network device, etc.) to execute the method described in the foregoing method embodiments or some parts of the method embodiments.

In addition, it will be understood by those skilled in the art that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the description of the present invention, numerous specific details are set forth. It will be appreciated, however, that embodiments of the invention may be practiced without such specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description. Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A dialect stream speech recognition method, comprising:

preprocessing the target dialect voice, segmenting the target dialect voice according to audio sampling points, and inputting the segmented target dialect voice into the target dialect voice recognition model so as to acquire a recognition result of the target dialect voice by utilizing the target dialect voice recognition model;

The pre-training speech recognition model is wav2vec2.0, the wav2vec2.0 consists of 7 layers of convolution and encoder layers based on a transformer, a product quantizer introduced in the wav2vec2.0 selects a speech unit from a learned unit list as a potential audio characterization vector and masks out half of the audio representation in the potential audio characterization vector, the wav2vec2.0 recognizes the correct quantized speech unit through a masking position and compares the correct quantized speech unit with the masked part, and the antagonism loss and the diversity loss are calculated to complete modeling.

2. The dialect-stream speech recognition method of claim 1, in which the pre-trained speech recognition model comprises a transducer encoder layer that calculates an attention score by calculating a dot product of a query vector Q, a key vector K, and a value vector V;

3. The dialect-stream speech recognition method according to claim 1 or 2, characterized in that the pre-trained speech recognition model comprises a one-dimensional convolution layer for position coding;

correspondingly, the method for adjusting the convolution receptive field of the pre-training voice recognition model comprises the following steps:

4. The dialect-flow speech recognition method according to claim 1 or 2, wherein the introducing distillation loss in the streaming pre-trained speech recognition model comprises:

5. The dialect flow speech recognition method of claim 4, wherein the distillation penalty is in particular a cross entropy penalty, a mean square error penalty or a CTC guide penalty;

the cross entropy loss is expressed as follows:

The mean square error loss is expressed as follows:

The CTC guide loss is expressed as follows:

6. The dialect flow type speech recognition method of claim 4, wherein the fine-tuning training of the knowledge-migrated pre-trained speech recognition model using segmented dialect speech samples comprises:

7. A dialect stream speech recognition apparatus, comprising:

The classification recognition module is used for preprocessing the target dialect voice, segmenting the target dialect voice according to the audio sampling points, and inputting the segmented target dialect voice into the target dialect voice recognition model so as to acquire a recognition result of the target dialect voice by utilizing the target dialect voice recognition model;

8. An electronic device comprising a memory, a processor and a program or instructions stored on the memory and executable on the processor, wherein the processor, when executing the program or instructions, performs the steps of the dialect stream speech recognition method as claimed in any of claims 1 to 6.

9. A non-transitory computer readable storage medium having stored thereon a program or instructions, which when executed by a computer, implement the steps of the dialect stream speech recognition method of any of claims 1 to 6.

10. A computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the dialect stream speech recognition method as claimed in any of claims 1 to 6.