CN112951213A

CN112951213A - End-to-end online voice detection and recognition method, system and equipment

Info

Publication number: CN112951213A
Application number: CN202110175961.6A
Authority: CN
Inventors: 周世玉; 徐波; 李蒙
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-06-11
Anticipated expiration: 2041-02-09
Also published as: CN112951213B

Abstract

The invention belongs to the field of voice detection and recognition, and particularly relates to an end-to-end online voice detection and recognition method, system and device, aiming at solving the problems that the existing online voice recognition technology needs to train and deploy a plurality of models, the model calculation efficiency is low, the deployment and parameter adjustment processes are complex, and the dependence on labeled sample data is strong. The invention comprises the following steps: obtaining a pre-training wav2vec2.0 model through label-free voice data self-supervision training; performing first-stage and second-stage fine adjustment on the model, and training through multi-task voice data to obtain a multi-task model for voice detection and recognition; and for the online audio data, carrying out blocking and edge splicing, and obtaining a real-time voice recognition text through online recognition and edge elimination of a multi-task model. The method has low dependency on tag data, a small number of model parameters and a simple structure, reduces calculation consumption by combined modeling, can be used for scenes with low resources and high real-time requirements, and has accurate identification and high precision.

Description

End-to-end online voice detection and recognition method, system and equipment

Technical Field

The invention belongs to the field of voice detection and recognition, and particularly relates to an end-to-end online voice detection and recognition method, system and device.

Background

With the popularization of innovative applications such as intelligent devices, speech recognition is an important entrance of human-computer interaction, and is now widely applied to various scenes, such as speech input, speech search, speech translation, smart home and the like. Some of the scenes have high requirements for real-time online recognition, such as voice control, conference summary, and the like.

At present, the mainstream online speech recognition method needs to train and deploy a speech detection model and a speech recognition model, firstly, a speech detection model is used for detecting a speech part in an audio signal, and then, the speech recognition model is used for online recognition aiming at data of the speech part. The disadvantages are: the two models of voice detection and voice recognition need to be trained and deployed and combined in a series connection mode, so that the accuracy of the voice detection model has a large influence on the performance of a subsequent voice recognition model, and in practical application, the relevant parameters of the voice detection model need to be dynamically adjusted according to an application scene so as to ensure that the performances of the voice detection model and the voice recognition model are optimal. In addition, a large amount of labeled sample data is needed in model training in the prior art, the labeled sample data has strong dependence on the labeled sample, and the robustness and the generalization of the model need to be improved.

Disclosure of Invention

In order to solve the above problems in the prior art, the invention provides an end-to-end online voice detection and recognition method, which integrates a voice detection model and a voice recognition model into one model, and simplifies the training and deployment processes; a multi-task learning technology is adopted to perform joint optimization on the voice detection module and the voice recognition module in the unified model, the performance of each is improved by utilizing the complementarity among tasks, and the robustness of the model is enhanced; the method fully utilizes mass non-labeled audio data, adopts the self-supervised learning technology to train the pre-training model, not only reduces the dependence of the supervised learning stage on labeled data, but also increases the robustness of the model in the supervised learning stage, so the method is suitable for developing online voice recognition systems of various low-resource languages, and can quickly build a set of online voice recognition systems of the low-resource languages only by a small amount of labeled data. The end-to-end online voice detection and recognition method comprises the following steps:

step A10, based on the obtained non-labeled voice data, using a wav2vec2.0 model to perform self-supervision training to obtain a pre-training wav2vec2.0 model;

step A20, adding a layer of fully-connected network on the top of the pre-training wav2vec2.0 model, and performing supervised training by adopting CTC loss based on the acquired marked voice data to obtain a voice recognition single-task model;

step A30, using the obtained multitask voice data as multitask training data, and performing voice detection and training of recognizing a multitask model through fine tuning of the voice recognition single-task model to obtain a trained voice detection and recognition multitask model;

step A40, acquiring audio data on line in real time, and performing voice and non-voice audio analysis on the audio data in real time through the voice detection module of the trained voice detection and recognition multitask model according to the preset voice detection block size;

and A50, performing state judgment based on the audio analysis result, extracting effective voice data based on the state judgment result, performing edge splicing on the effective voice data, and performing online recognition and edge elimination through the voice recognition module of the trained voice detection and recognition multitask model to obtain a real-time voice recognition text.

In some preferred embodiments, the trained multi-task model for speech detection and recognition is trained by:

step B10, for each group of data in the multitask training data, respectively obtaining the predicted output of the corresponding task of each group of data through the voice detection and recognition multitask model; the multitask training data comprises training data of voice detection and voice recognition which are respectively obtained;

step B20, respectively calculating loss values between the predicted output and the true value of the corresponding tasks of each group of data, and obtaining the loss value weight corresponding to each task through model self-learning;

step B30, weighting each loss value by the loss value weight corresponding to each task to obtain the multitask loss value of the multitask model for voice detection and recognition;

and B40, updating the parameters of the voice detection and recognition model in the gradient descending direction of the multitask loss value, and jumping to the step B10 to perform model iterative training until a set training ending condition is reached to obtain the voice detection and recognition multitask model.

In some preferred embodiments, the multitask model for speech detection and recognition, the multitask training data in training and the audio data in online application are preprocessed by edge splicing, and the method includes:

and blocking each piece of data in the trained multi-task training data according to a random size and/or blocking each piece of data in the audio data of the online application according to a preset fixed size, and splicing set audio with a set length at the edge of a blocked audio block to obtain preprocessed multi-task training data and/or preprocessed audio data of the online application.

In some preferred embodiments, after passing through the voice detection and recognition multitask model, the preprocessing multitask training data and/or the preprocessing online application audio data are both post-processed by adopting concatenation result elimination, and the method includes:

and inputting the preprocessed multitask training data and/or preprocessed online application audio data into the multitask model for voice detection and recognition, and removing the logits output of the spliced audio in the model output result to obtain the logits output of the multitask training data and/or the online application audio data.

In some preferred embodiments, a model distillation step is further provided between step a30 and step a40 by:

step A40a, using the voice detection and recognition multitask model as a teacher network, and constructing a voice detection and recognition multitask model with parameters and network layer numbers less than those of the teacher network as a student network; the network layer number of the teacher network is M, the network layer number of the student network is N, and M is greater than N;

step A40b, acquiring a distilled student voice detection and recognition multitask model through the nth layer output distribution of the mth layer learning teacher network of the student network; wherein M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, M is N/k, and k is a positive integer representing the network layer number relationship between the teacher network and the student network.

In some preferred embodiments, the nth layer output distribution of the teacher network is learned through the mth layer of the student network in step a40b, and the loss of the learning process includes attention moment matrix loss, hidden layer output loss and final output layer loss;

the attention moment array loss is the MSE loss of an attention score matrix of each attention head learning teacher network corresponding to the attention head of the student network;

the hidden layer output loss is MSE loss of hidden layer output corresponding to each layer of transducer hidden layer output learning teacher network of the student network;

and the final output layer loss is the cross entropy loss of the final output of the student network learning teacher network.

In another aspect of the present invention, an end-to-end online voice detection and recognition system is provided, which includes:

the voice data acquisition module is used for acquiring and inputting a user voice data stream as data to be detected and identified;

the data caching unit comprises a voice detection caching module and a voice recognition caching module and is used for caching data to be detected and recognized on line;

the voice preprocessing unit is used for carrying out voice format normalization, denoising and dereverberation preprocessing operations on the data in the voice detection cache module;

the voice detection and recognition unit comprises a voice activation detection module, a state judgment module and a voice recognition module; the voice activation detection module is used for calculating the voice detection classification probability of each frame of data in the voice detection cache module after the preprocessing; the state judgment module is used for judging the state of each frame according to the voice detection classification probability of each frame; if the voice state is a voice starting end point, inputting data into the voice recognition module, if the boundary state of the voice recognition block is activated, performing voice recognition and emptying the corresponding voice recognition module, and if the voice state is a voice ending end point, emptying the corresponding voice recognition module and outputting a real-time voice recognition text;

and the recognition result display module is used for displaying the real-time voice recognition text.

In some preferred embodiments, the state determination module has a state determination method that:

step C10, setting a threshold value T of the voice detection classification probability, if the voice detection classification probability of the current frame is greater than T, the current frame is voice, otherwise, the current frame is mute;

step C20, setting a voice activation threshold B and a voice ending threshold E, and if the length of the continuous voice frame is greater than B, the state is voice; if the continuous mute length is greater than E, the state is non-voice;

step C30, if the state is converted from non-speech to speech, the frame whose state is speech is the speech start end point; if the state is converted from voice to non-voice, the frame with the state of voice is a voice ending end point; when the data accumulation length in the voice recognition block reaches L, the boundary state of the voice recognition block is activated; wherein, L is the activation length of the set voice recognition boundary.

In a third aspect of the present invention, an electronic device is provided, including:

at least one processor; and

a memory communicatively coupled to at least one of the processors; wherein the content of the first and second substances,

the memory stores instructions executable by the processor for execution by the processor to implement the end-to-end online voice detection and recognition method described above.

In a fourth aspect of the present invention, a computer-readable storage medium is provided, which stores computer instructions for being executed by the computer to implement the end-to-end online voice detection and recognition method described above.

The invention has the beneficial effects that:

(1) the end-to-end online voice detection and recognition method disclosed by the invention fuses voice detection tasks and voice recognition tasks, does not need an additional voice detection model, fully utilizes the complementarity of the two tasks, improves the performance and robustness of each task through unified modeling, and simplifies the system deployment process.

(2) The end-to-end online voice detection and recognition method uses the pre-trained wav2vec2.0 model for a voice detection task, achieves a good detection effect, and improves the voice detection performance.

(3) The end-to-end online voice detection and recognition method adopts the form of the cache block to perform data cache before detection and recognition, can flexibly adjust the size of the cache block, and achieves the stable and high-performance recognition effect.

(4) According to the end-to-end online voice detection and recognition method, the pre-training wav2vec model is used for the voice recognition task, dependence on training data volume is greatly reduced, and the method can be quickly applied to the low-resource voice recognition task.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a schematic flow chart of an end-to-end online voice detection and recognition method of the present invention;

FIG. 2 is a schematic diagram of the voice detection and recognition multitask joint modeling of the end-to-end online voice detection and recognition method of the present invention;

FIG. 3 is a schematic diagram of data block splicing according to an embodiment of the end-to-end online voice detection and recognition method of the present invention;

FIG. 4 is a schematic diagram of model distillation for one embodiment of the end-to-end online speech detection and recognition method of the present invention;

FIG. 5 is a system framework diagram of one embodiment of the end-to-end online speech detection and recognition method of the present invention;

FIG. 6 is a schematic diagram of online voice detection and decoding according to an embodiment of the end-to-end online voice detection and recognition method of the present invention.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

The invention provides an end-to-end online voice detection and recognition method, which is based on the end-to-end voice detection and recognition combined modeling of a pre-training model, can unify voice recognition and voice detection in one model, converts continuous voice input into character output, and simplifies the deployment process of online voice recognition. In addition, a large amount of label-free data are used for pre-training in the model training, dependence on the number of labeled samples can be reduced, meanwhile, the robustness of the model is enhanced, dependence of a voice detection task on a model structure and model parameters is greatly reduced, and the calculation efficiency is improved. In modeling, a method of partitioning and edge splicing is used for training and predicting a voice recognition task, different delay requirements can be supported during use, and performance of low-delay recognition is improved.

The model of the invention is based on a pre-training model to carry out two-stage fine tuning, wherein the model structure of the pre-training model is wav2vec2.0 structure, the fine tuning of the first stage is used for training a voice recognition single-task model, and the fine tuning of the second stage is used for training a voice detection and recognition multi-task model.

The invention discloses an end-to-end online voice detection and identification method, which comprises the following steps:

In order to more clearly describe the end-to-end online voice detection and recognition method of the present invention, the following describes the steps in the embodiment of the present invention in detail with reference to fig. 1 and 2.

The end-to-end online voice detection and recognition method of the first embodiment of the invention comprises the steps A10-A50, and the steps are described in detail as follows:

and step A10, performing self-supervision training by using a wav2vec2.0 model based on the obtained unlabeled voice data to obtain a pre-training wav2vec2.0 model.

The data selected by the wav2vec2.0 model for self-supervision pre-training is mass unmarked voice data, the wav2vec model is trained on large-scale unmarked data through contrast loss, the finally obtained feature representation can replace the traditional acoustic features, the model has low dependence on marked sample data, high calculation efficiency and high real-time performance.

Step A20, adding a layer of fully-connected network on the top of the pre-training wav2vec2.0 model, and performing supervised training by adopting CTC loss based on the acquired marked voice data to obtain a voice recognition single-task model.

Model fine adjustment is carried out based on the pre-training model, so that a large amount of computing resources and computing time can be saved, computing efficiency is improved, and even accuracy is improved.

And step A30, taking the obtained multitask voice data as multitask training data, and performing voice detection and training of the recognition multitask model through fine tuning of the voice recognition single-task model to obtain the trained voice detection and recognition multitask model.

The training method of the trained multi-task model for voice detection and recognition comprises the following steps:

the process of obtaining the weight of the loss value corresponding to each task through model self-learning may refer to a task uncertainty loss learning technology proposed by Kendall, Alex and the like, which is not described in detail herein.

In order to support better streaming voice detection and recognition effect, a method of sliding block edge splicing is used for processing multitask voice data for model training in the training and prediction process, as shown in fig. 3, a data block splicing schematic diagram of an embodiment of the end-to-end online voice detection and recognition method is shown, audio blocking is performed on each section of audio data in the obtained multitask voice data according to a random size, set audio of a set length is spliced on the block edge of a blocked audio block, and training data of a voice detection and recognition multitask model are obtained.

The multitask model of voice detection and recognition, the multitask training data in its training and the audio data in the online application, all adopt the edge splice to carry on the preconditioning, its method is:

Preprocessing multitask training data and/or preprocessing online application audio data, and after the multitask model of voice detection and recognition is passed, adopting splicing result elimination to carry out post-processing, wherein the method comprises the following steps:

inputting the preprocessed multitask training data and/or preprocessed online application audio data into the multitask model for voice detection and recognition, eliminating the logits output of the spliced audio in the model output result, and obtaining the logits output of the multitask training data and/or the online application audio data

The audio data are randomly blocked in a training stage, the blocking size is set independently in a prediction stage, set audio with a certain length is spliced at the edge of a data block, a data block at the beginning of a sample is only spliced at the right edge, and a database at the end of the sample is only spliced at the left edge;

inputting the data blocks spliced at the edge into a model, wherein the corresponding output of the edge part is only used for providing additional information and does not participate in prediction;

and eliminating the edge part of the data block prediction result and then merging.

As shown in fig. 4, which is a schematic diagram of model distillation in an embodiment of the end-to-end online voice detection and recognition method of the present invention, a model distillation step is further provided between step a30 and step a40, and includes performing a transformer network layer distillation on the wav2vec2.0 structure, so that the student network structure learns the self-attention matrix and hidden layer output of the teacher network structure respectively; and (3) distilling the prediction layer to enable the prediction layer of the student network learning teacher network to output probability distribution, which specifically comprises the following steps:

step A40b, acquiring a distilled student voice detection and recognition multitask model through the nth layer output distribution of the mth layer learning teacher network of the student network; wherein M is more than or equal to 1 and less than or equal to M, N is more than or equal to 1 and less than or equal to N, M is N/k, and k is a positive integer representing the network layer number relationship between the teacher network and the student network. In one embodiment of the present invention, k is 4. Here, the network layer number relationship between the teacher network and the student network in the distillation learning selects a multiple mapping relationship, and in other application scenarios, other mapping relationships may also be selected, for example, an exponential mapping relationship or other pre-specified mapping relationships, and the present invention is not described in detail herein.

And (3) distilling the k-separation layer: learning the mth layer of the teacher network through the mth layer of the student network, wherein the loss of the learning process comprises attention moment array loss and hidden layer output loss:

attention is paid to moment array loss, and MSE loss of an attention score matrix of the teacher network corresponding to the attention head is learned for each attention head of the student network;

hidden layer output loss, namely MSE loss of hidden layer output corresponding to each layer of transducer hidden layer output learning teacher network of the student network;

prediction of layer distillation: and (3) learning the prediction layer output probability distribution of the teacher network through the last prediction layer of the student network, wherein the loss in the learning process is cross entropy loss.

And step A40, acquiring audio data on line in real time, and performing voice and non-voice audio analysis on the audio data in real time through the voice detection module of the trained voice detection and recognition multitask model according to the preset voice detection block size.

Although the foregoing embodiments describe the steps in the above sequential order, those skilled in the art will understand that, in order to achieve the effect of the present embodiments, the steps may not be executed in such an order, and may be executed simultaneously (in parallel) or in an inverse order, and these simple variations are within the scope of the present invention.

As shown in fig. 5, an end-to-end online voice detecting and recognizing system according to a second embodiment of the present invention includes:

As shown in fig. 6, a schematic diagram of online voice detection and decoding according to an embodiment of the end-to-end online voice detection and recognition method of the present invention is shown:

step one, inputting voice signal data into a voice detection block to carry out voice activation detection, wherein a network structure uses a bottom-layer CNN network structure in wav2vec 2.0;

inputting the detection result in the voice detection block into a state transition model, and judging the state (voice starting end point/voice recognition block boundary/voice ending end point) of each frame; wherein the detection result is the speech detection classification probability of each frame;

and operating according to the state obtained in the step two. If the state is a voice starting end point, starting to put data into a voice recognition block; if the state is a voice recognition boundary, recognizing the voice in the voice recognition block, emptying the voice recognition block, and then continuously storing data into the voice recognition block; if the state is a voice ending end point, recognizing and clearing data in the current voice recognition block, and recording a recognition result of the complete voice; the modeling unit of the voice recognition is Chinese characters/letters, and the decoding method is a beam search method. The external language model may use an N-gram statistical language model or a neural network based language model.

The state judgment module comprises a state judgment module and a state judgment module, wherein the state judgment method comprises the following steps:

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be noted that, the end-to-end online voice detection and recognition system provided in the foregoing embodiment is only illustrated by the division of the foregoing functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

An electronic apparatus according to a third embodiment of the present invention includes:

at least one processor; and

A computer-readable storage medium of a fourth embodiment of the present invention stores computer instructions for being executed by the computer to implement the end-to-end online voice detection and recognition method described above.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. An end-to-end online voice detection and recognition method, characterized in that the online voice detection and recognition method comprises:

2. The method for end-to-end online voice detection and recognition according to claim 1, wherein the trained multitask model for voice detection and recognition is trained by:

3. An end-to-end online speech detection and recognition method according to claim 1 or 2, wherein the multitask model for speech detection and recognition, the multitask training data in training and the audio data in online application are preprocessed by edge splicing, and the method comprises:

4. The end-to-end online voice detection and recognition method of claim 3, wherein the pre-processed multitask training data and/or the pre-processed online application audio data are post-processed by adopting stitching result elimination after passing through the multitask model for voice detection and recognition, and the method comprises:

5. The end-to-end online voice detection and recognition method of claim 1, wherein a model distillation step is further provided between step a30 and step a40, and the method comprises:

6. The end-to-end online voice detecting and recognizing method according to claim 5, wherein the n-th layer output distribution of the teacher network is learned through the m-th layer of the student network in step A40b, and the loss of the learning process includes attention moment matrix loss, hidden layer output loss and final output layer loss;

7. An end-to-end online voice detection and recognition system, comprising:

8. The system according to claim 7, wherein the status determining module determines the status by:

9. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the processor for execution by the processor to implement the end-to-end online voice detection and recognition method of any of claims 1-6.

10. A computer-readable storage medium storing computer instructions for execution by the computer to implement the end-to-end online voice detection and recognition method of any one of claims 1-6.