CN116863920A

CN116863920A - Voice recognition method, device, equipment and medium based on double-flow self-supervision network

Info

Publication number: CN116863920A
Application number: CN202310874348.2A
Authority: CN
Inventors: 明悦; 范春晓; 吕柏阳; 胡楠楠; 周江琬
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2023-07-17
Filing date: 2023-07-17
Publication date: 2023-10-10
Anticipated expiration: 2043-07-17
Also published as: CN116863920B

Abstract

The invention provides a voice recognition method, a device, equipment and a medium based on a double-flow self-supervision network, which comprises the following steps: encoding and quantizing the target acoustic features by using an encoding and quantizing module to obtain a voice vector; carrying out reconstruction prediction on the voice vector by utilizing a reconstruction prediction module to obtain a first voice representation; simultaneously, predicting the voice vector by utilizing an autoregressive model in the contrast prediction module to obtain a second voice representation; fusing the first voice representation and the second voice representation by utilizing a feature fusion sub-module to obtain a fused voice representation; based on the target acoustic characteristics, the first sub-model is combined with a connection time sequence classifier in the CTC module to recognize the fused voice representation so as to obtain a transcribed text. The invention can pay attention to detailed context information of the voice and difference information among different characteristics of the voice, improves the robustness of self-supervision learning, and effectively combines complementary advantages of the generation formula and the discriminant self-supervision learning.

Description

Voice recognition method, device, equipment and medium based on double-flow self-supervision network

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a device, and a medium for speech recognition based on a dual-flow self-supervision network.

Background

Voice is the most direct and effective way of information transfer, and is the most dominant way of people's emotion communication and thought transfer. Automatic speech recognition (Automatic Speech Recognition, ASR) techniques refer to the correct recognition of speech signals as corresponding text content or commands, allowing a machine to understand human language and perform related operations. ASR technology fuses the leading edge technology of multidisciplinary knowledge, covers basic disciplines such as mathematics and statistics, acoustics and phonetic science, computer and artificial intelligence and leading edge disciplines, is man-machine language communication and information exchange's key link, has very strong practical value. Along with the wide application of computers, ASR technology becomes a key technology for realizing simple and convenient human-computer intelligent interaction, and is widely applied to many real scenes such as search query, automatic navigation, self-service, machine translation, automatic driving and the like, and relates to the fields of industry, culture, business and the like.

ASR has undergone two periods of development, the traditional approach and the deep learning approach. The traditional method mainly integrates an acoustic model, a pronunciation model and a language model to find out word sequences most likely to occur in a given voice observation. With the rapid development of deep learning technology, the performance of voice tasks using deep learning gradually exceeds that of traditional algorithms. The End-to-End speech recognition (E2E ASR) model based on the deep neural network solves the problem that alignment preprocessing needs to be carried out on marked speech data, and can directly obtain the mapping relation between input speech waveforms or characteristics and output text content. E2E ASR has simplified model training process and improved speech recognition accuracy by means of powerful modeling and learning capacity. Notably, unlike conventional ASR systems, the performance of the E2E model is largely dependent on the number of available target annotation corpora. However, the workload of voice data collection and manual labeling is huge, and factors such as small languages or dialects can lead to the appearance of low-resource application scenes with insufficient labeling corpus. This presents a significant challenge for developing an effective E2 EASR. The current proposal of end-to-end voice recognition for limited annotation data mainly learns voice infrastructure information on a large amount of non-annotation data through a pre-training strategy, and then performs supervision training on the limited annotation data. The supervised training process can be specifically divided into the following problems according to different supervised learning modes:

(1) Unsupervised learning. The huge workload of data collection and labeling can lead to the occurrence of application scenes with insufficient corpus labeling, which can obviously reduce the modeling capacity of the model. The unsupervised learning does not depend on the labeling data, and the relation among the data samples is found through the structure or the characteristics of the data, so that the performance degradation caused by the insufficient labeling data can be relieved to a certain extent. However, since the unsupervised learning uses unlabeled data to capture the distribution or structure of the data, the supervised information is lost in the model prediction process, so that the deviation of the model prediction is increased, and the limited practical scene application of the labeled data is limited.

(2) Semi-supervised learning. Semi-supervised learning is a learning method combining supervised learning with unsupervised learning. Unlike unsupervised learning, semi-supervised learning considers the idea of partially labeling the unlabeled data in order to alleviate the problem of prediction deviation caused by insufficient supervision information in the process of unsupervised learning. That is, training a model on the annotation data, using the trained model to predict labels without the annotation data, thereby creating pseudo labels. And then, combining the label data with the newly generated pseudo label data as new training data so as to relieve the problem of insufficient supervision information in the unsupervised learning, wherein the performance of the semi-supervised training is seriously dependent on the accuracy of model prediction of the pseudo labels.

(3) Self-supervised learning. Self-supervision learning mainly utilizes auxiliary tasks to mine own supervision information from large-scale non-labeling data, and uses the constructed supervision information training model. More semantic relationships and valuable characterizations for downstream tasks can be learned than unsupervised and semi-supervised. But the speech signal has complex potential structures (including phonemes, syllables, words, prosodic features, sentence context information, etc.), including relevant information at different time scales. The current self-supervision learning scheme cannot give consideration to the difference information among different features and the self-distributed context information of the data, so that the prediction accuracy and the robustness are poor.

To sum up, in order to promote the application of end-to-end speech recognition in the actual scene with limited annotation data and improve the integrity of self-supervision learning on capturing the information of the speech infrastructure, the above problems need to be studied in depth, and a reasonable solution is provided.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method, device, equipment and medium based on a double-flow self-supervision network, which are used for overcoming the defects of the prior art.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

In a first aspect, the present invention provides a method for speech recognition based on a dual-flow self-supervision network, comprising:

acquiring target acoustic characteristics and a pre-trained voice recognition model; the pre-trained voice recognition model comprises a first sub-model and a second sub-model, the first sub-model comprises a coding and quantizing module, a reconstruction prediction module and a comparison prediction module, the comparison prediction module comprises a feature fusion sub-module, and the second sub-model comprises a CTC module;

encoding and quantizing the target acoustic feature by using the encoding and quantizing module to obtain a voice vector;

carrying out reconstruction prediction on the voice vector by utilizing the reconstruction prediction module to obtain a first voice representation; simultaneously, predicting the voice vector by utilizing an autoregressive model in the contrast prediction module to obtain a second voice representation;

fusing the first voice representation and the second voice representation by utilizing the characteristic fusion submodule to obtain a fused voice representation;

and based on the target acoustic characteristics, combining the first sub-model with a connection time sequence classifier in the CTC module to recognize the fused voice representation, and obtaining a transcribed text.

Optionally, the feature fusion submodule comprises a gating circulation unit and an adaptive fusion layer;

accordingly, the fusing the first voice representation and the second voice representation by using the feature fusion sub-module to obtain a fused voice representation, including:

respectively carrying out feature selection on the first voice representation and the second voice representation by using the gating circulating unit, and correspondingly obtaining a first selected feature and a second selected feature;

and carrying out self-adaptive fusion on the first selected feature and the second selected feature by using the self-adaptive fusion layer.

Optionally, the pre-trained speech recognition model is obtained by training in the following manner:

acquiring an acoustic characteristic sample and a pre-constructed voice recognition model;

inputting the acoustic feature sample into the pre-constructed speech recognition model;

calculating to obtain reconstruction loss based on the first voice representation output by the reconstruction prediction module and the acoustic feature sample;

obtaining a contrast loss based on the fused voice representation output by the feature fusion submodule and the acoustic feature sample calculation;

calculating to obtain diversity loss based on codebook information of the acoustic feature samples;

Performing iterative updating on initial network parameters in the coding and quantizing module, the reconstruction prediction module and the comparison prediction module according to the reconstruction loss, the comparison loss and the diversity loss to obtain updated network parameters in the coding and quantizing module, the reconstruction prediction module and the comparison prediction module;

taking the updated network parameters as voice characterization extracted by a feature extractor of the CTC module, and training and decoding the CTC module based on the acoustic feature sample and the labeling data so as to obtain a trained voice recognition model;

or, iteratively updating the randomly initialized network parameters in the coding and quantizing module, the reconstruction prediction module, the comparison prediction module and the CTC module according to the reconstruction loss, the comparison loss and the diversity loss, so as to obtain a trained speech recognition model.

Optionally, the encoding and quantization module includes an encoder and a vector quantization layer, the encoder being obtained based on a Conformer network;

accordingly, the encoding and quantizing the target acoustic feature by using the encoding and quantizing module to obtain a speech vector includes:

Encoding the target acoustic feature with the encoder to obtain a potential speech representation;

discretizing the potential speech representation by the vector quantization layer to obtain the speech vector.

Optionally, the encoder comprises a plurality of layers of Conformer, each layer of Conformer comprising:

the system comprises a first feedforward layer, a first residual error and standardization module, a multi-head self-attention layer, a second residual error and standardization module, a convolution module, a third residual error and standardization module, a second feedforward layer, a fourth residual error and standardization module and a Layernorm layer which are connected in sequence; the first residual error and the second residual error and the standardized module, the second residual error and the standardized module and the third residual error and the standardized module, and the third residual error and the standardized module and the fourth residual error and the standardized module are in residual error connection.

Optionally, the pre-trained speech recognition model further comprises a random masking module;

accordingly, after the acquisition of the target acoustic features, the method further comprises:

performing time random masking and frequency random masking processing on the target acoustic features by using the random masking module to obtain target masked acoustic features;

The encoding and quantizing module is used for encoding and quantizing the target acoustic feature to obtain a speech vector, and the encoding and quantizing module comprises:

and encoding and quantizing the target mask acoustic features by using the encoding and quantizing module to obtain a voice vector.

In a second aspect, the present invention also provides a voice recognition device based on a dual-flow self-supervision network, including:

the acoustic feature and model acquisition module is used for acquiring target acoustic features and a pre-trained voice recognition model; the pre-trained voice recognition model comprises a first sub-model and a second sub-model, the first sub-model comprises a coding and quantizing module, a reconstruction prediction module and a comparison prediction module, the comparison prediction module comprises a feature fusion sub-module, and the second sub-model comprises a CTC module;

the coding and quantizing module is used for coding and quantizing the target acoustic feature by utilizing the coding and quantizing module to obtain a voice vector;

the reconstruction and comparison module is used for carrying out reconstruction prediction on the voice vector by utilizing the reconstruction prediction module to obtain a first voice representation; simultaneously, predicting the voice vector by utilizing an autoregressive model in the contrast prediction module to obtain a second voice representation;

The fusion module is used for fusing the first voice representation and the second voice representation by utilizing the characteristic fusion sub-module to obtain a fused voice representation;

and the classification module is used for identifying the fused voice representation by combining the first sub-model with a connection time sequence classifier in the CTC module based on the target acoustic characteristics to obtain a transcribed text.

In a third aspect, the present invention also provides an electronic device, including a memory and a processor, where the processor and the memory are in communication with each other, the memory storing program instructions executable by the processor, the processor invoking the program instructions to perform the above-described voice recognition method based on a dual-flow self-supervising network.

In a fourth aspect, the present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements a method for speech recognition based on a dual-flow self-supervision network as described above.

The invention has the beneficial effects that: the invention provides a voice recognition method, a device, equipment and a medium based on a double-flow self-supervision network, which are characterized in that a double-channel structure is designed by combining a reconstruction prediction module (Reconstruction Prediction Module, RPM) and a comparison prediction module (Contrastive Prediction Module, CPM) in parallel after a coding and quantization module. The reconstruction prediction is used as an auxiliary task of comparison prediction to respectively predict voice frames of voice vectors, so that detailed voice context information is focused while different characteristic difference information of voices is captured by modeling attribution relations among different voice representations. In addition, in order to effectively utilize the two-channel speech representation, the speech representations of the two branches are fused through a feature fusion submodule, and the feature fusion submodule adaptively fuses the speech representations of the two branches through a parameter learning strategy and controls the exposure of various speech features by using weights. Finally, the dual-flow self-supervision learning network provided by the invention can well initialize the weight of the ASR model. Compared with other self-supervision learning methods, the voice recognition method provided by the invention can achieve competitive prediction accuracy. In addition, in the limited marked data scenario, it is comparable to the most advanced self-supervised learning method.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of a prior art speech recognition method based on mask reconstruction;

FIG. 2 is a flow chart of a prior art contrast prediction-based speech recognition method;

FIG. 3 is a schematic flow chart of a voice recognition method based on a dual-flow self-monitoring network according to an embodiment of the present invention;

FIG. 4 is a second flow chart of a voice recognition method based on a dual-flow self-monitoring network according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a feature fusion sub-module according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an encoder according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

In the prior art, self-supervised learning can learn valuable speech characterization for downstream tasks, and also can use a trained model for initializing an ASR task, and model parameters learned by self-supervised pre-training are found to be an effective method for initializing an ASR model. At present, the self-supervision learning method mainly comprises two main branches, namely generation type self-supervision learning and discriminant type self-supervision learning.

Wherein the generative self-supervised learning generates or reconstructs input data based on some limited speech frames, including predicting future input from past input, predicting a mask from an unmasked, or predicting original speech from other corrupted speech centers. The following description is given of a method of dividing the autoregressive predicted speech frame and the mask predicted speech frame.

Among them, the inspiration of autoregressive prediction comes mainly from a Language Model (LM) of text, and expands this to the speech domain. Unlike conventional linear regression, the autoregressive model encodes past acoustic wave sequence time information. The model then adjusts the past speech frames while predicting future speech frames. It encodes only the information of the previous time step and not the entire input and thus may result in a loss of global context information.

Mask reconstruction is largely inspired by a mask language model and then expanded into the voice field. Some tokens in the input sentence are randomly replaced with mask tokens, and the model then relies on unmasked tokens to recover hidden speech features from corrupted or masked token features. Where the masking strategy is similar to BERT, masking operations can generally be performed along both the time and frequency dimensions. This allows the model to encode information throughout the input to learn the speech infrastructure information. But the speech signal has complex potential structures (including phonemes, syllables, words, prosodic features, sentence context information, etc.), the masking reconstruction model will encode all the information in the speech signal, but will encode redundant information for a particular ASR task.

For the reasons described above, learning to reconstruct the original speech signal may not be the best way to find the underlying structure of the speech. The contrast model learns the phonetic representation by maximizing the similarity between the given speech and the positive samples while minimizing the similarity between the given speech and the negative samples, thereby distinguishing the target samples (positive) and the interference samples (negative) of the given speech. The following is divided into contrast-based predictive coding (Contrastive Predictive Coding, CPC) and wav2vec2.0 based methods.

Methods based on contrast prediction. Based on contrast predictive coding, a unidirectional modeling form in feature space is adopted, and a nonlinear encoder is first used to map an input speech sequence to hidden space, where the speech representation has only a low temporal resolution. The potential representations of the speech are then encoded using an autoregressive model to obtain a speech context representation, and the potential features of the future frames are predicted via a prediction network in combination with the historical context representation of the speech. And finally, judging the approaching degree of the predicted result and the real characteristic by maximizing the mutual information between the future several frames of audio fragments and the context characterization thereof. This not only allows the model to learn the basic shared information characterization between different parts of the speech encoded (high-dimensional) signal, but it also discards low-level information and more localized noise. The wav2vec2.0 learns speech tokens using infofce loss in combination with masking operations to maximize similarity between a contextualized representation and an original speech representation on a contrast predictive coding basis. It focuses on learning the mapping between input and output, resulting in insufficient capture of the characteristics of the training data itself, resulting in a lack of contextual information.

The speech recognition based on mask reconstruction is that after an acoustic model is trained by using a mask reconstruction self-supervision pre-training strategy, the obtained acoustic model is used for speech representation extraction or fine tuning to carry out speech recognition. The main algorithm flow is shown in fig. 1, and the specific steps of speech recognition based on mask reconstruction are as follows:

first, each input speech feature is considered as an image of dimension t×f, where T is the number of frames and F is the number of frequency bins. Masking is performed along both dimensions by using both time and frequency random masking scheme strategies. For time masking, each sequence is covered with a mask pattern from T _I Start of subsequent T _n A number of successive time steps are randomly masked (T _I ,T _I +T _n ) Wherein a total of 15% of the speech frames are not overlapped and masked. In the above procedure 80% of the frames are replaced by zero vectors, 10% of the frames are replaced by frames from random positions and remain the same for the rest of the time. Similar to the time mask, the frequency randomly masks the values of consecutive frequency bin blocks to zero throughout all time steps of the input sequence. For F consecutive mel frequency channels F _I ,F _I +f]Masking is performed, where F uniformly samples the width of F from {0, 1..f } to select a masking frequency block, F _I From [0,F-f ]Is selected randomly.

In addition, during the mask reconstruction process, part of the features are randomly masked, and the autoregressive model networks such as RNN/LSTM/transformemer are encouraged to fully learn the space-time information in the input features, namely the global context information and the space information of the voice through using the two masking code strategies of time and frequency singly or in a mixed mode.

Finally, the self-supervised learned speech knowledge is incorporated into the speech recognition network by means of representation extraction or fine tuning. For representation extraction, the extracted phonetic representation is fed as input features to the ASR network for supervised training by freezing the network parameters of the self-supervised training autoregressive model as a feature extractor of the ASR network to obtain text output. And for fine tuning, performing supervision training on the self-supervision training autoregressive model and the randomly initialized ASR, and updating network parameters to obtain final text output.

According to the speech recognition method based on mask reconstruction, because the speech signal contains miscellaneous potential structures (including phonemes, syllables, prosodic features, sentence context information and the like), the information capture of the prosodic features and the like affecting the ASR performance is insufficient due to the fact that the mask features are predicted by the context information.

To fully exploit the context prediction mask feature, all information in the speech signal has to be encoded to learn the data-by-itself characteristics of the speech data, which results in higher learning costs and more computational resources than discriminant self-supervised learning.

In addition, the reconstructed prediction encodes all the information in the speech signal, and redundant information for a particular ASR task, resulting in a prediction that is less robust.

In addition to the above-described mask reconstruction-based speech recognition, there is a contrast prediction-based speech recognition method that combines wav2vec2.0 with a downstream speech recognition network to perform speech recognition by speech representation extraction or fine tuning, wherein wav2vec2.0 comprises a feature encoder, a quantization module and a transform context representation. The main algorithm flow is shown in fig. 2, and the voice recognition method based on comparison prediction comprises the following specific steps:

firstly, a feature extractor of a seven-layer convolution network is adopted to encode the original audio into a frame feature sequence, and each frame feature is converted into discrete features through a vector quantization module and is used as a self-supervision target.

Further, a vector quantization module is used to discrete the output of the feature encoder, which contains G sets of codebooks, each containing V variables. For each continuous variable output by the feature encoder, a variable is found in each group of codebooks, and G variables are spliced together, and linear change is performed again to obtain the final discrete feature.

The transducer is then used to obtain a speech context characterization. The output of the feature encoder may be subjected to some masking operation prior to input into the transformer, using masked trainable embedded tokens instead. No masking operation is performed when vector quantization is performed. A contrast loss function is computed from the context and discrete features of the speech characterization to enable mask tokens to be identified in the transducer output among candidate discrete features containing interference terms sampled from other masking instants. Finally, text output of the final voice recognition task is obtained in a mode of representation extraction or fine tuning.

However, the contrast prediction-based speech recognition method has the following drawbacks: discriminant learning focuses on learning the mapping between input and output by comparing the target sample (positive) and interference sample (negative) similarity metrics, taking into account the inherent structure of the data, and thus, processing power for missing mask data is weak.

To sum up, in the prior art, the following problems exist in the self-supervised learning voice recognition algorithm:

1. the information capture is incomplete for voice infrastructure. The generated reconstruction prediction focuses on the distribution of data by voice context information reconstruction mask data, but due to the complex characteristics of voice signals, information capture defects such as rhythm characteristics affecting the ASR performance still exist. The discriminant model focuses on the difference information of the data by comparing the similarity degree of the target sample (positive) and the interference sample (negative), and searches for a classification surface which focuses on the mapping relation between the learning input and the learning output, so that the autocorrelation characteristic of the voice signal is not fully considered, and the processing capacity of missing mask data is weaker. The existing self-supervision learning schemes have the problem of incomplete voice infrastructure information capture.

2. The advantages of the generative and discriminant self-supervision are not effectively combined. Different types of self-supervising models represent different advantages over different downstream tasks. The prior technical proposal lacks an effective proposal for fusing two self-supervision learning. Therefore, in order to better exploit the potential of both models, an efficient fusion strategy is necessary.

3. Reconstruction prediction encodes redundant information for a particular ASR task. In the prior art, the reconstruction prediction uses the context information prediction mask feature, which requires coding all information in the speech signal to learn the characteristics of the speech data itself, which results in encoding redundant information for the ASR task, making the prediction less robust.

The invention aims to solve the problems of incomplete voice infrastructure information capture and poor robustness of model prediction results in self-supervision learning, improves the defects and shortcomings of the existing self-supervision learning scheme, and promotes wider application in practice.

The following describes a voice recognition method based on a double-flow self-supervision network according to the present invention with reference to the accompanying drawings.

Term interpretation:

self-supervised learning (Self-Supervised Learning, SSL): and the characteristic characteristics of the label-free data are mined by designing auxiliary tasks to serve as supervision information, so that the learning mode of the characteristic extraction capacity of the model is improved.

End-to-End speech recognition (End-to-End Automatic Speech Recognition, E2E ASR): the voice recognition system based on the end-to-end network model directly maps the input voice waveform column to the output text through a neural network model, and each module in the system is not required to be trained independently like a traditional voice recognition algorithm, so that the voice recognition flow is simplified, the problem of automatic alignment of an input sequence and an output sequence is well solved, and forced alignment processing is not required to be carried out on the input sequence.

Attention mechanism (Attention Mechanism): the attention mechanism refers to a method designed to simulate the characteristic that the human time system can naturally and effectively find salient regions in complex scenes in computing deep learning. The attention mechanism in deep learning includes spatial attention, channel attention, self-attention, and the like.

Example 1

FIG. 3 is a schematic flow chart of a voice recognition method based on a dual-flow self-monitoring network according to an embodiment of the present invention; FIG. 4 is a second flow chart of a voice recognition method based on a dual-flow self-monitoring network according to an embodiment of the present invention; as shown in fig. 3 and 4, a voice recognition method based on a dual-flow self-supervision network includes the following steps:

S301, acquiring target acoustic characteristics and a pre-trained voice recognition model.

The pre-trained voice recognition model comprises a first sub-model and a second sub-model, the first sub-model comprises a coding and quantizing module, a reconstruction prediction module and a comparison prediction module, the comparison prediction module comprises a feature fusion sub-module, and the second sub-model comprises a CTC module. The target acoustic features are the voice data to be recognized.

S302, the coding and quantizing module is utilized to code and quantize the target acoustic feature, and a voice vector is obtained.

In this step, the target acoustic features are encoded and quantized using an encoding and quantization module, so that more meaningful phonetic unit information is learned to enrich the phonetic representation. And inputting the voice vector into a reconstruction prediction module and a comparison prediction module of the double-channel structure.

S303, carrying out reconstruction prediction on the voice vector by utilizing the reconstruction prediction module to obtain a first voice representation; and simultaneously, predicting the voice vector by utilizing an autoregressive model in the comparison and prediction module to obtain a second voice representation.

In this step, the reconstructed predictions are jointly trained as an auxiliary task to the comparison predictions.

S304, fusing the first voice representation and the second voice representation by utilizing the characteristic fusion submodule to obtain a fused voice representation.

In the step, after the feature fusion submodule is utilized to fuse the voice representations, the feature fusion submodule can adaptively fuse two voice representations through a parameter learning strategy, and explore the attribution relations between different voice representations while focusing on the context information so as to capture different feature difference information.

S305, based on the target acoustic characteristics, the fused voice representation is identified by combining the first sub-model and a connection time sequence classifier in the CTC module, and a transcribed text is obtained.

In this step, final text output is performed by connecting a timing classifier (Connectionist Temporal Classification, CTC for short).

It should be noted that the voice recognition method provided by the invention can be applied to many real scenes such as man-machine interaction, machine translation, automatic driving, intelligent home and the like, and relates to the fields of industry, culture, business and the like. For example, intelligent sound boxes issued by Internet companies such as Google, amazon, hundred degrees, ariba, kodakuda and the like in succession are products of successful landing of ASR technology.

According to the voice recognition method based on the double-flow self-supervision network, a double-channel structure is designed by combining a reconstruction prediction module (Reconstruction Prediction Module, RPM) and a comparison prediction module (Contrastive Prediction Module, CPM) in parallel after a coding and quantization module. The reconstruction prediction is used as an auxiliary task of comparison prediction to respectively predict voice frames of voice vectors, so that detailed voice context information is focused while different characteristic difference information is captured by modeling attribution relations among different voice representations. In addition, in order to effectively utilize the two-channel speech representation, the speech representations of the two branches are fused through a feature fusion submodule, and the feature fusion submodule adaptively fuses the speech representations of the two branches through a parameter learning strategy and controls the exposure of various speech features by using weights. Finally, the dual-flow self-supervision learning network provided by the invention can well initialize the weight of the ASR model. Compared with other self-supervision learning methods, the voice recognition method provided by the invention can achieve competitive prediction accuracy. In addition, in the limited marked data scenario, it is comparable to the most advanced self-supervised learning method.

utilizing the pair of random mask modulesAnd performing time random masking and frequency random masking processing on the target acoustic features to obtain target masked acoustic features. That is, for the target acoustic feature x, a masking acoustic feature is obtained using both time and frequency random masking strategies

Specifically, for the time mask, the start index T is randomly selected _I To shield a screen having a maximum width T _n Wherein each sequence is randomly masked (T) _I ,T _I +T _n ) 15% of the total sequence. In the above procedure, 80% of the speech frames are replaced by zero vectors and 10% are replaced by other speech frames randomly sampled from the same speech. Similarly, the frequency mask randomly shifts the successive frequencies (F _I ,F _I The value of +f) masks to zero, where F samples the width of F uniformly from {0,1, …, F } to select the masking frequency.

In particular, the present invention uses a Conformer-based encoder structure, which consists of N layers, each layer consisting of a Multi-Head Self-attention layer (Multi-Head Self-Atten)the MHSA), convolution modules (Convolution module, conv) and feedforward layers (Feed forward module, FFN), residual and normalization layers (Add&Nor) is constructed as shown in fig. 6, the overall con former structure replaces the original feed-forward layer with two half-step feed-forward layers, one before the multi-headed attention layer and the second after the convolution module. The second feed forward module is followed by a layerrnorm layer. Thus, given an input Obtaining output H through Conformer ^X The definition is as follows:

H'＝H+MHSA(H) (2)

H”＝H'+Conv(H) (3)

multi-headed self-attention is effectively a multi-channel parallel self-attention mechanism. For self-attention mechanisms, the mask language spectrum feature representation is first inputAfter obtaining the query, key and value (Q, K, V) by linear calculation, dot product calculation is carried out:

wherein W is ^Q 、W ^K 、W ^V Respectively a matrix of parameters that can be learned. Dot product note calculations were then performed by the softmax function:

the multi-headed self-attention mechanism then equally divides the attention input into h different attention channels for parallel computation and concatenates all channel attention results:

Multihead(Q,K,V)＝concat(head ₁ ,…,head _h )W ^o (7)

wherein W is ^Q ,W ^K ,W ^V ,W ^o Respectively a matrix of parameters that can be learned,is a scaling factor. In general, it typically uses h=8 parallel attention spaces or heads. In practical application, always set d _k ＝d _model H makes the computation complexity of multi-head attention identical to that of single self-attention, d _model Representing the dimensions of the input vector. The convolution module consists of Pointwise convolution, depthwise convolution, GLU activation layer and Swish activation layer. The feed-forward layer consists of two linear transitions, with one ReLU active in the middle,

FFN(x)＝max(0,xW ₁ +b ₁ )W ₂ +b ₂ (8)

wherein x represents the input of the feedforward layer, W ₁ ，W ₂ Representing a matrix of learnable parameters, b ₁ ，b ₂ To maintain a linearly introduced constant medium.

Although the linear transforms are identical at different locations, they use different parameters between different layers. In addition, a residual connection is used around each two sub-layers for layer normalization to achieve more stable and faster convergence.

To focus more on language/phonetic unit information, the present invention accesses the quantization layer after Conformer. Potential phonetic representation H of Conformer output first through linear layer ^X Mapping to logitsI ε R ^G×V Where G is the number of codebooks and V is the size of the codebook. Then by subtracting from a fixed size codebook c= { C ₁ ,…,C _V Selecting a variable from the list and superimposing the resulting vectors and applying linear transformations to obtain a speech discretisation representation v _t . The probability of selecting the v-th code in the g-th codebook is defined as follows:

the acoustic feature samples are input to the pre-constructed speech recognition model.

And calculating and obtaining reconstruction loss based on the first voice representation output by the reconstruction prediction module and the acoustic characteristic sample.

And obtaining contrast loss based on the fused voice representation output by the feature fusion submodule and the acoustic feature sample.

And calculating the diversity loss based on the codebook information of the acoustic feature samples.

And carrying out iterative updating on initial network parameters in the coding and quantizing module, the reconstruction prediction module and the comparison prediction module according to the reconstruction loss, the comparison loss and the diversity loss to obtain updated network parameters in the coding and quantizing module, the reconstruction prediction module and the comparison prediction module.

And taking the updated network parameters as voice characterization extracted by a feature extractor of the CTC module, and training and decoding the CTC module based on the acoustic feature sample and the labeling data so as to obtain a trained voice recognition model.

In this embodiment, the present invention constructs a dual stream structure based on a reconstruction prediction module and a contrast prediction module after the encoding and quantization module. The reconstruction prediction module mainly consists of a prediction network P _net Composition, the purpose of which is from mask featuresMid-reconstruction acoustic feature x _t . The predictive network in the present invention consists of a Position-feed-forward network (FFN). Then at input x and P _net Calculating L1 reconstruction losses between network outputs of (a) to update network parametersAnd->

Wherein x is _t Representing the original speech feature input (i.e. masking the acoustic features),representing the speech features after masking operation (target masking acoustic features),/>For encoding and quantizing parameters of a Conformer encoder in a module, the network parameters are reserved for ASR tasks, and the network P is predicted _net Is discarded. The reconstruction prediction module effectively improves the accuracy of speech recognition prediction by understanding the reconstructed masked speech frames from the context of previous and future content.

The contrast prediction module uses an autoregressive model to sum the discrete representations into a new context vector c _t . However, the present invention does not directly use the context vector c _t To calculate a comparative prediction, instead, the GFF module is used to fuse the RPM output with the CPM autoregressive network output to obtain a phonetic representation c _GFF To improve the accuracy of the predicted phonetic representation. Then using the fused phonetic representation c _GFF Calculate the contrast loss, which is beneficial to learning more comprehensively Voice structure information. The model uses contrast loss to identify a true context vector phonetic representation c among a set of k+1 candidate representations _t Wherein includes x _t And K interference terms, including K interference terms. The interference term is uniformly sampled from other masking time steps of the same utterance. The contrast loss is defined as:

at L _Contrastive In (c), sim represents cosine similarity between two vectors, and κ is a temperature super-parameter. In addition, diversity loss is used to increase the quantized codebook representation by maximizing each codebook p in a batch of audio _g Entropy of average softmax distribution over codebook entries of (1) to balance the probability of using all entries in each codebook, where p _g,v Representing the probability of selecting the v-th code in the g-th codebook.

Final training object L of the invention _Total Loss L by reconstruction _{Reconstruction} Loss of contrast L _Contrastive And diversity loss L _Diversity The three parts are formed, and the two self-supervision tasks can be simultaneously solved. The training loss to be minimized is finally:

L _Total ＝L _Contrastive +αL _Diversity +βL _{Reconstruction} (13)

where α and β are learnable hyper-parameters. L (L) _Contrastive Through speech representation and acoustic feature computation, noise samples of acoustic features are uniformly sampled from other masks of the same speech. For L _Diversity Alpha is set to 0.1 to balance L _Diversity Is a weight of (2). L (L) _{Reconstruction} From acoustic features X and reconstructed outputAnd (5) calculating to obtain the product.

After determining the final training objective function L _Total Afterwards, there are two different ways for parameter training of each module, and the double-flow self-supervision network learning to speech knowledge can be incorporated into an ASR task for training and decoding through two ways of representation extraction and fine tuning, so that end-to-end speech recognition with limited annotation data can be realized.

Where representation extraction refers to extracting speech tokens by freezing parameters of the DSSLNet as a feature extractor of the training CTC module when training with downstream ASR, which is essentially the hidden state of the last layer of the DSSLNet encoder. The extracted representation is used as input to replace features such as FBANK/MFCC and the like to be fed to a CTC module for training and decoding, and text output is obtained.

The fine tuning is performed on the DSSLNet by using a downstream CTC module. Here the output of the DSSLNet is connected to the CTC module, where the parameters of the DSSLNet are not frozen. And then, updating the trained DSSLNET and the randomly initialized CTC module together for training and decoding to obtain text output.

According to the voice recognition method based on the double-flow self-supervision network provided by the embodiment of the invention, the GUR feature fusion module consisting of the gating circulating unit (GRU) and the self-adaptive fusion layer is provided, and the purpose of self-adaptive feature fusion is achieved by controlling the exposure of different features so as to reduce the redundant information generated by rebuilding and predicting a specific ASR task.

Optionally, the feature fusion submodule includes a gating loop unit and an adaptive fusion layer.

and respectively carrying out feature selection on the first voice representation and the second voice representation by using the gating circulating unit, and correspondingly obtaining a first selected feature and a second selected feature.

Specifically, the feature fusion sub-module (GFF) designed by the invention can avoid a great amount of redundant information in the fusion feature, and the module comprises a gate control loop unit (GRU) and an adaptive fusion layer, as shown in fig. 5. The workflow of the GFF module is split into two steps.

First, a first phonetic representation is input with a second phonetic representation into a GRU, wherein the GRU is formed by a reset gate r _t Updating door z _t Composition is prepared. With the gating mechanism of the GRU, the most useful information is selected from a large number of feature maps, and then the information is selectively aggregated according to the results obtained. The output of this step uses a gating mechanism to selectively pass information. And secondly, carrying out feature fusion processing on the output of the GRU through the self-adaptive fusion layer.

Specifically, during processing of the first phonetic representation, an output O is output by the current RPM _Recon (i.e. the first phonetic representation) and the hidden state h transferred from the last node _t-1 To obtain two gated information:

r _t ＝σ(W _r ·[h _t-1 ,O _Recon ]) (14)

z _t ＝σ(W _z ·[h _t-1 ,O _Recon ]) (15)

wherein sigma is a sigmoid type function, W _r And W is _z The weights of the reset gate and the update gate, respectively.

After obtaining the gating information, O _Recon And reset data are spliced inTogether, the reset gate determines how much information needs to be remembered in the past. Then, the output of the current hidden node is obtained by activating the function tanh.

Finally, in the 'update memory' stage, the updated expression is:

/>

wherein W represents a learnable parameter of the GRU, tanh represents an activation function, h _t-1 Indicating the hidden status of the input at the previous time. If the previous weight is ignored as z _t Then the current weight is selected as (1-z) _t ) Is provided.

Similarly, the same calculation is performed for the second phonetic representation, so that h is finally obtained _p And will not be described in detail herein.

After obtaining h _q And h _p And then, carrying out self-adaptive fusion by utilizing a self-adaptive fusion layer. The method comprises the following specific steps:

O _GFF ＝ηh _p +μh _q (18)

wherein eta and mu represent learnable superparameters, h _p And h _q Output O representing RPM respectively _Recon And output O of CPM _Con Results obtained by GRU treatment.

Example 2

On the basis of embodiment 1, this embodiment 2 provides a voice recognition device based on a dual-flow self-supervision network, where the voice recognition device based on the dual-flow self-supervision network corresponds to the voice recognition based on the dual-flow self-supervision network, and the voice recognition device based on the dual-flow self-supervision network includes:

Specific details refer to the description of the voice recognition method based on the dual-flow self-supervision network, and are not repeated here.

Example 3

The embodiment 3 of the invention provides an electronic device, which comprises a memory and a processor, wherein the processor and the memory are communicated with each other, the memory stores program instructions which can be executed by the processor, the processor calls the program instructions to execute a voice recognition method based on a double-flow self-supervision network, and the method comprises the following flow steps:

Example 4

Embodiment 4 of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements a method for voice recognition based on a dual-flow self-supervision network, the method comprising the following flow steps:

In summary, in the voice recognition method based on the dual-stream self-supervision network provided by the embodiment of the invention, after the encoding and quantization module, a dual-channel structure is designed by combining the reconstruction prediction module (Reconstruction Prediction Module, RPM) and the comparison prediction module (Contrastive Prediction Module, CPM) in parallel. The reconstruction prediction is used as an auxiliary task of comparison prediction to respectively predict voice frames of voice vectors, so that detailed context information is focused while different characteristic difference information is captured by modeling attribution relations among different voice representations. In addition, in order to effectively utilize the two-channel speech representation, the speech representations of the two branches are fused through a feature fusion submodule, and the feature fusion submodule adaptively fuses the speech representations of the two branches through a parameter learning strategy and controls the exposure of various speech features by using weights. Finally, the dual-flow self-supervision learning network provided by the invention can well initialize the weight of the ASR model. Compared with other self-supervision learning methods, the voice recognition method provided by the invention can achieve competitive prediction accuracy. In addition, in the limited marked data scenario, it is comparable to the most advanced self-supervised learning method.

Those of ordinary skill in the art will appreciate that: the drawing is a schematic diagram of one embodiment and the modules or flows in the drawing are not necessarily required to practice the invention.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a method or apparatus embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of a method embodiment in part. The method and apparatus embodiments described above are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A voice recognition method based on a dual-flow self-supervision network, comprising:

2. The voice recognition method based on the double-flow self-supervision network according to claim 1, wherein the feature fusion submodule comprises a gating circulation unit and an adaptive fusion layer;

3. The method for voice recognition based on a dual-stream self-monitoring network according to claim 1, wherein the pre-trained voice recognition model is trained by:

4. The method for voice recognition based on a dual-stream self-monitoring network according to claim 1, wherein the encoding and quantizing module comprises an encoder and a vector quantizing layer, wherein the encoder is obtained based on a Conformer network;

5. The dual-stream self-monitoring network based speech recognition method of claim 4, wherein the encoder comprises multiple layers of Conformer, each layer of Conformer comprising:

6. The method for voice recognition based on a dual-stream self-monitoring network according to any one of claims 1-5, wherein the pre-trained voice recognition model further comprises a random masking module;

and the encoding and quantizing module is used for encoding and quantizing the target mask acoustic features to obtain a voice vector.

7. A voice recognition device based on a dual-flow self-monitoring network, comprising:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements a voice recognition method based on a dual stream self supervising network according to any one of claims 1 to 6 when executing the program.

9. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements a dual-flow self-supervising network based speech recognition method according to any one of claims 1 to 6.