CN110246487B

CN110246487B - Optimization method and system for single-channel speech recognition model

Info

Publication number: CN110246487B
Application number: CN201910511791.7A
Authority: CN
Inventors: 钱彦旻; 张王优; 常煊恺
Original assignee: Sipic Technology Co Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-06-13
Filing date: 2019-06-13
Publication date: 2021-06-22
Anticipated expiration: 2039-06-13
Also published as: CN110246487A

Abstract

The embodiment of the invention provides an optimization method for a single-channel speech recognition model. The method comprises the following steps: receiving single voice with real label vectors and multi-person mixed voice, inputting voice features extracted from the single voice into a target teacher model, and obtaining target soft label vectors corresponding to the single voice; inputting multi-person mixed voice into an end-to-end student model, and determining output arrangement; determining knowledge distillation loss and direct loss according to the output label vector of each person in the multi-person mixed voice with determined output arrangement; when the joint error determined from knowledge distillation loss and direct loss does not converge, the end-to-end student model is optimized according to the joint error. The embodiment of the invention also provides an optimization system for the single-channel speech recognition model. The embodiment of the invention can learn good parameters more easily, and the model is more simplified, so that the better parameters enable the student model trained by the student model to have better performance.

Description

Optimization method and system for single-channel speech recognition model

Technical Field

The invention relates to the field of voice recognition, in particular to an optimization method and system for a single-channel voice recognition model.

Background

With the development of smart speech, more and more devices have a speech recognition function, but because of the use scenarios of different devices, some devices are equipped with only a single microphone and some devices are equipped with multiple microphones during device manufacturing, that is, so-called single-channel and multi-channel devices. Because there is only a single microphone, this type of device has poor recognition performance when processing a speech conversation similar to a banquet type where multiple people speak simultaneously mixed together. For this purpose, use is generally made of: the knowledge distillation method of single-channel multi-speaker voice recognition based on the bidirectional long-short term memory network-cyclic neural network or the end-to-end single-channel multi-speaker voice recognition system is used for training.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

the knowledge distillation method of single-channel multi-speaker voice recognition based on the bidirectional long and short term memory network-recurrent neural network comprises the following steps: the adopted model belongs to the traditional method, and is more complex and complicated in training process compared with an end-to-end model; and an end-to-end single channel multi-speaker speech recognition system: because voice signals of multiple speakers exist at the same time, the model can only utilize information of mixed voice, voice information of a single speaker is lacked during training, good effect is difficult to train, and the performance gap is larger compared with a single speaker voice recognition system.

Disclosure of Invention

The problems that a traditional model is complex, the training process is complex, the training effect is poor and the performance is poor in the prior art are at least solved.

In a first aspect, an embodiment of the present invention provides an optimization method for a single-channel speech recognition model, including:

receiving single voice with real label vectors and multi-person mixed voice synthesized by the single voice, and respectively inputting voice characteristics extracted from the single voice into a target teacher model to obtain target soft label vectors corresponding to the single voice;

inputting the multi-person mixed voice into an end-to-end student model, outputting output label vectors of each person in the multi-person mixed voice, pairing the output label vectors of each person in the multi-person mixed voice with real label vectors of each single voice through a displacement invariance method (PIT), and determining output arrangement of the output label vectors of each person in the multi-person mixed voice;

determining knowledge distillation loss with each target soft tag vector and direct loss with each single voice real tag vector according to the output tag vector of each person in the multi-person mixed voice which is determined to be output and arranged after pairing;

and when the joint error determined according to the knowledge distillation loss and the direct loss is not converged, performing back propagation on the end-to-end student model according to the joint error to update the end-to-end student model until the joint error is converged, and determining an optimized voice recognition student model for a single channel.

In a second aspect, an embodiment of the present invention provides an optimization system for a single-channel speech recognition model, including:

the target soft label determining program module is used for receiving single voices with real label vectors and multi-person mixed voices synthesized by the single voices, and respectively inputting voice characteristics extracted from the single voices into a target teacher model to obtain target soft label vectors corresponding to the single voices;

an output arrangement determining program module, configured to input the multi-person mixed speech into an end-to-end student model, output an output tag vector of each person in the multi-person mixed speech, pair the output tag vector of each person in the multi-person mixed speech with a real tag vector of each single speech by a permutation invariance method (PIT), and determine an output arrangement of the output tag vector of each person in the multi-person mixed speech;

a loss determination program module for determining a knowledge distillation loss with each target soft tag vector and a direct loss with each single voice true tag vector according to an output tag vector of each person in the multi-person mixed voice which is determined to be output and arranged after pairing;

and the optimization program module is used for reversely propagating the end-to-end student model according to the joint error when the joint error determined according to the knowledge distillation loss and the direct loss is not converged so as to update the end-to-end student model until the joint error is converged and determine the optimized voice recognition student model for the single channel.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for optimizing a speech recognition model for a single channel of any of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the optimization method for a single-channel speech recognition model according to any one of the embodiments of the present invention.

The embodiment of the invention has the beneficial effects that: the output of the teacher model trained on the single speaking corpus is used as a target training label, and the voice information of a single speaker is integrated during training, so that the soft label can provide more information, the student model can learn good parameters more easily, the model is simplified, and the better parameters enable the trained student model to have better performance. In addition, a course learning strategy is adopted, training data are sequenced according to the signal-to-noise ratio (SNR) of speakers, information in the data is better utilized, and the performance of the model is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for optimizing a speech recognition model for a single channel according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an end-to-end single-channel multi-speaker speech recognition model architecture based on knowledge distillation for a method for optimizing a single-channel speech recognition model according to an embodiment of the present invention;

FIG. 3 is a data diagram of a comparison list of an end-to-end multi-speaker federated CTC/attention-based encoder-decoder system for a method of optimization of a single-channel speech recognition model according to an embodiment of the present invention;

FIG. 4 is a data diagram of a list of expressions (average CER & WER) of different lesson learning strategies on a 2-person hybrid WSJ0 corpus test dataset for a method of optimization of a single-pass speech recognition model according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an optimization system for a single-channel speech recognition model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of an optimization method for a single-channel speech recognition model according to an embodiment of the present invention, which includes the following steps:

s11: receiving single voice with real label vectors and multi-person mixed voice synthesized by the single voice, and respectively inputting voice characteristics extracted from the single voice into a target teacher model to obtain target soft label vectors corresponding to the single voice;

s12: inputting the multi-person mixed voice into an end-to-end student model, outputting output label vectors of each person in the multi-person mixed voice, pairing the output label vectors of each person in the multi-person mixed voice with real label vectors of each single voice through a displacement invariance method (PIT), and determining output arrangement of the output label vectors of each person in the multi-person mixed voice;

s13: determining knowledge distillation loss with each target soft tag vector and direct loss with each single voice real tag vector according to the output tag vector of each person in the multi-person mixed voice which is determined to be output and arranged after pairing;

s14: and when the joint error determined according to the knowledge distillation loss and the direct loss is not converged, performing back propagation on the end-to-end student model according to the joint error to update the end-to-end student model until the joint error is converged, and determining an optimized voice recognition student model for a single channel.

In the embodiment, the existing method does not generally use a teacher model, and only uses the output label vector of a student model and the real label vector to carry out error calculation during training; and the method introduces a teacher model. The teacher model is usually used in knowledge distillation, and in use, the knowledge of the teacher model with powerful and excellent performance is usually migrated to a more compact student model, and although the capability of the student model directly trained in a supervision mode cannot be matched with that of the teacher model, through knowledge distillation, the prediction capability of the student model is closer to that of the teacher model.

For step S11, in order to optimize the recognition effect of the student voice recognition models, a target teacher model to be learned is first determined, wherein the target teacher model may be a teacher model trained in advance. In training, certain training data is also required, including: some single-person voices with true tag vectors, and multi-person mixed voices synthesized by these single-person voices. The labels can be understood as texts corresponding to the voice, but are mapped by a dictionary, so that the processing by a computer is facilitated. And respectively inputting the single voice into the target teacher model, and further obtaining a corresponding target soft label vector, wherein the soft label vector comprises supplementary information hidden by the overlapped voice and understanding of the single speaker model.

For step S12, the multi-person mixed speech determined in step S11 is input into an end-to-end student model to be learned, output tag vectors of each person in the multi-person mixed speech are output, and the output tag vectors of each person in the multi-person mixed speech are paired with the real tag vectors of each single person speech by a permutation invariance method (PIT), which is an algorithm for solving the problem of pairing a plurality of predicted tags (output tags) with a plurality of real tags, in this example, when the model processes the mixed speech, the model outputs tags corresponding to a plurality of speaker speeches respectively, but when training, it is necessary to be able to calculate the error between each output tag and the corresponding real tag, for example, the tags corresponding to 2 speaker speeches, and it is not known which speaker the 2 output tags of the model actually correspond respectively (for example, the two predicted tag vectors are P1 and P2 respectively, the true tags are Y1 and Y2, it is not known whether P1-Y1, P2-Y2 or P1-Y2, P2-Y1) should be used, and therefore a permutation invariance method is employed to assist in pairing. Further, an output arrangement of output tag vectors for each person in the multi-person mixed speech is determined.

For step S13, the knowledge distillation loss with each target soft tag vector and the direct loss of each single voice true tag vector are determined respectively by the output tag vectors of each person in the multi-person mixed voice with the determined output arrangement after pairing in step S12, and in the optimization process, not only the direct loss generated by using the true tag vectors in the prior art but also the knowledge distillation loss generated by a teacher model of knowledge distillation are further considered, and the loss generated by various aspects is considered in multiple dimensions.

For step S14, after the joint error determined by the knowledge distillation loss and the direct loss determined in step S13 is not converged, and the joint error is calculated, the error is propagated back to each layer network in front of the output by a back propagation algorithm (an algorithm common in machine learning) for updating network parameters, and the process of updating the parameters is training, so as to determine an optimized student model for single-channel speech recognition.

It can be seen through this embodiment that, the output of the teacher model trained on the single speech corpus is used as the target training label, and this kind of soft label can provide more information, so that the student model can learn better parameters more easily, and the model is more simplified, and better parameters make the student model trained by the student model have better performance.

As an embodiment, in this embodiment, the inputting the multi-person mixed voice into the end-to-end student model includes:

carrying out feature projection on the voice features of the multi-person mixed voice through a trained neural network in the end-to-end student model, and dividing out the voice features of each person in the multi-person mixed voice;

determining, by an encoder within the end-to-end student model, a corresponding Connection Timing Classification (CTC) score based on the voice characteristics of each person;

transforming, by a decoder within the end-to-end student model, a feature arrangement corresponding to the Connection Timing Classification (CTC) minimum score into a corresponding output label vector; and mapping the label vectors through a dictionary to obtain a corresponding text sequence.

In this embodiment, in the optimization training phase, based on the voice features of each person, a feature permutation combination corresponding to the teacher model is determined by an encoder in the end-to-end student model, a Connection Timing Classification (CTC) score set corresponding to each feature permutation combination is further determined, and a feature permutation with the minimum total score among the feature permutation combinations is determined by a permutation invariance method. And (3) converting the feature arrangement corresponding to the minimum total score of the connection time sequence classification (CTC) into a corresponding output label vector through a decoder in the end-to-end student model.

In the identification stage, a teacher model and replacement invariance training are not needed, the results determined by the decoder are directly arranged in sequence, and the corresponding decoding result is determined through the calculated score.

According to the embodiment, the characteristic arrangement corresponding to the minimum score is determined through the permutation invariance training, so that the error in the recognition can be reduced to the minimum, and the recognition effect is improved.

As an implementation manner, in this embodiment, after the dividing out the voice feature of each person in the multi-person mixed voice, the method further includes:

through attention module in the end-to-end student model, it is right the speech feature further feature extraction of everyone in many people's mixed pronunciation determines corresponding attention score, so that many people's mixed pronunciation with single output label vector time is aligned.

In the present embodiment, the attention score is calculated by, after the permutation invariance training, first rearranging the intermediate representation of each person output by the encoder in accordance with the output arrangement obtained by the permutation invariance training, and then calculating the attention score between the intermediate representation corresponding to each speaker and the intermediate representation of the corresponding teacher model.

Through the embodiment, the attention module further extracts the voice features in order to solve the problem that the output text and the input audio in the end-to-end voice recognition system are not aligned in time, so that the recognition effect of the voice recognition model is improved.

As an implementation manner, in this embodiment, after the determining the corresponding attention score, the method further includes:

determining a joint score of each feature arrangement by weighting the Connection Timing Classification (CTC) score and the attention score corresponding to each feature arrangement according to a preset recognition mode;

and (5) arranging the features with the minimum score in the joint scores, and converting the features into corresponding output label vectors.

In the present embodiment, different recognition modes have different training directions, and different directions correspond to different weighting factors. And determining different weighted ratios according to the determined preset identification mode, and further performing weighted calculation on the CTC score and the attention score through the weighted ratios. For example, when the CTC score of a certain feature array is 2.34, the attention score is 3.22, and the weighted percentage is half, the score of the feature array is determined to be 2.78. After the final score corresponding to each feature arrangement is determined, the feature arrangement with the smallest score, that is, the feature arrangement with the smallest error is selected and converted into the corresponding output label vector.

According to the embodiment, various requirements for optimizing the speech recognition model are met by adjusting the weighted ratio, and the recognition effect of the speech recognition model is further improved.

As an embodiment, in this embodiment, after feature projection of the voice features of the multi-person mixed voice by the trained neural network in the end-to-end student model, the method further includes:

and acquiring relevant information of each person in the multi-person mixed voice according to a speaker self-adaption module newly added to the end-to-end student model so as to determine the voice characteristics of each person and additionally determine the context variable of each person.

In this embodiment, a sequence summary network is added before an encoder corresponding to each speaker in an end-to-end student model, the input is the output of a hybrid speech encoder, the output of the hybrid speech encoder is transformed to the same dimension as the input through projection, and then multiplied by the original input as a new feature, and then connected to an encoder corresponding to each person later, as shown in fig. 2 specifically, for example, an encoder 1 and an encoder 2 in an end-to-end single-channel multiple-speaker speech recognition model architecture diagram based on knowledge distillation shown in fig. 2 are respectively added to a sequence summary network with the same structure.

Through the embodiment, the training process is completely consistent with the previous training process, and the newly added module learns the information related to each person in the training process, so that the context variable containing the information of each person can be output, and more data improve the recognition effect of the voice recognition model.

As an implementation manner, in this embodiment, after the receiving the single-person voices with the real tag vectors, the method further includes:

determining the signal-to-noise ratio of each single voice through the voice and background noise of the person in each single voice;

and sequencing the multi-person mixed voice data according to the magnitude of the signal-to-noise ratio so as to achieve the progressive optimization of the voice recognition model.

In this embodiment, the snr is a logarithmic value of a ratio of powers of human voice and background noise in the voice signal, and a magnitude thereof represents a relative strength of the voice, and a larger value indicates that the noise is relatively weaker, thereby allowing the voice to be recognized more easily.

The embodiment shows that the training data are sorted according to the signal-to-noise ratio of the speaker, so that the learning process of the speaker is simulated in the training process, namely, the difficulty is gradually increased from a simple sample, and the progressive effect is achieved, so that a better training effect is realized.

As an embodiment, the joint error determined from the knowledge of distillation loss and direct loss comprises:

and weighting and summing the knowledge distillation loss and the direct loss according to a preset training mode to determine a joint error.

In order to meet different recognition requirements, different training modes can be set according to different use environments in the training process. And then training the speech recognition models meeting different requirements through different weighting ratios.

According to the embodiment, different training modes are set, and in the training process, joint errors of knowledge distillation loss and direct loss are determined according to different weighting ratios, so that the recognition environments with different requirements are met. And further improves the recognition effect of the voice recognition model.

In a further embodiment, the method uses an end-to-end speech recognition model that is a joint CTC/attention-based coder-decoder. The advantage of this model is that it uses CTCs as a secondary task to enhance the alignment capability of attention-based coders-decoders. Later, by introducing a separation phase in the encoder, the model was modified to accommodate the multi-speaker scenario. The input speech mix is first explicitly divided in the encoder into a plurality of vector sequences, each vector sequence representing a speaker source. These sequences are fed to a decoder to calculate conditional probabilities.

O represents the input speech mix of S speakers. The encoder consists of three stages:

Coder-Mix (Encoder)_Mix) Encoder-SD (Encoder)_SD) And Encoder-Rec (Encoder)_Rec)。

encoder-Mix: a hybrid encoder that encodes O into an intermediate representation H, which is then processed by S independent speaker-dependent (SD) encoders;

encoder-SD with S output H^s(S-1, …, S), each corresponding to a representation of a speaker. In the final stage, for each stream S (S ═ 1, …, S);

Encoder-Rec will feature sequence H^sConversion to a high level representation G^S。

The encoder may be written as follows:

H＝Encoder_Mix(O)

H^s＝Encoder_SD(H^s)，s＝1，…，S

G^s＝Encoder_Rec(H^s)，s＝1，…，S

the CTC objective function is concatenated after the encoder, which has two advantages. The first is to train the sequence to the encoder of the sequence model as an auxiliary task. The second is that in the multi-speaker case, the CTC objective function is used to perform permutation-free training, also called PIT (permutation invariant training), as shown in the following formula.

Wherein Y is^SIs from the expression G^SThe computed output sequence variable, π (S) is the S-th element in {1, …, S } permutation π, R is the reference number of S speakers. Thereafter, permutations with minimal CTC loss

Reference markers for use in attention-based decoders in order to reduce computational costs.

Attention-based decoder network for decoding each stream G^SAnd generates a corresponding output tag sequence Y^S. For each pair of representation and reference tag index

The decoding process is described as the following equation:

wherein

A context vector is represented that represents the location of the context,

is a hidden state of the decoder and,

is the nth element in the reference tag sequence. Reference tag in R during training

Is used as a teacher forced history, not as described above

In the formula

A sequence history.

Target tag sequence Y ═ Y { Y } for attention-based encoder-decoder prediction is defined₁，...，y_NH, wherein Y represents a sequence and Y with a subscript (e.g., n) represents a subsequence of Y, e.g., Y_nDenotes the nth sequence in Y, Y_n-1Denotes the (n-1) th sequence in Y, corresponding

Denotes the nth sequence vector, y_1:n-1Representing the 1 st to n-1 st sequences in Y, Y at the nth time step in the pat equation_nThe probability depends on the previous sequence y₁N-1. The final loss function is defined as:

wherein lambda is an interpolation factor, and lambda is more than or equal to 0 and less than or equal to 1.

The modification of the attention-based decoder is called parallel attention of the speaker. The motivation is to compensate the separation capability of the encoder and improve the separation performance of the model. The idea is to filter the noise information by means of selective properties, with individual attention modules for different streams:

it is claimed that soft targets can provide additional useful information, and thus better performance, than hard targets used in the cross-entropy criterion. This approach can also be used to improve the accuracy of attention-based decoder networks in multi-voice speech recognition tasks. To obtain soft tag vectors, the speech of a single speaker is passed through a model trained on speech containing only one speaker in parallel. The soft tag vector contains supplementary information hidden by overlapping voices as well as insight into the single speaker model with better modeling capabilities.

The model architecture is shown in fig. 2. The mixed speech and the corresponding single speech are denoted as O and O, respectively^S(S-1, …, S). Thus, the end-to-end teacher model converts source speech O^SAs input to the computation of the teacher's log for each step in the target sequence. And the corresponding output is represented as

Is considered the target distribution of the student model. Thus, the loss function learned by teachers and students can be expressed as follows:

wherein attention-based decoder-followed knowledge steamingLoss of distillation

Is calculated as the cross entropy between the predictions of the student model and the teacher model,

is the optimal alignment determined by CTC loss.

In the method, attention-based decoder is modified

Is measured. The new form is a weighted sum of the original loss based on Cross Entropy (CE) and the KL-divergence loss based term, i.e.:

where η is the weighting factor.

In previous approaches, end-to-end multi-speaker ASR systems were trained regardless of the similarity and diversity of the data. However, in some studies, the order in which the data is claimed to have an effect on the training process, referred to as a course learning strategy. It is therefore desirable to find a pattern from the data to make the training process more stable and improve performance. One observation is that the signal-to-noise ratio (SNR) between overlapping voices has a large impact on separation performance. In utterances with small SNR, the speech from different speakers is distorted by similar energy. Conversely, a large SNR means that the speech is distorted under unbalanced conditions with one dominant speech.

In the present method, attention is paid to the SNR level of overlapping speech, which is defined as the energy ratio between source speech from two speakers. Other factors may also be used, but the method is the same. When generating mixed speech, the energy ratio is randomly selected to simulate the actual conditions. When the SNR is larger, high energy speech is clearer, but lower energy speech does not perform well. Conversely, when the SNR is small, each utterance in the mixed speech can be recognized with similar performance, and thus the model can learn knowledge from each speaker. The training data is rearranged, specifically, at the beginning of training, the small groups in the training set are iterated in ascending order of the speaker's speech SNR, after which the training is restored to the randomly ordered training set.

To verify the effectiveness of the method, a tool published by MERL1 was used to artificially generate a monophonic two-speaker mixed signal based on the Voice corpus of the street journal (WSJ 0). Training, development and evaluation data were from WSJ0 SI-84, Dev93 and Eval92, respectively, with the following durations for each data set: training was 88.2 hours, development 1.1 hours, evaluation 0.9 hours.

The input features are 80-dimensional log-Mel filter bank coefficients, each frame having a pitch feature, concatenated with their delta and delta coefficients. All features were extracted using the Kaldi toolkit and normalized to zero mean and unit variance.

In the present method, different neural network models have the same depth and similar size, so their performance is comparable. The encoder consists of two VGG (Visual Geometry Group) driven CNN (Convolutional Neural Network) modules and three BLSTMP (bidirectional long-short term memory recurrent Neural Network with projection), while the decoder Network has only one unidirectional Long Short Term Memory (LSTM) layer with 300 units. All networks are built based on the ESPnet framework. And an AdaDelta optimizer, wherein rho is 0.95 and epsilon is 1e-8 for training. During training, the factor λ is set to 0.2.

For teacher-student training, training of end-to-end teacher models is first performed on the raw clean speech training data set of WSJ 0. In the method, WER (Word Error Rate) of teacher models of WSJ0 Dev93 and Eval92 are 8.0% and 2.1%, respectively. Then we input the mixed voice data and the corresponding personal voice data into the teacher-student module at the same time. When the weight coefficient η is set to 0.5, the best performance is obtained.

In the decoding phase, the joint CTC/attention scores are combined with the scores of a pre-trained word-level RNN language model (RNNLM) with 1 level LSTM and 1000 cells, and the transcripts on WSJ0 SI-84 are trained in a shallow fusion fashion. The beam width is set to 30, the interpolation factor λ used during decoding is 0.3, and the weight of RNNLM is 1.0.

Teacher and student training and course learning experiments firstly evaluate a baseline end-to-end method of a mixed voice test data set in WSJ0 and the performance provided by the method. The results are presented in the comparative list of end-to-end multi-speaker federated CTC/attention-based encoder-decoder systems shown in fig. 3. The first approach is a CTC/attention-based joint encoder-decoder network for multi-speaker speech, where the attention-decoder module is shared between the representations of each speaker. The second approach extends single attention to the speaker parallel attention module. Both methods are considered baseline systems.

Then the teachers and the students learn and the courses learn step by step. Through teachers and students training, it was observed that the performance of both baseline systems improved both in developing and evaluating data sets. The speaker parallel attention method can achieve even greater performance enhancement with relative reductions in average WER of 7% and 6% for dev and eval datasets, respectively. This demonstrates that the speaker-concurrent attention method has a stronger ability to eliminate current individual speaker-independent information and can better learn with knowledge of the attention output distribution of the teacher model. Next, a course learning strategy is applied to the teacher learning framework to further improve performance. As can be seen in FIG. 3, the proposed end-to-end approach of speaker concurrent attention and curriculum learning combined with teacher and student training significantly improves the performance of bilingual hybrid speech recognition with a relative improvement Rate of WER and CER (Character Error Rate) of over 15%.

In order to research the influence of the course learning strategy on the model performance, the method explores different strategies. End-to-end models were tested using teacher-student training and speaker parallel attention, using two different strategies: the training data is sorted in ascending order of SNR and in descending order of SNR. The experimental results are shown in the table of fig. 4 showing the performance (average CER & WER) of different course learning strategies on the 2-person hybrid WSJ0 corpus test dataset.

When the training data are sorted in descending order of SNR, the model performs worse than the model trained in the reverse order, or even worse than the model trained with randomly sorted data, proving the speculation of the method. When the SNR is small, the energy difference between the two speakers is subtle and the model learns the separation ability. Data with a greater SNR then improves accuracy.

Sequence-level knowledge distillation and course learning techniques are applied to a multi-speaker end-to-end speech recognition system based on a joint CTC/attention encoder-decoder framework. A single language end-to-end speech recognition teacher model is used to compute soft label vectors as target distributions to compute the final loss function. To make full use of the training data, the data is further rearranged in ascending order of SNR.

Fig. 5 is a schematic structural diagram of an optimization system for a single-channel speech recognition model according to an embodiment of the present invention, which can execute the optimization method for a single-channel speech recognition model according to any of the above embodiments and is configured in a terminal.

The embodiment provides an optimization system for a single-channel speech recognition model, which comprises: a target soft tag determination program module 11, an output alignment determination program module 12, a loss determination program module 13, and an optimization program module 14.

The target soft tag determining program module 11 is configured to receive single voices with respective real tag vectors and multi-person mixed voices synthesized by the single voices, and input voice features extracted from the single voices to a target teacher model to obtain target soft tag vectors corresponding to the single voices; the output arrangement determining program module 12 is configured to input the multi-person mixed voice to an end-to-end student model, output an output tag vector of each person in the multi-person mixed voice, pair the output tag vector of each person in the multi-person mixed voice with a real tag vector of each single person voice by a permutation invariance method (PIT), and determine an output arrangement of the output tag vector of each person in the multi-person mixed voice; the loss determination program module 13 is configured to determine a knowledge distillation loss with each target soft tag vector and a direct loss with each single voice true tag vector according to the output tag vector of each person in the multi-person mixed voice which is determined to be output and arranged by pairing; the optimization program module 14 is configured to, when a joint error determined according to the knowledge distillation loss and the direct loss does not converge, perform back propagation on the end-to-end student model according to the joint error to update the end-to-end student model until the joint error converges, and determine an optimized voice recognition student model for a single channel.

Further, the output arrangement determining program module is for:

Further, the output arrangement determining program module is further configured to:

Further, the target soft tag determination program module is configured to:

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the optimization method for the single-channel speech recognition model in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer readable storage medium, may be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the methods of testing software in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform the optimization method for a single-channel speech recognition model in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a device of test software, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the means for testing software over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for optimizing a speech recognition model for a single channel of any of the embodiments of the present invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with voice recognition capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for optimizing a speech recognition model for a single channel, comprising:

and when the joint error determined according to the knowledge distillation loss and the direct loss is not converged, performing back propagation on the end-to-end student model according to the joint error to update the end-to-end student model until the joint error is converged, and determining the optimized voice recognition student model for the single channel.

2. The method of claim 1, wherein the inputting the multi-person mixed speech to an end-to-end student model comprises:

3. The method of claim 2, wherein after said demarcating voice characteristics of each person within the multi-person mixed voice, the method further comprises:

through the attention module in the end-to-end student model, to the further feature extraction of the speech feature of everyone in many people's mixed pronunciation, determine corresponding attention score to make many people's mixed pronunciation align with single output label vector time.

4. The method of claim 3, wherein after said determining the corresponding attention score, the method further comprises:

5. The method of claim 2, wherein after said feature projecting voice features of the multi-person mixed voice through a trained neural network within the end-to-end student model, the method further comprises:

6. The method of claim 1, wherein after said receiving single person voices each with a true tag vector, the method further comprises:

7. The method of claim 1, wherein the joint error determined from the knowledge distillation loss and direct loss comprises:

8. An optimization system for a single-channel speech recognition model, comprising:

and the optimization program module is used for reversely propagating the end-to-end student model according to the joint error when the joint error determined according to the knowledge distillation loss and the direct loss is not converged so as to update the end-to-end student model until the joint error is converged and determine an optimized voice recognition student model for a single channel.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-7.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.