CN110246487B - Optimization method and system for single-channel speech recognition model - Google Patents

Optimization method and system for single-channel speech recognition model Download PDF

Info

Publication number
CN110246487B
CN110246487B CN201910511791.7A CN201910511791A CN110246487B CN 110246487 B CN110246487 B CN 110246487B CN 201910511791 A CN201910511791 A CN 201910511791A CN 110246487 B CN110246487 B CN 110246487B
Authority
CN
China
Prior art keywords
voice
person
model
output
mixed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910511791.7A
Other languages
Chinese (zh)
Other versions
CN110246487A (en
Inventor
钱彦旻
张王优
常煊恺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AI Speech Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN201910511791.7A priority Critical patent/CN110246487B/en
Publication of CN110246487A publication Critical patent/CN110246487A/en
Application granted granted Critical
Publication of CN110246487B publication Critical patent/CN110246487B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The embodiment of the invention provides an optimization method for a single-channel speech recognition model. The method comprises the following steps: receiving single voice with real label vectors and multi-person mixed voice, inputting voice features extracted from the single voice into a target teacher model, and obtaining target soft label vectors corresponding to the single voice; inputting multi-person mixed voice into an end-to-end student model, and determining output arrangement; determining knowledge distillation loss and direct loss according to the output label vector of each person in the multi-person mixed voice with determined output arrangement; when the joint error determined from knowledge distillation loss and direct loss does not converge, the end-to-end student model is optimized according to the joint error. The embodiment of the invention also provides an optimization system for the single-channel speech recognition model. The embodiment of the invention can learn good parameters more easily, and the model is more simplified, so that the better parameters enable the student model trained by the student model to have better performance.

Description

Optimization method and system for single-channel speech recognition model
Technical Field
The invention relates to the field of voice recognition, in particular to an optimization method and system for a single-channel voice recognition model.
Background
With the development of smart speech, more and more devices have a speech recognition function, but because of the use scenarios of different devices, some devices are equipped with only a single microphone and some devices are equipped with multiple microphones during device manufacturing, that is, so-called single-channel and multi-channel devices. Because there is only a single microphone, this type of device has poor recognition performance when processing a speech conversation similar to a banquet type where multiple people speak simultaneously mixed together. For this purpose, use is generally made of: the knowledge distillation method of single-channel multi-speaker voice recognition based on the bidirectional long-short term memory network-cyclic neural network or the end-to-end single-channel multi-speaker voice recognition system is used for training.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:
the knowledge distillation method of single-channel multi-speaker voice recognition based on the bidirectional long and short term memory network-recurrent neural network comprises the following steps: the adopted model belongs to the traditional method, and is more complex and complicated in training process compared with an end-to-end model; and an end-to-end single channel multi-speaker speech recognition system: because voice signals of multiple speakers exist at the same time, the model can only utilize information of mixed voice, voice information of a single speaker is lacked during training, good effect is difficult to train, and the performance gap is larger compared with a single speaker voice recognition system.
Disclosure of Invention
The problems that a traditional model is complex, the training process is complex, the training effect is poor and the performance is poor in the prior art are at least solved.
In a first aspect, an embodiment of the present invention provides an optimization method for a single-channel speech recognition model, including:
receiving single voice with real label vectors and multi-person mixed voice synthesized by the single voice, and respectively inputting voice characteristics extracted from the single voice into a target teacher model to obtain target soft label vectors corresponding to the single voice;
inputting the multi-person mixed voice into an end-to-end student model, outputting output label vectors of each person in the multi-person mixed voice, pairing the output label vectors of each person in the multi-person mixed voice with real label vectors of each single voice through a displacement invariance method (PIT), and determining output arrangement of the output label vectors of each person in the multi-person mixed voice;
determining knowledge distillation loss with each target soft tag vector and direct loss with each single voice real tag vector according to the output tag vector of each person in the multi-person mixed voice which is determined to be output and arranged after pairing;
and when the joint error determined according to the knowledge distillation loss and the direct loss is not converged, performing back propagation on the end-to-end student model according to the joint error to update the end-to-end student model until the joint error is converged, and determining an optimized voice recognition student model for a single channel.
In a second aspect, an embodiment of the present invention provides an optimization system for a single-channel speech recognition model, including:
the target soft label determining program module is used for receiving single voices with real label vectors and multi-person mixed voices synthesized by the single voices, and respectively inputting voice characteristics extracted from the single voices into a target teacher model to obtain target soft label vectors corresponding to the single voices;
an output arrangement determining program module, configured to input the multi-person mixed speech into an end-to-end student model, output an output tag vector of each person in the multi-person mixed speech, pair the output tag vector of each person in the multi-person mixed speech with a real tag vector of each single speech by a permutation invariance method (PIT), and determine an output arrangement of the output tag vector of each person in the multi-person mixed speech;
a loss determination program module for determining a knowledge distillation loss with each target soft tag vector and a direct loss with each single voice true tag vector according to an output tag vector of each person in the multi-person mixed voice which is determined to be output and arranged after pairing;
and the optimization program module is used for reversely propagating the end-to-end student model according to the joint error when the joint error determined according to the knowledge distillation loss and the direct loss is not converged so as to update the end-to-end student model until the joint error is converged and determine the optimized voice recognition student model for the single channel.
In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for optimizing a speech recognition model for a single channel of any of the embodiments of the present invention.
In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the optimization method for a single-channel speech recognition model according to any one of the embodiments of the present invention.
The embodiment of the invention has the beneficial effects that: the output of the teacher model trained on the single speaking corpus is used as a target training label, and the voice information of a single speaker is integrated during training, so that the soft label can provide more information, the student model can learn good parameters more easily, the model is simplified, and the better parameters enable the trained student model to have better performance. In addition, a course learning strategy is adopted, training data are sequenced according to the signal-to-noise ratio (SNR) of speakers, information in the data is better utilized, and the performance of the model is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for optimizing a speech recognition model for a single channel according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an end-to-end single-channel multi-speaker speech recognition model architecture based on knowledge distillation for a method for optimizing a single-channel speech recognition model according to an embodiment of the present invention;
FIG. 3 is a data diagram of a comparison list of an end-to-end multi-speaker federated CTC/attention-based encoder-decoder system for a method of optimization of a single-channel speech recognition model according to an embodiment of the present invention;
FIG. 4 is a data diagram of a list of expressions (average CER & WER) of different lesson learning strategies on a 2-person hybrid WSJ0 corpus test dataset for a method of optimization of a single-pass speech recognition model according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an optimization system for a single-channel speech recognition model according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of an optimization method for a single-channel speech recognition model according to an embodiment of the present invention, which includes the following steps:
s11: receiving single voice with real label vectors and multi-person mixed voice synthesized by the single voice, and respectively inputting voice characteristics extracted from the single voice into a target teacher model to obtain target soft label vectors corresponding to the single voice;
s12: inputting the multi-person mixed voice into an end-to-end student model, outputting output label vectors of each person in the multi-person mixed voice, pairing the output label vectors of each person in the multi-person mixed voice with real label vectors of each single voice through a displacement invariance method (PIT), and determining output arrangement of the output label vectors of each person in the multi-person mixed voice;
s13: determining knowledge distillation loss with each target soft tag vector and direct loss with each single voice real tag vector according to the output tag vector of each person in the multi-person mixed voice which is determined to be output and arranged after pairing;
s14: and when the joint error determined according to the knowledge distillation loss and the direct loss is not converged, performing back propagation on the end-to-end student model according to the joint error to update the end-to-end student model until the joint error is converged, and determining an optimized voice recognition student model for a single channel.
In the embodiment, the existing method does not generally use a teacher model, and only uses the output label vector of a student model and the real label vector to carry out error calculation during training; and the method introduces a teacher model. The teacher model is usually used in knowledge distillation, and in use, the knowledge of the teacher model with powerful and excellent performance is usually migrated to a more compact student model, and although the capability of the student model directly trained in a supervision mode cannot be matched with that of the teacher model, through knowledge distillation, the prediction capability of the student model is closer to that of the teacher model.
For step S11, in order to optimize the recognition effect of the student voice recognition models, a target teacher model to be learned is first determined, wherein the target teacher model may be a teacher model trained in advance. In training, certain training data is also required, including: some single-person voices with true tag vectors, and multi-person mixed voices synthesized by these single-person voices. The labels can be understood as texts corresponding to the voice, but are mapped by a dictionary, so that the processing by a computer is facilitated. And respectively inputting the single voice into the target teacher model, and further obtaining a corresponding target soft label vector, wherein the soft label vector comprises supplementary information hidden by the overlapped voice and understanding of the single speaker model.
For step S12, the multi-person mixed speech determined in step S11 is input into an end-to-end student model to be learned, output tag vectors of each person in the multi-person mixed speech are output, and the output tag vectors of each person in the multi-person mixed speech are paired with the real tag vectors of each single person speech by a permutation invariance method (PIT), which is an algorithm for solving the problem of pairing a plurality of predicted tags (output tags) with a plurality of real tags, in this example, when the model processes the mixed speech, the model outputs tags corresponding to a plurality of speaker speeches respectively, but when training, it is necessary to be able to calculate the error between each output tag and the corresponding real tag, for example, the tags corresponding to 2 speaker speeches, and it is not known which speaker the 2 output tags of the model actually correspond respectively (for example, the two predicted tag vectors are P1 and P2 respectively, the true tags are Y1 and Y2, it is not known whether P1-Y1, P2-Y2 or P1-Y2, P2-Y1) should be used, and therefore a permutation invariance method is employed to assist in pairing. Further, an output arrangement of output tag vectors for each person in the multi-person mixed speech is determined.
For step S13, the knowledge distillation loss with each target soft tag vector and the direct loss of each single voice true tag vector are determined respectively by the output tag vectors of each person in the multi-person mixed voice with the determined output arrangement after pairing in step S12, and in the optimization process, not only the direct loss generated by using the true tag vectors in the prior art but also the knowledge distillation loss generated by a teacher model of knowledge distillation are further considered, and the loss generated by various aspects is considered in multiple dimensions.
For step S14, after the joint error determined by the knowledge distillation loss and the direct loss determined in step S13 is not converged, and the joint error is calculated, the error is propagated back to each layer network in front of the output by a back propagation algorithm (an algorithm common in machine learning) for updating network parameters, and the process of updating the parameters is training, so as to determine an optimized student model for single-channel speech recognition.
It can be seen through this embodiment that, the output of the teacher model trained on the single speech corpus is used as the target training label, and this kind of soft label can provide more information, so that the student model can learn better parameters more easily, and the model is more simplified, and better parameters make the student model trained by the student model have better performance.
As an embodiment, in this embodiment, the inputting the multi-person mixed voice into the end-to-end student model includes:
carrying out feature projection on the voice features of the multi-person mixed voice through a trained neural network in the end-to-end student model, and dividing out the voice features of each person in the multi-person mixed voice;
determining, by an encoder within the end-to-end student model, a corresponding Connection Timing Classification (CTC) score based on the voice characteristics of each person;
transforming, by a decoder within the end-to-end student model, a feature arrangement corresponding to the Connection Timing Classification (CTC) minimum score into a corresponding output label vector; and mapping the label vectors through a dictionary to obtain a corresponding text sequence.
In this embodiment, in the optimization training phase, based on the voice features of each person, a feature permutation combination corresponding to the teacher model is determined by an encoder in the end-to-end student model, a Connection Timing Classification (CTC) score set corresponding to each feature permutation combination is further determined, and a feature permutation with the minimum total score among the feature permutation combinations is determined by a permutation invariance method. And (3) converting the feature arrangement corresponding to the minimum total score of the connection time sequence classification (CTC) into a corresponding output label vector through a decoder in the end-to-end student model.
In the identification stage, a teacher model and replacement invariance training are not needed, the results determined by the decoder are directly arranged in sequence, and the corresponding decoding result is determined through the calculated score.
According to the embodiment, the characteristic arrangement corresponding to the minimum score is determined through the permutation invariance training, so that the error in the recognition can be reduced to the minimum, and the recognition effect is improved.
As an implementation manner, in this embodiment, after the dividing out the voice feature of each person in the multi-person mixed voice, the method further includes:
through attention module in the end-to-end student model, it is right the speech feature further feature extraction of everyone in many people's mixed pronunciation determines corresponding attention score, so that many people's mixed pronunciation with single output label vector time is aligned.
In the present embodiment, the attention score is calculated by, after the permutation invariance training, first rearranging the intermediate representation of each person output by the encoder in accordance with the output arrangement obtained by the permutation invariance training, and then calculating the attention score between the intermediate representation corresponding to each speaker and the intermediate representation of the corresponding teacher model.
Through the embodiment, the attention module further extracts the voice features in order to solve the problem that the output text and the input audio in the end-to-end voice recognition system are not aligned in time, so that the recognition effect of the voice recognition model is improved.
As an implementation manner, in this embodiment, after the determining the corresponding attention score, the method further includes:
determining a joint score of each feature arrangement by weighting the Connection Timing Classification (CTC) score and the attention score corresponding to each feature arrangement according to a preset recognition mode;
and (5) arranging the features with the minimum score in the joint scores, and converting the features into corresponding output label vectors.
In the present embodiment, different recognition modes have different training directions, and different directions correspond to different weighting factors. And determining different weighted ratios according to the determined preset identification mode, and further performing weighted calculation on the CTC score and the attention score through the weighted ratios. For example, when the CTC score of a certain feature array is 2.34, the attention score is 3.22, and the weighted percentage is half, the score of the feature array is determined to be 2.78. After the final score corresponding to each feature arrangement is determined, the feature arrangement with the smallest score, that is, the feature arrangement with the smallest error is selected and converted into the corresponding output label vector.
According to the embodiment, various requirements for optimizing the speech recognition model are met by adjusting the weighted ratio, and the recognition effect of the speech recognition model is further improved.
As an embodiment, in this embodiment, after feature projection of the voice features of the multi-person mixed voice by the trained neural network in the end-to-end student model, the method further includes:
and acquiring relevant information of each person in the multi-person mixed voice according to a speaker self-adaption module newly added to the end-to-end student model so as to determine the voice characteristics of each person and additionally determine the context variable of each person.
In this embodiment, a sequence summary network is added before an encoder corresponding to each speaker in an end-to-end student model, the input is the output of a hybrid speech encoder, the output of the hybrid speech encoder is transformed to the same dimension as the input through projection, and then multiplied by the original input as a new feature, and then connected to an encoder corresponding to each person later, as shown in fig. 2 specifically, for example, an encoder 1 and an encoder 2 in an end-to-end single-channel multiple-speaker speech recognition model architecture diagram based on knowledge distillation shown in fig. 2 are respectively added to a sequence summary network with the same structure.
Through the embodiment, the training process is completely consistent with the previous training process, and the newly added module learns the information related to each person in the training process, so that the context variable containing the information of each person can be output, and more data improve the recognition effect of the voice recognition model.
As an implementation manner, in this embodiment, after the receiving the single-person voices with the real tag vectors, the method further includes:
determining the signal-to-noise ratio of each single voice through the voice and background noise of the person in each single voice;
and sequencing the multi-person mixed voice data according to the magnitude of the signal-to-noise ratio so as to achieve the progressive optimization of the voice recognition model.
In this embodiment, the snr is a logarithmic value of a ratio of powers of human voice and background noise in the voice signal, and a magnitude thereof represents a relative strength of the voice, and a larger value indicates that the noise is relatively weaker, thereby allowing the voice to be recognized more easily.
The embodiment shows that the training data are sorted according to the signal-to-noise ratio of the speaker, so that the learning process of the speaker is simulated in the training process, namely, the difficulty is gradually increased from a simple sample, and the progressive effect is achieved, so that a better training effect is realized.
As an embodiment, the joint error determined from the knowledge of distillation loss and direct loss comprises:
and weighting and summing the knowledge distillation loss and the direct loss according to a preset training mode to determine a joint error.
In order to meet different recognition requirements, different training modes can be set according to different use environments in the training process. And then training the speech recognition models meeting different requirements through different weighting ratios.
According to the embodiment, different training modes are set, and in the training process, joint errors of knowledge distillation loss and direct loss are determined according to different weighting ratios, so that the recognition environments with different requirements are met. And further improves the recognition effect of the voice recognition model.
In a further embodiment, the method uses an end-to-end speech recognition model that is a joint CTC/attention-based coder-decoder. The advantage of this model is that it uses CTCs as a secondary task to enhance the alignment capability of attention-based coders-decoders. Later, by introducing a separation phase in the encoder, the model was modified to accommodate the multi-speaker scenario. The input speech mix is first explicitly divided in the encoder into a plurality of vector sequences, each vector sequence representing a speaker source. These sequences are fed to a decoder to calculate conditional probabilities.
O represents the input speech mix of S speakers. The encoder consists of three stages:
Coder-Mix (Encoder)Mix) Encoder-SD (Encoder)SD) And Encoder-Rec (Encoder)Rec)。
encoder-Mix: a hybrid encoder that encodes O into an intermediate representation H, which is then processed by S independent speaker-dependent (SD) encoders;
encoder-SD with S output Hs(S-1, …, S), each corresponding to a representation of a speaker. In the final stage, for each stream S (S ═ 1, …, S);
Encoder-Rec will feature sequence HsConversion to a high level representation GS
The encoder may be written as follows:
H=EncoderMix(O)
Hs=EncoderSD(Hs),s=1,…,S
Gs=EncoderRec(Hs),s=1,…,S
the CTC objective function is concatenated after the encoder, which has two advantages. The first is to train the sequence to the encoder of the sequence model as an auxiliary task. The second is that in the multi-speaker case, the CTC objective function is used to perform permutation-free training, also called PIT (permutation invariant training), as shown in the following formula.
Figure GDA0003065213330000091
Wherein Y isSIs from the expression GSThe computed output sequence variable, π (S) is the S-th element in {1, …, S } permutation π, R is the reference number of S speakers. Thereafter, permutations with minimal CTC loss
Figure GDA0003065213330000109
Reference markers for use in attention-based decoders in order to reduce computational costs.
Attention-based decoder network for decoding each stream GSAnd generates a corresponding output tag sequence YS. For each pair of representation and reference tag index
Figure GDA0003065213330000101
The decoding process is described as the following equation:
Figure GDA0003065213330000102
Figure GDA0003065213330000103
Figure GDA0003065213330000104
Figure GDA0003065213330000105
wherein
Figure GDA00030652133300001010
A context vector is represented that represents the location of the context,
Figure GDA00030652133300001011
is a hidden state of the decoder and,
Figure GDA00030652133300001012
is the nth element in the reference tag sequence. Reference tag in R during training
Figure GDA00030652133300001013
Is used as a teacher forced history, not as described above
Figure GDA00030652133300001014
In the formula
Figure GDA00030652133300001015
A sequence history.
Figure GDA00030652133300001016
Target tag sequence Y ═ Y { Y } for attention-based encoder-decoder prediction is defined1,...,yNH, wherein Y represents a sequence and Y with a subscript (e.g., n) represents a subsequence of Y, e.g., YnDenotes the nth sequence in Y, Yn-1Denotes the (n-1) th sequence in Y, corresponding
Figure GDA00030652133300001017
Denotes the nth sequence vector, y1:n-1Representing the 1 st to n-1 st sequences in Y, Y at the nth time step in the pat equationnThe probability depends on the previous sequence y1N-1. The final loss function is defined as:
Figure GDA0003065213330000106
Figure GDA0003065213330000107
Figure GDA0003065213330000108
wherein lambda is an interpolation factor, and lambda is more than or equal to 0 and less than or equal to 1.
The modification of the attention-based decoder is called parallel attention of the speaker. The motivation is to compensate the separation capability of the encoder and improve the separation performance of the model. The idea is to filter the noise information by means of selective properties, with individual attention modules for different streams:
Figure GDA0003065213330000111
it is claimed that soft targets can provide additional useful information, and thus better performance, than hard targets used in the cross-entropy criterion. This approach can also be used to improve the accuracy of attention-based decoder networks in multi-voice speech recognition tasks. To obtain soft tag vectors, the speech of a single speaker is passed through a model trained on speech containing only one speaker in parallel. The soft tag vector contains supplementary information hidden by overlapping voices as well as insight into the single speaker model with better modeling capabilities.
The model architecture is shown in fig. 2. The mixed speech and the corresponding single speech are denoted as O and O, respectivelyS(S-1, …, S). Thus, the end-to-end teacher model converts source speech OSAs input to the computation of the teacher's log for each step in the target sequence. And the corresponding output is represented as
Figure GDA0003065213330000112
Is considered the target distribution of the student model. Thus, the loss function learned by teachers and students can be expressed as follows:
Figure GDA0003065213330000113
wherein attention-based decoder-followed knowledge steamingLoss of distillation
Figure GDA0003065213330000114
Is calculated as the cross entropy between the predictions of the student model and the teacher model,
Figure GDA0003065213330000117
is the optimal alignment determined by CTC loss.
In the method, attention-based decoder is modified
Figure GDA0003065213330000115
Is measured. The new form is a weighted sum of the original loss based on Cross Entropy (CE) and the KL-divergence loss based term, i.e.:
Figure GDA0003065213330000116
where η is the weighting factor.
In previous approaches, end-to-end multi-speaker ASR systems were trained regardless of the similarity and diversity of the data. However, in some studies, the order in which the data is claimed to have an effect on the training process, referred to as a course learning strategy. It is therefore desirable to find a pattern from the data to make the training process more stable and improve performance. One observation is that the signal-to-noise ratio (SNR) between overlapping voices has a large impact on separation performance. In utterances with small SNR, the speech from different speakers is distorted by similar energy. Conversely, a large SNR means that the speech is distorted under unbalanced conditions with one dominant speech.
In the present method, attention is paid to the SNR level of overlapping speech, which is defined as the energy ratio between source speech from two speakers. Other factors may also be used, but the method is the same. When generating mixed speech, the energy ratio is randomly selected to simulate the actual conditions. When the SNR is larger, high energy speech is clearer, but lower energy speech does not perform well. Conversely, when the SNR is small, each utterance in the mixed speech can be recognized with similar performance, and thus the model can learn knowledge from each speaker. The training data is rearranged, specifically, at the beginning of training, the small groups in the training set are iterated in ascending order of the speaker's speech SNR, after which the training is restored to the randomly ordered training set.
To verify the effectiveness of the method, a tool published by MERL1 was used to artificially generate a monophonic two-speaker mixed signal based on the Voice corpus of the street journal (WSJ 0). Training, development and evaluation data were from WSJ0 SI-84, Dev93 and Eval92, respectively, with the following durations for each data set: training was 88.2 hours, development 1.1 hours, evaluation 0.9 hours.
The input features are 80-dimensional log-Mel filter bank coefficients, each frame having a pitch feature, concatenated with their delta and delta coefficients. All features were extracted using the Kaldi toolkit and normalized to zero mean and unit variance.
In the present method, different neural network models have the same depth and similar size, so their performance is comparable. The encoder consists of two VGG (Visual Geometry Group) driven CNN (Convolutional Neural Network) modules and three BLSTMP (bidirectional long-short term memory recurrent Neural Network with projection), while the decoder Network has only one unidirectional Long Short Term Memory (LSTM) layer with 300 units. All networks are built based on the ESPnet framework. And an AdaDelta optimizer, wherein rho is 0.95 and epsilon is 1e-8 for training. During training, the factor λ is set to 0.2.
For teacher-student training, training of end-to-end teacher models is first performed on the raw clean speech training data set of WSJ 0. In the method, WER (Word Error Rate) of teacher models of WSJ0 Dev93 and Eval92 are 8.0% and 2.1%, respectively. Then we input the mixed voice data and the corresponding personal voice data into the teacher-student module at the same time. When the weight coefficient η is set to 0.5, the best performance is obtained.
In the decoding phase, the joint CTC/attention scores are combined with the scores of a pre-trained word-level RNN language model (RNNLM) with 1 level LSTM and 1000 cells, and the transcripts on WSJ0 SI-84 are trained in a shallow fusion fashion. The beam width is set to 30, the interpolation factor λ used during decoding is 0.3, and the weight of RNNLM is 1.0.
Teacher and student training and course learning experiments firstly evaluate a baseline end-to-end method of a mixed voice test data set in WSJ0 and the performance provided by the method. The results are presented in the comparative list of end-to-end multi-speaker federated CTC/attention-based encoder-decoder systems shown in fig. 3. The first approach is a CTC/attention-based joint encoder-decoder network for multi-speaker speech, where the attention-decoder module is shared between the representations of each speaker. The second approach extends single attention to the speaker parallel attention module. Both methods are considered baseline systems.
Then the teachers and the students learn and the courses learn step by step. Through teachers and students training, it was observed that the performance of both baseline systems improved both in developing and evaluating data sets. The speaker parallel attention method can achieve even greater performance enhancement with relative reductions in average WER of 7% and 6% for dev and eval datasets, respectively. This demonstrates that the speaker-concurrent attention method has a stronger ability to eliminate current individual speaker-independent information and can better learn with knowledge of the attention output distribution of the teacher model. Next, a course learning strategy is applied to the teacher learning framework to further improve performance. As can be seen in FIG. 3, the proposed end-to-end approach of speaker concurrent attention and curriculum learning combined with teacher and student training significantly improves the performance of bilingual hybrid speech recognition with a relative improvement Rate of WER and CER (Character Error Rate) of over 15%.
In order to research the influence of the course learning strategy on the model performance, the method explores different strategies. End-to-end models were tested using teacher-student training and speaker parallel attention, using two different strategies: the training data is sorted in ascending order of SNR and in descending order of SNR. The experimental results are shown in the table of fig. 4 showing the performance (average CER & WER) of different course learning strategies on the 2-person hybrid WSJ0 corpus test dataset.
When the training data are sorted in descending order of SNR, the model performs worse than the model trained in the reverse order, or even worse than the model trained with randomly sorted data, proving the speculation of the method. When the SNR is small, the energy difference between the two speakers is subtle and the model learns the separation ability. Data with a greater SNR then improves accuracy.
Sequence-level knowledge distillation and course learning techniques are applied to a multi-speaker end-to-end speech recognition system based on a joint CTC/attention encoder-decoder framework. A single language end-to-end speech recognition teacher model is used to compute soft label vectors as target distributions to compute the final loss function. To make full use of the training data, the data is further rearranged in ascending order of SNR.
Fig. 5 is a schematic structural diagram of an optimization system for a single-channel speech recognition model according to an embodiment of the present invention, which can execute the optimization method for a single-channel speech recognition model according to any of the above embodiments and is configured in a terminal.
The embodiment provides an optimization system for a single-channel speech recognition model, which comprises: a target soft tag determination program module 11, an output alignment determination program module 12, a loss determination program module 13, and an optimization program module 14.
The target soft tag determining program module 11 is configured to receive single voices with respective real tag vectors and multi-person mixed voices synthesized by the single voices, and input voice features extracted from the single voices to a target teacher model to obtain target soft tag vectors corresponding to the single voices; the output arrangement determining program module 12 is configured to input the multi-person mixed voice to an end-to-end student model, output an output tag vector of each person in the multi-person mixed voice, pair the output tag vector of each person in the multi-person mixed voice with a real tag vector of each single person voice by a permutation invariance method (PIT), and determine an output arrangement of the output tag vector of each person in the multi-person mixed voice; the loss determination program module 13 is configured to determine a knowledge distillation loss with each target soft tag vector and a direct loss with each single voice true tag vector according to the output tag vector of each person in the multi-person mixed voice which is determined to be output and arranged by pairing; the optimization program module 14 is configured to, when a joint error determined according to the knowledge distillation loss and the direct loss does not converge, perform back propagation on the end-to-end student model according to the joint error to update the end-to-end student model until the joint error converges, and determine an optimized voice recognition student model for a single channel.
Further, the output arrangement determining program module is for:
carrying out feature projection on the voice features of the multi-person mixed voice through a trained neural network in the end-to-end student model, and dividing out the voice features of each person in the multi-person mixed voice;
determining, by an encoder within the end-to-end student model, a corresponding Connection Timing Classification (CTC) score based on the voice characteristics of each person;
transforming, by a decoder within the end-to-end student model, a feature arrangement corresponding to the Connection Timing Classification (CTC) minimum score into a corresponding output label vector; and mapping the label vectors through a dictionary to obtain a corresponding text sequence.
Further, the output arrangement determining program module is further configured to:
through attention module in the end-to-end student model, it is right the speech feature further feature extraction of everyone in many people's mixed pronunciation determines corresponding attention score, so that many people's mixed pronunciation with single output label vector time is aligned.
Further, the output arrangement determining program module is further configured to:
determining a joint score of each feature arrangement by weighting the Connection Timing Classification (CTC) score and the attention score corresponding to each feature arrangement according to a preset recognition mode;
and (5) arranging the features with the minimum score in the joint scores, and converting the features into corresponding output label vectors.
Further, the output arrangement determining program module is further configured to:
and acquiring relevant information of each person in the multi-person mixed voice according to a speaker self-adaption module newly added to the end-to-end student model so as to determine the voice characteristics of each person and additionally determine the context variable of each person.
Further, the target soft tag determination program module is configured to:
determining the signal-to-noise ratio of each single voice through the voice and background noise of the person in each single voice;
and sequencing the multi-person mixed voice data according to the magnitude of the signal-to-noise ratio so as to achieve the progressive optimization of the voice recognition model.
The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the optimization method for the single-channel speech recognition model in any method embodiment;
as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
receiving single voice with real label vectors and multi-person mixed voice synthesized by the single voice, and respectively inputting voice characteristics extracted from the single voice into a target teacher model to obtain target soft label vectors corresponding to the single voice;
inputting the multi-person mixed voice into an end-to-end student model, outputting output label vectors of each person in the multi-person mixed voice, pairing the output label vectors of each person in the multi-person mixed voice with real label vectors of each single voice through a displacement invariance method (PIT), and determining output arrangement of the output label vectors of each person in the multi-person mixed voice;
determining knowledge distillation loss with each target soft tag vector and direct loss with each single voice real tag vector according to the output tag vector of each person in the multi-person mixed voice which is determined to be output and arranged after pairing;
and when the joint error determined according to the knowledge distillation loss and the direct loss is not converged, performing back propagation on the end-to-end student model according to the joint error to update the end-to-end student model until the joint error is converged, and determining an optimized voice recognition student model for a single channel.
As a non-volatile computer readable storage medium, may be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the methods of testing software in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform the optimization method for a single-channel speech recognition model in any of the method embodiments described above.
The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a device of test software, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the means for testing software over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for optimizing a speech recognition model for a single channel of any of the embodiments of the present invention.
The client of the embodiment of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.
(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.
(4) Other electronic devices with voice recognition capabilities.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for optimizing a speech recognition model for a single channel, comprising:
receiving single voice with real label vectors and multi-person mixed voice synthesized by the single voice, and respectively inputting voice characteristics extracted from the single voice into a target teacher model to obtain target soft label vectors corresponding to the single voice;
inputting the multi-person mixed voice into an end-to-end student model, outputting output label vectors of each person in the multi-person mixed voice, pairing the output label vectors of each person in the multi-person mixed voice with real label vectors of each single voice through a displacement invariance method (PIT), and determining output arrangement of the output label vectors of each person in the multi-person mixed voice;
determining knowledge distillation loss with each target soft tag vector and direct loss with each single voice real tag vector according to the output tag vector of each person in the multi-person mixed voice which is determined to be output and arranged after pairing;
and when the joint error determined according to the knowledge distillation loss and the direct loss is not converged, performing back propagation on the end-to-end student model according to the joint error to update the end-to-end student model until the joint error is converged, and determining the optimized voice recognition student model for the single channel.
2. The method of claim 1, wherein the inputting the multi-person mixed speech to an end-to-end student model comprises:
carrying out feature projection on the voice features of the multi-person mixed voice through a trained neural network in the end-to-end student model, and dividing out the voice features of each person in the multi-person mixed voice;
determining, by an encoder within the end-to-end student model, a corresponding Connection Timing Classification (CTC) score based on the voice characteristics of each person;
transforming, by a decoder within the end-to-end student model, a feature arrangement corresponding to the Connection Timing Classification (CTC) minimum score into a corresponding output label vector; and mapping the label vectors through a dictionary to obtain a corresponding text sequence.
3. The method of claim 2, wherein after said demarcating voice characteristics of each person within the multi-person mixed voice, the method further comprises:
through the attention module in the end-to-end student model, to the further feature extraction of the speech feature of everyone in many people's mixed pronunciation, determine corresponding attention score to make many people's mixed pronunciation align with single output label vector time.
4. The method of claim 3, wherein after said determining the corresponding attention score, the method further comprises:
determining a joint score of each feature arrangement by weighting the Connection Timing Classification (CTC) score and the attention score corresponding to each feature arrangement according to a preset recognition mode;
and (5) arranging the features with the minimum score in the joint scores, and converting the features into corresponding output label vectors.
5. The method of claim 2, wherein after said feature projecting voice features of the multi-person mixed voice through a trained neural network within the end-to-end student model, the method further comprises:
and acquiring relevant information of each person in the multi-person mixed voice according to a speaker self-adaption module newly added to the end-to-end student model so as to determine the voice characteristics of each person and additionally determine the context variable of each person.
6. The method of claim 1, wherein after said receiving single person voices each with a true tag vector, the method further comprises:
determining the signal-to-noise ratio of each single voice through the voice and background noise of the person in each single voice;
and sequencing the multi-person mixed voice data according to the magnitude of the signal-to-noise ratio so as to achieve the progressive optimization of the voice recognition model.
7. The method of claim 1, wherein the joint error determined from the knowledge distillation loss and direct loss comprises:
and weighting and summing the knowledge distillation loss and the direct loss according to a preset training mode to determine a joint error.
8. An optimization system for a single-channel speech recognition model, comprising:
the target soft label determining program module is used for receiving single voices with real label vectors and multi-person mixed voices synthesized by the single voices, and respectively inputting voice characteristics extracted from the single voices into a target teacher model to obtain target soft label vectors corresponding to the single voices;
an output arrangement determining program module, configured to input the multi-person mixed speech into an end-to-end student model, output an output tag vector of each person in the multi-person mixed speech, pair the output tag vector of each person in the multi-person mixed speech with a real tag vector of each single speech by a permutation invariance method (PIT), and determine an output arrangement of the output tag vector of each person in the multi-person mixed speech;
a loss determination program module for determining a knowledge distillation loss with each target soft tag vector and a direct loss with each single voice true tag vector according to an output tag vector of each person in the multi-person mixed voice which is determined to be output and arranged after pairing;
and the optimization program module is used for reversely propagating the end-to-end student model according to the joint error when the joint error determined according to the knowledge distillation loss and the direct loss is not converged so as to update the end-to-end student model until the joint error is converged and determine an optimized voice recognition student model for a single channel.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-7.
10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN201910511791.7A 2019-06-13 2019-06-13 Optimization method and system for single-channel speech recognition model Active CN110246487B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910511791.7A CN110246487B (en) 2019-06-13 2019-06-13 Optimization method and system for single-channel speech recognition model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910511791.7A CN110246487B (en) 2019-06-13 2019-06-13 Optimization method and system for single-channel speech recognition model

Publications (2)

Publication Number Publication Date
CN110246487A CN110246487A (en) 2019-09-17
CN110246487B true CN110246487B (en) 2021-06-22

Family

ID=67886903

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910511791.7A Active CN110246487B (en) 2019-06-13 2019-06-13 Optimization method and system for single-channel speech recognition model

Country Status (1)

Country Link
CN (1) CN110246487B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852390A (en) * 2019-11-13 2020-02-28 山东师范大学 Student score classification prediction method and system based on campus behavior sequence
CN111062489B (en) * 2019-12-11 2023-10-20 北京知道创宇信息技术股份有限公司 Multi-language model compression method and device based on knowledge distillation
CN111179911B (en) * 2020-01-02 2022-05-03 腾讯科技(深圳)有限公司 Target voice extraction method, device, equipment, medium and joint training method
CN111199727B (en) * 2020-01-09 2022-12-06 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium
CN111261140B (en) * 2020-01-16 2022-09-27 云知声智能科技股份有限公司 Rhythm model training method and device
CN111341341B (en) * 2020-02-11 2021-08-17 腾讯科技(深圳)有限公司 Training method of audio separation network, audio separation method, device and medium
CN111048064B (en) * 2020-03-13 2020-07-07 同盾控股有限公司 Voice cloning method and device based on single speaker voice synthesis data set
CN111506702A (en) * 2020-03-25 2020-08-07 北京万里红科技股份有限公司 Knowledge distillation-based language model training method, text classification method and device
CN111768762B (en) * 2020-06-05 2022-01-21 北京有竹居网络技术有限公司 Voice recognition method and device and electronic equipment
CN111696519A (en) * 2020-06-10 2020-09-22 苏州思必驰信息科技有限公司 Method and system for constructing acoustic feature model of Tibetan language
CN111554268B (en) * 2020-07-13 2020-11-03 腾讯科技(深圳)有限公司 Language identification method based on language model, text classification method and device
CN111899727B (en) * 2020-07-15 2022-05-06 思必驰科技股份有限公司 Training method and system for voice recognition model of multiple speakers
CN112070233B (en) * 2020-08-25 2024-03-22 北京百度网讯科技有限公司 Model joint training method, device, electronic equipment and storage medium
CN111933121B (en) * 2020-08-31 2024-03-12 广州市百果园信息技术有限公司 Acoustic model training method and device
CN112233655A (en) * 2020-09-28 2021-01-15 上海声瀚信息科技有限公司 Neural network training method for improving voice command word recognition performance
CN112365885B (en) * 2021-01-18 2021-05-07 深圳市友杰智新科技有限公司 Training method and device of wake-up model and computer equipment
CN112365886B (en) * 2021-01-18 2021-05-07 深圳市友杰智新科技有限公司 Training method and device of speech recognition model and computer equipment
CN113609965B (en) * 2021-08-03 2024-02-13 同盾科技有限公司 Training method and device of character recognition model, storage medium and electronic equipment
CN113380268A (en) * 2021-08-12 2021-09-10 北京世纪好未来教育科技有限公司 Model training method and device and speech signal processing method and device
CN113707123B (en) * 2021-08-17 2023-10-20 慧言科技(天津)有限公司 Speech synthesis method and device
CN113782006A (en) * 2021-09-03 2021-12-10 清华大学 Voice extraction method, device and equipment
CN116805004B (en) * 2023-08-22 2023-11-14 中国科学院自动化研究所 Zero-resource cross-language dialogue model training method, device, equipment and medium
CN117351997B (en) * 2023-12-05 2024-02-23 清华大学 Synthetic audio detection method and system based on reverse knowledge distillation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108389576A (en) * 2018-01-10 2018-08-10 苏州思必驰信息科技有限公司 The optimization method and system of compressed speech recognition modeling
CN108986788A (en) * 2018-06-06 2018-12-11 国网安徽省电力有限公司信息通信分公司 A kind of noise robust acoustic modeling method based on aposterior knowledge supervision
CN109616105A (en) * 2018-11-30 2019-04-12 江苏网进科技股份有限公司 A kind of noisy speech recognition methods based on transfer learning
CN109637546A (en) * 2018-12-29 2019-04-16 苏州思必驰信息科技有限公司 Knowledge distillating method and device
CN109711544A (en) * 2018-12-04 2019-05-03 北京市商汤科技开发有限公司 Method, apparatus, electronic equipment and the computer storage medium of model compression

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108389576A (en) * 2018-01-10 2018-08-10 苏州思必驰信息科技有限公司 The optimization method and system of compressed speech recognition modeling
CN108986788A (en) * 2018-06-06 2018-12-11 国网安徽省电力有限公司信息通信分公司 A kind of noise robust acoustic modeling method based on aposterior knowledge supervision
CN109616105A (en) * 2018-11-30 2019-04-12 江苏网进科技股份有限公司 A kind of noisy speech recognition methods based on transfer learning
CN109711544A (en) * 2018-12-04 2019-05-03 北京市商汤科技开发有限公司 Method, apparatus, electronic equipment and the computer storage medium of model compression
CN109637546A (en) * 2018-12-29 2019-04-16 苏州思必驰信息科技有限公司 Knowledge distillating method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ADAPTIVE PERMUTATION INVARIANT TRAINING WITH AUXILIARY INFORMATION FOR MONAURAL MULTI-TALKER SPEECH RECOGNITION;Chang Xuankai et al.;《2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS,SPEECH AND SIGNAL PROCESSING(ICASSP)》;20180420;第5974-5978页 *
END-TO-END MONAURAL MULTI-SPEAKER ASR SYSTEM WITHOUT PRETRAINING;Chang Xuankai et al.;《International Conference on Acoustics Speech and Single Processing ICASSP》;20190517;第6256-6260页 *
Single-channel multi-talker speech recognition with permutation invariant training;Qian Yanmin et al.;《SPEECH COMMUNICATION》;20181130;第1-11页 *

Also Published As

Publication number Publication date
CN110246487A (en) 2019-09-17

Similar Documents

Publication Publication Date Title
CN110246487B (en) Optimization method and system for single-channel speech recognition model
CN109637546B (en) Knowledge distillation method and apparatus
CN111899727B (en) Training method and system for voice recognition model of multiple speakers
US20200402497A1 (en) Systems and Methods for Speech Generation
CN108922518B (en) Voice data amplification method and system
CN110706692B (en) Training method and system of child voice recognition model
CN111081259B (en) Speech recognition model training method and system based on speaker expansion
Li et al. Developing far-field speaker system via teacher-student learning
CN105741832B (en) Spoken language evaluation method and system based on deep learning
CN110459240A (en) The more speaker's speech separating methods clustered based on convolutional neural networks and depth
CN111862942B (en) Method and system for training mixed speech recognition model of Mandarin and Sichuan
CN111243576A (en) Speech recognition and model training method, device, equipment and storage medium
CN107871496B (en) Speech recognition method and device
CN110600013B (en) Training method and device for non-parallel corpus voice conversion data enhancement model
Liu et al. End-to-end accent conversion without using native utterances
Du et al. Speaker augmentation for low resource speech recognition
Zhang et al. Improving end-to-end single-channel multi-talker speech recognition
CN103594087A (en) Method and system for improving oral evaluation performance
CN111667728B (en) Voice post-processing module training method and device
CN111862934A (en) Method for improving speech synthesis model and speech synthesis method and device
CN109559749A (en) Combined decoding method and system for speech recognition system
Park et al. Unsupervised data selection for speech recognition with contrastive loss ratios
Tao et al. DNN Online with iVectors Acoustic Modeling and Doc2Vec Distributed Representations for Improving Automated Speech Scoring.
CN110597958A (en) Text classification model training and using method and device
CN114267334A (en) Speech recognition model training method and speech recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200616

Address after: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Applicant after: AI SPEECH Ltd.

Applicant after: Shanghai Jiaotong University Intellectual Property Management Co.,Ltd.

Address before: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Applicant before: AI SPEECH Ltd.

Applicant before: SHANGHAI JIAO TONG University

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201028

Address after: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Applicant after: AI SPEECH Ltd.

Address before: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Applicant before: AI SPEECH Ltd.

Applicant before: Shanghai Jiaotong University Intellectual Property Management Co.,Ltd.

TA01 Transfer of patent application right
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant