CN108389576A

CN108389576A - The optimization method and system of compressed speech recognition modeling

Info

Publication number: CN108389576A
Application number: CN201810021903.6A
Authority: CN
Inventors: 钱彦旻; 游永彬; 陈哲怀; 黄明坤
Original assignee: Shanghai Jiaotong University; Suzhou Speech Information Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2018-01-10
Filing date: 2018-01-10
Publication date: 2018-08-10
Anticipated expiration: 2038-01-10
Also published as: CN108389576B

Abstract

The embodiment of the present invention provides a kind of optimization method of compressed speech recognition modeling.This method includes：Based on the speech recognition modeling before compression, tutor model is determined, based on the voice data without mark in compressed speech recognition modeling and speech database, generate student model；Speech-sound data sequence of the extraction with mark determines the first posterior probability of student model as training data set by training data set to student model neural network propagated forward in speech database；Forward-backward algorithm calculating is carried out to tutor model by training data set, determines the second posterior probability of tutor model；Compare first and second posterior probability, determines the error of student model and tutor model；When error does not restrain, neural network backpropagation is carried out to optimize student model to student model according to error.The embodiment of the present invention also provides a kind of optimization system of compressed speech recognition modeling.The embodiment of the present invention optimizes model after compression according to source model.

Description

The optimization method and system of compressed speech recognition modeling

Technical field

The present invention relates to the optimization method of field of speech recognition more particularly to a kind of compressed speech recognition modeling and it is System.

Background technology

Speech recognition is the artificial intelligence technology for allowing equipment that voice signal is changed into corresponding text or order.By right The deep learning of speech recognition modeling makes the accuracy of speech recognition obtain very big promotion.

Although deep learning ensure that the accuracy of speech recognition performance, the quantity of parameters in model occupy largely Memory space.On the one hand, the speech recognition modeling needs of big parameter type largely calculate and occupy a large amount of memory sources, another The speech recognition modeling speed of service of aspect, big parameter type is slower.These factors hampers are on the limited system top of resource Affix one's name to the speech recognition modeling of this big parameter type.

In order to allow the limited system of resource that can dispose the speech recognition modeling of big parameter type, usually using model compression Technology compresses the speech recognition modeling of big parameter type, to the speech recognition modeling of compressed big parameter type It can be disposed in the limited system of resource.

In realizing process of the present invention, inventor has found that at least there are the following problems in the related technology：

Using model compression technology, compressed speech recognition modeling generally can not preferably retain the extensive of the preceding model of compression Ability and model accuracy.

Invention content

It generally can not preferably retain the preceding model of compression at least solve compressed speech recognition modeling in the prior art Generalization ability and model accuracy.Those skilled in the art would generally use：To non-optimal in speech recognition modeling after compression Model structure on be finely adjusted to improve compression performance.Applicant is found surprisingly that the transfer learning method pair based on sequence-level Compressed model utilizes neural network propagated forward, to-backcasting before being carried out to the data model before compression, to determine two The sequence error of person carries out neural network backpropagation by using the sequence error to compressed speech recognition modeling, To update compressed speech recognition modeling, updated speech recognition modeling is made mutually to be received with the speech recognition modeling before update It holds back, to solve the above problems.

In a first aspect, the embodiment of the present invention provides a kind of optimization method of compressed speech recognition modeling, including：

Based on the speech recognition modeling before compression, tutor model is determined, be based on compressed speech recognition modeling and voice Voice data without mark in database generates student model；

Speech-sound data sequence of the extraction part with mark passes through institute as training data set in the speech database It states training data set and neural network propagated forward is carried out to the student model, with the First ray of the determination student model Posterior probability；

To-backcasting before being carried out to the tutor model by the training data set, the tutor model is determined The second sequence posterior probability；

Compare the First ray posterior probability and the second sequence posterior probability, with the determination student model and institute State the sequence error of tutor model；

When the sequence error does not restrain, it is reversed that neural network is carried out to the student model according to the sequence error It propagates, to update the student model, generates optimization student model.

Second aspect, the embodiment of the present invention provide a kind of optimization system of compressed speech recognition modeling, including：

Model determines program module, for based on the speech recognition modeling before compression, determining tutor model, after compression Speech recognition modeling and speech database in the voice data without mark, generate student model；

First ray posterior probability determines program module, for language of the extraction part with mark in the speech database Sound data sequence as training data set, by the training data set to the student model carry out before neural network to It propagates, with the First ray posterior probability of the determination student model；

Second sequence posterior probability determines program module, for by the training data set to the tutor model into To-backcasting before row, the second sequence posterior probability of the tutor model is determined；

Sequence error determines program module, general for the First ray posterior probability and the second sequence posteriority Rate, with the sequence error of the determination student model and the tutor model；

Model optimization program module, for when the sequence error does not restrain, according to the sequence error to Raw model carries out neural network backpropagation, to update the student model, generates optimization student model.

The third aspect provides a kind of electronic equipment comprising：At least one processor, and at least one place Manage the memory of device communication connection, wherein the memory is stored with the instruction that can be executed by least one processor, institute It states instruction to be executed by least one processor, so that at least one processor is able to carry out any embodiment of the present invention Compressed speech recognition modeling optimization method the step of.

Fourth aspect, the embodiment of the present invention provide a kind of storage medium, are stored thereon with computer program, feature exists In the optimization method of the compressed speech recognition modeling of realization any embodiment of the present invention when the program is executed by processor Step.

The advantageous effect of the embodiment of the present invention is：

Pass through the sequence posteriority that the sequence posterior probability of student model is gradually passed through into Forward-backward algorithm to tutor model Convergence in probability, to make student model learn to tutor model.And student model study can be made to mould by forward-backward algorithm Each correct state occupation probability in type.It is effectively extensive to data progress using large-sized model, it can more acquire beneficial to phoneme classification Model table is sought peace modeling ability, while effectively moving to the modeling ability of large-sized model in mini Mod, due to the knot of mini Mod Structure customizable degree is high, is easy to change.In the case where losing less model accuracy, use above field on the one hand can be made Speech recognition speed in scape is greatly speeded up, and on the other hand can greatly reduce the calculating of speech recognition algorithm and memory source disappears Consumption.

Description of the drawings

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Some bright embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.

Fig. 1 is a kind of flow chart of the optimization method for compressed speech recognition modeling that one embodiment of the invention provides；

Fig. 2 is a kind of flow of the optimization method for compressed speech recognition modeling that another embodiment of the present invention provides Figure；

Fig. 3 is a kind of optimization method for compressed speech recognition modeling that one embodiment of the invention provides based on CTC Performance of the knowledge refinement on interchanger corpus compare table data figure；

Fig. 4 is a kind of optimization method for compressed speech recognition modeling that one embodiment of the invention provides based on CTC Knowledge refinement compared with the performance of Chinese corpus table data figure；

Fig. 5 is a kind of knowledge refinement of the optimization method for compressed speech recognition modeling that one embodiment of the invention provides CTC afterwards finely tunes table data figure；

Fig. 6 is a kind of use label of the optimization method for compressed speech recognition modeling that one embodiment of the invention provides The knowledge refinement table data figure of data and Unlabeled data；

Fig. 7 is a kind of structural representation of the optimization system for compressed speech recognition modeling that one embodiment of the invention provides Figure.

Specific implementation mode

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art The every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.

A kind of stream of the optimization method of the compressed speech recognition modeling provided as shown in Figure 1 for one embodiment of the invention Cheng Tu includes the following steps：

S11：Based on the speech recognition modeling before compression, determine tutor model, based on compressed speech recognition modeling and Voice data without mark in speech database generates student model；

S12：Speech-sound data sequence of the extraction part with mark leads to as training data set in the speech database It crosses the training data set and neural network propagated forward is carried out to the student model, with the first of the determination student model Sequence posterior probability；

S13：To-backcasting before being carried out to the tutor model by the training data set, the teacher is determined Second sequence posterior probability of model；

S14：Compare the First ray posterior probability and the second sequence posterior probability, with the determination student model With the sequence error of the tutor model；

S15：When the sequence error does not restrain, neural network is carried out to the student model according to the sequence error Backpropagation generates optimization student model to update the student model.

In the present embodiment, to the speech recognition modeling of big parameter compression can use transfer learning (knowledge refinement) or TS (teacher-student, teachers and students) training.Such as by taking teachers and students train as an example, it using the knowledge acquired from tutor model come The convergence for helping student model faster and better, in this way, traditional DNN (deep neural networks, depth god Through network) in knowledge less, the lower model of system performance requirements that can be refined into a parameter.

For step S11, voice data without mark and the life of compressed speech recognition modeling of small parameter amount are first passed through At student model, according to the audio of the small parameter amount, student model is generated to compressed speech recognition modeling and is trained, So that it is determined that going out the student model tentatively optimized, the speech recognition modeling before compression is determined as tutor model, determines teacher After model, it can be trained according to TS training methods among the above.

For step S12, voice data of the extraction part with mark, which is contributed a foreword, in the speech database is classified as trained number According to set, which part can be the audio data sequence of small parameter amount, for example, all voice data in training data set The length of sequence is 60 frames, before carrying out neural network to the student model that step S11 is determined by the training data set To propagation, so that it is determined that going out all audio data sequences in training data set, the posteriority of each speech data frame is general in 60 frames Rate, using the posterior probability of each frame in its each sequence as First ray posterior probability.

For step S13, the training data set determined by step S12 carries out the tutor model that step S11 is determined Forward direction-backcasting, so that it is determined that go out tonic train all in training data set, each speech data frame in 60 frames Posterior probability, using the posterior probability of each frame in its each sequence as the second sequence posterior probability.

For step S14, the second sequence that the First ray posterior probability determined in step S12 and step S13 are determined Posterior probability carries out sequence comparison, for example, First ray posterior probability and the second sequence posterior probability are subtracted each other frame by frame, to really Make the sequence error of student model and tutor model；

Step S15 is missed when the sequence error that detecting step S14 is determined does not reach convergence according to the sequence Difference to carry out neural network backpropagation to student model, to update student model, generates optimization student model.

As an implementation, in the present embodiment, continue the sequence posterior probability of optimization student model described in comparison Stop the optimization when sequence error restrains with the sequence error of the sequence posterior probability of the tutor model and learn The update of raw model.

In the present embodiment, the sequence posterior probability of the sequence posterior probability and tutor model of optimization student model is determined Sequence error, if do not reach convergence, repeat according to the sequence error come to student model carry out neural network it is anti- To propagation, to update student model, until the sequence error of updated update student model and the tutor model reaches receipts It holds back, stops optimization.

It can be seen that the sequence posterior probability of the student model of optimization and the sequence of tutor model by the implementation Posterior probability is compared, and when the sequence error of student model and tutor model restrains, optimization terminates, to the effect of optimization into Row limits, and prevents its unlimited repeated optimization, is absorbed in endless loop.

As an implementation, in the present embodiment, it is described by the training data set to the student model Carry out neural network propagated forward and it is described the tutor model is carried out by the training data set before to-backward meter It is parallel processing.

In the present embodiment, by taking training data set carries out neural network propagated forward to the student model as an example. Training data set is combined by extracting voice data of the part with mark in speech database.Such as training data set In have 10 section audio data sequence A, B, C, D, E, F, G, H, I, J, corresponding to voice data length be respectively 22 frames, 17 Frame, 16 frames, 17 frames, 17 frames, 12 frames, 12 frames, 19 frames, 24 frames, 17 frames.

Such as train equipment function can simultaneously 3 sequences of parallel training, then by before and after these audio data sequences Connect, is combined into three unduplicated long sequences.So that the length of final long sequence is substantially close.By this 10 section audio data sequence Row are divided into following three long sequences：A-B-C, D-E-F-G, H-I-J, length are respectively：55 frames, 58 frames, 60 frames.

Later by the length polishing of this three long sequences, for example, 5 frame, 0 data are mended at the end of long sequence A-B-C, in length Mend 2 frame, 0 data in the end of sequence D-E-F-G.Training data set is divided into the long sequence that 3 groups of isometric length are 60 frames in this way Row.

Every time when training, 1 frame in each long sequence is taken to be trained parallel, this training there are 3 long sequences, just instructs every time Practice 3 frames.After training, the predicted value and hidden layer numerical information corresponding to this 3 frame are cached, for each sequence next frame train when It uses.It is finished until by the training of this 60 frame.

It can be seen that by using parallel processing by the implementation, trained efficiency accelerated, it is excellent to improve The efficiency of change.

A kind of optimization method of compressed speech recognition modeling of another embodiment of the present invention offer is provided Flow chart includes the following steps：

S21：Using the compressed speech recognition modeling as source model；

S22：A part of voice data without mark is chosen from speech database as pre-training data acquisition system, passes through institute It states pre-training data acquisition system to train the source model frame by frame, to generate student model.

For step S21, using compressed speech recognition modeling as source model；

For step S22, a part of voice data without mark is chosen from speech database as pre-training data set It closes, by each voice data in pre-training data acquisition system, neural network propagated forward is carried out to source model and tutor model, from And determine the posterior probability frame by frame of source model and tutor model；

By determining the error frame by frame of the posterior probability frame by frame of source model and the posterior probability frame by frame of tutor model, work as determination Error frame by frame when not reaching convergence, neural network backpropagation is carried out to source model according to the error frame by frame, with more New source model generates student model when the error convergence frame by frame of updated source model and tutor model.

The data using no mark are can be seen that compressed speech recognition modeling to teacher's mould by the implementation Type learns, and when compressed modelling effect is poor, can improve the effect of optimization.

Overall description is carried out to the scheme below：Transfer learning used in above-described embodiment (knowledge refinement) or teachers and students (TS) training is a kind of machine learning normal form, and which show the potentiality of model compression.That is, it is utilized from tutor model To knowledge help the student model quickly to restrain or have better solution.In this way, in traditional DNN Knowledge can be refined into a parameter is less, system performance compared with relatively narrow and shallower model.

For training of students model, the KLD (Kullback-Leibler between the reasoning distribution of Faculty and Students' model Divergence, relative entropy, also known as KL divergences) be minimized it is as follows：

Wherein y^(T)And y^(S)It is the distribution of Faculty and Students respectively.Because the first item and student model in above-mentioned formula are excellent Change it is unrelated, so only Section 2 is for optimizing：

CE (cross in above-mentioned formula ASR (automatic speech recognition, automatic speech recognition) Entropy, cross entropy) standard, the soft distribution replacement that hard label is inferred by tutor model from source data.

In ASR, run with frame level.The prediction of Faculty and Students' model in each frame is by identical characteristic It provides.Then it is summed to the frame level KLD of each sequence and is optimized (by taking a training sequence as an example)：

Wherein, y in above-mentioned formula^(T)And y^(S)It is the reasoning distribution of t frames tutor model and student model respectively.However ASR is substantially a sequence mark problem, but traditional method, which is all frame level, not to be run, but tutor model can be logical The discriminate for crossing sequence level is trained.And the method ganged up is using the data transcribed and do not transcribed as Faculty and Students' mould The input of type is uniformly processed.Although a large amount of using may only be partly useful in the data that do not transcribe, if abandoning one If divided data, performance can be reduced.

It is passed by introducing CTC (Connectionist Temporal Classification connect temporal model) The knowledge refinement frame by frame of system and the list type knowledge refinement for being based on MAP (maximum a posterior, maximum a posteriori).

CTC object functionsIt is defined as the negative logarithm conditional probability of the correct labeling of all training sequences

Wherein, in above-mentioned formula, n is the index of training sequence.Wherein, l indicates phoneme notation sequence, and x is corresponding spy Levy sequence.It may be calculated (for a training sequence) relative to every frame：

Wherein, in above-mentioned formula,_yktIt is the output of k-th of phoneme of t frames.P (l | x) it can be by preceding to-calculation afterwards Method effectively calculates：

Wherein, in above-mentioned formula, l' is the modified flags sequence of aligned phoneme sequence l, in the beginning and end of l and each pair of Add blank in gap between adjacent marker.| l'| is length.J indicates the length of modified flag sequence.α_t(j) and β_t(j) The length for being time t is the forward and backward probability of j.Wherein, a is activated_ktFormula it is as follows：

The location sets for marking k to occur in l' in above-mentioned formula are defined as { l'_j=k }.σ_CTC(k, t) is The posterior probability of k-th of phoneme of t frames.Utilize y_ktAnd σ_CTCError signal between (k, t), backpropagation can be used for deriving god Parameter gradients through network.This method is using unidirectional LSTM (long short term memory, shot and long term memory).

Following F-TS-CTC (Frame-wise Knowledge can be expanded to by knowledge refinement frame by frame The framework knowledge of Distillation for CTC, CTC refine)：

Wherein, in above-mentioned formula, y_ktWithIt is the output of k-th of phoneme of student and tutor model t frames respectively.

F-TS-CTC training process is as follows：

(a) a large-scale CTC tutor model is trained with standardization program.

(b) for each small batch data, Faculty and Students' model carries out propagated forward to obtain y_ktWith

(c) pass throughError signal, and backpropagation only is carried out to student model.

(d) step b to d is repeated, until convergence.

S-TS-CTC (Sequence-wise are determined by the series model knowledge refinement of MAP (maximum a posteriori) Knowledge Distillation for CTC's, CTC refines by sequence knowledge).

The posterior probability of student model is optimized still according to above-mentioned formula.It is distributed y from inferring_ktIt (is used for supervision Corresponding phoneme posterior probability (or state acquistion probability) σ of more new model_CTC(k, t)) obtain error signal.With cross entropy In the conventional depth learning model of standard exercise, the quality of supervision is always most important to performance.By legacy system and it is based on CTC Several work for being compared of system also indicate that, σ_CTCThe quality of (k, t) is the bottleneck of convergence rate and performance.

Here, the posterior probability σ obtained from student model_CTCThe σ that (k, t) is obtained from tutor model_CTC ^(T)(k,t) Instead of：

The distribution in above-mentioned formula using tutor model can be used to calculate.In this way, from better tutor model The supervision of acquisition can improve the convergence of student model.In addition, by using tutor model, student model can simulate it and push away Reason distribution, alleviates the burden of directly alignment study and model generalization, and parameter is limited.

S-TS-CTC training process is as follows：

(a) a large-scale CTC tutor model is trained with standardization program.

(b) for each small batch data, Faculty and Students' model propagated forward is to obtain

(c) to-backcasting before being carried out to each sequence in tutor model, to obtain

(d) according toError signal, and carry out backpropagation for student model.

(e) step b to e is repeated, until convergence.

For this method compared with knowledge refinement frame by frame, a crucial difference is that transcription is utilized：Student model is optimized to Simulate the deduction distribution of tutor model, either correct or wrong transcription.However, S-TS-CTC forces student model only Correctly learnt by the state occupation probability obtained from forward direction-backcasting that teacher is distributed transcription.

In the GMMHMM (Gaussian based on EM (expectation maximization, expectation maximization) Mixture Model Hidden Markov Model, gauss hybrid models hidden Markov model) it applies in parameter Estimation Similar strategy, referred to as bimodulus revaluation.EM algorithm iterations alternately, estimate the state for giving "current" model parameter between E steps Rank is aligned and acquistion probability, and estimates the M steps of new parameter by maximizing posteriority acquistion probability, until algorithmic statement Or other stopping criterions are satisfied.In fact, when wanting newer model training poor, the alignment of poor Status Level causes Poor parameter Estimation.Bimodulus revaluation obtains the alignment of Status Level using a well-trained model, which is used for Update the parameter of second model.The strategy contributes to model to restrain, and motivation is similar with the method proposed.The side proposed Method can also regard that it studies the extension with Series Modeling in depth as.

In the present embodiment, the CTC model compressions that knowledge based refines are studied for the first time.Student model is still marked by CTC Quasi- training, but supervise, i.e., posterior probability obtains in being distributed from teacher.As CTC by sequence knowledge refine described in, use One well-trained model is referred to as to obtain the strategy of state adjustment in GMM-HMM to update the parameter of second model Two model revaluations.The method proposed can be regarded as it and study extension with Series Modeling in depth.

Compared with traditional knowledge refinement frame by frame, crucial difference lies in the knowledge refinement proposed is in sequence level It carries out.Pass through the tutor model that training is trained come translation sequence discriminate frame by frame based on KLD.Sequence can be carried out after transcription study Discriminate training is arranged, further improvement is obtained.It is believed that this is the evidence of sequence level training importance.Another area It is not that transcription is utilized.The method proposed forces student model only to be obtained by being distributed forward direction-backcasting of transcription from teacher To state occupation probability learn the alignment of correct sentence.

In order to prove the effect of this method, to being tested on interchanger corpus and larger Chinese corpus.CTC is taught Teacher's model is 5 layers of LSTM, each has the projection layer repeatedly of 1024 storage units and 384 units.In order to compare, base Line hybrid system is trained by CE criterion, and other than the last one layer is 8K polymerization triphones states, other structures are all It is identical.CTC models are initialized by baseline hybrid system above.For student model, we used 3 LSTM layers, Every layer of 400 unit, the projection layer repeatedly with 128 units.Weight random initializtion in all-network be uniformly (- 0.02,0.02 it) is distributed.We cut out gradual change [- 5,5].Learning rate is annealed and early stage stops strategy and used.All LSTM RNN models all use KALDI (card enlightening speech recognition tools packet) and EESEN (End-to-end speech Recognition using deep rnn models and wfst-based decoding, end to end speech recognition make Decoded with deep RNN models and based on WFST) it is trained.

Interchanger corpus is tested：

Interchanger is 310 hours English data set, there is 4870 sound channels.Every 10 milliseconds are extracted from input voice letter More than 36 dimension logarithm Meier filter groups of 25 milliseconds of frames in number.45 single-tone elements and one are predicted by the output layer of neural network A blank (makes a phone call) to be assessed in subset in the interchanger and Callhome of 2000 CTS test sets of NIST.Waveform root It is split according to PEM (partitioned evaluation map, subregion assessment figure) file of NIST.From interchanger corpus Transcription training and with Fisher corpus be inserted into 30k vocabulary language models be used to decode.WER(Word error Rate, character error rate) as measurement.

List display interchanger is compared based on performance of the knowledge refinement of CTC on interchanger corpus in Fig. 3 Performance compares.Front two row shows the performance gap between hybrid system and CTC.2nd row and the 3rd row be respectively tutor model and The baseline system of student model.If parameter increases by 10 times, the WER of tutor model reduces 20%, this is similar to student model.

Last two rows respectively illustrate knowledge refinement (F-TS-CTC) frame by frame and the series model knowledge refinement based on MAP (S-TS-CTC) result.In our preliminary experiment, model is finely adjusted using CTC criterion and is always brought further Slight to improve, this is the same with the observation in being trained in sequence discriminate.Therefore, all numbers shown in table be all CTC fine tunings obtain later.Compared with student model, F-TS-CTC can obtain 11% and 5% WER decrements respectively.With F- TS-CTC is compared, and S-TS-CTC also shows slight decrease but consistent.

In knowledge refinement frame by frame, the soft labels obtained from tutor model can be inserted into hard label.But, this plan It is slightly not particularly suited for this work, reason is：

(1) CTC is a sequence level standard.Therefore, with any hard label come to be inserted into phoneme posteriority in theory be not Reasonably.

(2) unavailable in this work as being aligned in firmly for hard label.In addition, our PRELIMINARY RESULTS is also shown, only It is aligned firmly by phoneme level to be inserted into CTC phoneme posteriority, any benefit will not be brought.That is, utilizing the better of transcription Method is proposed S-TS-CTC.

Chinese corpus experiment：

CTC always needs more data to realize the performance competed with hybrid system.We used one 2000 small When write by hand record Chinese corpus assess the knowledge refinement pattern based on CTC.All language are all from online speech recognition It is extracted in service.Our training set is made of 2,500,000 language, and average duration is 3 seconds.For Chinese corpus, The input of LSTM RNN is the logmel filter bank energies features that every 10ms calculates 40 dimensions.It is reduced using input layer frame-skipping Calculation amount.One random data scan plan is used to accelerate training program.Output layer predicts 121 single-tone elements and a sky In vain.Three gram language models are applied to assess.Assessment collection is also to be extracted from online service, the not repetition of talker.It surveys Examination collection is made of 6500 language.

As in Fig. 4 based on the knowledge refinement of CTC compared with the performance of Chinese corpus the large-scale Chinese of list display Expect the performance in library.Compare student's baseline system of the 4th row and the 5th row and the 3rd row, two kinds of knowledge refinement methods can significantly drop Low WER, and S-TS-CTC and F-TS-CTC shows slight gap but is unanimously improved.

The CTC fine tuning lists that CTC after knowledge refinement finely tunes in such as Fig. 5 after knowledge refinement are studied, and are such as exchanged Consistent improvement is brought as the experiment of machine corpus is described.

As described in refining the framework knowledge of CTC, knowledge refinement frame by frame can be worked in a manner of unsupervised.Therefore, greatly The Unlabeled data of amount can be used for making student model more like tutor model.It is small using other 2000 extracted from same source When data.We carry out knowledge refinement frame by frame by all data in the way of unsupervised, use flag data in figure 6 3.5% is reduced with WER in the 2nd row of knowledge refinement list of Unlabeled data.Later, we continue the S-TS-CTC of the 3rd row Program, and improve 2.1%.This two are improved all with statistical significance.

A kind of knot of the optimization system of compressed speech recognition modeling of one embodiment of the invention offer is provided Structure schematic diagram, the technical solution of the present embodiment are applicable to the optimization method to the compressed speech recognition modeling of equipment, should System can perform the optimization method of the compressed speech recognition modeling described in above-mentioned any embodiment, and configure in the terminal.

A kind of optimization system of compressed speech recognition modeling provided in this embodiment includes：Model determines program module 11, First ray posterior probability determines that program module 12, the second sequence posterior probability determine that program module 13, sequence error determine Program module 14 and model optimization program module 15.

Wherein, model determines program module 11 for based on the speech recognition modeling before compression, determining tutor model, being based on Voice data without mark in compressed speech recognition modeling and speech database generates student model；First ray posteriority Determine the probability program module 12 in the speech database for extracting speech-sound data sequence of the part with mark as training Data acquisition system carries out neural network propagated forward, with determination by the training data set to the student model The First ray posterior probability of raw model；Second sequence posterior probability determine program module 13 for pass through the training dataset It closes before being carried out to the tutor model to-backcasting, determines the second sequence posterior probability of the tutor model；Sequence error Program module 14 is determined for the First ray posterior probability and the second sequence posterior probability, with determination The sequence error of raw model and the tutor model；Model optimization program module 15 is used for when the sequence error does not restrain, Neural network backpropagation is carried out to the student model according to the sequence error, to update the student model, is generated excellent Change student model.

Further, the sequence error determines that program module is additionally operable to：

Continue the sequence posterior probability with the sequence posterior probability of the tutor model of optimization student model described in comparison Sequence error；

Model optimization program module is additionally operable to：When the sequence error restrains, stop the optimization student model Update.

Further, the model determines that program module is used for：

Using the compressed speech recognition modeling as source model；

A part of voice data without mark is chosen from speech database as pre-training data acquisition system, by described pre- Training data set trains the source model frame by frame, to generate student model.

Further, it is described by the training data set to the student model carry out neural network propagated forward with And it is described the tutor model is carried out by the training data set before to-backcasting be parallel processing.

The embodiment of the present invention additionally provides a kind of nonvolatile computer storage media, and computer storage media is stored with meter Calculation machine executable instruction, the computer executable instructions can perform the compressed speech recognition in above-mentioned any means embodiment The optimization method of model；

As an implementation, nonvolatile computer storage media of the invention is stored with the executable finger of computer It enables, computer executable instructions are set as：

As a kind of non-volatile computer readable storage medium storing program for executing, it can be used for storing non-volatile software program, non-volatile Property computer executable program and module, such as the corresponding program instruction/mould of the method for the test software in the embodiment of the present invention Block.One or more program instruction is stored in non-volatile computer readable storage medium storing program for executing, when being executed by a processor, is held The optimization method of compressed speech recognition modeling in the above-mentioned any means embodiment of row.

Non-volatile computer readable storage medium storing program for executing may include storing program area and storage data field, wherein storage journey It sequence area can storage program area, the required application program of at least one function；Storage data field can be stored according to test software Device use created data etc..It is deposited at random in addition, non-volatile computer readable storage medium storing program for executing may include high speed Access to memory, can also include nonvolatile memory, a for example, at least disk memory, flush memory device or other are non- Volatile solid-state part.In some embodiments, it includes relative to place that non-volatile computer readable storage medium storing program for executing is optional The remotely located memory of device is managed, these remote memories can pass through the device of network connection to test software.Above-mentioned network Example include but not limited to internet, intranet, LAN, mobile radio communication and combinations thereof.

The embodiment of the present invention also provides a kind of electronic equipment comprising：At least one processor, and with described at least one The memory of a processor communication connection, wherein the memory is stored with the finger that can be executed by least one processor It enables, described instruction is executed by least one processor, so as to be able to carry out the present invention any at least one processor The step of optimization method of the compressed speech recognition modeling of embodiment.

The client of the embodiment of the present application exists in a variety of forms, including but not limited to:

(1) mobile communication equipment:The characteristics of this kind of equipment is that have mobile communication function, and to provide speech, data Communication is main target.This Terminal Type includes:Smart mobile phone (such as iPhone), multimedia handset, functional mobile phone and low Hold mobile phone etc..

(2) super mobile personal computer equipment:This kind of equipment belongs to the scope of personal computer, there is calculating and processing work( Can, generally also have mobile Internet access characteristic.This Terminal Type includes:PDA, MID and UMPC equipment etc., such as iPad.

(3) portable entertainment device:This kind of equipment can show and play multimedia content.Such equipment includes:Audio, Video player (such as iPod), handheld device, e-book and intelligent toy and portable car-mounted navigation equipment.

(4) other electronic devices with phonetic function.

The apparatus embodiments described above are merely exemplary, wherein the unit illustrated as separating component can It is physically separated with being or may not be, the component shown as unit may or may not be physics list Member, you can be located at a place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of module achieve the purpose of the solution of this embodiment.

Herein, relational terms such as first and second and the like be used merely to by an entity or operation with it is another One entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this reality Relationship or sequence.Moreover, the terms "include", "comprise", include not only those elements, but also include being not explicitly listed Other element, or further include for elements inherent to such a process, method, article, or device.Do not limiting more In the case of system, the element that is limited by sentence " including ... ", it is not excluded that in the process including the element, method, article Or there is also other identical elements in equipment.

The apparatus embodiments described above are merely exemplary, wherein the unit illustrated as separating component can It is physically separated with being or may not be, the component shown as unit may or may not be physics list Member, you can be located at a place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of module achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness Labour in the case of, you can to understand and implement.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It is realized by the mode of software plus required general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be expressed in the form of software products in other words, should Computer software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.

Finally it should be noted that：The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, it will be understood by those of ordinary skill in the art that：It still may be used With technical scheme described in the above embodiments is modified or equivalent replacement of some of the technical features； And these modifications or replacements, various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. a kind of optimization method of compressed speech recognition modeling, including：

Based on the speech recognition modeling before compression, tutor model is determined, be based on compressed speech recognition modeling and voice data Voice data without mark in library generates student model；

Speech-sound data sequence of the extraction part with mark passes through the instruction as training data set in the speech database Practice data acquisition system and neural network propagated forward is carried out to the student model, with the First ray posteriority of the determination student model Probability；

To-backcasting before being carried out to the tutor model by the training data set, the of the tutor model is determined Two sequence posterior probability；

Compare the First ray posterior probability and the second sequence posterior probability, with the determination student model and the religion The sequence error of teacher's model；

When the sequence error does not restrain, neural network is carried out to the student model according to the sequence error and is reversely passed It broadcasts, to update the student model, generates optimization student model.

2. according to the method described in claim 1, wherein, the method further includes：

Continue the sequence of the sequence posterior probability and the sequence posterior probability of the tutor model of optimization student model described in comparison Error stops the update of the optimization student model when sequence error restrains.

3. described based in compressed speech recognition modeling and speech database according to the method described in claim 1, wherein Voice data without mark, generating student model includes：

Using the compressed speech recognition modeling as source model；

A part of voice data without mark is chosen from speech database as pre-training data acquisition system, passes through the pre-training Data acquisition system trains the source model frame by frame, to generate student model.

4. described to be carried out to the student model by the training data set according to the method described in claim 1, wherein Neural network propagated forward and it is described the tutor model is carried out by the training data set before to-backcasting be Parallel processing.

5. a kind of optimization system of compressed speech recognition modeling, including：

Model determines program module, for based on the speech recognition modeling before compression, determining tutor model, is based on compressed language Voice data without mark in sound identification model and speech database generates student model；

First ray posterior probability determines program module, for voice number of the extraction part with mark in the speech database According to sequence as training data set, the student model is carried out before neural network to biography by the training data set It broadcasts, with the First ray posterior probability of the determination student model；

Second sequence posterior probability determines program module, before being carried out to the tutor model by the training data set To-backcasting, the second sequence posterior probability of the tutor model is determined；

Sequence error determines program module, is used for the First ray posterior probability and the second sequence posterior probability, With the sequence error of the determination student model and the tutor model；

Model optimization program module, for when the sequence error does not restrain, according to the sequence error to student's mould Type carries out neural network backpropagation, to update the student model, generates optimization student model.

6. system according to claim 5, wherein the sequence error determines that program module is additionally operable to：

Continue the sequence of the sequence posterior probability and the sequence posterior probability of the tutor model of optimization student model described in comparison Error；

Model optimization program module is additionally operable to：When the sequence error restrains, stop the update of the optimization student model.

7. system according to claim 5, wherein the model determines that program module is used for：

Using the compressed speech recognition modeling as source model；

8. system according to claim 5, wherein described to be carried out to the student model by the training data set Neural network propagated forward and it is described the tutor model is carried out by the training data set before to-backcasting be Parallel processing.

9. a kind of electronic equipment comprising：At least one processor, and deposited with what at least one processor communication was connect Reservoir, wherein the memory is stored with the instruction that can be executed by least one processor, described instruction by it is described at least One processor executes, so that at least one processor is able to carry out the step of any one of claim 1-4 the methods Suddenly.

10. a kind of storage medium, is stored thereon with computer program, which is characterized in that the program is realized when being executed by processor The step of any one of claim 1-4 the methods.