CN114708857A

CN114708857A - Speech recognition model training method, speech recognition method and corresponding device

Info

Publication number: CN114708857A
Application number: CN202011629211.3A
Authority: CN
Inventors: 方健
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-07-05
Also published as: WO2022143723A1

Abstract

The embodiment of the invention discloses a speech recognition model training method, a speech recognition method and a corresponding device. The scheme can establish a migration model to be trained in a target domain based on an original voice recognition model obtained by training in a source domain, and extract a characteristic data set from sample data of the target domain, wherein each characteristic data in the characteristic data set carries an environmental factor capable of reflecting the similar possibility of different sample data in the aspect of the same voice physical attribute; and substituting the training set in the characteristic data set as an input parameter into the migration model for iterative training to obtain a target voice recognition model. Therefore, under the condition that the training data of the new scene is insufficient, the generation of an effective model is accelerated by using the transfer learning method, the cost is reduced, the training efficiency is improved, and meanwhile, in the training process, the feature data used by the retraining layer carries environmental factors capable of influencing the physical attributes of the voice, so that the target voice recognition model obtained by training has good expansibility and high recognition accuracy.

Description

Speech recognition model training method, speech recognition method and corresponding device

Technical Field

The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a speech recognition model training method, a speech recognition method and a corresponding device.

Background

With the arrival of the 5G era with large bandwidth and low time delay, the application of real-time media such as audio and video and the like is bound to become the mainstream in the future, and the generated mass data of the real-time media is greatly helpful to the analysis of user habits.

Under the traditional machine learning framework, the basic process is to learn a classification model based on given fully trained data, and then use the model to complete the classification and prediction of test data. However, with the advent of the 5G era, the well-spray explosion of various applications related to real-time media has brought about the refinement of each field, and a large amount of training data of these newly divided fields is very difficult to obtain. How to effectively analyze user habits by using the limited detailed field data and optimize the user habits in a targeted manner so as to improve user experience is a great challenge before contact center products.

At present, a large amount of training data is needed in the traditional training process of machine learning, and for some subdivided fields (such as announcements), it is difficult to obtain sufficient training data. Even if a large amount of training data is obtained, a large amount of manpower and material resources are required for labeling the data. There are many local points in the contact center, and the announcement data is different for different local points, which may result in a low recognition rate of the existing model for a new scene. The prior method is to add a new announcement to perform the ab initio training on the basis of the prior training data, so that the training time is long, and the recognition effect is general because the characteristics of the added data are not obvious compared with the prior massive data. Although the training data and the source data of a new scene are more or less different, part of examples in the source training data are suitable for the new application scene, and repeated training of some data and great waste of resources are caused by directly discarding the existing training model and starting training again.

Therefore, how to improve the training efficiency of the speech recognition model and improve the speech recognition efficiency become problems to be solved urgently.

Disclosure of Invention

One or more embodiments of the present disclosure provide a speech recognition model training method, a speech recognition method, and a corresponding apparatus, which can reduce training cost and training duration of a speech recognition model, and improve training efficiency and model recognition rate.

To solve the above technical problems, one or more embodiments of the present specification are implemented as follows:

in a first aspect, a method for training a speech recognition model is provided, including: creating a migration model to be trained in a target domain based on an original voice recognition model obtained by training in a source domain, wherein an output layer of the migration model is different from an output layer of the original voice recognition model; extracting a characteristic data set from the sample data of the target domain, wherein each characteristic data in the characteristic data set carries an environmental factor capable of reflecting the similar possibility of different sample data in the same voice physical attribute; substituting the training set in the characteristic data set as an input parameter into the migration model for iterative training to obtain a target voice recognition model; in the iterative training process, the weight of the output layer of the migration model is randomly adjusted, and the weights of other layers of the migration model are kept unchanged.

In a second aspect, a speech recognition method is provided, including: determining voice data to be recognized; extracting feature data from the voice data, wherein the feature data carries environmental factors capable of reflecting the similar possibility of different sample data in the same voice physical attribute; and substituting the characteristic data into a target voice recognition model for voice recognition, wherein the target voice recognition model is obtained by training based on the voice recognition model training method of the first aspect.

In a third aspect, a speech recognition model training apparatus is provided, including: the system comprises a creating module, a calculating module and a calculating module, wherein the creating module is used for creating a transfer model to be trained in a target domain based on an original voice recognition model obtained by training in a source domain, and an output layer of the transfer model is different from that of the original voice recognition model; the extraction module is used for extracting a characteristic data set from the sample data of the target domain, wherein each characteristic data in the characteristic data set carries an environmental factor capable of reflecting the similar possibility of different sample data in the same voice physical attribute; the training module is used for substituting the training set in the characteristic data set as an input parameter into the migration model for iterative training to obtain a target voice recognition model; in the iterative training process, the weight of the output layer of the migration model is randomly adjusted, and the weights of other layers of the migration model are kept unchanged.

In a fourth aspect, a speech recognition apparatus is provided, including: the determining module is used for determining voice data to be recognized; the extraction module is used for extracting feature data from the voice data, wherein the feature data carries environmental factors capable of reflecting the similar possibility of different sample data in the same voice physical attribute; and the recognition module is used for substituting the characteristic data into a target voice recognition model for voice recognition, wherein the target voice recognition model is obtained by training based on the voice recognition model training method of the first aspect.

In a fifth aspect, an electronic device is provided, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor to: creating a migration model to be trained in a target domain based on an original voice recognition model obtained by training in a source domain, wherein an output layer of the migration model is different from an output layer of the original voice recognition model; extracting a characteristic data set from the sample data of the target domain, wherein each characteristic data in the characteristic data set carries an environmental factor capable of reflecting the similar possibility of different sample data in the same voice physical attribute; substituting the training set in the characteristic data set as an input parameter into the migration model for iterative training to obtain a target voice recognition model; in the iterative training process, the weight of the output layer of the migration model is randomly adjusted, and the weights of other layers of the migration model are kept unchanged.

In a sixth aspect, a computer-readable storage medium is presented, the computer-readable storage medium storing one or more programs that, when executed by a server comprising a plurality of application programs, cause the server to: creating a migration model to be trained in a target domain based on an original voice recognition model obtained by training in a source domain, wherein an output layer of the migration model is different from an output layer of the original voice recognition model; extracting a characteristic data set from the sample data of the target domain, wherein each characteristic data in the characteristic data set carries an environmental factor capable of reflecting the similar possibility of different sample data in the same voice physical attribute; substituting the training set in the characteristic data set as an input parameter into the migration model for iterative training to obtain a target voice recognition model; in the iterative training process, the weight of the output layer of the migration model is randomly adjusted, and the weights of other layers of the migration model are kept unchanged.

In a seventh aspect, an electronic device is provided, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor to: determining voice data to be recognized; extracting feature data from the voice data, wherein the feature data carries environmental factors capable of reflecting the similar possibility of different sample data in the same voice physical attribute; and substituting the characteristic data into a target voice recognition model for voice recognition, wherein the target voice recognition model is obtained by training based on the voice recognition model training method of the first aspect.

In an eighth aspect, a computer-readable storage medium is presented, the computer-readable storage medium storing one or more programs which, when executed by a server comprising a plurality of application programs, cause the server to: determining voice data to be recognized; extracting feature data from the voice data, wherein the feature data carries environmental factors capable of reflecting the similar possibility of different sample data in the same voice physical attribute; and substituting the characteristic data into a target voice recognition model for voice recognition, wherein the target voice recognition model is obtained by training based on the voice recognition model training method of the first aspect.

According to the technical scheme provided by one or more embodiments of the specification, a migration model to be trained in a target domain is created based on an original voice recognition model obtained by training in a source domain, a characteristic data set is extracted from sample data of the target domain, and each characteristic data in the characteristic data set carries an environmental factor capable of reflecting the similar possibility of different sample data in the aspect of the same voice physical attribute; substituting a training set in the characteristic data set as an input parameter into the migration model for iterative training to obtain a target voice recognition model; in the iterative training process, the weight of the output layer of the migration model is randomly adjusted, and the weights of other layers of the migration model are kept unchanged. Therefore, under the condition that the training data of the new scene is insufficient, a scheme for generating an effective training model can be quickened by using a transfer learning method, the cost is reduced, the training efficiency is improved, and meanwhile, in the training process, the feature data used by the retraining layer carries environmental factors capable of influencing the voice physical attributes, so that the target voice recognition model obtained by training has good expansibility and high recognition accuracy.

Drawings

In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, reference will now be made briefly to the attached drawings, which are needed in the description of one or more embodiments or prior art, and it should be apparent that the drawings in the description below are only some of the embodiments described in the specification, and that other drawings may be obtained by those skilled in the art without inventive exercise.

Fig. 1 is a schematic step diagram of a speech recognition model training method according to an embodiment of the present invention.

Fig. 2 is a schematic step diagram of a speech recognition method according to an embodiment of the present invention.

Fig. 3a and fig. 3b are schematic diagrams of recognition results before and after training, respectively, according to an embodiment of the present invention.

Fig. 4 is a flowchart of training and applying a speech recognition model implemented by using an announcement as an example according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a speech recognition model training apparatus according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the one or more embodiments described are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments that can be derived by a person skilled in the art from one or more of the embodiments described herein without making any inventive step shall fall within the scope of protection of this document.

Before this point, the terminology may be used to describe embodiments of the present disclosure.

Frozen layers (Frozen layers): the parameters of a certain layer are fixed and are not changed during training.

Retraining layers (Retrain layers): the last few layers of the neural network are removed and trained on the new data set.

Learning rate: for controlling the speed of parameter update during the gradient descent. If the learning rate is too low, convergence can be guaranteed, but the optimization speed of the model is greatly reduced, and if the learning rate is too high, a global optimum point may be crossed, so that the parameter "oscillates".

Announcement sound: an alert tone that occurs during a telephone call.

The invention mainly aims to provide a scheme which can accelerate the generation of an effective training model by using a transfer learning method under the condition of insufficient training data of a new scene aiming at different application scenes, reduces the cost and improves the training efficiency.

It should be noted that the training and specific recognition scheme of the speech recognition model according to the present specification is not limited to the recognition scenario for the announcement, and may also be applied to other speech recognition scenarios with differences in the speech physical attributes.

Example one

Referring to fig. 1, a schematic diagram illustrating steps of a speech recognition model training method according to an embodiment of the present invention is shown, where the method may include the following steps:

step 102: and creating a migration model to be trained in a target domain based on an original voice recognition model obtained by training in a source domain, wherein the output layer of the migration model is different from the output layer of the original voice recognition model.

The source domain may be a feature range formed by sample data based on which the original speech recognition model is trained. The sample data in these feature ranges are known and have a sufficient number to be well trained to generate high quality models.

The target domain can be understood as a domain to be learned, and the number of sample data in the domain is limited or less, so that a high-quality model cannot be obtained through independent training.

Here, step 102 performs migration learning using the original speech recognition model trained in the source domain, so that a small amount of sample data for the target domain can be trained based on the migrated model parameters.

Specifically, when step 102 creates a migration model to be trained in the target domain based on the original speech recognition model trained in the source domain, the following two ways may be used:

in a first mode

1.1, extracting parameters of an original speech recognition model obtained by source domain training, wherein the parameters at least comprise the weight of each layer in a neural network model;

1.2, creating a new voice recognition model based on the neural network, and transferring the weights of other full connection layers except the output layer in the extracted parameters to the new voice recognition model to be used as a transfer model to be trained in a target domain.

It should be understood that the speech recognition model to which this description refers is a deep neural network model. In specific implementation, the parameters of the original speech recognition model can be obtained and the model is loaded, mainly the weight matrix of each layer of the neural network. And then, re-initializing and creating a new target speech recognition model, correspondingly placing the weight value obtained in the previous step into the new model, and removing the last layer of parameters.

Mode two

2.1, obtaining an original speech recognition model obtained by training in a source domain;

and 2.2, eliminating the weight of an output layer in the original voice recognition model to obtain a transfer model to be trained of a target domain.

The method one is different from the method one in that a speech recognition model is not newly created, but processing is directly performed on the basis of the obtained original speech recognition model, and the original speech recognition model with the weight of the output layer removed is taken as a target speech recognition model.

It should be understood that, in the first or second mode, the output layer of the obtained target speech recognition model is equivalent to a retraining layer, and the migration training is mainly performed based on a small number of speech samples existing in the target domain. And the other full-connection layers are used as freezing layers, so that the parameters are not changed during model training.

Step 104: and extracting a characteristic data set from the sample data of the target domain, wherein each characteristic data in the characteristic data set carries an environmental factor capable of reflecting the similarity possibility of different sample data in the same voice physical attribute.

The voice physical properties mainly comprise: four types of attribute elements, namely pitch, tone strength, duration and timbre, can be understood as intonation, accent and the like in popular terms.

In this embodiment of the present specification, when the step 104 extracts the feature data set from the sample data of the target domain, specifically includes:

firstly, respectively preprocessing the sample data of the target domain to obtain the MFCC.

When sample data is preprocessed, the preprocessing can be realized according to the following steps:

step 1, preparing a translation document translation.txt of an original declaration sound and a corresponding dictionary lexicon.txt, wherein the translation document needs to be participled.

And 2, performing pre-emphasis, framing and windowing on the announcement data.

And 3, performing discrete Fourier transform on the original announcement data, converting a time domain sequence of the voice stream into a frequency spectrum sequence, and extracting sound frequency spectrum information.

And 4, configuring a triangular filter bank and calculating the output of each triangular filter after filtering the signal magnitude spectrum.

Step 5, carrying out logarithm operation on the outputs of all filters, and further carrying out discrete cosine transform to obtain MFCC; wherein, MFCC: mel Frequency Cepstrum Coefficient, Mel Frequency cepstral Coefficient.

In general, MFCC features of speech recognition are extracted basically even though they are completed, but in order to minimize the influence of external conditions on the recognition result, we propose an environmental factor based on MFCC, which is named MFCC-p.

Second, based on the MFCCs and the likelihood of the MFCCs, an environmental factor for the MFCCs is determined.

Specifically, the likelihood of the MFCC can be determined from the MFCC and a likelihood function; and then determining a corresponding weight set when the likelihood converges to the optimal value as the environmental factor of the MFCC.

The likelihood of the MFCC may determine that the MFCC for any sample data is X, where X ═ { X ═ X₁,x₂,...x_TX contains T feature components; calculating the likelihood of the MFCC according to the following likelihood function formula;

the p (x)_t| λ) is a likelihood value of the MFCC, w_kIs the weight of the kth feature component, the

As a parameter of the Gaussian model, p_kIs the weight, u, of each Gaussian in the Gaussian mixture model_kSum Σ_kRespectively, the mean and variance of each gaussian weight.

And thirdly, combining the MFCC and the corresponding environmental factor MFCC-p into characteristic data. In a specific combination, two dimensions can be combined together additively.

And fourthly, forming the characteristic data of all the sample data to obtain a characteristic data set.

In fact, the second step to the fourth step may specifically include:

the MFCC feature corresponding to a certain piece of extracted voice data is X, where X ═ X₁,x₂,...x_TAnd assuming that the dimension is D, we can calculate its corresponding likelihood, and the likelihood function formula is:

from GMM (Gaussian mixture model), we can derive extracted features by weighting with k Gaussian density functions, the mean u of each Gaussian component_kSum covariance ∑_kAre 1 × D and D × D, respectively, and the set of weights λ ═ w_k,u_i,∑_k}. And giving an initial value set and an initial basic learning rate to lambda, and using a neural network to enable the Euclidean distance between the result obtained by weighting the Gaussian component and the original feature to be continuously reduced, so that a new feature MFCC-p can be obtained finally. The addition of the newly extracted impact factor MFCC-p features on the basis of the original MFCC is the features required by the following training.

The MFCC feature is based on the effect that human ears have different sensitivities to low and high frequency sounds to achieve recognition. The characteristics of the same dialect, even if spoken by the same person, are different in different situations, for example, dialects spoken by different people may differ in tone, speed, and accent. In order to better refine and accurately recognize the voice, a likelihood function can be considered, environmental factors are added through accurate analysis of attributes such as different intonations, speech speeds and accents, MFCC is subjected to feature correction, compared with MFCC, the MFCC-p features can just reflect the influences, and the voice, particularly similar voices such as announcements and the like, can be recognized more accurately and finely by combining the original MFCC features.

Step 106: and substituting the training set in the characteristic data set as an input parameter into the migration model for iterative training to obtain a target voice recognition model.

In the iterative training process, the weight of the output layer of the migration model is randomly adjusted, and the weights of other layers of the migration model are kept unchanged.

Different learning rates are set for different sized data sets. For the freozen layers, since they are obtained based on huge voice data, a normal learning rate is set for them, while the newly added Retrain layers are trained based on smaller announcement voice data, and a smaller learning rate needs to be set for them. Therefore, when the migration model is iteratively trained, the learning rate set for the output layer is lower than the learning rate set for the other layers.

It should be understood that different learning rates are adjusted in each iteration. In the whole iterative training process, after each iteration is completed, the error rate of the model test after the iteration can be compared with the error rate of the model test after the last iteration; if the error rate of the current time is greater than the error rate of the last time, the learning rate is reduced according to a preset amplitude; and if the error rate of the current time is less than or equal to the error rate of the last time, increasing the learning rate according to a preset amplitude. Wherein the preset amplitude may be about 5%.

And continuously calculating and identifying the error rate in the iteration process, and terminating the training when the error rate is lower than the preset precision to obtain a new training model.

According to the technical scheme, a to-be-trained migration model in a target domain is established based on an original voice recognition model obtained by training in a source domain, a characteristic data set is extracted from sample data in the target domain, and each characteristic data in the characteristic data set carries an environmental factor capable of reflecting the similar possibility of different sample data in the aspect of the same voice physical attribute; substituting a training set in the characteristic data set as an input parameter into the migration model for iterative training to obtain a target voice recognition model; in the iterative training process, the weight of the output layer of the migration model is randomly adjusted, and the weights of other layers of the migration model are kept unchanged. Therefore, under the condition that the training data of the new scene is insufficient, a scheme for generating an effective training model can be quickened by using a transfer learning method, the cost is reduced, the training efficiency is improved, and meanwhile, in the training process, the feature data used by the retraining layer carries environmental factors capable of influencing the voice physical attributes, so that the target voice recognition model obtained by training has good expansibility and high recognition accuracy.

Example two

Referring to fig. 2, a schematic step diagram of a speech recognition method according to an embodiment of the present invention is shown, where the method may include:

step 202: determining voice data to be recognized;

step 204: extracting feature data from the voice data, wherein the feature data carries environmental factors capable of reflecting the similar possibility of different sample data in the same voice physical attribute;

step 206: and substituting the characteristic data into a target voice recognition model for voice recognition, wherein the target voice recognition model is obtained by training based on the method of the first embodiment.

In the step 204, the manner of extracting the feature data may refer to the first embodiment, and the description of the feature data is extracted from the sample data. And will not be described in detail herein.

Compared with the traditional transfer learning process, in order to improve the recognition rate of the declaration sound, the voice recognition model is subjected to hyper-parameter adjustment processing by using the language model generated by training. Typically, the language model can be used to implement functions such as correcting homophonic mistyped words, for example, correcting "dialing" in the recognition result to "redialing", so that the coupling degree between the speech recognition model and the field of the announcement is greatly improved.

The recognition results before and after training are shown in fig. 3a and 3 b: and the total number of the declaration sounds which can not be identified by the original training models sent back in the field is 40, 33 of the declaration sounds can be identified by adopting the new training models, and the speech identification efficiency is obviously improved.

In summary, in the embodiments of the present invention, under the condition that the existing announcement data is insufficient and the existing model has a low recognition rate for the announcement, the engineering personnel spend less time to obtain a training model with improved recognition rate. Manpower is not needed to label a large amount of training data, overdue data is not needed to be labeled again, and training for the trained data is not needed to be started from the beginning. Only engineering personnel are needed to extract the characteristics of the newly added announcement data, and training is carried out on the characteristic data of the announcement on the basis of the original model, so that a better training model is obtained. Because the training of the output layer is only carried out aiming at the newly added data, the training time can be effectively shortened. Moreover, this method is also exceptionally flexible compared to conventional methods. When the recognition result of the current training model for a specific field is not good, the traditional method is to add data of the specific field on the basis of original data and train again to obtain the model. The results obtained with this training, which takes a lot of time, are not necessarily better. By the method provided by the embodiment of the invention, not only can a model be quickly obtained, but also a better identification result can be obtained by flexibly adjusting the initial weight of the output layer. From the perspective of engineering personnel, the training process is simplified, and the training efficiency is improved; from the perspective of a user, under the condition that the existing scheme is incomplete, a mature solution can be obtained by spending less time, and the user experience is greatly improved.

EXAMPLE III

Referring to fig. 4, the overall process of model training and application is described by taking announcement as an example.

Model training

MFCC + MFCC-p from which the announcement sample data can be extracted. The specific extraction method is described in the first embodiment.

-removing the last layer of the full-connected layer of the original model, creating the migration model.

In fact, the extraction of MFCC + MFCC-p and the creation of the migration model can be performed in a reversed order or simultaneously without mutual interference.

The weights of the freezing layer are not changed, and then the weights of the training layer are randomly adjusted.

Training the migration model using the extracted training set in MFCC + MFCC-p.

-judging whether the error rate of the model test for each iteration is smaller than a threshold value, if so, ending the training to obtain a target speech recognition model, otherwise, continuing the iterative training.

Subsequently, the obtained target speech recognition model can be sent to a model application link.

Model application

-extracting MFCC + MFCC-p declaring the tone to be examined for data.

-sending the extracted MFCC + MFCC-p to the target speech recognition model obtained from the previous training step for speech recognition.

Feature extraction is carried out on voice samples of an original model and an announcing sound recognition model in a training stage, feature data are mapped, so that the features of the original voice and the announcing sound can be mapped to a new public feature space, the dimensionality of the data is guaranteed to be controlled within a certain range (dimensionality disaster is prevented), and then pseudo marks of samples in the area of the announcing sound are iteratively optimized in the new feature space until convergence. The pre-training model is obtained on the basis of a large data set, and because the similarity of the common voice and the announcement in the aspect of characteristics exists, the corresponding structure and weight can be directly used, the DNN without an output layer is taken as a fixed characteristic extractor, a smaller learning rate is set for a migration layer, the newly added output layer is updated at a normal learning rate and applied to a new data set, and the final model is obtained by predicting and scoring the new announcement data set. If the recognition result of the newly generated model is not ideal, the weights of some layers at the beginning of the original model can be kept unchanged, and the later layers are retrained to obtain new weights. This process allows the user to make multiple attempts to optimally train the model.

Example four

Referring to fig. 5, a schematic structural diagram of a speech recognition model training apparatus provided in an embodiment of the present disclosure, the speech recognition model training apparatus 500 may include:

a creating module 502, configured to create a migration model to be trained in a target domain based on an original speech recognition model obtained by training in a source domain, where an output layer of the migration model is different from an output layer of the original speech recognition model;

an extracting module 504, configured to extract a feature data set from the sample data of the target domain, where each feature data in the feature data set carries an environmental factor that can reflect a likelihood that different sample data are similar in the same physical attribute of the voice;

a training module 506, configured to substitute a training set in the feature data set as an input parameter into the migration model for iterative training, so as to obtain a target speech recognition model;

An implementation solution is that, when creating a migration model to be trained in a target domain based on an original speech recognition model trained in a source domain, the creating module 502 is specifically configured to extract parameters of the original speech recognition model trained in the source domain, where the parameters at least include a weight of each layer in a neural network model; and creating a new voice recognition model based on the neural network, and transferring the weights of other layers except the output layer in the extracted parameters to the new voice recognition model to be used as a transfer model to be trained in the target domain.

In another implementation scheme, the creating module 502 is specifically configured to obtain the original speech recognition model obtained by training in the source domain when creating the migration model to be trained in the target domain based on the original speech recognition model obtained by training in the source domain; and eliminating the weight of an output layer in the original speech recognition model to obtain a migration model to be trained of a target domain.

In another implementation scheme in this specification, when extracting the feature data set from the sample data of the target domain, the extracting module 504 is specifically configured to respectively pre-process the sample data of the target domain to obtain Mel-frequency cepstrum coefficients MFCC; determining an environmental factor for the MFCC based on the MFCC and its likelihood; combining the MFCCs and the corresponding environmental factors into feature data; and forming the characteristic data of all the sample data to obtain a characteristic data set.

In yet another implementation of the present disclosure, the extracting module 504, when determining the environmental factor of the MFCC based on the likelihood of the MFCC, is specifically configured to determine the likelihood of the MFCC according to the MFCC and a likelihood function; and determining a corresponding weight set when the likelihood converges to the optimal value as the environmental factor of the MFCC.

In another implementation of this specification, the extracting module 504, when determining the likelihood of the MFCC according to the MFCC and the likelihood function, is specifically configured to determine that the MFCC of any sample data is X, where X ═ { X ═ X₁,x₂,...x_TX contains T feature components; calculating the likelihood of the MFCC according to the following likelihood function formula;

In another implementation of the present specification, when the migration model is iteratively trained, the learning rate set for the output layer is lower than the learning rate set for the other layers.

In the iterative training process, after each iteration is completed, the method further includes: the comparison module is used for comparing the error rate of the model test after the iteration with the error rate of the model test after the last iteration; if the error rate of the current time is greater than the error rate of the last time, the learning rate is reduced according to a preset amplitude; otherwise, the learning rate is increased according to the preset amplitude.

In another implementation of the present disclosure, in the iterative training process, when the error rate is lower than the threshold, the training is ended after the iteration is completed.

EXAMPLE five

Referring to fig. 6, a schematic structural diagram of a speech recognition apparatus provided in an embodiment of this specification, the apparatus 600 may include:

a determining module 602, configured to determine voice data to be recognized;

an extracting module 604, configured to extract feature data from the voice data, where the feature data carries an environmental factor that can reflect similarity possibility of different sample data in the same voice physical attribute;

the recognition module 606 is configured to substitute the feature data into a target speech recognition model for speech recognition, where the target speech recognition model is obtained by training based on the method described in the first embodiment.

EXAMPLE six

An embodiment of the present specification further provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor to:

creating a migration model to be trained in a target domain based on an original voice recognition model obtained by training in a source domain, wherein an output layer of the migration model is different from an output layer of the original voice recognition model;

extracting a characteristic data set from the sample data of the target domain, wherein each characteristic data in the characteristic data set carries an environmental factor capable of reflecting the similar possibility of different sample data in the same voice physical attribute;

substituting the training set in the characteristic data set as an input parameter into the migration model for iterative training to obtain a target voice recognition model;

Also, a computer-readable storage medium is provided that stores one or more programs that, when executed by a server including a plurality of application programs, cause the server to perform operations of:

creating a migration model to be trained in a target domain based on an original voice recognition model obtained by training in a source domain, wherein an output layer of the migration model is different from that of the original voice recognition model;

extracting a characteristic data set from the sample data of the target domain, wherein each characteristic data in the characteristic data set carries an environmental factor capable of reflecting the similar possibility of different sample data in the aspect of the same voice physical attribute;

EXAMPLE seven

An embodiment of the present specification further provides another electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor to:

determining voice data to be recognized;

extracting feature data from the voice data, wherein the feature data carries environmental factors capable of reflecting the similar possibility of different sample data in the same voice physical attribute;

and substituting the characteristic data into a target voice recognition model for voice recognition, wherein the target voice recognition model is obtained by training based on the method of the embodiment I.

Meanwhile, another computer-readable storage medium is also provided, which stores one or more programs that, when executed by a server including a plurality of application programs, cause the server to perform operations of:

determining voice data to be recognized;

The electronic device in the above embodiment may refer to a schematic structural diagram shown in fig. 7.

In short, the above description is only a preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present specification shall be included in the protection scope of the present specification.

The system, apparatus, module or unit illustrated in one or more of the above embodiments may be implemented by a computer chip or an entity, or by an article of manufacture with a certain functionality. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Claims

1. A method of speech recognition model training, comprising:

2. The method for training the speech recognition model according to claim 1, wherein the creating of the migration model to be trained in the target domain based on the original speech recognition model trained in the source domain comprises:

extracting parameters of an original speech recognition model obtained by training in a source domain, wherein the parameters at least comprise the weight of each layer in a neural network model;

and creating a new voice recognition model based on the neural network, and transferring the weights of other layers except the output layer in the extracted parameters to the new voice recognition model to be used as a transfer model to be trained in the target domain.

3. The method for training the speech recognition model according to claim 1, wherein the creating of the migration model to be trained in the target domain based on the original speech recognition model trained in the source domain comprises:

obtaining an original speech recognition model obtained by training in a source domain;

and eliminating the weight of an output layer in the original speech recognition model to obtain a migration model to be trained of a target domain.

4. The method of training a speech recognition model according to any one of claims 1-3, extracting a feature data set from sample data of the target domain, comprising:

respectively preprocessing the sample data of the target domain to obtain Mel frequency cepstrum coefficient MFCC;

determining an environmental factor for the MFCC based on the MFCC and a likelihood for the MFCC;

combining the MFCCs and the environment factors corresponding to the MFCCs into feature data;

and forming the characteristic data of all the sample data to obtain a characteristic data set.

5. The method of speech recognition model training of claim 4, determining environmental factors for the MFCCs based on the MFCCs and the likelihood of the MFCCs, comprising:

determining a likelihood of the MFCCs from the MFCCs and a likelihood function;

and determining a corresponding weight set when the likelihood converges to the optimal value as the environmental factor of the MFCC.

6. The method of training a speech recognition model according to claim 5, determining the likelihood of the MFCC from the MFCC and a likelihood function, comprising:

determining the MFCC of any sample data as X, wherein X contains T characteristic components;

calculating the likelihood of the MFCC according to the following likelihood function formula;

the above-mentioned

Taking a value for the likelihood of the MFCC, the

Is the weight of the kth feature component, the

Is the parameter of the Gaussian model, is the weight of each Gaussian in the Gaussian mixture model, and

respectively, the mean and variance of each gaussian weight.

7. The speech recognition model training method according to any one of claims 1 to 3, wherein a learning rate set for the output layer is lower than a learning rate set for other layers when iteratively training a migration model.

8. The speech recognition model training method of any one of claims 1-3, during iterative training, after completion of each iteration, the method further comprising:

comparing the error rate of the model test after the iteration with the error rate of the model test after the last iteration;

if the error rate of the current time is greater than the error rate of the last time, the learning rate is reduced according to a preset amplitude;

and if the error rate of the current time is less than or equal to the error rate of the last time, increasing the learning rate according to a preset amplitude.

9. A method as claimed in any one of claims 1 to 3, wherein in the iterative training process, when the error rate is lower than a threshold value, the training is terminated after the current iteration is completed.

10. A speech recognition method comprising:

determining voice data to be recognized;

extracting feature data from the voice data, wherein the feature data carries environmental factors capable of reflecting the similarity possibility of different sample data in the aspect of the same voice physical attribute;

and substituting the characteristic data into a target speech recognition model for speech recognition, wherein the target speech recognition model is obtained by training based on the method of any one of claims 1-9.

11. A speech recognition model training apparatus comprising:

the system comprises a creating module, a calculating module and a calculating module, wherein the creating module is used for creating a transfer model to be trained in a target domain based on an original voice recognition model obtained by training in a source domain, and an output layer of the transfer model is different from that of the original voice recognition model;

the extraction module is used for extracting a characteristic data set from the sample data of the target domain, wherein each characteristic data in the characteristic data set carries an environmental factor capable of reflecting the similar possibility of different sample data in the same voice physical attribute;

the training module is used for substituting the training set in the characteristic data set as an input parameter into the migration model for iterative training to obtain a target voice recognition model;

12. A speech recognition apparatus comprising:

the determining module is used for determining voice data to be recognized;

the extraction module is used for extracting feature data from the voice data, wherein the feature data carries environmental factors capable of reflecting the similar possibility of different sample data in the same voice physical attribute;

a recognition module, configured to substitute the feature data into a target speech recognition model for speech recognition, where the target speech recognition model is obtained by training based on the method of any one of claims 1 to 9.

13. An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor to:

14. A computer readable storage medium storing one or more programs which, when executed by a server comprising a plurality of application programs, cause the server to:

15. An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor to:

determining voice data to be recognized;

16. A computer readable storage medium storing one or more programs which, when executed by a server comprising a plurality of application programs, cause the server to:

determining voice data to be recognized;