CN113707134B

CN113707134B - Model training method and device for model training

Info

Publication number: CN113707134B
Application number: CN202110942719.7A
Authority: CN
Inventors: 王森茂; 周盼; 王智超; 王佳文
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2021-08-17
Filing date: 2021-08-17
Publication date: 2024-05-17
Anticipated expiration: 2041-08-17
Also published as: CN113707134A

Abstract

The embodiment of the invention provides a model training method and device and a device for model training. The method comprises the following steps: acquiring a voice training sample, wherein the voice training sample comprises a noisy voice sample and a clean voice sample corresponding to the noisy voice sample; and performing iterative joint training on the voice enhancement model and the voice recognition model which are connected in series based on the voice training sample, adjusting the voice enhancement model according to joint loss values of the voice enhancement model and the voice recognition model in each training round, and/or obtaining the trained voice enhancement model and voice recognition model when the joint loss values meet convergence conditions by model parameters of the voice recognition model. The embodiment of the invention can improve the training efficiency of the voice recognition model, and can improve the recognition performance of the voice recognition model in the noisy scene under the condition that the recognition performance of the voice recognition model in the clean scene is not reduced.

Description

Model training method and device for model training

Technical Field

The invention relates to the technical field of intelligent control, in particular to a model training method and device and a device for model training.

Background

With the maturity of the voice recognition algorithm technology, the voice recognition accuracy in a clean scene is improved stably, but in a real noisy scene, voice data often does not reach an ideal clean degree, so that the recognition accuracy of a voice recognition model is reduced, and with the reduction of the signal-to-noise ratio of background noise, the voice recognition performance of the voice recognition model is obviously reduced.

The current speech recognition technology is mainly improved from the data level or the algorithm level so as to improve the speech recognition performance. Specifically, the data layer mainly adds training corpus matched with the scene into training data according to different required scenes, and then trains a voice recognition model based on the adjusted training data. However, in a real scene, the matching degree between the training corpus and the scene is variable, the training corpus which is matched with the scene is difficult to obtain, and after the training corpus is added, the data size of training data is increased, so that the training time of a voice recognition model is also increased, and the training efficiency of the voice recognition model is reduced. The algorithm level is mainly to perform noise reduction processing on voice data in a noise scene through the learning of a neural network, obtain noise-reduced clean voice, and then perform voice recognition based on the noise-reduced clean voice. But the voice recognition is performed based on clean voice after noise reduction, so that the voice recognition performance of the voice recognition model in a real scene is improved, and meanwhile, the voice recognition performance of the voice recognition model in the clean scene is reduced, so that the application scene of the voice recognition model is single.

Disclosure of Invention

The embodiment of the invention provides a model training method and device and a device for model training, which can improve the training efficiency of a voice recognition model in a complex scene and improve the recognition performance of the voice recognition model.

In order to solve the above problems, an embodiment of the present invention discloses a model training method, which includes:

Acquiring a voice training sample, wherein the voice training sample comprises a noisy voice sample and a clean voice sample corresponding to the noisy voice sample;

and performing iterative joint training on the voice enhancement model and the voice recognition model which are connected in series based on the voice training sample, adjusting the voice enhancement model according to joint loss values of the voice enhancement model and the voice recognition model in each training round, and/or obtaining the trained voice enhancement model and voice recognition model when the joint loss values meet convergence conditions by model parameters of the voice recognition model.

Optionally, based on the voice training sample, performing iterative joint training on the voice enhancement model and the voice recognition model which are connected in series, including:

In each round of training, selecting a noisy speech sample from the speech training samples, inputting the noisy speech sample into the speech enhancement model for speech enhancement processing, and obtaining a speech enhancement result corresponding to the noisy speech sample;

extracting features of the voice enhancement result to obtain target feature data corresponding to the voice enhancement result;

inputting the target characteristic data into the voice recognition model for voice recognition processing to obtain a voice recognition result of the voice sample with noise;

and determining a joint loss value of the voice enhancement model and the voice recognition model according to the voice enhancement result of the voice sample with noise and the voice recognition result of the voice sample with noise, and adjusting the voice enhancement model and/or model parameters of the voice recognition model according to the joint loss value.

Optionally, the voice training sample further includes text information corresponding to the noisy voice sample, and determining, according to a voice enhancement result of the noisy voice sample and a voice recognition result of the noisy voice sample, a joint loss value of the voice enhancement model and the voice recognition model includes:

Determining a first loss value of the voice enhancement model according to a voice enhancement result of the noisy voice sample and the clean voice sample;

determining a second loss value of the voice recognition model according to the voice recognition result of the voice sample with noise and the text information;

and carrying out weighted summation on the first loss value and the second loss value to obtain a joint loss value of the voice enhancement model and the voice recognition model.

Optionally, the feature extraction of the speech enhancement result includes:

extracting the characteristics of the voice enhancement result frame by frame to obtain the characteristic information of each frame;

And adding the characteristic information of the previous frame and the next frame of the current frame to the characteristic information of the current frame in the voice enhancement result to obtain target characteristic data.

Optionally, the method further comprises:

Acquiring target voice data to be processed;

And inputting the target voice data into a trained voice enhancement model for voice enhancement processing to obtain a voice enhancement result corresponding to the target voice data.

Optionally, the method further comprises:

Inputting the voice enhancement result corresponding to the target voice data into a trained voice recognition model for voice recognition processing, and obtaining the voice recognition result of the target voice data.

Optionally, the method further comprises:

based on a preset voice recognition model, performing iterative joint training on the serial voice enhancement model and the preset voice recognition model by utilizing voice training samples of different scenes to obtain a voice enhancement model and a voice recognition model under different scenes after training.

In another aspect, an embodiment of the present invention discloses a model training apparatus, including:

the training sample acquisition module is used for acquiring a voice training sample, wherein the voice training sample comprises a noisy voice sample and a clean voice sample corresponding to the noisy voice sample;

And the model training module is used for carrying out iterative joint training on the voice enhancement model and the voice recognition model which are connected in series based on the voice training sample, adjusting the voice enhancement model according to the joint loss value of the voice enhancement model and the voice recognition model in each training round, and/or obtaining the trained voice enhancement model and voice recognition model when the joint loss value meets the convergence condition by the model parameters of the voice recognition model.

Optionally, the model training module includes:

The voice enhancement sub-module is used for selecting a voice sample with noise from the voice training samples in each round of training, inputting the voice sample with noise into the voice enhancement model for voice enhancement processing, and obtaining a voice enhancement result corresponding to the voice sample with noise;

the feature extraction sub-module is used for carrying out feature extraction on the voice enhancement result to obtain target feature data corresponding to the voice enhancement result;

the voice recognition sub-module is used for inputting the target characteristic data into the voice recognition model to perform voice recognition processing to obtain a voice recognition result of the voice sample with noise;

And the loss value determining submodule is used for determining a joint loss value of the voice enhancement model and the voice recognition model according to the voice enhancement result of the voice sample with noise and the voice recognition result of the voice sample with noise, and adjusting the voice enhancement model and/or model parameters of the voice recognition model according to the joint loss value.

Optionally, the voice training sample further includes text information corresponding to the noisy voice sample, and the loss value determining submodule includes:

A first loss value determining unit, configured to determine a first loss value of the speech enhancement model according to a speech enhancement result of the noisy speech sample and the clean speech sample;

a second loss value determining unit, configured to determine a second loss value of the speech recognition model according to a speech recognition result of the noisy speech sample and the text information;

And the joint loss value determining unit is used for carrying out weighted summation on the first loss value and the second loss value to obtain the joint loss value of the voice enhancement model and the voice recognition model.

Optionally, the feature extraction submodule includes:

The feature extraction unit is used for carrying out feature extraction on the voice enhancement result frame by frame to obtain feature information of each frame;

And the frame spelling processing unit is used for adding the characteristic information of the previous frame and the next frame of the current frame into the characteristic information of the current frame to obtain target characteristic data for the current frame in the voice enhancement result.

Optionally, the apparatus further comprises:

the target voice data acquisition module is used for acquiring target voice data to be processed;

the voice enhancement module is used for inputting the target voice data into the trained voice enhancement model to carry out voice enhancement processing, and obtaining a voice enhancement result corresponding to the target voice data.

Optionally, the apparatus further comprises:

the voice recognition module is used for inputting the voice enhancement result corresponding to the target voice data into the trained voice recognition model to perform voice recognition processing, so as to obtain the voice recognition result of the target voice data.

Optionally, the apparatus further comprises:

The enhancement module training module is used for carrying out iterative joint training on the serial voice enhancement model and the preset voice recognition model by utilizing voice training samples of different scenes based on the preset voice recognition model to obtain a voice enhancement model and a voice recognition model under different scenes after training is completed.

In yet another aspect, embodiments of the present invention disclose an apparatus for model training, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for performing the model training method as described in one or more of the foregoing.

In yet another aspect, embodiments of the present invention disclose a machine-readable medium having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform a model training method as described in one or more of the preceding.

The embodiment of the invention has the following advantages:

According to the embodiment of the invention, iterative joint training is carried out on the speech enhancement model and the speech recognition model which are connected in series based on the acquired speech training sample, the speech enhancement model is adjusted according to the joint loss value of the speech enhancement model and the speech recognition model in each training round, and/or model parameters of the speech recognition model are obtained when the joint loss value meets the convergence condition, and the speech enhancement model and the speech recognition model which are trained are obtained. The invention combines the front-end enhancement model and the voice recognition model in series for combined training, introduces the enhancement error function of the voice enhancement model to train the voice recognition model on the basis of the recognition error function of the original voice recognition, does not need to add a large amount of training corpus, reduces the training time of the voice recognition model and improves the training efficiency of the voice recognition model.

In addition, the voice training sample in the embodiment of the invention comprises a noisy voice sample and a clean voice sample corresponding to the noisy voice sample, and the voice processed by the front-end enhancement model is used as a training corpus of the voice recognition model, so that the voice recognition model can not be reduced in recognition performance in a clean scene, the recognition performance of the voice recognition model in complex scenes such as noisy scenes can be improved, and the higher the noise signal-to-noise ratio is, the greater the recognition performance of the voice recognition model is improved.

Furthermore, for the voice enhancement model, the invention introduces the recognition error function of the recognition model to train the front end enhancement model on the basis of the original enhancement error function, and reduces the information loss of the output voice of the voice enhancement model, so that the trained enhancement model approaches to voice recognition while reducing the AI noise, and the purpose of enhancing the voice recognition performance is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the steps of an embodiment of a model training method of the present invention;

FIG. 2 is an application scenario architecture diagram of a model training method of the present invention

FIG. 3 is a schematic diagram of a model training process of the present invention;

FIG. 4 is a block diagram of an embodiment of a model training apparatus of the present invention;

FIG. 5 is a block diagram of an apparatus 800 for model training of the present invention;

fig. 6 is a schematic diagram of a server in some embodiments of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Method embodiment

Referring to fig. 1, a flowchart of steps of an embodiment of a model training method of the present invention is shown, and the method may specifically include the steps of:

Step 101, obtaining a voice training sample, wherein the voice training sample comprises a noisy voice sample and a clean voice sample corresponding to the noisy voice sample.

Step 102, based on the voice training sample, performing iterative joint training on the voice enhancement model and the voice recognition model which are connected in series, and adjusting the voice enhancement model according to the joint loss value of the voice enhancement model and the voice recognition model in each training round, and/or obtaining the trained voice enhancement model and voice recognition model when the joint loss value meets the convergence condition.

Referring to fig. 2, an application scenario architecture diagram of a model training method provided by an embodiment of the present invention is shown. As shown in fig. 2, an application scenario of an embodiment of the present invention may include a terminal device 201 and a server 202. Wherein the terminal device 201 and the server 202 are connected through a wireless or wired network. The terminal device 201 includes, but is not limited to, intelligent devices such as an intelligent speaker, an intelligent watch, an intelligent home, an intelligent robot, an AI manual customer service, a bank credit card call system, and electronic devices such as a smart phone, a mobile computer, and a tablet computer having a voice interaction function. The server 202 may provide relevant voice services, such as voice recognition, voice synthesis, etc., and the server 202 may be a server, a server cluster formed by several servers, or a cloud computing center. Both the terminal 201 and the server 202 may be used separately to perform the model training method provided in the embodiment of the present invention, and the terminal 201 and the server 202 may also be used cooperatively to perform the model training method provided in the embodiment of the present invention.

In one possible application scenario, the user interacts with the terminal device 201, and the terminal device 201 sends voice data input by the user to the server 202. The server 202 performs voice recognition processing and semantic analysis processing on voice data sent by the terminal device 201, determines a corresponding voice recognition text according to a semantic analysis result, sends the voice recognition text to the terminal device 201, and the terminal device 201 displays or executes instructions corresponding to the voice recognition text.

In another possible application scenario, the terminal device 201 sends a training instruction to the server 202, where the training instruction contains training data. The server 202 is trained by using training data, the server 202 at least comprises a voice enhancement model and a voice recognition model, a joint loss value is obtained by joint training the voice enhancement model and the voice recognition model, and model parameters of the voice enhancement model and the voice recognition model are adjusted by the joint loss value until convergence conditions are met.

It should be noted that, the architecture diagram in the embodiment of the present invention is for example to more clearly illustrate the technical solution in the embodiment of the present invention, and does not constitute a limitation on the technical solution provided by the embodiment of the present invention, and for other application scenario architectures and service applications, the technical method provided by the embodiment of the present invention is also applicable to similar problems.

The embodiment of the invention relates to a joint model for voice processing, which comprises two models for voice processing in different links, in particular to a voice enhancement model at the front end and a voice recognition model at the rear end. The two models may each be a machine learning model. The machine learning model is a model with certain capability after being learned through a sample, and can be specifically a neural network model, such as a CNN (Convolutional Neural Networks, convolutional neural network) model, an RNN (Recurrent Neural Networks, cyclic neural network) model and the like. Of course, other types of models may be employed for the machine learning model.

It can be understood that before model training, the model adopted by each link can be flexibly selected according to conditions such as precision requirements, so that each link can adopt optimal configuration without compromising the performance of any link. In other words, the speech enhancement model and the speech recognition model according to the embodiments of the present invention can freely select the specific models that are good at the corresponding fields, respectively.

The voice enhancement model is used for enhancing the input voice data with noise to obtain relatively clean voice data. The voice recognition model is used for carrying out recognition processing on input voice data and recognizing text information corresponding to the voice data.

According to the model training method provided by the embodiment of the invention, the front-end enhancement model and the voice recognition model are connected in series for combined training, the enhancement error function of the voice enhancement model is introduced to train the voice recognition model on the basis of the recognition error function of the original voice recognition, a large amount of training corpus is not required to be added, the training time of the voice recognition model is shortened, and the training efficiency of the voice recognition model is improved.

In the embodiment of the invention, loss values are generated by the voice enhancement model and the voice recognition model aiming at each round of training, and the embodiment of the invention determines the joint loss value of the voice enhancement model and the voice recognition model in each round of training and adjusts the voice enhancement model and/or model parameters of the voice recognition model according to the joint loss value. The joint loss value is determined by the training results of the voice enhancement model and the voice recognition model, and the voice enhancement model is connected in series with the voice recognition model, and the training results of the voice enhancement model and the voice recognition model are related, so that the voice enhancement model is adjusted by the joint loss value, and/or the model parameters of the voice recognition model, the performance of the two models can be balanced and adjusted, and the performance of the adjusted voice enhancement model and the performance of the voice recognition model are optimal. For the voice recognition model, the embodiment of the invention can improve the recognition performance of the voice recognition model under complex scenes with noise and the like; for the voice enhancement model, the embodiment of the invention can reduce the information loss of the output voice of the voice enhancement model, so that the trained enhancement model approaches to voice recognition while AI noise reduction, and the purpose of enhancing the voice recognition performance is achieved.

After multiple rounds of training, when the joint loss value meets the convergence condition, the voice enhancement model and the voice recognition model are considered to be trained. The convergence condition can be set as the preset training times and the error limit of the training result. For example, when the training times reach the preset training times, the convergence condition is considered to be satisfied; or if the error between training results in the multiple rounds of training is smaller than a preset threshold value, the convergence condition is considered to be satisfied. The trained speech enhancement model and speech recognition model may be used alone or in series.

In an alternative embodiment of the present invention, the performing iterative joint training on the speech enhancement model and the speech recognition model based on the speech training samples in step 102 includes:

Step S11, selecting a noisy speech sample from the speech training samples in each round of training, and inputting the noisy speech sample into the speech enhancement model for speech enhancement processing to obtain a speech enhancement result corresponding to the noisy speech sample;

Step S12, extracting features of the voice enhancement result to obtain target feature data corresponding to the voice enhancement result;

S13, inputting the target characteristic data into the voice recognition model for voice recognition processing to obtain a voice recognition result of the voice sample with noise;

Step S14, determining a joint loss value of the voice enhancement model and the voice recognition model according to the voice enhancement result of the voice sample with noise and the voice recognition result of the voice sample with noise, and adjusting the voice enhancement model and/or model parameters of the voice recognition model according to the joint loss value.

Referring to fig. 3, a schematic diagram of a model training process according to an embodiment of the present invention is shown. In each training round, firstly, a noisy speech sample in a speech training sample is input into a speech enhancement model, and the speech enhancement model is utilized to carry out enhancement processing on the noisy speech, so as to obtain a speech enhancement result corresponding to the noisy speech sample. Before the noisy speech sample is input into the speech enhancement model, the noisy speech sample can be firstly subjected to framing, the noisy speech sample is segmented into a plurality of small segments, each small segment is a frame, and the noisy speech sample is subjected to speech enhancement processing frame by frame through the speech enhancement model.

After a voice enhancement result corresponding to the voice sample with noise is obtained, further carrying out feature extraction on the voice enhancement result, taking the obtained target feature data as input of a voice recognition model, and carrying out voice recognition processing on the target feature data by utilizing the voice recognition model to obtain a voice recognition result of the voice sample with noise. The extracted target feature data may be Fbank features in the speech enhancement result, among others.

And finally, determining a joint loss value of the voice enhancement model and the voice recognition model according to the voice enhancement result and the voice recognition result of the noisy voice sample, and adjusting the voice enhancement model and/or model parameters of the voice recognition model according to the joint loss value. Wherein the model parameters of the speech enhancement model and the speech recognition model comprise weight parameters in the neural network. In the implementation, each weight parameter in the neural network can be initialized, then the gradient of the node in the neural network is calculated according to the joint loss value, and then the weight parameter of the corresponding node in the neural network is adjusted according to the gradient.

After the parameters are adjusted, the next training is carried out until the joint loss value meets the convergence condition, and the training of the voice enhancement model and the voice recognition model is determined to be completed.

In an optional embodiment of the present invention, the speech training sample further includes text information corresponding to the noisy speech sample, and step S14 of determining, according to a speech enhancement result of the noisy speech sample and a speech recognition result of the noisy speech sample, a joint loss value of the speech enhancement model and the speech recognition model includes:

Sub-step S141, determining a first loss value of the voice enhancement model according to the voice enhancement result of the noisy voice sample and the clean voice sample;

sub-step S142, determining a second loss value of the voice recognition model according to the voice recognition result of the noisy voice sample and the text information;

And step S143, carrying out weighted summation on the first loss value and the second loss value to obtain a joint loss value of the voice enhancement model and the voice recognition model.

In the embodiment of the invention, the first loss value of the voice enhancement model can be determined according to the voice enhancement result output by the voice enhancement model and the clean voice sample in each round of training, and the second loss value of the voice recognition model can be determined according to the voice recognition result output by the voice recognition model and the text information corresponding to the noisy sample. Then, the combined loss value of the speech enhancement model and the speech recognition model is obtained by weighted summation of the first loss value and the second loss value.

The first loss value may be determined according to a spectral feature of the speech enhancement result and a spectral feature of the clean speech sample. For example, the first loss value may be obtained by calculating the mean value of the sum of squares of the error of the points corresponding to the speech enhancement result and the clean speech sample, i.e. MSE (Mean Square Error ). The second loss value may be determined based on cross entropy of text information corresponding to the noisy speech samples as a result of the speech recognition.

The weights of the first loss value and the second loss value can be determined according to the weights of the voice enhancement model and the voice recognition model, and the weights of the voice enhancement model and the voice recognition model can be determined according to factors such as a training target, a training environment, an application scene and the like. When the performance requirement for which model is higher, the weight of the loss value corresponding to that model is smaller. In the embodiment of the present invention, the main objective is to improve the recognition performance of the speech recognition model, so that a smaller weight may be set for the second loss value of the speech recognition model, for example, the weight value of the first loss value may be set to 0.8, and the weight value of the second loss value may be set to 0.2.

In an optional embodiment of the present invention, the feature extraction of the speech enhancement result in step S12 includes:

Step S121, carrying out feature extraction on the voice enhancement result frame by frame to obtain feature information of each frame;

And a substep S122, adding the feature information of the previous frame and the next frame of the current frame to the feature information of the current frame in the voice enhancement result to obtain target feature data.

In the embodiment of the invention, when the feature extraction is performed on the voice enhancement result, the feature extraction is performed on each frame of data frame by frame to obtain the feature information of each frame. Specifically, when feature extraction is performed on the speech enhancement result, frame processing may be performed on the speech enhancement result first, or a series of pre-processing such as pre-emphasis and windowing may be further performed on each frame, and then feature extraction is performed on each frame after the pre-processing. The extracted features may be Fbank features, such as performing a fast fourier transform on each frame, converting each frame of data from a time domain signal to a frequency domain signal; and then, filtering the frequency domain signals after Fourier transformation through a Mel filter bank, and taking the logarithm of the filtering result to obtain Fank characteristics of each frame.

In order to achieve a better voice smoothing effect, the extracted characteristic information can be further subjected to framing processing. Specifically, the characteristic information of the previous frame and the next frame of the current frame is added to the characteristic information of the current frame, so that target characteristic data corresponding to the voice enhancement result is obtained. Thus, the target feature data reflecting the context information can be obtained, which is helpful for improving the accuracy of the extracted target feature data.

In an alternative embodiment of the invention, the method further comprises:

Step S21, obtaining target voice data to be processed;

And S22, inputting the target voice data into a trained voice enhancement model for voice enhancement processing, and obtaining a voice enhancement result corresponding to the target voice data.

The speech enhancement model and the speech recognition model which are trained in the embodiment of the invention can be used independently. For example, performing voice enhancement processing on the target voice data through the trained voice enhancement model to obtain a voice enhancement result corresponding to the target voice data; or performing voice recognition processing on the target voice data corresponding to the trained voice enhancement model to obtain a voice recognition result corresponding to the target voice data.

In an alternative embodiment of the invention, the method further comprises:

The speech enhancement model and the speech recognition model trained in the embodiment of the invention can also be used in series. Specifically, the target voice data can be input into a trained voice enhancement model, and voice enhancement processing is performed on the target voice data by using the voice enhancement model to obtain a voice enhancement result corresponding to the target voice data; and then, inputting the voice enhancement result into a trained voice recognition model, and performing voice recognition processing on the voice enhancement result by using the voice recognition model to obtain a voice recognition result of the target voice data.

In an alternative embodiment of the invention, the method further comprises:

The trained voice enhancement model and the trained voice recognition model can perform voice processing on target voice data in a clean scene, and can also perform voice processing on target voice data in a complex noisy scene. In specific application, the target voice data can be correspondingly preprocessed according to application scenes, such as adjusting the signal-to-noise ratio of the target voice data, selecting different voice enhancement models according to different application scenes, adjusting the model structure of the voice enhancement models according to the application scenes, and then performing iterative joint training on the serial voice enhancement models and the preset voice recognition models by using voice training samples of different scenes to obtain the voice enhancement models, the voice recognition models and the voice recognition models in different scenes after training.

Compared with the traditional training method of the voice recognition model, the method has the advantages that the voice recognition model is required to be independently trained according to training data matched with an application scene, the training period is long, the recognition performance of the trained voice recognition model in different scenes cannot be guaranteed at the same time, according to the application scene, only the voice enhancement model at the front end can be correspondingly adjusted, the model structure of the preset voice recognition model is not required to be changed, a large amount of training corpus matched with the scene is not required to be additionally introduced into a training sample, the training efficiency of the voice recognition model is improved, and the recognition performance of the voice recognition model is improved.

In addition, the trained voice recognition model in the embodiment of the invention can also recognize target voice data containing a plurality of user voices in a noisy scene. Specifically, the voice characteristic information of the target user can be determined first, then the voice data of the target user is extracted from the target voice data according to the voice characteristic information of the target user, the voice data of the target user is input into the trained voice recognition model, and the text information corresponding to the voice data of the target user can be obtained, so that the voice data of the target user can be recognized in a targeted manner under the noisy scene.

In summary, the embodiment of the invention connects in series a voice enhancement model before the voice recognition model, and carries out iterative joint training on the voice enhancement model and the voice recognition model, introduces the enhancement error function of the voice enhancement model on the basis of the recognition error function of the original voice recognition to train the voice recognition model, does not need to add a large amount of training corpus, reduces the training time of the voice recognition model, and improves the training efficiency of the voice recognition model. In addition, the voice training sample in the embodiment of the invention comprises a noisy voice sample and a clean voice sample corresponding to the noisy voice sample, and the voice processed by the front-end enhancement model is used as a training corpus of the voice recognition model, so that the voice recognition model can not be reduced in recognition performance in a clean scene, the recognition performance of the voice recognition model in complex scenes such as noisy scenes can be improved, and the higher the noise signal-to-noise ratio is, the greater the recognition performance of the voice recognition model is improved.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Device embodiment

Referring to fig. 4, there is shown a block diagram of an embodiment of a model training apparatus of the present invention, which may include:

A training sample obtaining module 201, configured to obtain a voice training sample, where the voice training sample includes a noisy voice sample and a clean voice sample corresponding to the noisy voice sample;

The model training module 202 is configured to perform iterative joint training on the speech enhancement model and the speech recognition model that are connected in series based on the speech training sample, adjust the speech enhancement model according to a joint loss value of the speech enhancement model and the speech recognition model in each training round, and/or obtain a trained speech enhancement model and speech recognition model when the joint loss value meets a convergence condition.

Optionally, the model training module includes:

Optionally, the feature extraction submodule includes:

Optionally, the apparatus further comprises:

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

An embodiment of the invention provides an apparatus for model training, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for: acquiring a voice training sample, wherein the voice training sample comprises a noisy voice sample and a clean voice sample corresponding to the noisy voice sample; and performing iterative joint training on the voice enhancement model and the voice recognition model which are connected in series based on the voice training sample, adjusting the voice enhancement model according to joint loss values of the voice enhancement model and the voice recognition model in each training round, and/or obtaining the trained voice enhancement model and voice recognition model when the joint loss values meet convergence conditions by model parameters of the voice recognition model.

Optionally, the feature extraction of the speech enhancement result includes:

Optionally, the device is also configured to execute the one or more programs by one or more processors, including instructions for:

Acquiring target voice data to be processed;

FIG. 5 is a block diagram illustrating an apparatus 800 for model training, according to an example embodiment. For example, apparatus 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 5, apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the apparatus 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on the device 800, contact data, phonebook data, messages, pictures, videos, and the like. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 806 provides power to the various components of the device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.

The multimedia component 808 includes a screen between the device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the device 800 is in an operational mode, such as a call mode, a recording mode, and a voice information processing mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the apparatus 800. For example, the sensor assembly 814 may detect an on/off state of the device 800, a relative positioning of the components, such as a display and keypad of the device 800, the sensor assembly 814 may also model a change in position of the training device 800 or a component of the device 800, the presence or absence of user contact with the device 800, an orientation or acceleration/deceleration of the device 800, and a change in temperature of the device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the apparatus 800 and other devices, either in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of apparatus 800 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

Fig. 6 is a schematic diagram of a server in some embodiments of the invention. The server 1900 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage mediums 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Wherein the memory 1932 and storage medium 1930 may be transitory or persistent. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, a central processor 1922 may be provided in communication with a storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

A non-transitory computer readable storage medium, which when executed by a processor of an apparatus (server or terminal) enables the apparatus to perform the model training method shown in fig. 1.

A non-transitory computer readable storage medium, which when executed by a processor of an apparatus (server or terminal), causes the apparatus to perform a model training method, the method comprising: acquiring a voice training sample, wherein the voice training sample comprises a noisy voice sample and a clean voice sample corresponding to the noisy voice sample; and performing iterative joint training on the voice enhancement model and the voice recognition model which are connected in series based on the voice training sample, adjusting the voice enhancement model according to joint loss values of the voice enhancement model and the voice recognition model in each training round, and/or obtaining the trained voice enhancement model and voice recognition model when the joint loss values meet convergence conditions by model parameters of the voice recognition model.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

The above detailed description of a model training method, a model training device and a device for model training provided by the invention applies specific examples to illustrate the principles and embodiments of the invention, and the above examples are only used to help understand the method and core ideas of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method of model training, the method comprising:

Performing iterative joint training on the voice enhancement model and the voice recognition model which are connected in series based on the voice training sample, adjusting model parameters of the voice enhancement model and the voice recognition model according to joint loss values of the voice enhancement model and the voice recognition model in each training round, and obtaining the voice enhancement model and the voice recognition model which are completed by training when the joint loss values meet convergence conditions;

Based on the voice training sample, performing iterative joint training on the voice enhancement model and the voice recognition model which are connected in series, wherein the iterative joint training comprises the following steps: in each round of training, selecting a noisy speech sample from the speech training samples, inputting the noisy speech sample into the speech enhancement model for speech enhancement processing, and obtaining a speech enhancement result corresponding to the noisy speech sample; extracting the characteristics of the voice enhancement result frame by frame to obtain the characteristic information of each frame; adding the characteristic information of the previous frame and the next frame of the current frame to the characteristic information of the current frame to obtain target characteristic data corresponding to the voice enhancement result; inputting the target characteristic data into the voice recognition model for voice recognition processing to obtain a voice recognition result of the voice sample with noise; determining a joint loss value of the voice enhancement model and the voice recognition model according to a voice enhancement result of the voice sample with noise and a voice recognition result of the voice sample with noise, and adjusting model parameters of the voice enhancement model and the voice recognition model according to the joint loss value;

The voice training sample further comprises text information corresponding to the noisy voice sample, and the determining of the joint loss value of the voice enhancement model and the voice recognition model according to the voice enhancement result of the noisy voice sample and the voice recognition result of the noisy voice sample comprises the following steps:

2. The method according to claim 1, wherein the method further comprises:

Acquiring target voice data to be processed;

3. The method according to claim 2, wherein the method further comprises:

4. The method according to claim 1, wherein the method further comprises:

5. A model training apparatus, the apparatus comprising:

the model training module is used for carrying out iterative joint training on the voice enhancement model and the voice recognition model which are connected in series based on the voice training sample, adjusting model parameters of the voice enhancement model and the voice recognition model according to joint loss values of the voice enhancement model and the voice recognition model in each training round, and obtaining the voice enhancement model and the voice recognition model which are completed in training when the joint loss values meet convergence conditions;

The model training module comprises: the voice enhancement sub-module is used for selecting a voice sample with noise from the voice training samples in each round of training, inputting the voice sample with noise into the voice enhancement model for voice enhancement processing, and obtaining a voice enhancement result corresponding to the voice sample with noise;

The feature extraction submodule comprises:

The frame spelling processing unit is used for adding the characteristic information of the previous frame and the next frame of the current frame into the characteristic information of the current frame to obtain target characteristic data for the current frame in the voice enhancement result;

The loss value determining submodule is used for determining a joint loss value of the voice enhancement model and the voice recognition model according to the voice enhancement result of the voice sample with noise and the voice recognition result of the voice sample with noise, and adjusting model parameters of the voice enhancement model and the voice recognition model according to the joint loss value;

The voice training sample further comprises text information corresponding to the noisy voice sample, and the loss value determining submodule comprises:

6. The apparatus of claim 5, wherein the apparatus further comprises:

7. The apparatus of claim 6, wherein the apparatus further comprises:

8. An apparatus for model training, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for performing the model training method of any of claims 1-4.

9. A machine readable medium having instructions stored thereon, which when executed by one or more processors, cause the processors to perform the model training method of any of claims 1 to 4.