CN113707134A

CN113707134A - Model training method and device for model training

Info

Publication number: CN113707134A
Application number: CN202110942719.7A
Authority: CN
Inventors: 王森茂; 周盼; 王智超; 王佳文
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2021-08-17
Filing date: 2021-08-17
Publication date: 2021-11-26
Anticipated expiration: 2041-08-17
Also published as: CN113707134B

Abstract

The embodiment of the invention provides a model training method and device and a device for model training. The method comprises the following steps: acquiring a voice training sample, wherein the voice training sample comprises a voice sample with noise and a clean voice sample corresponding to the voice sample with noise; and carrying out iterative joint training on the serially connected voice enhancement model and the voice recognition model based on the voice training sample, adjusting the voice enhancement model according to the joint loss value of the voice enhancement model and the voice recognition model in each round of training, and/or obtaining the trained voice enhancement model and the trained voice recognition model when the joint loss value meets the convergence condition. The embodiment of the invention can improve the training efficiency of the voice recognition model and improve the recognition performance of the voice recognition model in a noisy scene under the condition of not reducing the recognition performance of the voice recognition model in a clean scene.

Description

Model training method and device for model training

Technical Field

The invention relates to the technical field of intelligent control, in particular to a model training method and device and a device for model training.

Background

With the maturity of the speech recognition algorithm technology, the speech recognition accuracy in a clean scene is stably improved, but in a real noisy scene, the speech data often does not reach an ideal clean degree, which can cause the recognition accuracy of the speech recognition model to be reduced, and with the reduction of the signal-to-noise ratio of the background noise, the speech recognition performance of the speech recognition model can be remarkably reduced.

The current speech recognition technology is mainly improved from a data level or an algorithm level so as to improve the speech recognition performance. Specifically, the data layer is mainly to add training corpora matching the scene into the training data according to different demand scenes, and then train the speech recognition model based on the adjusted training data. However, in a real scene, the matching degree of the training corpus and the scene is variable, it is difficult to obtain the training corpus highly matched with the scene, and after the training corpus is added, the data volume of the training data is increased, so that the training time of the speech recognition model is increased, and the training efficiency of the speech recognition model is reduced. The algorithm level is mainly to perform noise reduction processing on voice data in a noisy scene through learning of a neural network to obtain noise-reduced clean voice, and then perform voice recognition based on the noise-reduced clean voice. However, the speech recognition is performed based on the noise-reduced clean speech, so that the speech recognition performance of the speech recognition model in a real scene is improved, and meanwhile, the speech recognition performance of the speech recognition model in a clean scene is reduced, so that the application scene of the speech recognition model is single.

Disclosure of Invention

The embodiment of the invention provides a model training method and device and a device for model training, which can improve the training efficiency of a speech recognition model in a complex scene and improve the recognition performance of the speech recognition model.

In order to solve the above problem, an embodiment of the present invention discloses a model training method, including:

acquiring a voice training sample, wherein the voice training sample comprises a voice sample with noise and a clean voice sample corresponding to the voice sample with noise;

and carrying out iterative joint training on the serially connected voice enhancement model and the voice recognition model based on the voice training sample, adjusting the voice enhancement model according to the joint loss value of the voice enhancement model and the voice recognition model in each round of training, and/or obtaining the trained voice enhancement model and the trained voice recognition model when the joint loss value meets the convergence condition.

Optionally, the iteratively training the speech enhancement model and the speech recognition model in series based on the speech training sample includes:

in each round of training, selecting a voice sample with noise from the voice training samples, inputting the voice sample with noise into the voice enhancement model, and performing voice enhancement processing to obtain a voice enhancement result corresponding to the voice sample with noise;

performing feature extraction on the voice enhancement result to obtain target feature data corresponding to the voice enhancement result;

inputting the target characteristic data into the voice recognition model to perform voice recognition processing to obtain a voice recognition result of the voice sample with noise;

and determining a joint loss value of the speech enhancement model and the speech recognition model according to the speech enhancement result of the noisy speech sample and the speech recognition result of the noisy speech sample, and adjusting the speech enhancement model and/or model parameters of the speech recognition model according to the joint loss value.

Optionally, the determining, by the speech training sample, a joint loss value of the speech enhancement model and the speech recognition model according to the speech enhancement result of the noisy speech sample and the speech recognition result of the noisy speech sample further includes:

determining a first loss value of the speech enhancement model according to the speech enhancement result of the noisy speech sample and the clean speech sample;

determining a second loss value of the voice recognition model according to the voice recognition result of the voice sample with the noise and the text information;

and carrying out weighted summation on the first loss value and the second loss value to obtain a joint loss value of the voice enhancement model and the voice recognition model.

Optionally, the performing feature extraction on the speech enhancement result includes:

carrying out feature extraction on the voice enhancement result frame by frame to obtain feature information of each frame;

and for the current frame in the voice enhancement result, adding the feature information of the previous frame and the next frame of the current frame into the feature information of the current frame to obtain target feature data.

Optionally, the method further comprises:

acquiring target voice data to be processed;

and inputting the target voice data into the trained voice enhancement model for voice enhancement processing to obtain a voice enhancement result corresponding to the target voice data.

Optionally, the method further comprises:

and inputting the voice enhancement result corresponding to the target voice data into the trained voice recognition model for voice recognition processing to obtain the voice recognition result of the target voice data.

Optionally, the method further comprises:

based on a preset voice recognition model, performing iterative joint training on the serially connected voice enhancement model and the preset voice recognition model by using voice training samples of different scenes to obtain the trained voice enhancement model and the trained voice recognition model under different scenes.

In another aspect, an embodiment of the present invention discloses a model training apparatus, including:

the system comprises a training sample acquisition module, a comparison module and a comparison module, wherein the training sample acquisition module is used for acquiring a voice training sample, and the voice training sample comprises a voice sample with noise and a clean voice sample corresponding to the voice sample with noise;

and the model training module is used for carrying out iterative joint training on the serially connected voice enhancement model and the voice recognition model based on the voice training sample, adjusting the voice enhancement model according to the joint loss value of the voice enhancement model and the voice recognition model in each round of training, and/or obtaining the trained voice enhancement model and the trained voice recognition model when the joint loss value meets the convergence condition.

Optionally, the model training module includes:

the voice enhancement submodule is used for selecting a voice sample with noise from the voice training samples in each round of training, inputting the voice sample with noise into the voice enhancement model for voice enhancement processing, and obtaining a voice enhancement result corresponding to the voice sample with noise;

the feature extraction submodule is used for extracting features of the voice enhancement result to obtain target feature data corresponding to the voice enhancement result;

the voice recognition submodule is used for inputting the target characteristic data into the voice recognition model to perform voice recognition processing so as to obtain a voice recognition result of the voice sample with the noise;

and the loss value determining submodule is used for determining a joint loss value of the speech enhancement model and the speech recognition model according to the speech enhancement result of the voice sample with noise and the speech recognition result of the voice sample with noise, and adjusting the speech enhancement model and/or the model parameters of the speech recognition model according to the joint loss value.

Optionally, the speech training sample further includes text information corresponding to the noisy speech sample, and the loss value determination sub-module includes:

a first loss value determining unit, configured to determine a first loss value of the speech enhancement model according to the speech enhancement result of the noisy speech sample and the clean speech sample;

a second loss value determining unit, configured to determine a second loss value of the speech recognition model according to the speech recognition result of the noisy speech sample and the text information;

and the joint loss value determining unit is used for weighting and summing the first loss value and the second loss value to obtain a joint loss value of the voice enhancement model and the voice recognition model.

Optionally, the feature extraction sub-module includes:

the feature extraction unit is used for extracting features of the voice enhancement result frame by frame to obtain feature information of each frame;

and the frame splicing processing unit is used for adding the characteristic information of the previous frame and the next frame of the current frame to the characteristic information of the current frame in the voice enhancement result to obtain target characteristic data.

Optionally, the apparatus further comprises:

the target voice data acquisition module is used for acquiring target voice data to be processed;

and the voice enhancement module is used for inputting the target voice data into the trained voice enhancement model for voice enhancement processing to obtain a voice enhancement result corresponding to the target voice data.

Optionally, the apparatus further comprises:

and the voice recognition module is used for inputting the voice enhancement result corresponding to the target voice data into the trained voice recognition model for voice recognition processing to obtain the voice recognition result of the target voice data.

Optionally, the apparatus further comprises:

and the enhancement module training module is used for carrying out iterative joint training on the serially connected voice enhancement model and the preset voice recognition model by utilizing voice training samples of different scenes based on the preset voice recognition model to obtain the trained voice enhancement model and the trained voice recognition model under different scenes.

In yet another aspect, an embodiment of the present invention discloses an apparatus for model training, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors comprise instructions for performing one or more of the model training methods described above.

In yet another aspect, embodiments of the invention disclose a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a model training method as described in one or more of the preceding.

The embodiment of the invention has the following advantages:

the method and the device for the speech enhancement model and the speech recognition model carry out iterative joint training on the speech enhancement model and the speech recognition model which are connected in series based on the obtained speech training samples, the speech enhancement model and/or model parameters of the speech recognition model are/is adjusted in each round of training according to joint loss values of the speech enhancement model and the speech recognition model, and the trained speech enhancement model and the trained speech recognition model are obtained when the joint loss values meet convergence conditions. The front-end enhanced model and the voice recognition model are connected in series for combined training, the enhanced error function of the voice enhanced model is introduced to train the voice recognition model on the basis of the original recognition error function of the voice recognition, a large amount of training corpora are not required to be added, the training time of the voice recognition model is shortened, and the training efficiency of the voice recognition model is improved.

In addition, the voice training samples of the embodiment of the invention comprise noisy voice samples and clean voice samples corresponding to the noisy voice samples, and the voice processed by the front-end enhancement model is used as the training corpus of the voice recognition model, so that the reduction of the recognition performance of the voice recognition model in a clean scene can not be caused, the recognition performance of the voice recognition model in complex scenes such as noisy scenes can also be improved, and the higher the noise-to-noise ratio is, the higher the recognition performance of the voice recognition model is.

Moreover, for the speech enhancement model, the recognition error function of the recognition model is introduced to train the front-end enhancement model on the basis of the original enhancement error function, so that the loss of information of output speech of the speech enhancement model is reduced, the trained enhancement model is closer to speech recognition while AI noise reduction is carried out, and the purpose of enhancing the speech recognition performance is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of the steps of one embodiment of a model training method of the present invention;

FIG. 2 is an application scenario architecture diagram of a model training method of the present invention

FIG. 3 is a schematic diagram of a model training process of the present invention;

FIG. 4 is a block diagram of a model training apparatus according to an embodiment of the present invention;

FIG. 5 is a block diagram of an apparatus 800 for model training of the present invention;

fig. 6 is a schematic diagram of a server in some embodiments of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Method embodiment

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a model training method according to the present invention is shown, where the method may specifically include the following steps:

step 101, obtaining a voice training sample, where the voice training sample includes a noisy voice sample and a clean voice sample corresponding to the noisy voice sample.

And 102, carrying out iterative joint training on the serially connected voice enhancement model and the voice recognition model based on the voice training sample, adjusting the voice enhancement model and/or model parameters of the voice recognition model in each round of training according to the joint loss value of the voice enhancement model and the voice recognition model, and obtaining the trained voice enhancement model and the trained voice recognition model when the joint loss value meets the convergence condition.

Referring to fig. 2, an application scenario architecture diagram of the model training method provided by the embodiment of the present invention is shown. As shown in fig. 2, an application scenario of the embodiment of the present invention may include a terminal device 201 and a server 202. The terminal device 201 and the server 202 are connected through a wireless or wired network. The terminal device 201 includes, but is not limited to, smart devices such as a smart speaker, a smart watch, a smart home, a smart robot, an AI manual customer service, a bank credit card billing phone system, and electronic devices such as a smart phone, a mobile computer, and a tablet computer with a voice interaction function. The server 202 may provide related voice services, such as voice recognition, voice synthesis, and the like, and the server 202 may be a server, a server cluster composed of several servers, or a cloud computing center. The terminal 201 and the server 202 may be used separately to execute the model training method provided in the embodiment of the present invention, and the terminal 201 and the server 202 may also be used cooperatively to execute the model training method provided in the embodiment of the present invention.

In one possible application scenario, a user interacts with the terminal device 201, and the terminal device 201 transmits voice data input by the user to the server 202. The server 202 performs voice recognition processing and semantic parsing processing on the voice data sent by the terminal device 201, determines a corresponding voice recognition text according to a semantic parsing result, sends the voice recognition text to the terminal device 201, and the terminal device 201 displays or executes an instruction corresponding to the voice recognition text.

In another possible application scenario, the terminal device 201 sends a training instruction to the server 202, where the training instruction includes training data. The server 202 performs training by using the training data, the server 202 at least includes a speech enhancement model and a speech recognition model, obtains a joint loss value by performing joint training on the speech enhancement model and the speech recognition model, and adjusts model parameters of the speech enhancement model and the speech recognition model by using the joint loss value until a convergence condition is satisfied.

It should be noted that the architecture diagram in the embodiment of the present invention is used as an example to more clearly illustrate the technical solution in the embodiment of the present invention, and does not limit the technical solution provided in the embodiment of the present invention, and for other application scenario architectures and service applications, the technical method provided in the embodiment of the present invention is also applicable to similar problems.

The embodiment of the invention relates to a combined model for voice processing, which comprises two models for voice processing in different links, specifically comprises a front-end voice enhancement model and a rear-end voice recognition model. Each of the two models may be a machine learning model. The machine learning model is a model having a certain capability after sample learning, and may specifically be a Neural network model, such as a CNN (Convolutional Neural network) model, an RNN (Recurrent Neural network) model, or the like. Of course, other types of models may be employed by the machine learning model.

It can be understood that before the model training, the model adopted by each link can be flexibly selected according to the conditions such as precision requirements and the like, so that each link can adopt the optimal configuration without compromising the performance of any link. In other words, the speech enhancement model and the speech recognition model related in the embodiment of the present invention may respectively freely select a dedicated model good in the corresponding field.

The voice enhancement model is used for enhancing the input voice data with noise to obtain relatively clean voice data. The voice recognition model is used for recognizing the input voice data and recognizing text information corresponding to the voice data.

The model training method provided by the embodiment of the invention connects the front-end enhancement model and the voice recognition model in series for combined training, introduces the enhancement error function of the voice enhancement model to train the voice recognition model on the basis of the recognition error function of the original voice recognition, does not need to add a large amount of training corpora, reduces the training time of the voice recognition model, and improves the training efficiency of the voice recognition model.

In addition, the voice training samples of the embodiment of the invention comprise noisy voice samples and clean voice samples corresponding to the noisy voice samples, and the voice processed by the front-end enhancement model is used as the training corpus of the voice recognition model, so that the reduction of the recognition performance of the voice recognition model in a clean scene can not be caused, the improvement of the recognition performance of the voice recognition model in complex scenes such as noisy scenes can also be promoted, and the higher the noise-to-noise ratio is, the greater the improvement of the recognition performance of the voice recognition model is.

In the embodiment of the invention, loss values are generated by the voice enhancement model and the voice recognition model aiming at each round of training, the embodiment of the invention determines the joint loss value of the voice enhancement model and the voice recognition model in each round of training process, and adjusts the voice enhancement model and/or the model parameters of the voice recognition model according to the joint loss value. Because the joint loss value is determined by the training results of the speech enhancement model and the speech recognition model, the speech enhancement model is connected with the speech recognition model in series, and the training results of the speech enhancement model and the speech recognition model are related, the performance of the two models can be adjusted in a balanced manner by adjusting the speech enhancement model and/or the model parameters of the speech recognition model through the joint loss value, so that the performance of the adjusted speech enhancement model and the adjusted speech recognition model is optimal. For the voice recognition model, the embodiment of the invention can improve the recognition performance of the voice recognition model in complex scenes such as noise and the like; for the voice enhancement model, the embodiment of the invention can reduce the information loss of the output voice of the voice enhancement model, thereby leading the trained enhancement model to be closer to the voice recognition while reducing the noise of AI, and achieving the purpose of enhancing the voice recognition performance.

After multiple rounds of training, when the joint loss value meets the convergence condition, the speech enhancement model and the speech recognition model are considered to be trained completely. The convergence condition may be set to a preset number of training times and an error limit of a training result. For example, when the training times reach the preset training times, the convergence condition is considered to be satisfied; or, if the error between the training results in the multiple rounds of training is smaller than a preset threshold, the convergence condition is considered to be met. The trained speech enhancement model and the trained speech recognition model can be used independently or in series.

In an optional embodiment of the present invention, the iteratively training the speech enhancement model and the speech recognition model in series based on the speech training sample in step 102 includes:

step S11, in each round of training, selecting a noisy speech sample from the speech training samples, inputting the noisy speech sample into the speech enhancement model, and performing speech enhancement processing to obtain a speech enhancement result corresponding to the noisy speech sample;

step S12, extracting the features of the voice enhancement result to obtain target feature data corresponding to the voice enhancement result;

step S13, inputting the target characteristic data into the voice recognition model for voice recognition processing to obtain a voice recognition result of the voice sample with noise;

step S14, determining a joint loss value of the speech enhancement model and the speech recognition model according to the speech enhancement result of the noisy speech sample and the speech recognition result of the noisy speech sample, and adjusting the speech enhancement model and/or the model parameters of the speech recognition model according to the joint loss value.

Referring to fig. 3, a schematic diagram of a model training process according to an embodiment of the present invention is shown. In each round of training, firstly, the noisy speech sample in the speech training sample is input into a speech enhancement model, and the noisy speech is enhanced by the speech enhancement model to obtain a speech enhancement result corresponding to the noisy speech sample. Before the noisy speech sample is input into the speech enhancement model, the noisy speech sample may be framed, the noisy speech sample may be cut into a plurality of segments, each segment is a frame, and the speech enhancement processing is performed on the noisy speech sample frame by frame through the speech enhancement model.

And after a voice enhancement result corresponding to the voice sample with the noise is obtained, further performing feature extraction on the voice enhancement result, taking the obtained target feature data as the input of a voice recognition model, and performing voice recognition processing on the target feature data by using the voice recognition model to obtain the voice recognition result of the voice sample with the noise. The extracted target feature data may be an Fbank feature in the speech enhancement result.

And finally, determining a joint loss value of the voice enhancement model and the voice recognition model according to the voice enhancement result and the voice recognition result of the noisy voice sample, and adjusting the voice enhancement model and/or the model parameters of the voice recognition model according to the joint loss value. Wherein the model parameters of the speech enhancement model and the speech recognition model comprise some weight parameters in the neural network. In specific implementation, each weight parameter in the neural network may be initialized, then the gradient of the node in the neural network is calculated according to the joint loss value, and then the weight parameter of the corresponding node in the neural network is adjusted according to the gradient.

And after the parameters are adjusted, entering the next round of training until the joint loss value meets the convergence condition, and determining that the training of the voice enhancement model and the voice recognition model is finished.

In an optional embodiment of the present invention, the determining, in step S14, a joint loss value of the speech enhancement model and the speech recognition model according to the speech enhancement result of the noisy speech sample and the speech recognition result of the noisy speech sample further includes:

substep S141, determining a first loss value of the speech enhancement model according to the speech enhancement result of the noisy speech sample and the clean speech sample;

substep S142, determining a second loss value of the speech recognition model according to the speech recognition result of the noisy speech sample and the text information;

and a substep S143 of performing weighted summation on the first loss value and the second loss value to obtain a joint loss value of the speech enhancement model and the speech recognition model.

In the embodiment of the present invention, a first loss value of the speech enhancement model may be determined according to the speech enhancement result output by the speech enhancement model and the clean speech sample in each round of training, and a second loss value of the speech recognition model may be determined according to the speech recognition result output by the speech recognition model and the text information corresponding to the noisy sample. Then, the first loss value and the second loss value are weighted and summed to obtain a joint loss value of the voice enhancement model and the voice recognition model.

The first loss value can be determined according to the spectral feature of the speech enhancement result and the spectral feature of the clean speech sample. For example, the first loss value may be obtained by calculating a Mean Square Error (MSE) which is a Mean Square Error (Mean Square Error) of the speech enhancement result and the point Error corresponding to the clean speech sample. The second loss value may be determined based on a cross entropy of the speech recognition result and the text information corresponding to the noisy speech sample.

The weights of the first loss value and the second loss value can be determined according to the weights of the speech enhancement model and the speech recognition model, and the weights of the speech enhancement model and the speech recognition model can be determined according to factors such as a training target, a training environment, an application scene and the like. When the performance requirement of which model is higher, the weight of the loss value corresponding to the model is smaller. In the embodiment of the present invention, the main objective is to improve the recognition performance of the speech recognition model, and therefore, a smaller weight may be set for the second loss value of the speech recognition model, for example, a weight value of the first loss value may be set to 0.8, and a weight value of the second loss value may be set to 0.2.

In an optional embodiment of the present invention, the performing feature extraction on the speech enhancement result in step S12 includes:

substep S121, performing feature extraction on the voice enhancement result frame by frame to obtain feature information of each frame;

and a substep S122, adding the feature information of the previous frame and the next frame of the current frame to the feature information of the current frame in the voice enhancement result to obtain target feature data.

In the embodiment of the invention, when the feature extraction is performed on the voice enhancement result, the feature extraction is also performed on each frame of data frame by frame to obtain the feature information of each frame. Specifically, when feature extraction is performed on the speech enhancement result, frame division processing may be performed on the speech enhancement result, a series of preprocessing such as pre-emphasis and windowing may be further performed on each frame, and then feature extraction is performed on each frame after preprocessing. The extracted features may be Fbank features, for example, fast fourier transform is performed on each frame, and each frame of data is converted from a time domain signal to a frequency domain signal; and then, filtering the frequency domain signal after Fourier transform through a Mel filter bank, and taking logarithm of the filtering result to obtain the Fank characteristic of each frame.

In order to realize better voice smoothing effect, the extracted feature information can be further subjected to frame splicing processing. Specifically, the feature information of the previous frame and the next frame of the current frame is added to the feature information of the current frame to obtain target feature data corresponding to the speech enhancement result. Thus, the target feature data reflecting the context information can be obtained, which is helpful for improving the accuracy of the extracted target feature data.

In an optional embodiment of the invention, the method further comprises:

step S21, acquiring target voice data to be processed;

and step S22, inputting the target voice data into the trained voice enhancement model for voice enhancement processing to obtain a voice enhancement result corresponding to the target voice data.

The trained voice enhancement model and the trained voice recognition model in the embodiment of the invention can be used independently. For example, performing speech enhancement processing on the target speech data through the trained speech enhancement model to obtain a speech enhancement result corresponding to the target speech data; or performing voice recognition processing on the trained voice enhancement model corresponding to the target voice data to obtain a voice recognition result corresponding to the target voice data.

In an optional embodiment of the invention, the method further comprises:

The trained speech enhancement model and the trained speech recognition model in the embodiment of the invention can also be used in series. Specifically, the target voice data may be input into the trained voice enhancement model, and the target voice data is subjected to voice enhancement processing by using the voice enhancement model to obtain a voice enhancement result corresponding to the target voice data; and then, inputting the voice enhancement result into the trained voice recognition model, and performing voice recognition processing on the voice enhancement result by using the voice recognition model to obtain a voice recognition result of the target voice data.

In an optional embodiment of the invention, the method further comprises:

The trained voice enhancement model and the trained voice recognition model can perform voice processing on target voice data in a clean scene and can also perform voice processing on target voice data in a complex noisy scene. In addition, in specific application, corresponding preprocessing can be performed on the target voice data according to application scenes, for example, the signal to noise ratio of the target voice data is adjusted, different voice enhancement models can be selected according to different application scenes, model structures of the voice enhancement models can be adjusted according to the application scenes, then, iterative joint training is performed on the serially connected voice enhancement models and the preset voice recognition models by using voice training samples of different scenes, and the trained voice enhancement models, the trained voice recognition models and the trained voice recognition models in different scenes are obtained.

Compared with the traditional training method of the voice recognition model, the voice recognition model needs to be trained independently according to the training data matched with the application scene, the training period is long, and the recognition performance of the trained voice recognition model under different scenes cannot be guaranteed at the same time.

In addition, the trained voice recognition model in the embodiment of the invention can also recognize target voice data containing a plurality of user voices in a noisy scene. Specifically, the voice feature information of the target user can be determined, then the voice data of the target user is extracted from the target voice data according to the voice feature information of the target user, and the voice data of the target user is input into the trained voice recognition model, so that the text information corresponding to the voice data of the target user can be obtained, and the voice data of the target user can be recognized in a targeted manner under a noisy scene.

To sum up, in the embodiment of the present invention, a speech enhancement model is connected in series before a speech recognition model, and iterative joint training is performed on the speech enhancement model and the speech recognition model, and an enhancement error function of the speech enhancement model is introduced to train the speech recognition model on the basis of an original recognition error function of speech recognition, so that a large amount of training corpora is not required to be added, the training time of the speech recognition model is reduced, and the training efficiency of the speech recognition model is improved. In addition, the voice training samples of the embodiment of the invention comprise noisy voice samples and clean voice samples corresponding to the noisy voice samples, and the voice processed by the front-end enhancement model is used as the training corpus of the voice recognition model, so that the reduction of the recognition performance of the voice recognition model in a clean scene can not be caused, the recognition performance of the voice recognition model in complex scenes such as noisy scenes can also be improved, and the higher the noise-to-noise ratio is, the higher the recognition performance of the voice recognition model is.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Device embodiment

Referring to fig. 4, a block diagram of a model training apparatus according to an embodiment of the present invention is shown, where the apparatus may include:

a training sample obtaining module 201, configured to obtain a voice training sample, where the voice training sample includes a noisy voice sample and a clean voice sample corresponding to the noisy voice sample;

and the model training module 202 is configured to perform iterative joint training on the speech enhancement model and the speech recognition model which are connected in series based on the speech training sample, adjust the speech enhancement model according to a joint loss value of the speech enhancement model and the speech recognition model in each round of training, and/or obtain the trained speech enhancement model and the trained speech recognition model when the joint loss value satisfies a convergence condition.

Optionally, the model training module includes:

Optionally, the feature extraction sub-module includes:

Optionally, the apparatus further comprises:

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

An embodiment of the present invention provides an apparatus for model training, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors to include instructions for: acquiring a voice training sample, wherein the voice training sample comprises a voice sample with noise and a clean voice sample corresponding to the voice sample with noise; and carrying out iterative joint training on the serially connected voice enhancement model and the voice recognition model based on the voice training sample, adjusting the voice enhancement model according to the joint loss value of the voice enhancement model and the voice recognition model in each round of training, and/or obtaining the trained voice enhancement model and the trained voice recognition model when the joint loss value meets the convergence condition.

Optionally, the device is also configured to execute the one or more programs by the one or more processors including instructions for:

acquiring target voice data to be processed;

FIG. 5 is a block diagram illustrating an apparatus 800 for model training in accordance with an exemplary embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 5, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice information processing mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also model a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 6 is a schematic diagram of a server in some embodiments of the invention. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform the model training method shown in fig. 1.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform a model training method, the method comprising: acquiring a voice training sample, wherein the voice training sample comprises a voice sample with noise and a clean voice sample corresponding to the voice sample with noise; and carrying out iterative joint training on the serially connected voice enhancement model and the voice recognition model based on the voice training sample, adjusting the voice enhancement model according to the joint loss value of the voice enhancement model and the voice recognition model in each round of training, and/or obtaining the trained voice enhancement model and the trained voice recognition model when the joint loss value meets the convergence condition.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The model training method, the model training device and the device for model training provided by the invention are introduced in detail, specific examples are applied in the text to explain the principle and the implementation mode of the invention, and the description of the above embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of model training, the method comprising:

2. The method of claim 1, wherein the iterative co-training of the concatenated speech enhancement model and speech recognition model based on the speech training samples comprises:

3. The method according to claim 2, wherein the speech training samples further include text information corresponding to the noisy speech sample, and wherein determining the joint loss value of the speech enhancement model and the speech recognition model according to the speech enhancement result of the noisy speech sample and the speech recognition result of the noisy speech sample comprises:

4. The method of claim 2, wherein the feature extracting the speech enhancement result comprises:

5. The method of claim 1, further comprising:

acquiring target voice data to be processed;

6. The method of claim 5, further comprising:

7. The method of claim 1, further comprising:

8. A model training apparatus, the apparatus comprising:

9. The apparatus of claim 8, wherein the model training module comprises:

10. The apparatus of claim 9, wherein the speech training samples further include text information corresponding to the noisy speech samples, and wherein the loss value determination submodule comprises:

11. The apparatus of claim 9, wherein the feature extraction sub-module comprises:

12. The apparatus of claim 8, further comprising:

13. The apparatus of claim 12, further comprising:

14. An apparatus for model training, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein the one or more programs configured to be executed by the one or more processors comprise instructions for performing the model training method of any of claims 1-7.

15. A machine-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform the model training method of any of claims 1 to 7.