CN112364993B

CN112364993B - Model joint training method and device, computer equipment and storage medium

Info

Publication number: CN112364993B
Application number: CN202110044163.XA
Authority: CN
Inventors: 徐泓洋; 王广新; 杨汉丹
Original assignee: Shenzhen Youjie Zhixin Technology Co ltd
Current assignee: Shenzhen Youjie Zhixin Technology Co ltd
Priority date: 2021-01-13
Filing date: 2021-01-13
Publication date: 2021-04-30
Anticipated expiration: 2041-01-13
Also published as: CN112364993A

Abstract

The application provides a model joint training method, a device, computer equipment and a storage medium, comprising the following steps: constructing a first acoustic feature matrix of the audio training data; inputting the first acoustic feature matrix into an encoding network to obtain a first high-dimensional feature matrix; inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix; inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix; and respectively inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model. In this application, decoding network output second acoustic feature matrix has increased the data volume of training the sample, and the joint training awakens the model and falls the model of making an uproar, and the effect is better than when training the model alone, and trains fastly, and training is with low costs.

Description

Model joint training method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a model joint training method and apparatus, a computer device, and a storage medium.

Background

Currently, training the wake-up model and the noise reduction model are generally based on a collected clean speech data set and a collected noise data set. During training, a data enhancement technology simulating a real scene is performed to increase the diversity of training data and improve the anti-noise capability of the model in the real scene.

In order to obtain a good noise reduction model, training requires substantially more training data than the training of the wake-up model. When the training data is limited, only noisy speech or only a small amount of clean speech is available, a good noise reduction model cannot be obtained; the effect that the direct training awakening model can obtain is relatively limited, and the effect of the awakening model is difficult to further promote.

Disclosure of Invention

The application mainly aims to provide a model joint training method, a model joint training device, computer equipment and a storage medium, and aims to overcome the defect that the effect of a model obtained by training when the current training data is less is poor.

In order to achieve the above object, the present application provides a model joint training method, including the following steps:

constructing a first acoustic feature matrix of the audio training data;

inputting the first acoustic feature matrix into an encoding network to obtain a first high-dimensional feature matrix;

inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;

inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix;

inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network respectively, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.

Further, the classification network comprises a full connection layer and a softmax function, and the loss function used is a cross-entropy loss function.

Further, the step of inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network, and the classification network based on a back propagation algorithm to obtain a trained wake-up model and a trained noise reduction model includes:

after the first high-dimensional feature matrix and the second high-dimensional feature matrix are input into the full-connection layer for calculation, calculating a cross entropy loss value based on the loss function;

adjusting network parameters of the encoding network, decoding network, and classification network using a gradient descent back propagation algorithm to minimize the cross entropy loss value;

and after iterative training, when the cross entropy loss value does not decrease any more, the model converges to obtain the trained awakening model and the noise reduction model.

Further, the audio training data comprises positive sample audio and negative sample audio;

before the step of constructing the first acoustic feature matrix of the audio training data, the method comprises:

acquiring noise voice as the negative sample audio;

acquiring pure awakening voice; the pure awakening voice is pure voice without noise and carrying awakening words;

and mixing the pure awakening voice and the noise voice according to a preset signal-to-noise ratio to obtain the awakening voice with noise as the positive sample audio.

The application also provides a model joint training device, including:

the audio training device comprises a construction unit, a processing unit and a processing unit, wherein the construction unit is used for constructing a first acoustic feature matrix of audio training data;

the first coding unit is used for inputting the first acoustic feature matrix into a coding network to obtain a first high-dimensional feature matrix;

the decoding unit is used for inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;

the second coding unit is used for inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix;

the training unit is used for respectively inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.

Further, the training unit is specifically configured to:

Further, the audio training data comprises positive sample audio and negative sample audio; the device further comprises:

a first obtaining unit configured to obtain a noise voice as the negative sample audio;

the second acquisition unit is used for acquiring the pure awakening voice; the pure awakening voice is pure voice without noise and carrying awakening words;

and the mixing unit is used for mixing the pure awakening voice and the noise voice according to a preset signal-to-noise ratio to obtain the awakening voice with noise, and the awakening voice with noise is used as the positive sample audio.

The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of any one of the above methods when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the above.

The model joint training method, device, computer equipment and storage medium provided by the application comprise: constructing a first acoustic feature matrix of the audio training data; inputting the first acoustic feature matrix into an encoding network to obtain a first high-dimensional feature matrix; inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix; inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix; and respectively inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model. In the application, the decoding network outputs the second acoustic feature matrix, so that the data volume of a training sample is increased, and an awakening model and a noise reduction model are jointly trained; the two models share one coding network, so that the two models have the capability of more accurately extracting target information from noisy audio, the effect is better than that when the models are trained independently, the training speed is high, and the training cost is low.

Drawings

FIG. 1 is a schematic diagram illustrating the steps of a model co-training method according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a principle of a model joint training method according to an embodiment of the present application;

FIG. 3 is a block diagram of a model co-training apparatus according to an embodiment of the present application;

fig. 4 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, an embodiment of the present application provides a model joint training method, including the following steps:

step S1, constructing a first acoustic feature matrix of the audio training data;

step S2, inputting the first acoustic feature matrix into a coding network to obtain a first high-dimensional feature matrix;

step S3, inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;

step S4, inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix;

step S5, the first high-dimensional feature matrix and the second high-dimensional feature matrix are respectively input into a classification network, and network parameters of the coding network, the decoding network and the classification network are adjusted based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.

In this embodiment, the model joint training method is applied to a scene with less training data, and the effect of the trained model is improved. The method obtains the noise reduction model with a certain noise reduction effect while obtaining the awakening model through training, and provides a feasible scheme for constructing the noise reduction model under the condition of insufficient training data. The training data is less, which means that the pure wake-up voice is less or no.

Specifically, as described in the above step S1, the audio training data is audio data, typically noisy wake-up voice data, and the audio data is labeled with a corresponding label for training the neural network model. Before the training of the input value neural network model, a first acoustic feature matrix of the audio training data needs to be constructed, and a linear transformation network can be generally adopted for extracting the feature matrix.

As described in the above step S2, the above coding network (kws net) is a neural network for extracting an audio high-dimensional feature matrix, and the coding network inputs the acoustic feature matrix of the audio and outputs the feature matrix of the high-dimensional space.

As described in the step S3, the decoding network (decode _ net) is a neural network for decoding the high-dimensional feature matrix into an acoustic feature matrix, and after passing through the decoding network, a new acoustic feature matrix, that is, the second acoustic feature matrix is generated. It can be understood that, in the present embodiment, as shown in fig. 2, the coding network described above is used as a common part of the noise reduction model and the wake-up network, wherein the coding network of the wake-up model mainly extracts information related to the speech content in the noisy sound when processing the input data, and the coding network of the noise reduction model mainly separates the target sound feature when processing the input data, and then generates the target speech according to the target sound feature. The same point of the two methods is that feature information of the target voice needs to be extracted, so that the coding network of the noise reduction model can keep the voice information when processing the noisy audio, and the audio generated by the decoding network can be awakened after the audio is awakened through the awakening network.

As described in step S4, the audio generated by the decoding network can be woken up after passing through the wake-up network; therefore, the second acoustic feature matrix obtained by decoding through the decoding network can also be used as training data, and the training data is input into the coding network to obtain a second high-dimensional feature matrix.

As stated in step S5, the first high-dimensional feature matrix and the second high-dimensional feature matrix are respectively input into the classification network, and network parameters (network weights) of the coding network, the decoding network, and the classification network are continuously adjusted based on a back propagation algorithm, so as to obtain a trained wake-up model and a trained noise reduction model.

In this embodiment, the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model; after the coding network, the decoding network and the classification network are iteratively trained, the awakening model and the noise reduction model can be obtained after the model is converged. In this embodiment, the decoding network outputs the second acoustic feature matrix, which increases the data size of the training sample, and jointly trains the wake-up model and the noise reduction model; the two models share one coding network, so that the two models have the capability of more accurately extracting target information from noisy audio, the effect is better than that when the models are trained independently, the training speed is high, and the training cost is low.

In summary, the model joint training method in the embodiment of the present application is applicable to a scene with insufficient training data, that is, when there is not enough data to construct a noise reduction model to assist in improving the effect of the wake-up model, the wake-up network and the noise reduction network share one coding network, and the two networks are directly subjected to joint training on the noisy wake-up speech training data and the noisy data set. In addition, when the loss function of the awakening model is used for training the awakening model, the noise reduction model is trained in a sequential manner, the obtained noise reduction model has a certain noise reduction effect under the condition that pure voice does not exist, and a feasible scheme is provided for constructing the noise reduction model under the condition that data is insufficient.

In an embodiment, the coding network includes any one or more of neural networks such as DNN, CNN, RNN, etc., and the networks may all implement coding of the acoustic feature matrix, which is not limited herein.

In an embodiment, the decoding network includes any one or more of neural networks such as DNN, CNN, RNN, etc., and the above networks may all implement decoding of the high-dimensional feature matrix, which is not limited herein.

In an embodiment, the classification network comprises a full connectivity layer and a softmax function, and the loss function used is a cross-entropy loss function.

In this embodiment, the classification network of the wake-up model is a general classification model, the target is a class label, and in the decoding network, since the audio output by the wake-up model is input into the classification model of the coding network again as a sample, the target is still a class label, so that two network joint training only has one loss function, that is, a cross entropy loss function commonly used by the general classification model, and the formula is as follows:

Total_loss = ce_loss；

in an embodiment, the step of inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network, and the classification network based on a back propagation algorithm to obtain a trained wake-up model and a trained noise reduction model includes:

In the iterative training process of this embodiment, the classification result is predicted, and the cross entropy loss value between the predicted classification result and the real label is calculated by the loss function. And then, continuously adjusting network parameters of the coding network, the decoding network and the classification network, namely network weights, by adopting a gradient descending back propagation algorithm so as to minimize a cross entropy loss value calculated by a loss function, and converging the model when the cross entropy loss value does not descend any more, thereby obtaining the trained awakening model and the trained noise reduction model.

In an embodiment, the audio training data comprises positive sample audio and negative sample audio;

acquiring noise voice as the negative sample audio;

and acquiring a noisy awakening voice as the positive sample audio.

In an embodiment, the step of obtaining a noisy wake-up voice as the positive sample audio includes:

and mixing the pure awakening voice and the noise voice to obtain the awakening voice with noise as the positive sample audio.

Referring to fig. 3, an embodiment of the present application further provides a model joint training apparatus, including:

a construction unit 10, configured to construct a first acoustic feature matrix of the audio training data;

the first encoding unit 20 is configured to input the first acoustic feature matrix to an encoding network to obtain a first high-dimensional feature matrix;

a decoding unit 30, configured to input the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;

a second encoding unit 40, configured to input the second acoustic feature matrix to the encoding network to obtain a second high-dimensional feature matrix;

the training unit 50 is configured to input the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjust network parameters of the coding network, the decoding network, and the classification network based on a back propagation algorithm to obtain a trained wake-up model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.

In this embodiment, the model joint training device is applied to a scene with less training data, so as to improve the effect of the trained model. The method obtains the noise reduction model with a certain noise reduction effect while obtaining the awakening model through training, and provides a feasible scheme for constructing the noise reduction model under the condition of insufficient training data. The training data is less, which means that the pure wake-up voice is less or no.

Specifically, as described in the above building unit 10, the audio training data is audio data, typically noisy wake-up voice data, and the audio data is labeled with a corresponding label for training the neural network model. Before the training of the input value neural network model, a first acoustic feature matrix of the audio training data needs to be constructed, and a linear transformation network can be generally adopted for extracting the feature matrix.

As described in the first encoding unit 20, the encoding network (kws net) is a neural network for extracting a high-dimensional feature matrix of audio, and the encoding network inputs an acoustic feature matrix of audio and outputs a feature matrix of a high-dimensional space.

As described in the decoding unit 30, the decoding network (decode _ net) is a neural network for decoding a high-dimensional feature matrix into an acoustic feature matrix, and after passing through the decoding network, a new acoustic feature matrix, that is, the second acoustic feature matrix is generated. It can be understood that, in the present embodiment, as shown in fig. 2, the coding network described above is used as a common part of the noise reduction model and the wake-up network, wherein the coding network of the wake-up model mainly extracts information related to the speech content in the noisy sound when processing the input data, and the coding network of the noise reduction model mainly separates the target sound feature when processing the input data, and then generates the target speech according to the target sound feature. The same point of the two methods is that feature information of the target voice needs to be extracted, so that the coding network of the noise reduction model can keep the voice information when processing the noisy audio, and the audio generated by the decoding network can be awakened after the audio is awakened through the awakening network.

As described in the second encoding unit 40, the audio generated by the decoding network can be awakened after passing through the wake-up network; therefore, the second acoustic feature matrix obtained by decoding through the decoding network can also be used as training data, and the training data is input into the coding network to obtain a second high-dimensional feature matrix.

As stated in the training unit 50, the first high-dimensional feature matrix and the second high-dimensional feature matrix are respectively input into the classification network, and network parameters (network weights) of the coding network, the decoding network and the classification network are continuously adjusted based on a back propagation algorithm, so as to obtain a trained wake-up model and a trained noise reduction model.

To sum up, for the model joint training device in this application embodiment, be applicable to the not enough scene of training data volume, when there is not enough data to construct the noise reduction model promptly and assist the effect that promotes the awakening model, make the awakening network and the noise reduction network share a coding network, directly carry out the joint training to two networks on the awakening voice training data that make an uproar and the noise data set, through this kind of training mode, make the coding network possess the ability of accurately extracting the target information from the voice that makes an uproar, thereby make the model effect better than when training alone, and training speed is fast, training is with low costs. In addition, when the loss function of the awakening model is used for training the awakening model, the noise reduction model is trained in a sequential manner, the obtained noise reduction model has a certain noise reduction effect under the condition that pure voice does not exist, and a feasible scheme is provided for constructing the noise reduction model under the condition that data is insufficient.

In one embodiment, the coding network includes any one or more of a neural network such as DNN, CNN, RNN, etc.

In one embodiment, the decoding network comprises any one or more of a neural network such as a DNN, a CNN, an RNN, etc.

In an embodiment, the training unit 50 is specifically configured to:

the model joint training device further comprises:

and the second acquisition unit is used for acquiring the awakening voice with noise as the positive sample audio.

In an embodiment, the second obtaining unit is specifically configured to:

In this embodiment, please refer to the method described in the above embodiment for specific implementation of each unit in the model joint training apparatus, which is not described herein again.

Referring to fig. 4, a computer device, which may be a server and whose internal structure may be as shown in fig. 4, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing models and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a model co-training method.

Those skilled in the art will appreciate that the architecture shown in fig. 4 is only a block diagram of some of the structures associated with the present solution and is not intended to limit the scope of the present solution as applied to computer devices.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a model joint training method. It is to be understood that the computer-readable storage medium in the present embodiment may be a volatile-readable storage medium or a non-volatile-readable storage medium.

In summary, the model joint training method, apparatus, computer device and storage medium provided in the embodiments of the present application include: constructing a first acoustic feature matrix of the audio training data; inputting the first acoustic feature matrix into an encoding network to obtain a first high-dimensional feature matrix; inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix; inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix; and respectively inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model. In the application, the decoding network outputs the second acoustic feature matrix, so that the data volume of a training sample is increased, and an awakening model and a noise reduction model are jointly trained; the two models share one coding network, so that the two models have the capability of more accurately extracting target information from noisy audio, the effect is better than that when the models are trained independently, the training speed is high, and the training cost is low.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Dynamic Random Access Memory (RDRAM), direct bus dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only for the preferred embodiment of the present application and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims

1. A model joint training method is characterized by comprising the following steps:

constructing a first acoustic feature matrix of the audio training data;

2. The model co-training method of claim 1, wherein the classification network comprises a fully connected layer and a softmax function, and the loss function used is a cross-entropy loss function.

3. The model joint training method according to claim 2, wherein the step of inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting the network parameters of the coding network, the decoding network, and the classification network based on a back propagation algorithm to obtain a trained wake-up model and a noise reduction model comprises:

4. The model co-training method of claim 1, wherein the audio training data comprises positive sample audio and negative sample audio;

acquiring noise voice as the negative sample audio;

5. A model co-training apparatus, comprising:

6. The model co-training apparatus of claim 5, wherein the classification network comprises a fully connected layer and a softmax function, and the loss function used is a cross-entropy loss function.

7. The model joint training device of claim 6, wherein the training unit is specifically configured to:

8. The model co-training apparatus of claim 5, wherein the audio training data comprises positive sample audio and negative sample audio; the device further comprises:

9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method according to any of claims 1 to 4.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.