CN112364993B - Model joint training method and device, computer equipment and storage medium - Google Patents

Model joint training method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112364993B
CN112364993B CN202110044163.XA CN202110044163A CN112364993B CN 112364993 B CN112364993 B CN 112364993B CN 202110044163 A CN202110044163 A CN 202110044163A CN 112364993 B CN112364993 B CN 112364993B
Authority
CN
China
Prior art keywords
network
feature matrix
model
training
dimensional feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110044163.XA
Other languages
Chinese (zh)
Other versions
CN112364993A (en
Inventor
徐泓洋
王广新
杨汉丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Youjie Zhixin Technology Co ltd
Original Assignee
Shenzhen Youjie Zhixin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Youjie Zhixin Technology Co ltd filed Critical Shenzhen Youjie Zhixin Technology Co ltd
Priority to CN202110044163.XA priority Critical patent/CN112364993B/en
Publication of CN112364993A publication Critical patent/CN112364993A/en
Application granted granted Critical
Publication of CN112364993B publication Critical patent/CN112364993B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

The application provides a model joint training method, a device, computer equipment and a storage medium, comprising the following steps: constructing a first acoustic feature matrix of the audio training data; inputting the first acoustic feature matrix into an encoding network to obtain a first high-dimensional feature matrix; inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix; inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix; and respectively inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model. In this application, decoding network output second acoustic feature matrix has increased the data volume of training the sample, and the joint training awakens the model and falls the model of making an uproar, and the effect is better than when training the model alone, and trains fastly, and training is with low costs.

Description

Model joint training method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a model joint training method and apparatus, a computer device, and a storage medium.
Background
Currently, training the wake-up model and the noise reduction model are generally based on a collected clean speech data set and a collected noise data set. During training, a data enhancement technology simulating a real scene is performed to increase the diversity of training data and improve the anti-noise capability of the model in the real scene.
In order to obtain a good noise reduction model, training requires substantially more training data than the training of the wake-up model. When the training data is limited, only noisy speech or only a small amount of clean speech is available, a good noise reduction model cannot be obtained; the effect that the direct training awakening model can obtain is relatively limited, and the effect of the awakening model is difficult to further promote.
Disclosure of Invention
The application mainly aims to provide a model joint training method, a model joint training device, computer equipment and a storage medium, and aims to overcome the defect that the effect of a model obtained by training when the current training data is less is poor.
In order to achieve the above object, the present application provides a model joint training method, including the following steps:
constructing a first acoustic feature matrix of the audio training data;
inputting the first acoustic feature matrix into an encoding network to obtain a first high-dimensional feature matrix;
inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;
inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix;
inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network respectively, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.
Further, the classification network comprises a full connection layer and a softmax function, and the loss function used is a cross-entropy loss function.
Further, the step of inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network, and the classification network based on a back propagation algorithm to obtain a trained wake-up model and a trained noise reduction model includes:
after the first high-dimensional feature matrix and the second high-dimensional feature matrix are input into the full-connection layer for calculation, calculating a cross entropy loss value based on the loss function;
adjusting network parameters of the encoding network, decoding network, and classification network using a gradient descent back propagation algorithm to minimize the cross entropy loss value;
and after iterative training, when the cross entropy loss value does not decrease any more, the model converges to obtain the trained awakening model and the noise reduction model.
Further, the audio training data comprises positive sample audio and negative sample audio;
before the step of constructing the first acoustic feature matrix of the audio training data, the method comprises:
acquiring noise voice as the negative sample audio;
acquiring pure awakening voice; the pure awakening voice is pure voice without noise and carrying awakening words;
and mixing the pure awakening voice and the noise voice according to a preset signal-to-noise ratio to obtain the awakening voice with noise as the positive sample audio.
The application also provides a model joint training device, including:
the audio training device comprises a construction unit, a processing unit and a processing unit, wherein the construction unit is used for constructing a first acoustic feature matrix of audio training data;
the first coding unit is used for inputting the first acoustic feature matrix into a coding network to obtain a first high-dimensional feature matrix;
the decoding unit is used for inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;
the second coding unit is used for inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix;
the training unit is used for respectively inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.
Further, the classification network comprises a full connection layer and a softmax function, and the loss function used is a cross-entropy loss function.
Further, the training unit is specifically configured to:
after the first high-dimensional feature matrix and the second high-dimensional feature matrix are input into the full-connection layer for calculation, calculating a cross entropy loss value based on the loss function;
adjusting network parameters of the encoding network, decoding network, and classification network using a gradient descent back propagation algorithm to minimize the cross entropy loss value;
and after iterative training, when the cross entropy loss value does not decrease any more, the model converges to obtain the trained awakening model and the noise reduction model.
Further, the audio training data comprises positive sample audio and negative sample audio; the device further comprises:
a first obtaining unit configured to obtain a noise voice as the negative sample audio;
the second acquisition unit is used for acquiring the pure awakening voice; the pure awakening voice is pure voice without noise and carrying awakening words;
and the mixing unit is used for mixing the pure awakening voice and the noise voice according to a preset signal-to-noise ratio to obtain the awakening voice with noise, and the awakening voice with noise is used as the positive sample audio.
The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of any one of the above methods when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the above.
The model joint training method, device, computer equipment and storage medium provided by the application comprise: constructing a first acoustic feature matrix of the audio training data; inputting the first acoustic feature matrix into an encoding network to obtain a first high-dimensional feature matrix; inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix; inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix; and respectively inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model. In the application, the decoding network outputs the second acoustic feature matrix, so that the data volume of a training sample is increased, and an awakening model and a noise reduction model are jointly trained; the two models share one coding network, so that the two models have the capability of more accurately extracting target information from noisy audio, the effect is better than that when the models are trained independently, the training speed is high, and the training cost is low.
Drawings
FIG. 1 is a schematic diagram illustrating the steps of a model co-training method according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating a principle of a model joint training method according to an embodiment of the present application;
FIG. 3 is a block diagram of a model co-training apparatus according to an embodiment of the present application;
fig. 4 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, an embodiment of the present application provides a model joint training method, including the following steps:
step S1, constructing a first acoustic feature matrix of the audio training data;
step S2, inputting the first acoustic feature matrix into a coding network to obtain a first high-dimensional feature matrix;
step S3, inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;
step S4, inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix;
step S5, the first high-dimensional feature matrix and the second high-dimensional feature matrix are respectively input into a classification network, and network parameters of the coding network, the decoding network and the classification network are adjusted based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.
In this embodiment, the model joint training method is applied to a scene with less training data, and the effect of the trained model is improved. The method obtains the noise reduction model with a certain noise reduction effect while obtaining the awakening model through training, and provides a feasible scheme for constructing the noise reduction model under the condition of insufficient training data. The training data is less, which means that the pure wake-up voice is less or no.
Specifically, as described in the above step S1, the audio training data is audio data, typically noisy wake-up voice data, and the audio data is labeled with a corresponding label for training the neural network model. Before the training of the input value neural network model, a first acoustic feature matrix of the audio training data needs to be constructed, and a linear transformation network can be generally adopted for extracting the feature matrix.
As described in the above step S2, the above coding network (kws net) is a neural network for extracting an audio high-dimensional feature matrix, and the coding network inputs the acoustic feature matrix of the audio and outputs the feature matrix of the high-dimensional space.
As described in the step S3, the decoding network (decode _ net) is a neural network for decoding the high-dimensional feature matrix into an acoustic feature matrix, and after passing through the decoding network, a new acoustic feature matrix, that is, the second acoustic feature matrix is generated. It can be understood that, in the present embodiment, as shown in fig. 2, the coding network described above is used as a common part of the noise reduction model and the wake-up network, wherein the coding network of the wake-up model mainly extracts information related to the speech content in the noisy sound when processing the input data, and the coding network of the noise reduction model mainly separates the target sound feature when processing the input data, and then generates the target speech according to the target sound feature. The same point of the two methods is that feature information of the target voice needs to be extracted, so that the coding network of the noise reduction model can keep the voice information when processing the noisy audio, and the audio generated by the decoding network can be awakened after the audio is awakened through the awakening network.
As described in step S4, the audio generated by the decoding network can be woken up after passing through the wake-up network; therefore, the second acoustic feature matrix obtained by decoding through the decoding network can also be used as training data, and the training data is input into the coding network to obtain a second high-dimensional feature matrix.
As stated in step S5, the first high-dimensional feature matrix and the second high-dimensional feature matrix are respectively input into the classification network, and network parameters (network weights) of the coding network, the decoding network, and the classification network are continuously adjusted based on a back propagation algorithm, so as to obtain a trained wake-up model and a trained noise reduction model.
In this embodiment, the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model; after the coding network, the decoding network and the classification network are iteratively trained, the awakening model and the noise reduction model can be obtained after the model is converged. In this embodiment, the decoding network outputs the second acoustic feature matrix, which increases the data size of the training sample, and jointly trains the wake-up model and the noise reduction model; the two models share one coding network, so that the two models have the capability of more accurately extracting target information from noisy audio, the effect is better than that when the models are trained independently, the training speed is high, and the training cost is low.
In summary, the model joint training method in the embodiment of the present application is applicable to a scene with insufficient training data, that is, when there is not enough data to construct a noise reduction model to assist in improving the effect of the wake-up model, the wake-up network and the noise reduction network share one coding network, and the two networks are directly subjected to joint training on the noisy wake-up speech training data and the noisy data set. In addition, when the loss function of the awakening model is used for training the awakening model, the noise reduction model is trained in a sequential manner, the obtained noise reduction model has a certain noise reduction effect under the condition that pure voice does not exist, and a feasible scheme is provided for constructing the noise reduction model under the condition that data is insufficient.
In an embodiment, the coding network includes any one or more of neural networks such as DNN, CNN, RNN, etc., and the networks may all implement coding of the acoustic feature matrix, which is not limited herein.
In an embodiment, the decoding network includes any one or more of neural networks such as DNN, CNN, RNN, etc., and the above networks may all implement decoding of the high-dimensional feature matrix, which is not limited herein.
In an embodiment, the classification network comprises a full connectivity layer and a softmax function, and the loss function used is a cross-entropy loss function.
In this embodiment, the classification network of the wake-up model is a general classification model, the target is a class label, and in the decoding network, since the audio output by the wake-up model is input into the classification model of the coding network again as a sample, the target is still a class label, so that two network joint training only has one loss function, that is, a cross entropy loss function commonly used by the general classification model, and the formula is as follows:
Total_loss = ce_loss;
in an embodiment, the step of inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network, and the classification network based on a back propagation algorithm to obtain a trained wake-up model and a trained noise reduction model includes:
after the first high-dimensional feature matrix and the second high-dimensional feature matrix are input into the full-connection layer for calculation, calculating a cross entropy loss value based on the loss function;
adjusting network parameters of the encoding network, decoding network, and classification network using a gradient descent back propagation algorithm to minimize the cross entropy loss value;
and after iterative training, when the cross entropy loss value does not decrease any more, the model converges to obtain the trained awakening model and the noise reduction model.
In the iterative training process of this embodiment, the classification result is predicted, and the cross entropy loss value between the predicted classification result and the real label is calculated by the loss function. And then, continuously adjusting network parameters of the coding network, the decoding network and the classification network, namely network weights, by adopting a gradient descending back propagation algorithm so as to minimize a cross entropy loss value calculated by a loss function, and converging the model when the cross entropy loss value does not descend any more, thereby obtaining the trained awakening model and the trained noise reduction model.
In an embodiment, the audio training data comprises positive sample audio and negative sample audio;
before the step of constructing the first acoustic feature matrix of the audio training data, the method comprises:
acquiring noise voice as the negative sample audio;
and acquiring a noisy awakening voice as the positive sample audio.
In an embodiment, the step of obtaining a noisy wake-up voice as the positive sample audio includes:
acquiring pure awakening voice; the pure awakening voice is pure voice without noise and carrying awakening words;
and mixing the pure awakening voice and the noise voice to obtain the awakening voice with noise as the positive sample audio.
Referring to fig. 3, an embodiment of the present application further provides a model joint training apparatus, including:
a construction unit 10, configured to construct a first acoustic feature matrix of the audio training data;
the first encoding unit 20 is configured to input the first acoustic feature matrix to an encoding network to obtain a first high-dimensional feature matrix;
a decoding unit 30, configured to input the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;
a second encoding unit 40, configured to input the second acoustic feature matrix to the encoding network to obtain a second high-dimensional feature matrix;
the training unit 50 is configured to input the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjust network parameters of the coding network, the decoding network, and the classification network based on a back propagation algorithm to obtain a trained wake-up model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.
In this embodiment, the model joint training device is applied to a scene with less training data, so as to improve the effect of the trained model. The method obtains the noise reduction model with a certain noise reduction effect while obtaining the awakening model through training, and provides a feasible scheme for constructing the noise reduction model under the condition of insufficient training data. The training data is less, which means that the pure wake-up voice is less or no.
Specifically, as described in the above building unit 10, the audio training data is audio data, typically noisy wake-up voice data, and the audio data is labeled with a corresponding label for training the neural network model. Before the training of the input value neural network model, a first acoustic feature matrix of the audio training data needs to be constructed, and a linear transformation network can be generally adopted for extracting the feature matrix.
As described in the first encoding unit 20, the encoding network (kws net) is a neural network for extracting a high-dimensional feature matrix of audio, and the encoding network inputs an acoustic feature matrix of audio and outputs a feature matrix of a high-dimensional space.
As described in the decoding unit 30, the decoding network (decode _ net) is a neural network for decoding a high-dimensional feature matrix into an acoustic feature matrix, and after passing through the decoding network, a new acoustic feature matrix, that is, the second acoustic feature matrix is generated. It can be understood that, in the present embodiment, as shown in fig. 2, the coding network described above is used as a common part of the noise reduction model and the wake-up network, wherein the coding network of the wake-up model mainly extracts information related to the speech content in the noisy sound when processing the input data, and the coding network of the noise reduction model mainly separates the target sound feature when processing the input data, and then generates the target speech according to the target sound feature. The same point of the two methods is that feature information of the target voice needs to be extracted, so that the coding network of the noise reduction model can keep the voice information when processing the noisy audio, and the audio generated by the decoding network can be awakened after the audio is awakened through the awakening network.
As described in the second encoding unit 40, the audio generated by the decoding network can be awakened after passing through the wake-up network; therefore, the second acoustic feature matrix obtained by decoding through the decoding network can also be used as training data, and the training data is input into the coding network to obtain a second high-dimensional feature matrix.
As stated in the training unit 50, the first high-dimensional feature matrix and the second high-dimensional feature matrix are respectively input into the classification network, and network parameters (network weights) of the coding network, the decoding network and the classification network are continuously adjusted based on a back propagation algorithm, so as to obtain a trained wake-up model and a trained noise reduction model.
In this embodiment, the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model; after the coding network, the decoding network and the classification network are iteratively trained, the awakening model and the noise reduction model can be obtained after the model is converged. In this embodiment, the decoding network outputs the second acoustic feature matrix, which increases the data size of the training sample, and jointly trains the wake-up model and the noise reduction model; the two models share one coding network, so that the two models have the capability of more accurately extracting target information from noisy audio, the effect is better than that when the models are trained independently, the training speed is high, and the training cost is low.
To sum up, for the model joint training device in this application embodiment, be applicable to the not enough scene of training data volume, when there is not enough data to construct the noise reduction model promptly and assist the effect that promotes the awakening model, make the awakening network and the noise reduction network share a coding network, directly carry out the joint training to two networks on the awakening voice training data that make an uproar and the noise data set, through this kind of training mode, make the coding network possess the ability of accurately extracting the target information from the voice that makes an uproar, thereby make the model effect better than when training alone, and training speed is fast, training is with low costs. In addition, when the loss function of the awakening model is used for training the awakening model, the noise reduction model is trained in a sequential manner, the obtained noise reduction model has a certain noise reduction effect under the condition that pure voice does not exist, and a feasible scheme is provided for constructing the noise reduction model under the condition that data is insufficient.
In one embodiment, the coding network includes any one or more of a neural network such as DNN, CNN, RNN, etc.
In one embodiment, the decoding network comprises any one or more of a neural network such as a DNN, a CNN, an RNN, etc.
In an embodiment, the classification network comprises a full connectivity layer and a softmax function, and the loss function used is a cross-entropy loss function.
In an embodiment, the training unit 50 is specifically configured to:
after the first high-dimensional feature matrix and the second high-dimensional feature matrix are input into the full-connection layer for calculation, calculating a cross entropy loss value based on the loss function;
adjusting network parameters of the encoding network, decoding network, and classification network using a gradient descent back propagation algorithm to minimize the cross entropy loss value;
and after iterative training, when the cross entropy loss value does not decrease any more, the model converges to obtain the trained awakening model and the noise reduction model.
In an embodiment, the audio training data comprises positive sample audio and negative sample audio;
the model joint training device further comprises:
a first obtaining unit configured to obtain a noise voice as the negative sample audio;
and the second acquisition unit is used for acquiring the awakening voice with noise as the positive sample audio.
In an embodiment, the second obtaining unit is specifically configured to:
acquiring pure awakening voice; the pure awakening voice is pure voice without noise and carrying awakening words;
and mixing the pure awakening voice and the noise voice to obtain the awakening voice with noise as the positive sample audio.
In this embodiment, please refer to the method described in the above embodiment for specific implementation of each unit in the model joint training apparatus, which is not described herein again.
Referring to fig. 4, a computer device, which may be a server and whose internal structure may be as shown in fig. 4, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing models and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a model co-training method.
Those skilled in the art will appreciate that the architecture shown in fig. 4 is only a block diagram of some of the structures associated with the present solution and is not intended to limit the scope of the present solution as applied to computer devices.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a model joint training method. It is to be understood that the computer-readable storage medium in the present embodiment may be a volatile-readable storage medium or a non-volatile-readable storage medium.
In summary, the model joint training method, apparatus, computer device and storage medium provided in the embodiments of the present application include: constructing a first acoustic feature matrix of the audio training data; inputting the first acoustic feature matrix into an encoding network to obtain a first high-dimensional feature matrix; inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix; inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix; and respectively inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model. In the application, the decoding network outputs the second acoustic feature matrix, so that the data volume of a training sample is increased, and an awakening model and a noise reduction model are jointly trained; the two models share one coding network, so that the two models have the capability of more accurately extracting target information from noisy audio, the effect is better than that when the models are trained independently, the training speed is high, and the training cost is low.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Dynamic Random Access Memory (RDRAM), direct bus dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above description is only for the preferred embodiment of the present application and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims (10)

1. A model joint training method is characterized by comprising the following steps:
constructing a first acoustic feature matrix of the audio training data;
inputting the first acoustic feature matrix into an encoding network to obtain a first high-dimensional feature matrix;
inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;
inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix;
inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network respectively, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.
2. The model co-training method of claim 1, wherein the classification network comprises a fully connected layer and a softmax function, and the loss function used is a cross-entropy loss function.
3. The model joint training method according to claim 2, wherein the step of inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting the network parameters of the coding network, the decoding network, and the classification network based on a back propagation algorithm to obtain a trained wake-up model and a noise reduction model comprises:
after the first high-dimensional feature matrix and the second high-dimensional feature matrix are input into the full-connection layer for calculation, calculating a cross entropy loss value based on the loss function;
adjusting network parameters of the encoding network, decoding network, and classification network using a gradient descent back propagation algorithm to minimize the cross entropy loss value;
and after iterative training, when the cross entropy loss value does not decrease any more, the model converges to obtain the trained awakening model and the noise reduction model.
4. The model co-training method of claim 1, wherein the audio training data comprises positive sample audio and negative sample audio;
before the step of constructing the first acoustic feature matrix of the audio training data, the method comprises:
acquiring noise voice as the negative sample audio;
acquiring pure awakening voice; the pure awakening voice is pure voice without noise and carrying awakening words;
and mixing the pure awakening voice and the noise voice according to a preset signal-to-noise ratio to obtain the awakening voice with noise as the positive sample audio.
5. A model co-training apparatus, comprising:
the audio training device comprises a construction unit, a processing unit and a processing unit, wherein the construction unit is used for constructing a first acoustic feature matrix of audio training data;
the first coding unit is used for inputting the first acoustic feature matrix into a coding network to obtain a first high-dimensional feature matrix;
the decoding unit is used for inputting the first high-dimensional feature matrix into a decoding network to obtain a second acoustic feature matrix;
the second coding unit is used for inputting the second acoustic feature matrix into the coding network to obtain a second high-dimensional feature matrix;
the training unit is used for respectively inputting the first high-dimensional feature matrix and the second high-dimensional feature matrix into a classification network, and adjusting network parameters of the coding network, the decoding network and the classification network based on a back propagation algorithm to obtain a trained awakening model and a trained noise reduction model; the coding network and the classification network form a wake-up model, and the coding network and the decoding network form a noise reduction model.
6. The model co-training apparatus of claim 5, wherein the classification network comprises a fully connected layer and a softmax function, and the loss function used is a cross-entropy loss function.
7. The model joint training device of claim 6, wherein the training unit is specifically configured to:
after the first high-dimensional feature matrix and the second high-dimensional feature matrix are input into the full-connection layer for calculation, calculating a cross entropy loss value based on the loss function;
adjusting network parameters of the encoding network, decoding network, and classification network using a gradient descent back propagation algorithm to minimize the cross entropy loss value;
and after iterative training, when the cross entropy loss value does not decrease any more, the model converges to obtain the trained awakening model and the noise reduction model.
8. The model co-training apparatus of claim 5, wherein the audio training data comprises positive sample audio and negative sample audio; the device further comprises:
a first obtaining unit configured to obtain a noise voice as the negative sample audio;
the second acquisition unit is used for acquiring the pure awakening voice; the pure awakening voice is pure voice without noise and carrying awakening words;
and the mixing unit is used for mixing the pure awakening voice and the noise voice according to a preset signal-to-noise ratio to obtain the awakening voice with noise, and the awakening voice with noise is used as the positive sample audio.
9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method according to any of claims 1 to 4.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.
CN202110044163.XA 2021-01-13 2021-01-13 Model joint training method and device, computer equipment and storage medium Active CN112364993B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110044163.XA CN112364993B (en) 2021-01-13 2021-01-13 Model joint training method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110044163.XA CN112364993B (en) 2021-01-13 2021-01-13 Model joint training method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112364993A CN112364993A (en) 2021-02-12
CN112364993B true CN112364993B (en) 2021-04-30

Family

ID=74534933

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110044163.XA Active CN112364993B (en) 2021-01-13 2021-01-13 Model joint training method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112364993B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114512136B (en) * 2022-03-18 2023-09-26 北京百度网讯科技有限公司 Model training method, audio processing method, device, equipment, storage medium and program
CN116074150B (en) * 2023-03-02 2023-06-09 广东浩博特科技股份有限公司 Switch control method and device for intelligent home and intelligent home

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109977212A (en) * 2019-03-28 2019-07-05 清华大学深圳研究生院 Talk with the reply content generation method and terminal device of robot
CN110009025A (en) * 2019-03-27 2019-07-12 河南工业大学 A kind of semi-supervised additive noise self-encoding encoder for voice lie detection
CN110503981A (en) * 2019-08-26 2019-11-26 苏州科达科技股份有限公司 Without reference audio method for evaluating objective quality, device and storage medium
CN110619885A (en) * 2019-08-15 2019-12-27 西北工业大学 Method for generating confrontation network voice enhancement based on deep complete convolution neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463953B (en) * 2017-07-21 2019-11-19 上海媒智科技有限公司 Image classification method and system based on quality insertion in the noisy situation of label

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110009025A (en) * 2019-03-27 2019-07-12 河南工业大学 A kind of semi-supervised additive noise self-encoding encoder for voice lie detection
CN109977212A (en) * 2019-03-28 2019-07-05 清华大学深圳研究生院 Talk with the reply content generation method and terminal device of robot
CN110619885A (en) * 2019-08-15 2019-12-27 西北工业大学 Method for generating confrontation network voice enhancement based on deep complete convolution neural network
CN110503981A (en) * 2019-08-26 2019-11-26 苏州科达科技股份有限公司 Without reference audio method for evaluating objective quality, device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Deep neural network based learning and transferring mid-level audio features for acoustic scene classification;MUN S 等;《2017IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP)》;20171231;第1-2页 *
基于深度学习的数字几何处理与分析技术研究进展;夏清 等;《计算机研究与发展》;20190131;第56卷(第1期);第1-2页 *

Also Published As

Publication number Publication date
CN112364993A (en) 2021-02-12

Similar Documents

Publication Publication Date Title
Matsubara et al. Head network distillation: Splitting distilled deep neural networks for resource-constrained edge computing systems
CN112364993B (en) Model joint training method and device, computer equipment and storage medium
CN109523014B (en) News comment automatic generation method and system based on generative confrontation network model
CN112365885B (en) Training method and device of wake-up model and computer equipment
CN110119447B (en) Self-coding neural network processing method, device, computer equipment and storage medium
CN112214604A (en) Training method of text classification model, text classification method, device and equipment
CN112435656A (en) Model training method, voice recognition method, device, equipment and storage medium
CN112331183B (en) Non-parallel corpus voice conversion method and system based on autoregressive network
CN111428771B (en) Video scene classification method and device and computer-readable storage medium
CN112712813B (en) Voice processing method, device, equipment and storage medium
CN112634867A (en) Model training method, dialect recognition method, device, server and storage medium
CN110069611B (en) Topic-enhanced chat robot reply generation method and device
JP6908302B2 (en) Learning device, identification device and program
CN111583911B (en) Speech recognition method, device, terminal and medium based on label smoothing
CN113128232B (en) Named entity identification method based on ALBERT and multiple word information embedding
CN111598213A (en) Network training method, data identification method, device, equipment and medium
CN112735389A (en) Voice training method, device and equipment based on deep learning and storage medium
CN107463928A (en) Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM
CN114360502A (en) Processing method of voice recognition model, voice recognition method and device
CN113360610A (en) Dialog generation method and system based on Transformer model
CN113626610A (en) Knowledge graph embedding method and device, computer equipment and storage medium
WO2022246986A1 (en) Data processing method, apparatus and device, and computer-readable storage medium
WO2022121188A1 (en) Keyword detection method and apparatus, device and storage medium
CN113052257A (en) Deep reinforcement learning method and device based on visual converter
Naik et al. Indian monsoon rainfall classification and prediction using robust back propagation artificial neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant