CN114155883B - Progressive type based speech deep neural network training method and device - Google Patents

Progressive type based speech deep neural network training method and device Download PDF

Info

Publication number
CN114155883B
CN114155883B CN202210116109.6A CN202210116109A CN114155883B CN 114155883 B CN114155883 B CN 114155883B CN 202210116109 A CN202210116109 A CN 202210116109A CN 114155883 B CN114155883 B CN 114155883B
Authority
CN
China
Prior art keywords
voice
neural network
speech
deep neural
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210116109.6A
Other languages
Chinese (zh)
Other versions
CN114155883A (en
Inventor
史慧宇
欧阳鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qingwei Intelligent Information Technology Co ltd
Original Assignee
Beijing Qingwei Intelligent Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qingwei Intelligent Information Technology Co ltd filed Critical Beijing Qingwei Intelligent Information Technology Co ltd
Priority to CN202210116109.6A priority Critical patent/CN114155883B/en
Publication of CN114155883A publication Critical patent/CN114155883A/en
Application granted granted Critical
Publication of CN114155883B publication Critical patent/CN114155883B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a progressive speech deep neural network training method and device, a storage medium and an electronic device. The progressive-type-based speech deep neural network training method comprises the following steps: acquiring a mixed voice sample and target sample voice, wherein the mixed voice sample comprises the target voice and noise voice; the method comprises the steps of inputting a mixed voice sample into a preset voice deep neural network model to obtain predicted target voice, wherein the preset voice neural network model comprises a step extractor, a coder and a reconstructor, determining the preset voice deep neural network model as the target voice deep neural network model, and solving the technical problem that the target voice cannot be effectively separated from the mixed voice in the prior art based on the voice deep neural network trained in the scheme and comprising the step extractor, the coder and the reconstructor.

Description

Progressive type based speech deep neural network training method and device
Technical Field
The invention relates to the field related to voice signal processing, in particular to a method and a device for training a voice deep neural network based on an advanced form, a storage medium and an electronic device.
Background
Intelligent devices such as intelligent sound, hearing aids, intelligent headsets, etc. have become an indispensable part of people's daily life. The rapid development of these devices has benefited from the continuous improvement in voice interaction technology in recent years. During voice interaction, a speaker often speaks a password in a complex scene, and therefore, the voice of the speaker is often interfered by noise, reverberation or other speakers. If the background noise or the overlapped speech sound cannot be removed in time, the applications of speech recognition, semantic recognition or awakening and the like at the back end are seriously influenced. Therefore, it is necessary to use the speech extraction and separation technique as the focus of the speech signal processing. Compared with a multi-channel voice separation task, the single-channel voice separation technology has the advantages of low hardware requirement and cost and low computation amount, but has the defect of higher difficulty in algorithm design because the single-channel voice separation mainly utilizes signals collected by a single microphone and carries out modeling by means of the difference of time-frequency domain acoustics and statistical characteristics between target voice and interference signals.
In recent years, rapid development of neural networks and deep learning techniques has led to extensive research in this field for speech separation techniques. The basic idea of the deep learning-based voice separation method is as follows: establishing a voice separation model, extracting characteristic parameters from mixed voice, searching a mapping relation between the characteristic parameters and the characteristic parameters of a target voice signal through network training, and outputting the signal of the target voice through the trained model by any input mixed signal, thereby achieving the purpose of voice separation. A great deal of research work is carried out on end-to-end time domain and frequency domain algorithms, the frequency domain algorithms include Deep Clustering, DANet, uPIT, deep CASA and the like, and the time domain algorithms include Conv-TasNet, BLSTM-TasNet, furcaNeXt, wavesplit and the like. Most of the algorithms are designed by taking pure voice separation as a platform, and although the separation effect is good, when the algorithms are applied in a complex scene, the separation accuracy is greatly attenuated. However, real life scenes are often accompanied by factors such as background noise, reverberation and other speaker voices, and if the problem of voice separation is inevitably researched and the problem that a mixed voice contains more interference factors is to be researched, the algorithm can be more accurate and efficient by adopting any method.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a method, a device, a storage medium and an electronic device for training a speech deep neural network based on an advanced stage, which at least solve the technical problem that target speech cannot be effectively separated from mixed speech in the prior art.
According to an aspect of an embodiment of the present invention, there is provided a method for training a speech deep neural network based on an evolutionary form, including: acquiring a mixed voice sample and target sample voice, wherein the mixed voice sample comprises the target voice and noise voice; inputting the mixed voice sample into a preset voice deep neural network model to obtain a predicted target voice, wherein the preset voice deep neural network model comprises a step extractor, a reconstructor and a coder, the coder is used for carrying out feature extraction on the mixed voice to obtain a first feature, the step extractor is used for calculating to obtain a high-dimensional mapping relation feature according to the first feature, and the reconstructor is used for obtaining the predicted target voice in the mixed voice sample according to the high-dimensional mapping relation feature; and determining the preset voice deep neural network model as a target voice deep neural network model according to the condition that the loss function determined by the target sample voice and the predicted target voice meets a preset condition.
Optionally, the encoder is configured to perform feature extraction on the mixed speech to obtain a first feature, and includes: and inputting the mixed voice sample into the preset voice deep neural network model, and obtaining the first characteristic through a two-layer convolution network, a ReLU activation function and batch normalization processing which are included by the encoder.
Optionally, the step extractor is configured to calculate a high-dimensional mapping relationship feature according to the first feature, and includes: the step extractor comprises a plurality of step units, each step unit comprises: a time-delay neural network, a ReLU activation function, batch normalization processing, a time-delay neural network, a pooling layer, batch normalization processing and a graph convolution layer; and inputting each element in the first characteristic into a corresponding advanced unit respectively to obtain the high-dimensional mapping relation characteristic.
Optionally, the respectively inputting each element in the first feature into a corresponding advanced unit to obtain the high-dimensional mapping relationship feature includes: in case the first feature is denoted as H = { H0, …, hi, …, hM-1}, where i =0 to M-1, the advanced cell comprises M, i.e., J = { J0, …, ji, …, jM-1 }; h0 is input into the first step unit to obtain a corresponding output p0; the result of the addition of h1 and p0 is input into a second advanced unit for calculation to obtain an output p1 corresponding to the position of h 1; adding h2 and p1, and inputting the sum to a third step unit to obtain an output p2 corresponding to the position of h 2; and performing calculation in each position by analogy until the final hM-1 and pM-2 are added to obtain a corresponding output pM-1, and obtaining a high-dimensional mapping relation characteristic P = { P0, …, pM-1}.
Optionally, the reconstructor is configured to reconstruct the image according to the high-dimensional mapping relationship characteristic, obtaining a predicted target speech in the mixed speech sample, including: and inputting the mapping relation P into the reconstructor, and obtaining the predicted target voice in the mixed voice sample after two layers of convolutional network layers, reLU activation functions and batch normalization processing.
Optionally, the determining that the predetermined speech deep neural network model is the target speech deep neural network model when the loss function determined according to the target sample speech and the predicted target speech satisfies a predetermined condition includes: calculating an equal-proportion invariant signal-to-noise ratio of the target sample voice and the predicted target voice, and determining the loss function according to the equal-proportion invariant signal-to-noise ratio; according to the loss value of the loss function, adjusting the weight and the bias of each parameter of the preset speech deep neural network model by a gradient descent method; and determining the preset speech deep neural network model as a target speech deep neural network model according to the loss function determined by the target sample speech and the predicted target speech and meeting a preset condition.
According to an aspect of the embodiments of the present invention, there is provided a speech deep neural network training apparatus based on a progressive equation, including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a mixed voice sample and target sample voice, and the mixed voice sample comprises target voice and noise voice; the prediction unit is used for inputting the mixed voice sample into a preset voice deep neural network model to obtain a predicted target voice, wherein the preset voice deep neural network model comprises a step extractor, a reconstructor and a coder, the coder is used for extracting the characteristics of the mixed voice to obtain a first characteristic, the step extractor is used for calculating to obtain a high-dimensional mapping relation characteristic according to the first characteristic, and the reconstructor is used for obtaining the predicted target voice in the mixed voice sample according to the high-dimensional mapping relation characteristic; and the determining unit is used for determining the preset speech deep neural network model as the target speech deep neural network model according to the loss function determined by the target sample speech and the predicted target speech and meeting a preset condition.
Optionally, the prediction unit includes: and the coding module is used for inputting the mixed voice sample into the preset voice deep neural network model, and obtaining the first characteristic through two layers of convolutional networks, a ReLU activation function and batch normalization processing which are included by the coder.
Optionally, the prediction unit is further configured to perform the following operations: the step extractor comprises a plurality of step units, each step unit comprises: a time-delay neural network, a ReLU activation function, batch normalization processing, a time-delay neural network, a pooling layer, batch normalization processing and a graph convolution layer; and inputting each element in the first characteristic into a corresponding advanced unit respectively to obtain the high-dimensional mapping relation characteristic.
Optionally, the prediction unit is further configured to perform the following operations: in case the first feature is denoted as H = { H0, …, hi, …, hM-1}, where i =0 to M-1, the advanced cell comprises M, i.e., J = { J0, …, ji, …, jM-1 }; h0 is input into the first step unit to obtain a corresponding output p0; the result of the addition of h1 and p0 is input into a second advanced unit for calculation to obtain an output p1 corresponding to the position of h 1; adding h2 and p1, and inputting the sum to a third-order unit to obtain an output p2 corresponding to the position of h 2; and performing calculation in each position by analogy until the final hM-1 and pM-2 are added to obtain a corresponding output pM-1, and obtaining a high-dimensional mapping relation characteristic P = { P0, …, pM-1}.
Optionally, the prediction unit is further configured to perform the following operations: and inputting the mapping relation P into the reconstructor, and obtaining the predicted target voice in the mixed voice sample after two layers of convolutional network layers, reLU activation functions and batch normalization processing.
Optionally, the determining unit includes: the calculation module is used for calculating the equal-proportion invariant signal-to-noise ratio of the target sample voice and the predicted target voice and determining the loss function according to the equal-proportion invariant signal-to-noise ratio; the adjusting module is used for adjusting the weight and the bias of each parameter of the preset speech depth neural network model through a gradient descent method according to the loss value of the loss function; and the determining module is used for determining the preset speech deep neural network model as the target speech deep neural network model according to the loss function determined by the target sample speech and the predicted target speech and meeting a preset condition.
According to a first aspect of the embodiments of the present application, there is provided a computer-readable storage medium, wherein the storage medium stores a computer program, and the computer program is configured to execute the above advanced stepwise-based speech deep neural network training method when the computer program is executed.
According to a first aspect of embodiments of the present application, there is provided an electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the above advanced stepwise based deep speech neural network training method.
In the embodiment of the invention, a mixed voice sample and target sample voice are obtained, wherein the mixed voice sample comprises the target voice and noise voice; inputting the mixed voice sample into a preset voice deep neural network model to obtain a predicted target voice, wherein the preset voice deep neural network model comprises an advanced extractor, a reconstructor and a coder, the coder is used for carrying out feature extraction on the mixed voice to obtain a first feature, the advanced extractor is used for calculating to obtain a high-dimensional mapping relation feature according to the first feature, and the reconstructor is used for obtaining the predicted target voice in the mixed voice sample according to the high-dimensional mapping relation feature; the method comprises the steps of determining a preset voice deep neural network model as a target voice deep neural network model according to the fact that a loss function determined by target sample voice and predicted target voice meets preset conditions, and training the voice deep neural network model comprising a step-in extractor, an encoder and a reconstructor based on the scheme, so that the technical problem that the target voice cannot be effectively separated from mixed voice in the prior art is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a block diagram of a hardware structure of a mobile terminal according to an alternative advanced hierarchical-based speech deep neural network training method in an embodiment of the present invention;
FIG. 2 is a flow chart of an alternative step-forward based speech deep neural network training method according to an embodiment of the present invention;
FIG. 3 is an overall structure diagram of an alternative advanced speech extraction network according to an embodiment of the present invention;
FIG. 4 is a block diagram of an alternative encoder in accordance with embodiments of the present invention;
FIG. 5 is a block diagram of an alternative advanced cell in accordance with an embodiment of the present invention;
FIG. 6 is a block diagram of an alternative progressive extractor in accordance with embodiments of the present invention;
FIG. 7 is a block diagram of an alternative reconstructor according to an embodiment of the invention;
FIG. 8 is a diagram of an alternative apparatus for training a speech deep neural network based on a progressive equation according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a sequence of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For a better understanding of the present application, some of the names are now described below:
the embodiment of the advanced-form-based speech deep neural network training method provided by the embodiment of the application can be executed in a mobile terminal, a computer terminal or a similar operation device. Taking the example of the operation on the mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal based on the progressive speech deep neural network training method according to the embodiment of the present invention. As shown in fig. 1, the mobile terminal 10 may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program and a module of an application software, such as a computer program corresponding to the advanced speech deep neural network training method according to an embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to execute various functional applications and data processing, i.e., to implement the method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
In this embodiment, a method for training a speech deep neural network based on a step-forward method is also provided, fig. 2 is a flowchart of a method for training a speech deep neural network based on a step-forward method according to an embodiment of the present invention, as shown in fig. 2, the process of the step-forward based speech deep neural network training method includes the following steps:
step S202, a mixed voice sample and a target sample voice are obtained, where the mixed voice sample includes the target voice and a noise voice.
And S204, inputting the mixed voice sample into a preset voice deep neural network model to obtain a predicted target voice, wherein the preset voice deep neural network model comprises an advanced extractor, a reconstructor and a coder, the coder is used for carrying out feature extraction on the mixed voice to obtain a first feature, the advanced extractor is used for calculating to obtain a high-dimensional mapping relation feature according to the first feature, and the reconstructor is used for obtaining the predicted target voice in the mixed voice sample according to the high-dimensional mapping relation feature.
And step S206, determining the preset speech deep neural network model as the target speech deep neural network model according to the loss function determined by the target sample speech and the predicted target speech and meeting the preset condition.
In this embodiment, the present invention is directed to a single-channel speech separation algorithm for extracting a target speaker in a step-by-step manner, aiming at the problem of extracting the target speaker in a background containing background noise, reverberation and other speaker interference. Compared with other similar types, the algorithm can enhance the extraction characteristics of the target voice step by step, thereby greatly improving the voice separation accuracy under noise and reverberation scenes, reducing the distortion rate after voice extraction and improving the intelligibility of the voice.
The noise may include, but is not limited to, a dialog of the target user's target voice information with other objects, and may also include other sounds in the environment.
According to the embodiment provided by the application, a mixed voice sample and target sample voice are obtained, wherein the mixed voice sample comprises the target voice and noise voice; inputting the mixed voice sample into a preset voice deep neural network model to obtain a predicted target voice, wherein the preset voice deep neural network model comprises an advanced extractor, a reconstructor and a coder, the coder is used for carrying out feature extraction on the mixed voice to obtain a first feature, the advanced extractor is used for calculating to obtain a high-dimensional mapping relation feature according to the first feature, and the reconstructor is used for obtaining the predicted target voice in the mixed voice sample according to the high-dimensional mapping relation feature; the method comprises the steps of determining a preset speech deep neural network model as a target speech deep neural network model according to the loss function determined by target sample speech and predicted target speech and meeting preset conditions, and training the speech deep neural network model comprising an advanced extractor, an encoder and a reconstructor based on the scheme, so that the technical problem that the target speech cannot be effectively separated from mixed speech in the prior art is solved.
Optionally, the encoder is configured to perform feature extraction on the mixed speech to obtain the first feature, and may include: and inputting the mixed voice sample into a preset voice deep neural network model, and obtaining a first characteristic through a two-layer convolution network, a ReLU activation function and batch normalization processing which are included by an encoder.
Optionally, the step extractor is configured to calculate to obtain the high-dimensional mapping relationship feature according to the first feature, and may include: the step extractor comprises a plurality of step units, each step unit comprises: a time-delay neural network, a ReLU activation function, batch normalization processing, a time-delay neural network, a pooling layer, batch normalization processing and a graph convolution layer; and respectively inputting each element in the first characteristic into the corresponding advanced unit to obtain the high-dimensional mapping relation characteristic.
Optionally, inputting each element in the first feature into a corresponding advanced unit respectively to obtain the high-dimensional mapping relationship feature, which may include: in the case where the first feature is represented as H = { H0, …, hi, …, hM-1}, where i =0 to M-1, the advanced cell includes M, i.e., J = { J0, …, ji, …, jM-1 }; h0 is input into a first advanced unit to obtain a corresponding output p0; the result of the addition of h1 and p0 is input into a second advanced unit for calculation to obtain an output p1 corresponding to the position of h 1; adding h2 and p1, and inputting the sum to a third-order unit to obtain an output p2 corresponding to the position of h 2; and performing calculation in each position by analogy until the final hM-1 and pM-2 are added to obtain a corresponding output pM-1, and obtaining a high-dimensional mapping relation characteristic P = { P0, …, pM-1}.
Optionally, the reconstructor is configured to obtain the predicted target speech in the mixed speech sample according to the high-dimensional mapping relationship feature, and may include: and inputting the mapping relation P into a reconstructor, and obtaining the predicted target voice in the mixed voice sample after two layers of convolutional network layers, a ReLU activation function and batch normalization processing.
Optionally, determining the preset speech deep neural network model as the target speech deep neural network model according to that the loss function determined by the target sample speech and the predicted target speech satisfies a preset condition, which may include: calculating the equal-proportion invariant signal-to-noise ratio of the target sample voice and the predicted target voice, and determining a loss function according to the equal-proportion invariant signal-to-noise ratio; adjusting the weight and the bias of each parameter of the preset speech deep neural network model by a gradient descent method according to the loss value of the loss function; and determining a preset voice deep neural network model as a target voice deep neural network model according to the condition that the loss function determined by the target sample voice and the predicted target voice meets the preset condition.
As an alternative embodiment, the present application further provides a progressive speech extraction algorithm. The method comprises the following steps.
As shown in fig. 3, the overall structure of the advanced speech extraction network is shown. In this embodiment, the progressive speech extraction algorithm includes a progressive extractor, an encoder, and a reconstructor. As shown in fig. 4, the encoder structure mainly includes two convolutional layers (CNN) and one Pooling layer (Pooling). As shown in fig. 5, the structure of the step units, each step unit may include: a time-delay neural network, a ReLU activation function, batch normalization processing, a time-delay neural network, a pooling layer, batch normalization processing and a graph convolution layer. As shown in fig. 6, the structure diagram of the progressive extractor mainly includes two delay neural network layers (TDNN), a pooling layer, and a graph convolution network layer (GCN). As shown in fig. 7, the reconstructor structure diagram is mainly composed of two deconvolution network layers (DCNN). The method mainly comprises the following steps:
a first part: preprocessing a mixed voice sample required in training and testing;
a second part: training the built advanced extraction deep neural network by using a loss function to obtain an advanced extraction deep neural network model;
and a third part: and preprocessing the voice sample to be tested, and extracting a deep neural network model through the trained advanced mode to perform voice separation to obtain a separation result.
Each of the portions will be described in detail below.
Wherein, the first part specifically includes:
step 1, resampling time domain signals of a speech signal sample and a noise sample at 8kHz, mixing different speaker voices immediately between 0 and 5dB of signal-to-noise ratio, mixing the voices with randomly extracted noise samples at-6 to 3dB of signal-to-noise ratio, and then carrying out reverberation calculation on spaces and microphones under different conditions according to a room response function to obtain a final mixed speech signal y;
and 2, dividing the whole database obtained in the step into a training set, a verification set and a test set. The mixed voice is used as the input of the advanced extraction deep neural network, and one speaker voice in the mixed voice is used as the training target of the network.
The second part specifically comprises:
step 1, establishing a progressive extraction deep neural network model, which comprises an encoder, a progressive extractor and a reconstructor. The encoder is composed of two convolutional layers (CNN) and one Pooling layer (Pooling), as shown in fig. 4. The progressive extractor consists of two layers of time-delay neural network (TDNN), one pooling layer and one Graph Convolution Network (GCN), as shown in fig. 6. The reconstructor is made up of two deconvolution network layers (DCNN), as shown in fig. 7.
And 2, randomly initializing advanced extraction type deep neural network parameters, including initializing weights and biases among network neuron nodes.
And 3, carrying out forward propagation by the deep neural network. In the forward propagation process, the nonlinear relationship between networks can be increased by using an activation function, and finally, a nonlinear mapping between input and output results can be generated.
And 4, carrying out supervised training on the deep neural network according to the initialized parameters in the step 2 and the network training targets of the first part. In this embodiment, the update weights and biases are propagated back by gradient descent using a penalty function, which is:
Figure GDA0003764772030000111
wherein,
Figure GDA0003764772030000112
s is the ideal target voice for the voice,
Figure GDA0003764772030000113
in order to estimate the target voice of the target voice,<·,·>represents the dot product between two vectors, and | · | | calcualty 2 Representing the euclidean distance.
And 5, updating parameters of the deep neural network by a gradient descent method.
a. Within a certain time, fixing parameters in the network, and calculating the gradient of a loss function of an output layer;
b. calculating the gradient corresponding to each layer when the number of network layers L = L-1,L-2 and … is 2;
c. the weights and biases for the entire network are updated.
And 6, after training is finished, obtaining a deep neural network model according to a training result.
Note that, the encoder section: and inputting the mixed audio y to a network input end, and then performing primary feature extraction on the target voice through a two-layer convolutional network, a ReLU activation function and batch normalization processing (BN) to obtain H = { H0, …, hi, …, hM-1}, wherein i =0 to M-1.M is the output length corresponding to the last layer of network of the global extractor.
A step-in extractor section: this part is composed of a plurality of step units, and the specific calculation operation is shown in fig. 5. Each of the advanced units is shown in fig. 5, and includes: the method comprises the steps of a time-delay neural network, a ReLU activation function, batch normalization processing, a time-delay neural network, a pooling layer, batch normalization processing and a graph convolution layer. Inputting H to an input end of the module, wherein H0 directly enters a stepping unit to obtain corresponding output p0, a result obtained by adding H1 and p0 enters the stepping unit for calculation to obtain output p1 corresponding to the position of H1, H2 is added with p1 and then enters the stepping unit to obtain output p2 corresponding to the position of H2, each subsequent position is calculated in the same way until the final hM-1 and pM-2 are added to obtain corresponding output pM-1. At this time, all output results are used as high-dimensional extraction mapping P = { P0, …, pM-1} corresponding to the target voice.
A reconstructor section: inputting P to the input end of the module, and obtaining the estimated voice corresponding to each speaker after two layers of deconvolution network layers, reLU activation function and batch normalization processing
Figure GDA0003764772030000121
The speech reconstruction operation in the third section is: and inputting the voice sample to be tested in the first part into the trained advanced extraction separation network model, and directly obtaining a voice separation result of the target speaker through calculation.
Through the embodiment provided by the application, the problems of difficulty in extracting the voice of the target speaker and attenuation of the separation effect under the background of noise, reverberation and interference of other speakers can be solved through the single-channel voice separation algorithm for extracting the target speaker in a step-in manner, and compared with other single-channel voice separation methods, the method can effectively extract the useful information of the target voice in a step-in manner, improve the accuracy of voice separation, reduce the distortion rate of the voice and improve the intelligibility.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (which may be a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The embodiment also provides a step-forward based speech deep neural network training device, which is used for implementing the above embodiments and preferred embodiments, and the description of the step-forward based speech deep neural network training device is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 8 is a block diagram of a hierarchical-based speech deep neural network training device according to an embodiment of the present invention, and as shown in fig. 8, the hierarchical-based speech deep neural network training device includes:
an obtaining unit 81 is configured to obtain a mixed voice sample and a target sample voice, where the mixed voice sample includes the target voice and a noise voice.
The prediction unit 83 is configured to input the mixed speech sample into a preset speech deep neural network model to obtain a predicted target speech, where the preset speech deep neural network model includes an advanced extractor, a reconstructor, and a coder, the coder is configured to perform feature extraction on the mixed speech to obtain a first feature, the advanced extractor is configured to calculate a high-dimensional mapping relationship feature according to the first feature, and the reconstructor is configured to obtain the predicted target speech in the mixed speech sample according to the high-dimensional mapping relationship feature.
And the determining unit 85 is configured to determine the preset speech deep neural network model as the target speech deep neural network model according to that the loss function determined by the target sample speech and the predicted target speech satisfies a preset condition.
According to the embodiment provided by the application, the obtaining unit 81 obtains a mixed voice sample and a target sample voice, wherein the mixed voice sample comprises the target voice and a noise voice; the prediction unit 83 inputs the mixed speech sample into a preset speech deep neural network model to obtain a predicted target speech, wherein the preset speech deep neural network model comprises an advanced extractor, a reconstructor and a coder, the coder is used for performing feature extraction on the mixed speech to obtain a first feature, the advanced extractor is used for calculating to obtain a high-dimensional mapping relation feature according to the first feature, and the reconstructor is used for obtaining the predicted target speech in the mixed speech sample according to the high-dimensional mapping relation feature; the determining unit 85 determines that the preset speech deep neural network model is the target speech deep neural network model according to the loss function determined by the target sample speech and the predicted target speech and trains the speech deep neural network model including the step extractor, the encoder and the reconstructor based on the scheme, thereby solving the technical problem that the target speech cannot be effectively separated from the mixed speech in the prior art.
Optionally, the prediction unit 83 may include: and the coding module is used for inputting the mixed voice sample into a preset voice deep neural network model and obtaining a first characteristic through two layers of convolutional networks, a ReLU activation function and batch normalization processing which are included by a coder.
Optionally, the prediction unit 83 may be further configured to perform the following operations: the step extractor comprises a plurality of step units, each step unit comprises: a time-delay neural network, a ReLU activation function, batch normalization processing, a time-delay neural network, a pooling layer, batch normalization processing and a graph convolution layer; and respectively inputting each element in the first characteristic into the corresponding advanced unit to obtain the high-dimensional mapping relation characteristic.
Optionally, the prediction unit 83 may be further configured to perform the following operations: in the case where the first feature is represented as H = { H0, …, hi, …, hM-1}, where i =0 to M-1, the advanced cell includes M, i.e., J = { J0, …, ji, …, jM-1 }; h0 is input into the first step unit to obtain a corresponding output p0; the result of the addition of h1 and p0 is input into a second advanced unit for calculation to obtain an output p1 corresponding to the position of h 1; adding h2 and p1, and inputting the sum to a third step unit to obtain an output p2 corresponding to the position of h 2; and performing calculation in each position by analogy until the final hM-1 and pM-2 are added to obtain a corresponding output pM-1, and obtaining a high-dimensional mapping relation characteristic P = { P0, …, pM-1}.
Optionally, the prediction unit 83 may be further configured to perform the following operations: and inputting the mapping relation P into a reconstructor, and obtaining the predicted target voice in the mixed voice sample after two layers of convolutional network layers, a ReLU activation function and batch normalization processing.
Optionally, the determining unit 85 may include: the calculating module is used for calculating the equal-proportion invariable signal-to-noise ratio of the target sample voice and the predicted target voice and determining a loss function according to the equal-proportion invariable signal-to-noise ratio; the adjusting module is used for adjusting the weight and the bias of each parameter of the preset speech depth neural network model through a gradient descent method according to the loss value of the loss function; and the determining module is used for determining the preset voice deep neural network model as the target voice deep neural network model according to the condition that the loss function determined by the target sample voice and the predicted target voice meets the preset condition.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:
s1, acquiring a mixed voice sample and target sample voice, wherein the mixed voice sample comprises the target voice and noise voice;
s2, inputting the mixed voice sample into a preset voice deep neural network model to obtain predicted target voice, wherein the preset voice deep neural network model comprises an advanced extractor, a reconstructor and a coder, the coder is used for carrying out feature extraction on the mixed voice to obtain a first feature, the advanced extractor is used for calculating to obtain a high-dimensional mapping relation feature according to the first feature, and the reconstructor is used for obtaining the predicted target voice in the mixed voice sample according to the high-dimensional mapping relation feature;
and S3, determining the preset speech deep neural network model as the target speech deep neural network model according to the loss function determined by the target sample speech and the predicted target speech and meeting the preset condition.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, acquiring a mixed voice sample and target sample voice, wherein the mixed voice sample comprises the target voice and noise voice;
s2, inputting the mixed voice sample into a preset voice deep neural network model to obtain predicted target voice, wherein the preset voice deep neural network model comprises an advanced extractor, a reconstructor and a coder, the coder is used for carrying out feature extraction on the mixed voice to obtain a first feature, the advanced extractor is used for calculating to obtain a high-dimensional mapping relation feature according to the first feature, and the reconstructor is used for obtaining the predicted target voice in the mixed voice sample according to the high-dimensional mapping relation feature;
and S3, determining the preset speech deep neural network model as the target speech deep neural network model according to the loss function determined by the target sample speech and the predicted target speech and meeting the preset condition.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for training a speech deep neural network based on an advanced form is characterized by comprising the following steps:
acquiring a mixed voice sample and target sample voice, wherein the mixed voice sample comprises the target voice and noise voice;
inputting the mixed voice sample into a preset voice deep neural network model to obtain a predicted target voice, wherein the preset voice deep neural network model comprises a step extractor, a reconstructor and a coder, the coder is used for carrying out feature extraction on the mixed voice to obtain a first feature, the step extractor is used for calculating to obtain a high-dimensional mapping relation feature according to the first feature, and the reconstructor is used for obtaining the predicted target voice in the mixed voice sample according to the high-dimensional mapping relation feature;
determining the preset speech deep neural network model as a target speech deep neural network model according to the loss function determined by the target sample speech and the predicted target speech and meeting a preset condition;
the encoder is configured to perform feature extraction on the mixed speech to obtain a first feature, and includes:
and inputting the mixed voice sample into the preset voice deep neural network model, and obtaining the first characteristic through a two-layer convolution network, a ReLU activation function and batch normalization processing which are included by the encoder.
2. The method of claim 1, wherein the step extractor is configured to compute a high-dimensional mapping relationship feature according to the first feature, and comprises:
the step extractor comprises a plurality of step units, each step unit comprises: a time-delay neural network, a ReLU activation function, batch normalization processing, a time-delay neural network, a pooling layer, batch normalization processing and a graph convolution layer;
and inputting each element in the first characteristic into a corresponding advanced unit respectively to obtain the high-dimensional mapping relation characteristic.
3. The method according to claim 2, wherein the inputting each element in the first feature into a corresponding order unit to obtain the high-dimensional mapping relationship feature comprises:
in case the first feature is denoted as H = { H0, …, hi, …, hM-1}, where i =0 to M-1, the advanced cell comprises M, i.e., J = { J0, …, ji, …, jM-1 };
h0 is input into the first step unit to obtain a corresponding output p0;
the result of the addition of h1 and p0 is input into a second advanced unit for calculation to obtain an output p1 corresponding to the position of h 1;
adding h2 and p1, and inputting the sum to a third step unit to obtain an output p2 corresponding to the position of h 2;
and performing calculation in each position by analogy until the final hM-1 and pM-2 are added to obtain a corresponding output pM-1, and obtaining a high-dimensional mapping relation characteristic P = { P0, …, pM-1}.
4. The method according to claim 3, wherein the reconstructor is configured to obtain the predicted target speech in the mixed speech sample according to the high-dimensional mapping relation feature, and comprises:
and inputting the mapping relation P into the reconstructor, and obtaining the predicted target voice in the mixed voice sample after two layers of convolutional network layers, reLU activation functions and batch normalization processing.
5. The method according to claim 1, wherein the determining the predetermined deep neural network model as the target deep neural network model according to the loss function determined by the target sample speech and the predicted target speech satisfying a predetermined condition comprises:
calculating an equal-proportion invariant signal-to-noise ratio of the target sample voice and the predicted target voice, and determining the loss function according to the equal-proportion invariant signal-to-noise ratio;
according to the loss value of the loss function, adjusting the weight and the bias of each parameter of the preset speech deep neural network model by a gradient descent method;
and determining the preset speech deep neural network model as a target speech deep neural network model according to the loss function determined by the target sample speech and the predicted target speech and meeting a preset condition.
6. A speech deep neural network training device based on an advance formula is characterized by comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a mixed voice sample and target sample voice, and the mixed voice sample comprises the target voice and noise voice;
the prediction unit is used for inputting the mixed voice sample into a preset voice deep neural network model to obtain a predicted target voice, wherein the preset voice deep neural network model comprises a step extractor, a reconstructor and a coder, the coder is used for extracting the characteristics of the mixed voice to obtain a first characteristic, the step extractor is used for calculating to obtain a high-dimensional mapping relation characteristic according to the first characteristic, and the reconstructor is used for obtaining the predicted target voice in the mixed voice sample according to the high-dimensional mapping relation characteristic;
the determining unit is used for determining the preset speech deep neural network model as a target speech deep neural network model according to the loss function determined by the target sample speech and the predicted target speech and meeting a preset condition;
the prediction unit includes:
and the coding module is used for inputting the mixed voice sample into the preset voice deep neural network model, and obtaining the first characteristic through two layers of convolutional networks, a ReLU activation function and batch normalization processing which are included by the coder.
7. The apparatus of claim 6, wherein the prediction unit is further configured to:
the step extractor comprises a plurality of step units, each step unit comprises: a time-delay neural network, a ReLU activation function, batch normalization processing, a time-delay neural network, a pooling layer, batch normalization processing and a graph convolution layer;
and inputting each element in the first characteristic into a corresponding advanced unit respectively to obtain the high-dimensional mapping relation characteristic.
8. The apparatus of claim 7, wherein the prediction unit is further configured to:
in case the first feature is denoted as H = { H0, …, hi, …, hM-1}, where i =0 to M-1, the advanced cell comprises M, i.e., J = { J0, …, ji, …, jM-1 };
h0 is input into the first step unit to obtain a corresponding output p0;
the result of the addition of h1 and p0 is input into a second advanced unit for calculation to obtain an output p1 corresponding to the position of h 1;
adding h2 and p1, and inputting the sum to a third step unit to obtain an output p2 corresponding to the position of h 2;
and performing calculation in each position by analogy until the final hM-1 and pM-2 are added to obtain a corresponding output pM-1, and obtaining a high-dimensional mapping relation characteristic P = { P0, …, pM-1}.
9. The apparatus of claim 8, wherein the prediction unit is further configured to:
and inputting the mapping relation P into the reconstructor, and obtaining the predicted target voice in the mixed voice sample after two layers of convolutional network layers, reLU activation functions and batch normalization processing.
10. The apparatus of claim 6, wherein the determining unit comprises:
the calculation module is used for calculating the equal-proportion invariable signal-to-noise ratio of the target sample voice and the predicted target voice and determining the loss function according to the equal-proportion invariable signal-to-noise ratio;
the adjusting module is used for adjusting the weight and the bias of each parameter of the preset speech depth neural network model through a gradient descent method according to the loss value of the loss function;
and the determining module is used for determining the preset voice deep neural network model as a target voice deep neural network model according to the condition that the loss function determined by the target sample voice and the predicted target voice meets a preset condition.
CN202210116109.6A 2022-02-07 2022-02-07 Progressive type based speech deep neural network training method and device Active CN114155883B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210116109.6A CN114155883B (en) 2022-02-07 2022-02-07 Progressive type based speech deep neural network training method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210116109.6A CN114155883B (en) 2022-02-07 2022-02-07 Progressive type based speech deep neural network training method and device

Publications (2)

Publication Number Publication Date
CN114155883A CN114155883A (en) 2022-03-08
CN114155883B true CN114155883B (en) 2022-12-02

Family

ID=80450374

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210116109.6A Active CN114155883B (en) 2022-02-07 2022-02-07 Progressive type based speech deep neural network training method and device

Country Status (1)

Country Link
CN (1) CN114155883B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107221320A (en) * 2017-05-19 2017-09-29 百度在线网络技术(北京)有限公司 Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model
US11257507B2 (en) * 2019-01-17 2022-02-22 Deepmind Technologies Limited Speech coding using content latent embedding vectors and speaker latent embedding vectors
KR102294639B1 (en) * 2019-07-16 2021-08-27 한양대학교 산학협력단 Deep neural network based non-autoregressive speech synthesizer method and system using multiple decoder
CN112382308A (en) * 2020-11-02 2021-02-19 天津大学 Zero-order voice conversion system and method based on deep learning and simple acoustic features

Also Published As

Publication number Publication date
CN114155883A (en) 2022-03-08

Similar Documents

Publication Publication Date Title
JP6671020B2 (en) Dialogue act estimation method, dialogue act estimation device and program
WO2021143327A1 (en) Voice recognition method, device, and computer-readable storage medium
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
US10964337B2 (en) Method, device, and storage medium for evaluating speech quality
CN110223673B (en) Voice processing method and device, storage medium and electronic equipment
Pawar et al. Convolution neural network based automatic speech emotion recognition using Mel-frequency Cepstrum coefficients
CN113053407B (en) Single-channel voice separation method and system for multiple speakers
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN111862951B (en) Voice endpoint detection method and device, storage medium and electronic equipment
CN108877823A (en) Sound enhancement method and device
CN113314119B (en) Voice recognition intelligent household control method and device
CN113241064B (en) Speech recognition, model training method and device, electronic equipment and storage medium
CN111785288A (en) Voice enhancement method, device, equipment and storage medium
CN110751960B (en) Method and device for determining noise data
CN115457938A (en) Method, device, storage medium and electronic device for identifying awakening words
CN113555007B (en) Voice splicing point detection method and storage medium
JP6910002B2 (en) Dialogue estimation method, dialogue activity estimation device and program
WO2018001125A1 (en) Method and device for audio recognition
CN114155883B (en) Progressive type based speech deep neural network training method and device
CN106971731B (en) Correction method for voiceprint recognition
CN111833897B (en) Voice enhancement method for interactive education
CN110164418B (en) Automatic speech recognition acceleration method based on convolution grid long-time memory recurrent neural network
CN114067785B (en) Voice deep neural network training method and device, storage medium and electronic device
CN113707172A (en) Single-channel voice separation method, system and computer equipment of sparse orthogonal network
CN112735394A (en) Semantic parsing method and device for voice

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant