CN112259120B - Single-channel human voice and background voice separation method based on convolution cyclic neural network - Google Patents

Single-channel human voice and background voice separation method based on convolution cyclic neural network Download PDF

Info

Publication number
CN112259120B
CN112259120B CN202011119804.5A CN202011119804A CN112259120B CN 112259120 B CN112259120 B CN 112259120B CN 202011119804 A CN202011119804 A CN 202011119804A CN 112259120 B CN112259120 B CN 112259120B
Authority
CN
China
Prior art keywords
voice
neural network
time
background
human voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011119804.5A
Other languages
Chinese (zh)
Other versions
CN112259120A (en
Inventor
孙超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Guiji Intelligent Technology Co ltd
Original Assignee
Nanjing Guiji Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Guiji Intelligent Technology Co ltd filed Critical Nanjing Guiji Intelligent Technology Co ltd
Priority to CN202011119804.5A priority Critical patent/CN112259120B/en
Publication of CN112259120A publication Critical patent/CN112259120A/en
Application granted granted Critical
Publication of CN112259120B publication Critical patent/CN112259120B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Filters That Use Time-Delay Elements (AREA)

Abstract

The invention discloses a method for separating single-channel human voice from background voice based on a convolution cyclic neural network, which comprises the following steps: s1, acquiring an original mixed voice signal; s2, obtaining an original mixed signal amplitude spectrum and an original mixed signal phase spectrum; s3, inputting the original mixed signal amplitude spectrum into a convolution neural network; s4, inputting the low-resolution feature map and the original mixed signal amplitude spectrum into a recurrent neural network, and combining a time-frequency mask to obtain a predicted value of human voice after the human voice passes through the time-frequency mask and a predicted value of background voice after the background voice passes through the time-frequency mask; and S5, respectively combining the predicted value of the human voice after the human voice passes through the time-frequency mask and the predicted value of the background voice after the background voice passes through the time-frequency mask with the phase spectrum of the original mixed signal to obtain a predicted human voice signal and a predicted background voice signal. Compared with the prior art, the separation method provided by the invention can capture the time domain and frequency domain information of the voice, and generate the multi-scale features to separate the human voice signal and the background voice signal of the mixed voice.

Description

Single-channel human voice and background voice separation method based on convolution cyclic neural network
Technical Field
The invention relates to human voice and background voice separation, in particular to a single-channel human voice and background voice separation method based on a convolution cyclic neural network.
Background
The purpose of voice separation is to separate target voice from background interference, and since the voice collected by the microphone may include noise, voice of other people speaking, background music and other interference items, if the voice separation is not performed and the recognition is directly performed, the accuracy of the recognition may be affected. Therefore, the source separated and recognized has important value in the signal processing field of voice and automatic voice recognition, and the separation of the voice and background music in a single channel is a basic and important branch in voice separation.
In recent years, with the improvement of software and hardware performance and the popularization of machine learning algorithms, deep learning gradually exhibits extremely high effects in the fields of natural language processing, images and the like. The speech separation based on deep learning is to learn the characteristics of speech, speakers and noise from training data and construct an integral neural network so as to realize the aim of speech separation. The voice information can be simultaneously embodied in a time domain and a frequency domain, and the time domain information and the frequency domain information of the voice are valuable characteristic information, but for voice separation, most deep learning methods are separated by using a single convolutional neural network or a cyclic neural network, a unified generalized universal framework is not available for voice separation, the time domain information and the frequency domain information in the mixed voice cannot be accurately extracted, and the separation effect of the human voice and the background voice of the mixed voice is poor.
Disclosure of Invention
The invention aims to overcome the defects that the prior art can not accurately extract time domain and frequency domain information in voice and the separation effect of human voice and background sound in mixed voice is poor, and provides a single-channel human voice and background sound separation method based on a convolution cyclic neural network.
The purpose of the invention is mainly realized by the following technical scheme:
a single-channel human voice and background voice separation method based on a convolution cyclic neural network comprises the following steps:
s1, acquiring an original mixed voice signal, wherein the original mixed voice signal is a single-channel mixed signal of human voice and background voice;
s2, performing framing windowing and time-frequency conversion on the obtained original mixed voice signal to obtain an original mixed signal amplitude spectrum and an original mixed signal phase spectrum;
s3, inputting the original mixed signal amplitude spectrum into a convolutional neural network, wherein the convolutional neural network comprises a convolutional layer and a pooling layer which are sequentially arranged; the convolution layer obtains local characteristics of an original mixed signal amplitude spectrum, and the pooling layer reduces the dimension of the characteristics, converts the characteristics into a low-resolution characteristic diagram and outputs the characteristic diagram; the convolutional layer comprises two layers, and convolutional kernels in the two layers of convolutional layers are different in size;
s4, inputting the low-resolution feature map and the original mixed signal amplitude spectrum into a recurrent neural network, and combining a time-frequency mask to obtain a predicted value of human voice after the human voice passes through the time-frequency mask and a predicted value of background voice after the background voice passes through the time-frequency mask;
s5, respectively combining the predicted value of the human voice after passing through the time-frequency mask and the predicted value of the background voice after passing through the time-frequency mask with the phase spectrum of the original mixed signal, and respectively obtaining a predicted human voice signal and a predicted background voice signal through inverse Fourier transform;
and the convolutional neural network and the cyclic neural network are both provided with original mixed signal amplitude spectrum channels.
In the prior art, when deep learning is adopted for voice separation, time domain and frequency domain information of voice in mixed voice cannot be accurately extracted, and the mixed voice separation effect is poor. In order to solve the technical problem, the technical scheme provides a voice separation method, a convolutional neural network is used as a front end, a single-channel mixed signal of human voice and background voice is input, the purpose is to perform dimensionality reduction on a spectrogram and extract local features of the spectrogram, then the extracted features and multi-scale features combined by an original spectrogram are sent into a rear-end cyclic neural network together, and finally, a time-frequency mask is utilized to obtain predicted spectrograms of the human voice and the background music; the technical scheme is that two convolution kernels with different sizes are designed in a convolution neural network according to the characteristics of audio signals, the purpose is to capture context information of a time domain and a frequency domain from an input spectrogram, human voice signals and background voice signals of mixed voice can be accurately separated, when convolution operation is carried out on an original mixed signal magnitude spectrum, the first convolution is carried out on the time domain and the frequency domain of the spectrogram respectively, the two obtained features are fused immediately after the first convolution operation is completed, and subsequent convolution operation is facilitated. As the number of convolution layers increases, the network becomes deeper, so that the fused time domain features and frequency domain features are further expressed in a deep level. Specifically, on the basis of obtaining the characteristics of the audio signal, a direct connection channel and an original mixed signal magnitude spectrum channel are added in the technical scheme, the extracted characteristics and the multi-scale characteristics combined by the original spectrogram are sent to a back-end circulating neural network together by utilizing 'multi-scale information' combined by the extracted characteristics and the original spectrogram, the so-called scale refers to the resolution ratio of an image, the resolution ratio of the original spectrogram is 513 × 64, the image after the convolution pooling operation is changed into 256 × 64, namely, the image is unchanged in the dimension of a time domain and halved in the frequency domain, and the operation of combining the two is that the time domain is aligned and the frequency domain is added; the dimension after combination is (513 + 256) multiplied by 64, so that the input of lower resolution features is increased while the integrity of the original overall feature information is maintained, and the complementarity of the two is reflected; in the design of the convolutional neural network, in order to compress the number of data and parameters, a pooling layer is used for feature dimension reduction after a convolutional layer, so that overfitting of the network is reduced, and the generalization of a model is improved; the neural network model of the technical scheme does not change the phase of an original voice frequency spectrogram, but respectively combines a predicted value of human voice after passing through a time-frequency mask and a predicted value of background voice after passing through the time-frequency mask with the phase spectrum of an original mixed signal, respectively obtains a predicted human voice signal and a predicted background voice signal through inverse Fourier transform, and adds the constraint that the sum of prediction results is equal to the original mixed signal by utilizing the time-frequency mask. According to the technical scheme, two filters in different shapes are designed in the convolutional neural network, time domain and frequency domain information of voice is captured, meanwhile, a pooling layer is utilized to perform feature dimensionality reduction and extract local features of the filters, and the features and an original mixed signal amplitude spectrum are combined to form multi-scale features which are input into the convolutional neural network, so that human voice signals and background voice signals of mixed voice can be accurately separated.
It should be noted that in the technical scheme, the time-frequency conversion adopts short-time Fourier transform, the structure adopted by the recurrent neural network in the technical scheme is GRU, the GRU model is simpler than the standard LSTM model, the parameters are reduced, and overfitting is not easy to generate; the convolution cyclic neural network is a neural network jointly adopting a convolution neural network and a cyclic neural network.
Furthermore, convolution kernels in the two convolution layers are rectangular strip-shaped filters. Because the general square convolution kernel can not well utilize the time-frequency domain characteristic information of the audio frequency, the convolution kernels in the technical scheme are two rectangular strip-shaped filters respectively.
Further, the convolution kernel size of the first layer of convolutional layers is 2 × 10, and the convolution kernel size of the second layer of convolutional layers is 10 × 2. When processing the sequence type data such as voice, a common convolution kernel such as 3 × 3 cannot sufficiently and effectively utilize the features of voice, because the 3 × 3 is for a common image, because the horizontal and vertical axes of the common image have no specific meaning, and the horizontal coordinate of the amplitude spectrum of the original mixed signal represents time and the vertical coordinate represents frequency, the technical solution can extract the frequency domain features of the voice signal by using a 2 × 10 convolution kernel, and can extract the time domain features by using a 10 × 2 convolution kernel.
Further, a batch processing normalization layer is arranged behind the two convolution layers, the batch processing normalization layer uses a Leaky-relu activation function, and the formula of the Leaky-relu activation function is as follows:
Figure DEST_PATH_IMAGE001
wherein
Figure 703939DEST_PATH_IMAGE002
Is an independent variable, and
Figure DEST_PATH_IMAGE003
is a fixed parameter in the interval (1, + ∞). The training time of the model is accelerated by setting a batch processing normalization layer, so that the model can be converged more quickly.
Further, the pooled layer convolution kernel size is 2 × 1. In the technical scheme, the size of a convolution kernel arranged in the pooling layer is 2 multiplied by 1, so that the time dimension is unchanged and the frequency dimension is halved after the characteristics pass through the pooling layer; by adopting the method, the data volume can be reduced, the space size of the data is continuously reduced, so the quantity and the calculation amount of the parameters are reduced, and the overfitting is controlled to a certain extent.
Further, the spectrogram size of the input of the convolutional neural network in S3 is 10 × 513, where 513=1024/2+1 and the sampling rate is 16000 Hz.
The original mixed speech signal read from the audio file is a one-dimensional array, the length of which is determined by the audio length and the sampling rate. After the original mixed voice signal is subjected to framing and windowing, a plurality of frames can be obtained, each frame is subjected to fast Fourier transform, namely, a time domain signal is converted into a frequency domain signal, and the frequency domain signals after each frame of Fourier transform are stacked in time to obtain a spectrogram. The Fourier transform has N frequency points, and due to the symmetry of the Fourier transform, N/2+1 points are taken when N is an even number, and (N +1)/2 points are taken when N is an odd number. In the technical scheme, 10 frames of input are adopted to model and train the convolution cyclic neural network, the window length n _ fft of Fourier transform is set to be 1024 points, and 50% of overlap is used for extracting sound spectrum representation. The input to the neural network herein is therefore a spectrogram of size 10 × 513. Where 513=1024/2+1, the sampling rate is 16000 Hz.
Furthermore, an attention layer is arranged between the convolutional layer and the pooling layer of the convolutional neural network, the attention layer automatically acquires the importance degree of each characteristic channel in a learning mode, the weight of the useful characteristic channel is improved according to the importance degree, and the weight of the characteristic channel with low use for the current task is reduced.
In the prior art, a neural network is adopted for separation work, generally, the network performance is improved from the spatial dimension, namely, the multi-scale characteristic information or different resolution characteristic diagrams are combined, and the relation among characteristic channels is rarely concerned; the inventor finds that the output channels of the network can be increased by setting the number of convolution kernels after the mixed voice separation research, however, the channels are not all equally important, and too many redundant feature channels can influence the expression capability of the network.
Further, the attention layer is globally pooled by a maximum pooling method.
In the technical scheme, the attention layer adopts a maximum pooling method to carry out global pooling, so that weights corresponding to different characteristic channels have distinctiveness. The convolutional recurrent neural network structure provided by the technical scheme can be divided into a convolutional layer, an attention layer, a pooling layer and a cyclic layer, wherein the attention layer adopts global maximum pooling, the pooling layer only adopts maximum pooling, and the cyclic layer is the recurrent neural network.
Further, the convolutional neural network and the cyclic neural network use a mean square error loss function, as follows:
Figure 139599DEST_PATH_IMAGE004
Figure DEST_PATH_IMAGE005
is a predicted value of the human voice after passing through a time-frequency mask,
Figure 848929DEST_PATH_IMAGE006
is a predicted value of the background sound after passing through a time-frequency mask,
Figure DEST_PATH_IMAGE007
and
Figure 206092DEST_PATH_IMAGE008
respectively representing the real values of human voice and background voice; or a mean square error and source-to-interference ratio combined loss function is adopted, as follows:
Figure DEST_PATH_IMAGE009
wherein
Figure 728078DEST_PATH_IMAGE010
In order to be a hyper-parameter,
Figure 549404DEST_PATH_IMAGE005
is a predicted value of the human voice after passing through a time-frequency mask,
Figure 898477DEST_PATH_IMAGE006
is a predicted value of the background sound after passing through a time-frequency mask,
Figure 805253DEST_PATH_IMAGE007
and
Figure 193247DEST_PATH_IMAGE008
representing the real values of human voice and background voice, respectively. Where time t refers to the tth frame.
Preferably, the technical scheme adopts a mean square error and source-to-interference ratio combined loss function, and the mean square error and source-to-interference ratio combined loss function not only enables the predicted human voice signal to be closer to the real human voice signal, but also enables the predicted human voice signal to contain less background signals; preferably, it is 0.05.
Further, in S4, the method for calculating the predicted value of the human voice after passing through the time-frequency mask and the predicted value of the background voice after passing through the time-frequency mask is as follows:
Figure DEST_PATH_IMAGE011
Figure 603499DEST_PATH_IMAGE012
wherein
Figure DEST_PATH_IMAGE013
Is defined as the multiplication of elements and the multiplication of elements,
Figure 654632DEST_PATH_IMAGE005
is a predicted value of the human voice after passing through a time-frequency mask,
Figure 986387DEST_PATH_IMAGE006
is a predicted value of the background sound after passing through a time-frequency mask,
Figure 646914DEST_PATH_IMAGE014
in order to be the amplitude spectrum of the original mixed signal,
Figure DEST_PATH_IMAGE015
the output of the human voice at time t representing the convolution cyclic neural network prediction,
Figure 114935DEST_PATH_IMAGE016
is the output of the background sound of the convolutional recurrent neural network at time t,
Figure DEST_PATH_IMAGE017
time-frequency mask for predicting value of human voice after passing through time-frequency mask and method for predicting value of human voice
Figure 336969DEST_PATH_IMAGE018
And the time frequency mask is a predicted value of the background sound after the background sound passes through the time frequency mask. In the technical scheme, the time-frequency masking technology is utilized to further smooth the source separation result, so that the sum of the prediction results is equal to the constraint of the original mixture.
In conclusion, compared with the prior art, the invention has the following beneficial effects:
1. the invention provides a voice separation method, which comprises the steps of taking a convolutional neural network as a front end, inputting a single-channel mixed signal of human voice and background voice, reducing the voice spectrogram and extracting local features of the voice spectrogram, then sending the extracted features and multi-scale features combined by an original voice spectrogram into a rear-end cyclic neural network together, and finally obtaining predicted voice and background music voice spectrograms by utilizing a time-frequency mask; the invention designs two convolution kernels with different shapes in a convolution neural network according to the characteristics of audio signals, aims to capture context information of a time domain and a frequency domain from an input spectrogram and can accurately separate a human sound signal and a background sound signal of mixed voice.
2. According to the method, the attention layer is added, the importance degree of each feature channel is automatically acquired in a learning mode, then, useful features are promoted according to the importance degree, the features which are not useful for the current task are restrained, the feature channels after convolution have different weights, and the weights corresponding to the redundant feature channels are correspondingly reduced, so that the expression capacity of the network is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
FIG. 1 is a flow chart of a single-channel human voice and background voice separation method based on a convolution cyclic neural network;
fig. 2 is a schematic diagram of an attention layer of a single-channel human voice and background voice separation method based on a convolution cyclic neural network.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
Example 1:
a single-channel human voice and background voice separation method based on a convolution cyclic neural network comprises the following steps:
s1, acquiring an original mixed voice signal, wherein the original mixed voice signal is a single-channel mixed signal of human voice and background voice;
s2, performing framing windowing and time-frequency conversion on the obtained original mixed voice signal to obtain an original mixed signal amplitude spectrum and an original mixed signal phase spectrum;
s3, inputting the original mixed signal amplitude spectrum into a convolutional neural network, wherein the convolutional neural network comprises a convolutional layer and a pooling layer which are sequentially arranged; the convolution layer obtains local characteristics of an original mixed signal amplitude spectrum, and the pooling layer reduces the dimension of the characteristics, converts the characteristics into a low-resolution characteristic diagram and outputs the characteristic diagram; the convolutional layer comprises two layers, and convolutional kernels in the two layers of convolutional layers are different in size;
s4, inputting the low-resolution feature map and the original mixed signal amplitude spectrum into a recurrent neural network, and combining a time-frequency mask to obtain a predicted value of human voice after the human voice passes through the time-frequency mask and a predicted value of background voice after the background voice passes through the time-frequency mask;
s5, respectively combining the predicted value of the human voice after passing through the time-frequency mask and the predicted value of the background voice after passing through the time-frequency mask with the phase spectrum of the original mixed signal, and respectively obtaining a predicted human voice signal and a predicted background voice signal through inverse Fourier transform;
and the convolutional neural network and the cyclic neural network are both provided with original mixed signal amplitude spectrum channels.
Preferably, the convolution kernels in the two convolution layers are both rectangular strip-shaped filters.
Preferably, the convolution kernel size of the first layer of convolutional layers is 2 × 10, and the convolution kernel size of the second layer of convolutional layers is 10 × 2.
Preferably, a batch normalization layer is arranged after the two convolutional layers, and the batch normalization layer uses a Leaky-relu activation function, and the formula of the Leaky-relu activation function is as follows:
Figure DEST_PATH_IMAGE019
wherein
Figure 185714DEST_PATH_IMAGE002
Is an independent variable, and
Figure 682554DEST_PATH_IMAGE003
is a fixed parameter in the interval (1, + ∞).
Preferably, the pooled layer convolution kernel size is 2 × 1.
Preferably, the spectrogram size of the input of the convolutional neural network in S3 is 10 × 513, where 513=1024/2+1 and the sampling rate is 16000 Hz.
Preferably, the convolutional neural network and the cyclic neural network use a mean square error loss function, as follows:
Figure 598558DEST_PATH_IMAGE004
Figure 257072DEST_PATH_IMAGE005
is a predicted value of the human voice after passing through a time-frequency mask,
Figure 828999DEST_PATH_IMAGE006
is a predicted value of the background sound after passing through a time-frequency mask,
Figure 863951DEST_PATH_IMAGE007
and
Figure 634461DEST_PATH_IMAGE008
respectively representing the real values of human voice and background voice; or a mean square error and source-to-interference ratio combined loss function is adopted, as follows:
Figure 251429DEST_PATH_IMAGE009
wherein
Figure 107389DEST_PATH_IMAGE010
In order to be a hyper-parameter,
Figure 946032DEST_PATH_IMAGE005
is a predicted value of the human voice after passing through a time-frequency mask,
Figure 305469DEST_PATH_IMAGE006
is a predicted value of the background sound after passing through a time-frequency mask,
Figure 774628DEST_PATH_IMAGE007
and
Figure 852305DEST_PATH_IMAGE008
respectively representing the real values of human voice and background voice; preferably, a mean square error and source-to-interference ratio combined loss function is adopted, and the mean square error and source-to-interference ratio combined loss function not only enables the predicted human voice signal to be closer to the real human voice signal, but also enables the predicted human voice signal to contain less background signals; preferably, γ is 0.05.
Preferably, in S4, the method for calculating the predicted value of the human voice after passing through the time-frequency mask and the predicted value of the background voice after passing through the time-frequency mask is as follows:
Figure 760219DEST_PATH_IMAGE020
Figure DEST_PATH_IMAGE021
wherein
Figure 207118DEST_PATH_IMAGE022
Is defined as the multiplication of elements and the multiplication of elements,
Figure DEST_PATH_IMAGE023
is a predicted value of the human voice after passing through a time-frequency mask,
Figure 112757DEST_PATH_IMAGE024
is a predicted value of the background sound after passing through a time-frequency mask,
Figure DEST_PATH_IMAGE025
in order to be the amplitude spectrum of the original mixed signal,
Figure 84256DEST_PATH_IMAGE026
the output of the human voice at time t representing the convolution cyclic neural network prediction,
Figure DEST_PATH_IMAGE027
is the output of the background sound of the convolutional recurrent neural network at time t,
Figure 966499DEST_PATH_IMAGE028
time-frequency mask for predicting value of human voice after passing through time-frequency mask and method for predicting value of human voice
Figure DEST_PATH_IMAGE029
And the time frequency mask is a predicted value of the background sound after the background sound passes through the time frequency mask.
The voice separation method provided by this embodiment inputs a single-channel mixed signal of human voice and background voice by using a convolutional neural network as a front end, and aims to perform subtraction and extraction of local features of a spectrogram, then sends the extracted features and multi-scale features combined with an original spectrogram into a back-end cyclic neural network, and finally obtains predicted spectrograms of human voice and background music by using a time-frequency mask; the number of the convolution kernels can be set to increase the output channels of the network, and in the embodiment, two convolution kernels with different sizes are designed in the convolution neural network according to the characteristics of the audio signal, so that the method aims to capture the context information of a time domain and a frequency domain from an input spectrogram and accurately separate a human sound signal and a background sound signal of mixed voice. Specifically, in the embodiment, two rectangular and long filters are selected as convolution kernels, the number of the convolution kernels is set to increase output channels of the network, and the expression capability of the network is improved by combining the selection of the size; on the basis of obtaining the characteristics of the audio signal, a direct connection channel and an original mixed signal amplitude spectrum channel are added, and the multi-scale information combining the extracted characteristics and the original spectrogram is utilized, so that the integrity of original overall characteristic information is maintained, meanwhile, the input of lower-resolution characteristics is increased, and the complementarity of the extracted characteristics and the original spectrogram is reflected; in the design of the convolutional neural network, in order to compress the number of data and parameters, a pooling layer is used for feature dimension reduction after a convolutional layer, overfitting of the network is reduced, the generalization of a model is improved, and the training time of the model is shortened by setting a batch normalization layer, so that the model is converged more quickly; the neural network model of this embodiment does not change the phase of the original speech spectrogram, but combines the predicted value of the human voice after passing through the time-frequency mask and the predicted value of the background voice after passing through the time-frequency mask with the phase spectrum of the original mixed signal, and obtains the predicted human voice signal and the predicted background voice signal through inverse fourier transform, respectively, and the time-frequency mask is used to add the constraint that the sum of the prediction results is equal to the original mixed signal. In the embodiment, two filters with different shapes are designed in the convolutional neural network to capture time domain and frequency domain information of voice, meanwhile, a pooling layer is utilized to perform feature dimensionality reduction and extract local features of the filters, and the features and the original mixed signal amplitude spectrum are combined to form multi-scale features which are input into the convolutional neural network, so that human voice signals and background voice signals of mixed voice can be accurately separated.
Example 2:
as shown in fig. 1 and 2, the present embodiment further includes, on the basis of embodiment 1: an attention layer is arranged between the convolution layer and the pooling layer of the convolutional neural network, the attention layer automatically acquires the importance degree of each characteristic channel in a learning mode, the weight of the useful characteristic channel is improved according to the importance degree, and the characteristic channel with low use for the current task is restrained. Fig. 4 shows the attention layer. The attention layer is disposed between the convolutional layer and the pooling layer.
Preferably, the attention layer is globally pooled using a max-pooling method.
Fig. 2 provides a schematic diagram of the attention layer in this embodiment, and given an input x, the number of characteristic channels is c _1, and a characteristic with the number of characteristic channels c _2 is obtained through a series of general transformations such as convolution. Unlike the conventional CNN, the present embodiment then re-calibrates the previously derived features through three operations. Firstly, the extrusion operation is carried out, the feature compression is carried out along the space dimension, each two-dimensional feature channel is changed into a real number, the real number has a global receptive field to some extent, and the output dimension is matched with the number of the input feature channels. It characterizes the global distribution of responses over the feature channels and makes it possible to obtain a global receptive field also for layers close to the input, which is very useful in many tasks. The second is the excitation operation, which is a mechanism similar to the gate in the recurrent neural network, which generates weights for each eigen-channel by a parameter w that is learned to explicitly model the correlation between eigen-channels. And finally, correcting the weight, wherein the weight of the output of the excitation operation is regarded as the importance of each feature channel after feature selection, and then the original features are recalibrated in the channel dimension by weighting the previous features channel by channel through multiplication.
As can be seen from fig. 2, this embodiment considers that the importance occupied by different channels may be different, and the previous network does not consider this, but treats the importance of all channels as the same. The importance of different channels is scaled by a learned set of weights, which is equivalent to a new calibration of the original features after adding weights.
The effect of using two layers of convolution kernels to combine in embodiment 1 is improved compared with the effect of a single convolution kernel, and after a study on mixed speech separation, the inventor finds that the number of output channels of a network can be increased by setting the number of convolution kernels, however, the channels are not all equally important, and too many redundant feature channels can affect the expression capability of the network. In this embodiment, after the attention module is further added on the basis of embodiment 1, the performance is further improved. In addition, the inventor adds an attention module after the last convolutional layer of the convolutional recurrent neural network by using maximum pooling, and the module uses a maximum pooling function to perform global pooling, so that weights corresponding to different channels have distinctiveness, and the importance degree between different channels may be weakened by using average pooling.
1. Verification and comparative test: to verify the separation effect of the method of example 2, the inventors used an MIR-1K data set, where the audio of the MIR-1K data set includes 2 folders, one is undvidedwavfile including 1000 audio data of 4-13 seconds, and the other is Wavfile including 110 audio data. These segments were extracted from 110 karaoke songs in china sung by both men and women. The inventors used specific males and specific females as training sets, containing 175 segments in total. The remaining 825 fragments served as test sets. The sampling rate is 16000Hz, and the sampling points are 16 bits.
(1) And performance evaluation indexes: using bss _ eval _ sources in the mir _ eval packet as evaluation indexes, the separation effect is evaluated using the following four indexes,
source-distortion ratio (SDR):
Figure 769370DEST_PATH_IMAGE030
source-to-interference ratio (SIR):
Figure DEST_PATH_IMAGE031
source-to-noise ratio (SNR):
Figure 845910DEST_PATH_IMAGE032
source-algorithm induced artifact ratio (SAR):
Figure DEST_PATH_IMAGE033
wherein the content of the first and second substances,
Figure 367021DEST_PATH_IMAGE034
is a prediction signal that is a function of the signal,
Figure DEST_PATH_IMAGE035
is a signal that is an interference signal or a signal,
Figure 52955DEST_PATH_IMAGE036
is a noise signal that is a function of,
Figure DEST_PATH_IMAGE037
are artifacts introduced by the algorithm; SDR evaluates the separation effect of the separation algorithm from a relatively comprehensive angle, SIR analyzes the separation effect from the angle of interference, SNR analyzes the separation effect from the angle of noise, and SAR analyzes the separation effect from the angle of artifact; the larger the values of SDR, SIR, SNR and SAR are, the better the separation effect of human voice and background music is. Global NSDR (gnsdr), global SIR (gsir) and global SAR (gsar) are weighted averages of NSDR, SIR and SAR, respectively, weighted by source length. Where the normalized SDR is defined as:
Figure 975912DEST_PATH_IMAGE038
wherein
Figure DEST_PATH_IMAGE039
Defined as the estimated human voice/background music of the model, ToFor pure human voice/background music, T, in the original signalmIs the original mixed signal.
Table: algorithm comparison under loss function of mean square error and source-to-interference ratio combination
Figure 20091DEST_PATH_IMAGE040
In the above table, methods 1 to 8 adopt a conventional mixed speech separation method, method 9 is to replace two layers of convolution kernels with one layer of convolution kernel on the basis of embodiment 1, method 10 is the method of embodiment 1, and method 11 is the method of embodiment 2; the method 12 is that the maximum pooling adopted by the attention layer is replaced by the average pooling on the basis of the embodiment 2, and two layers of convolution kernels are replaced by one layer of convolution kernel; the method 13 replaces two layers of convolution kernels with one layer of convolution kernel on the basis of the embodiment 1, and replaces the maximum pooling adopted by the attention layer with the average pooling; mode 14 replaces the maximum pooling employed by the attention tier with an average pooling on the basis of example 2.
As can be seen from the table, the mixed speech separation effects obtained by the method of embodiment 1 (method 10) and the method of embodiment 2 (method 11) are all equal to those obtained by the methods 1 to 8 in the prior art; from methods 9 and 10, the effect when two layers of convolution kernels are combined is improved compared with the effect of a single convolution kernel, and the performance is further improved after the attention module is added to method 11. In addition, the inventor uses two pooling methods, average pooling and maximum pooling, in the above table, and compared with the mixed voice separation effect of the method 11 and the method 14, the mixed voice separation effect using average pooling is not as good as the effect using average pooling is combined, because an attention module is added after the last convolutional layer of the convolutional recurrent neural network, and the module uses the maximum pooling function to perform global pooling, so that weights corresponding to different channels are differentiated, while the use of average pooling may weaken the importance degree between different channels, and it can be seen that the mixed voice separation effect using the attention layer of the embodiment 2 and the maximum pooling in the attention layer is better.
The inventor also finds in the research process that the reduction ratio r is an important hyper-parameter in the attention layer, when the value of r is small, the performance improvement is not facilitated, and when r =64 is set, the performance improvement on three indexes of GNSAR, GSIR and GSAR can be realized.
In addition, after the inventor researches the convolutional neural network and the cyclic neural network by adopting gamma with different values in a mean square error and source-interference ratio combined loss function, the inventor finds that when the gamma is 0.05, the balance exists on GNSAR, GSIR and GSAR; the three evaluation indexes of GNSAR, GSIR and GSAR are larger, the higher the signal-to-noise ratio is, and the better the separation effect is. However, the three indexes are not increased or decreased simultaneously with the change of the hyper-parameter, and 0.05 is selected in order to balance the three indexes, namely, the indexes are relatively large, rather than only considering one of the indexes.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (9)

1. A single-channel human voice and background voice separation method based on a convolution cyclic neural network is characterized by comprising the following steps:
s1, acquiring an original mixed voice signal, wherein the original mixed voice signal is a single-channel mixed signal of human voice and background voice;
s2, performing framing windowing and time-frequency conversion on the obtained original mixed voice signal to obtain an original mixed signal amplitude spectrum and an original mixed signal phase spectrum;
s3, inputting the original mixed signal amplitude spectrum into a convolutional neural network, wherein the convolutional neural network comprises a convolutional layer and a pooling layer which are sequentially arranged; the convolution layer obtains local characteristics of an original mixed signal amplitude spectrum, and the pooling layer reduces the dimension of the characteristics, converts the characteristics into a low-resolution characteristic diagram and outputs the characteristic diagram; the convolutional layer comprises two layers, and convolutional kernels in the two layers of convolutional layers are different in size;
s4, inputting the low-resolution feature map and the original mixed signal amplitude spectrum into a recurrent neural network, and combining a time-frequency mask to obtain a predicted value of human voice after the human voice passes through the time-frequency mask and a predicted value of background voice after the background voice passes through the time-frequency mask;
s5, respectively combining the predicted value of the human voice after passing through the time-frequency mask and the predicted value of the background voice after passing through the time-frequency mask with the phase spectrum of the original mixed signal, and respectively obtaining a predicted human voice signal and a predicted background voice signal through inverse Fourier transform;
the convolutional neural network and the cyclic neural network are both provided with an original mixed signal amplitude spectrum channel, an attention layer is arranged between a convolutional layer and a pooling layer of the convolutional neural network, the attention layer automatically acquires the importance degree of each characteristic channel in a learning mode, the weight of the useful characteristic channel is improved according to the importance degree, and the weight of the characteristic channel which is not used for the current task is reduced.
2. The method of claim 1, wherein the convolution kernels in the two convolutional layers are both rectangular strip filters.
3. The method of claim 2, wherein the convolution kernel size of the first convolutional layer is 2 x 10, and the convolution kernel size of the second convolutional layer is 10 x 2.
4. The method for separating the single-channel human voice from the background voice based on the convolutional recurrent neural network as claimed in claim 2, wherein a batch normalization layer is arranged after the two convolutional layers, the batch normalization layer uses a Leaky-relu activation function, and the formula of the Leaky-relu activation function is as follows:
Figure DEST_PATH_IMAGE002
wherein
Figure DEST_PATH_IMAGE004
Is an independent variable, and
Figure DEST_PATH_IMAGE006
is a fixed parameter in the interval (1, + ∞).
5. The convolution-cyclic neural network-based method for separating single-channel human voice from background sound according to claim 1, wherein the pooling layer convolution kernel size is 2 x 1.
6. The method for separating a single-channel human voice from a background voice based on a convolutional circular neural network as claimed in claim 1, wherein the spectrogram size of the input of the convolutional neural network in S3 is 10 × 513, wherein 513=1024/2+1 and the sampling rate is 16000 Hz.
7. The convolution-cyclic neural network-based single-channel human voice and background voice separation method of claim 1, wherein the attention layer is globally pooled by a max-pooling method.
8. The method for separating the single-channel human voice from the background voice based on the convolutional recurrent neural network as claimed in claim 1, wherein the convolutional neural network and the recurrent neural network use a mean square error loss function as follows:
Figure DEST_PATH_IMAGE008
Figure DEST_PATH_IMAGE010
is a predicted value of the human voice after passing through a time-frequency mask,
Figure DEST_PATH_IMAGE012
is a predicted value of the background sound after passing through a time-frequency mask,
Figure DEST_PATH_IMAGE014
and
Figure DEST_PATH_IMAGE016
respectively representing the real values of human voice and background voice;
or a mean square error and source-to-interference ratio combined loss function is adopted, as follows:
Figure DEST_PATH_IMAGE018
wherein
Figure DEST_PATH_IMAGE020
In order to be a hyper-parameter,
Figure 851589DEST_PATH_IMAGE010
is a human beingThe predicted value of the sound after passing through the time-frequency mask,
Figure 971992DEST_PATH_IMAGE012
is a predicted value of the background sound after passing through a time-frequency mask,
Figure 220571DEST_PATH_IMAGE014
and
Figure 84622DEST_PATH_IMAGE016
representing the real values of human voice and background voice, respectively.
9. The method for separating the single-channel human voice from the background voice based on the convolutional recurrent neural network as claimed in claim 1, wherein the method for calculating the predicted value of the human voice after passing through the time-frequency mask and the predicted value of the background voice after passing through the time-frequency mask in S4 is as follows:
Figure DEST_PATH_IMAGE022
Figure DEST_PATH_IMAGE024
wherein
Figure DEST_PATH_IMAGE026
Is defined as the multiplication of elements and the multiplication of elements,
Figure 554786DEST_PATH_IMAGE010
is a predicted value of the human voice after passing through a time-frequency mask,
Figure 846090DEST_PATH_IMAGE012
is a predicted value of the background sound after passing through a time-frequency mask,
Figure DEST_PATH_IMAGE028
in order to be the amplitude spectrum of the original mixed signal,
Figure DEST_PATH_IMAGE030
the output of the human voice at time t representing the convolution cyclic neural network prediction,
Figure DEST_PATH_IMAGE032
is the output of the background sound of the convolutional recurrent neural network at time t,
Figure DEST_PATH_IMAGE034
time-frequency mask for predicting value of human voice after passing through time-frequency mask and method for predicting value of human voice
Figure DEST_PATH_IMAGE036
And the time frequency mask is a predicted value of the background sound after the background sound passes through the time frequency mask.
CN202011119804.5A 2020-10-19 2020-10-19 Single-channel human voice and background voice separation method based on convolution cyclic neural network Active CN112259120B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011119804.5A CN112259120B (en) 2020-10-19 2020-10-19 Single-channel human voice and background voice separation method based on convolution cyclic neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011119804.5A CN112259120B (en) 2020-10-19 2020-10-19 Single-channel human voice and background voice separation method based on convolution cyclic neural network

Publications (2)

Publication Number Publication Date
CN112259120A CN112259120A (en) 2021-01-22
CN112259120B true CN112259120B (en) 2021-06-29

Family

ID=74243874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011119804.5A Active CN112259120B (en) 2020-10-19 2020-10-19 Single-channel human voice and background voice separation method based on convolution cyclic neural network

Country Status (1)

Country Link
CN (1) CN112259120B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113808607A (en) * 2021-03-05 2021-12-17 北京沃东天骏信息技术有限公司 Voice enhancement method and device based on neural network and electronic equipment
CN112802484B (en) * 2021-04-12 2021-06-18 四川大学 Panda sound event detection method and system under mixed audio frequency
CN113113041B (en) * 2021-04-29 2022-10-11 电子科技大学 Voice separation method based on time-frequency cross-domain feature selection
CN113257267B (en) * 2021-05-31 2021-10-15 北京达佳互联信息技术有限公司 Method for training interference signal elimination model and method and equipment for eliminating interference signal
CN113241092A (en) * 2021-06-15 2021-08-10 新疆大学 Sound source separation method based on double-attention mechanism and multi-stage hybrid convolution network
CN113990033A (en) * 2021-09-10 2022-01-28 南京融才交通科技研究院有限公司 Vehicle traffic accident remote take-over rescue method and system based on 5G internet of vehicles
CN113903355B (en) * 2021-12-09 2022-03-01 北京世纪好未来教育科技有限公司 Voice acquisition method and device, electronic equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106653048A (en) * 2016-12-28 2017-05-10 上海语知义信息技术有限公司 Method for separating sound of single channels on basis of human sound models
CN106847302A (en) * 2017-02-17 2017-06-13 大连理工大学 Single channel mixing voice time-domain seperation method based on convolutional neural networks
CN110097049A (en) * 2019-04-03 2019-08-06 中国科学院计算技术研究所 A kind of natural scene Method for text detection and system
CN110148400A (en) * 2018-07-18 2019-08-20 腾讯科技(深圳)有限公司 The pronunciation recognition methods of type, the training method of model, device and equipment
US10452902B1 (en) * 2018-12-21 2019-10-22 Capital One Services, Llc Patent application image generation systems
US10540961B2 (en) * 2017-03-13 2020-01-21 Baidu Usa Llc Convolutional recurrent neural networks for small-footprint keyword spotting
CN110992987A (en) * 2019-10-23 2020-04-10 大连东软信息学院 Parallel feature extraction system and method for general specific voice in voice signal
CN111179957A (en) * 2020-01-07 2020-05-19 腾讯科技(深圳)有限公司 Voice call processing method and related device
CN111275046A (en) * 2020-01-10 2020-06-12 中科鼎富(北京)科技发展有限公司 Character image recognition method and device, electronic equipment and storage medium
CN111667819A (en) * 2019-03-08 2020-09-15 北京京东尚科信息技术有限公司 CRNN-based speech recognition method, system, storage medium and electronic equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10361673B1 (en) * 2018-07-24 2019-07-23 Sony Interactive Entertainment Inc. Ambient sound activated headphone
US11645745B2 (en) * 2019-02-15 2023-05-09 Surgical Safety Technologies Inc. System and method for adverse event detection or severity estimation from surgical data

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106653048A (en) * 2016-12-28 2017-05-10 上海语知义信息技术有限公司 Method for separating sound of single channels on basis of human sound models
CN106847302A (en) * 2017-02-17 2017-06-13 大连理工大学 Single channel mixing voice time-domain seperation method based on convolutional neural networks
US10540961B2 (en) * 2017-03-13 2020-01-21 Baidu Usa Llc Convolutional recurrent neural networks for small-footprint keyword spotting
CN110148400A (en) * 2018-07-18 2019-08-20 腾讯科技(深圳)有限公司 The pronunciation recognition methods of type, the training method of model, device and equipment
US10452902B1 (en) * 2018-12-21 2019-10-22 Capital One Services, Llc Patent application image generation systems
CN111667819A (en) * 2019-03-08 2020-09-15 北京京东尚科信息技术有限公司 CRNN-based speech recognition method, system, storage medium and electronic equipment
CN110097049A (en) * 2019-04-03 2019-08-06 中国科学院计算技术研究所 A kind of natural scene Method for text detection and system
CN110992987A (en) * 2019-10-23 2020-04-10 大连东软信息学院 Parallel feature extraction system and method for general specific voice in voice signal
CN111179957A (en) * 2020-01-07 2020-05-19 腾讯科技(深圳)有限公司 Voice call processing method and related device
CN111275046A (en) * 2020-01-10 2020-06-12 中科鼎富(北京)科技发展有限公司 Character image recognition method and device, electronic equipment and storage medium

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
"Crnn-Ctc Based Mandarin Keywords Spotting";Haikang Yan;《ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing》;20200504;全文 *
"Hybrid Approach to Combining Conventional and Deep Learning Techniques for Single-Channel Speech Enhancement and Recognition";Yan-Hui Tu;《2018 IEEE International Conference on Acoustics, Speech and Signal Processing 》;20180420;全文 *
"Single Channel Speech Enhancement Using Temporal Convolutional Recurrent Neural Networks";Jingdong Li;《Proceedings of APSIPA Annual Summit and Conference》;20191221;全文 *
"SOUND EVENT LOCALIZATION AND DETECTION USING CONVOLUTIONAL RECURRENT NEURAL NETWORK";Wen Jie Jee1;《Detection and Classification of Acoustic Scenes and Events 2019》;20191231;全文 *
"Sound Event Localization Based on Sound Intensity Vector Refined by Dnn-Based Denoising and Source Separation";Masahiro Yasuda;《ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing》;20200508;全文 *
"基于深度神经网络的单通道语音增强方法回顾";鲍长春;《信号处理》;20191231;第35卷(第12期);全文 *
"结合深度卷积循环网络和时频注意力机制的单通道语音增强算法";闫昭宇;《信号处理》;20200630;第36卷(第6期);全文 *

Also Published As

Publication number Publication date
CN112259120A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
CN112259120B (en) Single-channel human voice and background voice separation method based on convolution cyclic neural network
CN109890043B (en) Wireless signal noise reduction method based on generative countermeasure network
CN112712812B (en) Audio signal generation method, device, equipment and storage medium
Wang et al. Specaugment++: A hidden space data augmentation method for acoustic scene classification
CN110246510B (en) End-to-end voice enhancement method based on RefineNet
CN110197665B (en) Voice separation and tracking method for public security criminal investigation monitoring
CN108091345B (en) Double-ear voice separation method based on support vector machine
CN112349297A (en) Depression detection method based on microphone array
CN113470671B (en) Audio-visual voice enhancement method and system fully utilizing vision and voice connection
CN112259119B (en) Music source separation method based on stacked hourglass network
CN111986661A (en) Deep neural network speech recognition method based on speech enhancement in complex environment
Fan et al. Utterance-level permutation invariant training with discriminative learning for single channel speech separation
CN108806725A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN111724806A (en) Double-visual-angle single-channel voice separation method based on deep neural network
CN111968669B (en) Multi-element mixed sound signal separation method and device
Şimşekli et al. Non-negative tensor factorization models for Bayesian audio processing
Abdulatif et al. Investigating cross-domain losses for speech enhancement
Sun Digital audio scene recognition method based on machine learning technology
CN110136741A (en) A kind of single-channel voice Enhancement Method based on multiple dimensioned context
CN113229842B (en) Heart and lung sound automatic separation method based on complex deep neural network
CN114898773A (en) Synthetic speech detection method based on deep self-attention neural network classifier
Ali et al. Enhancing Embeddings for Speech Classification in Noisy Conditions.
TWI749547B (en) Speech enhancement system based on deep learning
Xie et al. Cross-corpus open set bird species recognition by vocalization
Dou et al. Cochleagram-based identification of electronic disguised voice with pitch scaling in the noisy environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210204

Address after: No. 1418, 14th floor, building 1, No. 1166, Tianfu 3rd Street, Chengdu hi tech Zone, China (Sichuan) pilot Free Trade Zone, Chengdu, Sichuan 610000

Applicant after: Chengdu Yuejian Technology Co.,Ltd.

Address before: 610000 Chengdu, Sichuan, Shuangliu District, Dongsheng Street, long bridge 6, 129, 1 units, 9 level 902.

Applicant before: CHENGDU MINGJIE TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210611

Address after: 210012 4th floor, building C, Wanbo Science Park, 20 Fengxin Road, Yuhuatai District, Nanjing City, Jiangsu Province

Applicant after: NANJING GUIJI INTELLIGENT TECHNOLOGY Co.,Ltd.

Address before: No. 1418, 14th floor, building 1, No. 1166, Tianfu 3rd Street, Chengdu hi tech Zone, China (Sichuan) pilot Free Trade Zone, Chengdu, Sichuan 610000

Applicant before: Chengdu Yuejian Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant