CN113113041A - Voice separation method based on time-frequency cross-domain feature selection - Google Patents
Voice separation method based on time-frequency cross-domain feature selection Download PDFInfo
- Publication number
- CN113113041A CN113113041A CN202110471865.6A CN202110471865A CN113113041A CN 113113041 A CN113113041 A CN 113113041A CN 202110471865 A CN202110471865 A CN 202110471865A CN 113113041 A CN113113041 A CN 113113041A
- Authority
- CN
- China
- Prior art keywords
- time
- domain
- feature
- voice
- cross
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000926 separation method Methods 0.000 title claims abstract description 46
- 238000010586 diagram Methods 0.000 claims abstract description 67
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 29
- 230000004927 fusion Effects 0.000 claims abstract description 27
- 238000000034 method Methods 0.000 claims abstract description 16
- 230000009466 transformation Effects 0.000 claims description 20
- 238000013528 artificial neural network Methods 0.000 claims description 11
- 239000000284 extract Substances 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 6
- 230000001131 transforming effect Effects 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 230000017105 transposition Effects 0.000 claims description 5
- 238000011176 pooling Methods 0.000 claims description 4
- 238000001228 spectrum Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 2
- 238000010187 selection method Methods 0.000 claims description 2
- 238000012549 training Methods 0.000 description 3
- 230000009467 reduction Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000006722 reduction reaction Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/18—Vocoders using multiple modes
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02087—Noise filtering the noise being separate speech, e.g. cocktail party
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Stereophonic System (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention discloses a voice separation method based on time-frequency cross-domain feature selection, and belongs to the field of single-channel voice separation. The method comprises the following steps: coding single-channel voice by using a one-dimensional convolutional neural network and short-time Fourier transform respectively; fusing feature maps obtained by coding in two modes; calculating a mask for each speaker by adopting a separation network according to the feature graph obtained by fusion; applying the obtained mask to the fused feature map to obtain a separation feature map of each speaker; and obtaining the voice of each speaker based on the separation characteristic diagram of each speaker. The invention selects the characteristics in the voice separation, and the characteristics of the time domain and the frequency domain are mutually complemented, so that the characteristic signal of the voice can be accurately captured under the noise environment of multiple speakers, the problem of insufficient utilization of the characteristics of the time domain and the time-frequency domain in the industry is solved, and the problem of poor separation under the unstable noise environment in the past is solved.
Description
Technical Field
The invention belongs to the field of single sound channel voice separation, and particularly relates to a voice separation method based on time-frequency cross-domain feature selection.
Background
The speech separation technology is a branch of the natural language processing field and is used for processing the problem that effective speech information cannot be identified in the multi-speaker noise environment. The goal of speech separation is to separate the target speech from background interference.
With the development of deep learning, many new algorithms based on neural networks come out, and Deep Clustering (DC) and Permutation Invariance Training (PIT) surpass the traditional method. Based on deep clustering and permutation invariance training, deep attraction networks (DANet) achieved unprecedented success by using an attractor mechanism to estimate the mask for each spoken utterance. Different from a voice separation network taking an amplitude spectrum as characteristic input, a time domain audio separation network (TasNet) and a full convolution time domain audio separation network (C onv-TasNet) propose to use a time domain signal as the input of the network, and are the most prominent models at present. The core idea of TasNet is to use a one-dimensional convolution to capture the features of the time-domain signal instead of a common transform such as a short-time fourier transform, which is not optimal for the separation task. And an optimal convolution encoder for capturing time domain signal characteristics is obtained by training the network end to end.
Based on these algorithms that directly use the time domain signal features, a combined approach of embedding and clustering the time domain and frequency domain features has been proposed by scholars. In the encoding stage, the features of the two domains, including the convolution extracted time domain features and the fourier transformed magnitude spectrum, are computed in parallel and stitched in the channel dimension. The features are separated by a separation network to obtain the output of a high-dimensional embedding space, and then a mask is generated for each speaking source by an attractor mechanism. In the decoder part, the processed voice signals are obtained by deconvoluting and inverse Fourier transform two characteristic graphs which are masked in different domains. Experiments show that the performance of splicing the time domain and frequency domain features is better than the performance of only using the time domain features.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a voice separation method based on time-frequency cross-domain feature selection, which is realized by adopting a voice separation network based on time-frequency cross-domain feature selection, wherein the voice separation network based on time-frequency cross-domain feature selection mainly comprises a feature encoder, a voice separator and a decoder, and the feature encoder is a voice time-frequency domain cross-domain feature encoder;
the method comprises the following steps:
step 2, performing cross-domain feature fusion on the obtained feature graph coded in the mode of using the one-dimensional convolutional neural network and the feature graph coded in the mode of using the short-time Fourier transform by adopting a time-frequency cross-domain feature selection method to obtain a cross-domain fused feature graph;
step 3, according to the feature diagram of cross-domain fusion, a voice separator is adopted to calculate a mask for each speaker in the single sound channel voice, and the mask is acted on the feature diagram of cross-domain fusion to obtain the separation feature diagram of each speaker in the single sound channel voice;
and 4, based on the separation characteristic diagram of each speaker in the single-channel voice, adopting a one-dimensional transposition convolutional neural network in a decoder to reconstruct the voice signal, and finally obtaining the voice of each speaker in the single-channel voice.
Further, the specific process of respectively encoding the mono voice by using the one-dimensional convolutional neural network and the short-time fourier transform in the voice time-frequency domain cross-domain feature encoder in step 1 includes the following steps:
step 1-1: calculating the single-channel voice by using a one-dimensional convolution neural network 1 in a voice time-frequency domain cross-domain feature encoder to obtain a time-domain feature map which is used as a feature map coded by using the one-dimensional convolution neural network;
step 1-2: calculating the single-channel voice by using short-time Fourier transform in a voice time-frequency domain cross-domain feature encoder to obtain a magnitude spectrum as a time-frequency domain feature map;
step 1-3: linearly transforming the time-frequency domain characteristic diagram to the same characteristic dimension as the time-domain characteristic diagram by using a full-connection network 1 in a voice time-frequency domain cross-domain characteristic encoder to obtain a transformed time-frequency domain characteristic diagram, carrying out nonlinear transformation on the transformed time-frequency domain characteristic diagram by using a one-dimensional convolutional neural network 2 in the voice time-frequency domain cross-domain characteristic encoder to obtain a time-frequency domain characteristic diagram after nonlinear transformation, and taking the time-frequency domain characteristic diagram after nonlinear transformation as a characteristic diagram coded by using short-time Fourier transform;
the number of input and output channels of the one-dimensional convolutional neural network 1 is different from that of the one-dimensional convolutional neural network 2.
Further, the specific process of performing cross-domain feature fusion in step 2 includes the following steps:
step 2-1: corresponding elements of the time-frequency domain characteristic diagram and the time-domain characteristic diagram after the nonlinear transformation are added to obtain a sum characteristic diagram;
step 2-2: based on the addition feature map, performing average calculation on each feature channel along a time dimension by adopting global pooling to obtain a global feature descriptor, wherein the number of channels of the global feature descriptor is the same as that of the addition feature map;
step 2-3: compressing the global feature descriptor obtained in the step 2-2 by adopting a full-connection network 2 in the voice time-frequency domain cross-domain feature encoder, reducing the feature dimension of the global feature descriptor and obtaining a compressed feature descriptor;
step 2-4: respectively expanding the compressed feature descriptors by using a fully-connected network 3 and a fully-connected network 4 in a voice time-frequency domain cross-domain feature encoder, reducing the feature dimensions of the global feature descriptors to obtain time domain feature descriptors and time-frequency domain feature descriptors respectively, wherein the two fully-connected networks of the fully-connected network 3 and the fully-connected network 4 have the same parameter quantity but different parameter values;
step 2-5: multiplying the time domain feature descriptors to a time domain feature map in a form of multiplying corresponding elements, multiplying the time domain feature descriptors to a time domain feature map after nonlinear transformation in a form of multiplying corresponding elements, completing cross-domain feature selection, obtaining two feature maps after cross-domain feature selection, and finally adding the two feature maps after cross-domain feature selection according to corresponding elements, completing cross-domain feature fusion, and obtaining a cross-domain fused feature map.
Further, the step 3 specifically includes the following steps:
step 3-1: adopting a stacked convolutional neural network in a voice separator to further extract the characteristics of the cross-domain fusion characteristic diagram to obtain a transformed cross-domain fusion characteristic diagram;
step 3-2: adopting a full-connection network 5 in a voice separator to carry out dimensionality increase on the transformed cross-domain fusion characteristic diagram, and transforming the cross-domain fusion characteristic diagram into tensors of three dimensionalities of time, characteristics and embedding;
step 3-3: calculating the mask of each speaker in the single-channel voice by adopting an attractor mechanism based on the time and the feature obtained by transformation and the tensor of the three embedded dimensions;
step 3-4: and (3) multiplying the mask of each speaker in the single-channel voice by the cross-domain fusion feature map obtained in the step (2-5) in a form of multiplying corresponding elements respectively to obtain a separation feature map of each speaker in the single-channel voice.
Further, the speech signal reconstruction in step 4 specifically includes:
a one-dimensional transposition convolution neural network in a decoder is adopted to convert the separation characteristic diagram of each speaker in the single-channel voice into a voice signal of the corresponding speaker, so that voice separation is completed.
The fully connected networks 1-5 are all different in structure and parameter values.
The invention mainly selects the characteristics in the voice separation, and the characteristics of the time domain and the frequency domain complement each other through cross-domain mixing, can accurately capture the characteristic signals of the voice under the noise environment of multiple speakers, overcomes the problem of insufficient utilization of the characteristics of the time domain and the time frequency domain in the industry, constructs a novel characteristic encoder, and can effectively extract the effective information of the voice. The high-dimensional characteristics of the voice signals are extracted through the stacked convolutional neural network, a speaker voice signal cluster is constructed through deep clustering of the attractor sub-networks, the voice signals of non-corresponding speakers are filtered in a self-adaptive mode, the robustness of the model is improved, and the problem that the separation is poor in the unstable noise environment in the prior art is solved.
Drawings
Fig. 1 is a schematic diagram of an overall structure of a voice separation network based on time-frequency cross-domain feature selection.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings.
The invention provides a voice separation method based on time-frequency cross-domain feature selection, which is mainly used for the voice separation problem in a multi-speaker environment and can effectively mix time-frequency domain features.
As shown in fig. 1, the present invention is implemented by using a voice separation network based on time-frequency cross-domain feature selection, and the voice separation network based on time-frequency cross-domain feature selection mainly includes three parts, namely, a feature encoder, a voice separator, and a decoder.
The feature encoder adopts a voice time-frequency domain cross-domain feature encoder, the voice time-frequency domain cross-domain feature encoder extracts a time domain feature map and a time-frequency domain feature map of a single-channel voice containing a plurality of speaker voices through a one-dimensional convolutional neural network 1 and short-time Fourier transform respectively, and the time domain feature map is used as a feature map encoded by using the one-dimensional convolutional neural network; linearly transforming the time-frequency domain characteristic diagram to the same characteristic dimension as the time-domain characteristic diagram by using a full-connection network 1 in a voice time-frequency domain cross-domain characteristic encoder to obtain a transformed time-frequency domain characteristic diagram, carrying out nonlinear transformation on the transformed time-frequency domain characteristic diagram by using a one-dimensional convolutional neural network 2 in the voice time-frequency domain cross-domain characteristic encoder to obtain a time-frequency domain characteristic diagram after nonlinear transformation, and taking the time-frequency domain characteristic diagram after nonlinear transformation as a characteristic diagram coded by using short-time Fourier transform; adding corresponding position elements of the time-frequency domain characteristic diagram and the time-domain characteristic diagram after the nonlinear transformation to obtain a sum characteristic diagram; the adding feature graph is subjected to global pooling and compression and reduction of three full-connection networks (a full-connection network 2, a full-connection network 3 and a full-connection network 4) to obtain a time domain feature descriptor and a time-frequency domain feature descriptor, the time domain feature descriptor is multiplied to the time domain feature graph according to a multiplication form of corresponding elements, the time-frequency domain feature descriptor is multiplied to the time-frequency domain feature graph subjected to nonlinear transformation according to a multiplication form of corresponding elements to obtain two feature graphs subjected to cross-domain feature selection, and finally the two feature graphs subjected to cross-domain feature selection are added according to the corresponding elements to obtain a cross-domain fused feature graph.
Giving a mixed voice signal (namely, a single-channel voice containing n speaker voices) x comprising n speakers, wherein n is a positive integer larger than 1, the time-frequency domain characteristic diagram is obtained by x through short-time Fourier transform, and meanwhile, the time-domain characteristic diagram is obtained by x through one-dimensional convolution operation of a one-dimensional convolution neural network 1, and the formula is as follows:
Fspec=S(x),Fconv=F(x)
where S (-) represents a short-time Fourier transform operation, F (-) represents a one-dimensional convolution operation, FspecRepresenting a time-frequency domain profile obtained using a short-time Fourier transform, FconvRepresenting a time domain signature graph obtained using one-dimensional convolutional neural network coding.
After that FspecFirstly, carrying out linear transformation through a full-connection network 1, and linearly transforming the time-frequency domain characteristic diagram to the same characteristic dimension as the time-domain characteristic diagram to obtain a transformed time-frequency domain characteristic diagram; then, a one-dimensional convolution neural network 2 with a convolution kernel size of 3 is used for carrying out nonlinear transformation on the transformed time-frequency domain characteristic diagram, the transformed time-frequency domain characteristic diagram is coded and converted into a potential representation space which is the same as the time domain characteristic, and the time-frequency domain characteristic diagram after the nonlinear transformation is obtained
The number of input and output channels of the one-dimensional convolutional neural network 1 is different from that of the one-dimensional convolutional neural network 2.
The time-frequency domain characteristic diagram after the nonlinear transformationAnd time domain feature map FconvAnd adding corresponding position elements to obtain a sum characteristic diagram U:
wherein the content of the first and second substances,indicating a corresponding element addition operation.
Obtaining global feature descriptors by global pooling of summed feature mapsC represents the total number of feature channels, i.e. feature dimensions,for computing time-domain and time-frequency-domain feature descriptors, T representing the length of the time dimension, UtSummation feature graph representing time t:
compressing global feature descriptors over a fully connected network 2Obtaining a compressed feature descriptorm is a compressed feature dimension, the compressed feature descriptor is used for guiding feature selection, and the calculation process is as follows, wherein N represents the operation of the fully-connected network 2, δ represents a sigmoid activation function, W represents a weight matrix of the fully-connected network 2, and g represents a global feature descriptor:
z=δ(N(Wg))
then, two full-connection layers (a full-connection network 3 and a full-connection network 4) are used for carrying out feature dimension reduction on the compressed feature descriptors to respectively obtain time domain feature descriptorsAnd time-frequency domain feature descriptor
ajA characteristic selection value b representing that the time domain characteristic descriptor a is positioned in the jth channel in the time domain characteristic diagramjRepresenting the time-frequency domain feature descriptor b after the nonlinear transformationTo middleFeature selection values for j channels, j ═ 1,2, …, C:
wherein(where C denotes the total number of eigen-channels, i.e. eigen-dimensions) are the weight matrices for fully-connected network 3 and fully-connected network 4, respectively, AjAnd BjRespectively representing the weights of the two weight matrixes in the jth row, and e represents a natural base number.
The cross-domain fused feature map (i.e., cross-domain selected feature map) H is calculated by the following formula:
wherein |, represents the corresponding element multiplication operation.
The voice separator further extracts the characteristics of the cross-domain fusion characteristic diagram through a stacked convolutional neural network to obtain a transformed cross-domain fusion characteristic diagram, extracts high-dimensional cross-domain fusion characteristics (namely time, characteristics and tensor embedded with three dimensions) from the transformed cross-domain fusion characteristic diagram by adopting a full-connection network 5, obtains a mask code of each speaker in the monophonic voice through an attractor mechanism based on the high-dimensional cross-domain fusion characteristics, and multiplies the mask code of each speaker with the cross-domain fusion characteristic diagram according to corresponding elements to obtain a separation characteristic diagram of each speaker in the monophonic voice.
Further feature extraction is carried out on the cross-domain fused feature map H through a stacked convolutional neural network formed by stacking 8-32 one-dimensional convolutional neural networks to obtain a transformed cross-domain fused feature map, the transformed cross-domain fused feature map is mapped to a high-dimensional space through a full-connection network 5, and the method is represented by the following formula:
V=WembTs(H)
whereinC represents a characteristic dimension, T represents the length of a time dimension, D represents D-dimension embedding, and the value of D is 20. WembRepresenting the weight, T, of the fully connected network 5s(. cndot.) represents the operation of a stacked convolutional neural network.
Then, a mask of each speaker in the monophonic voice is obtained based on an attractor mechanism, wherein M isiA mask representing the ith speaker obtained by the speech separator, i is 1,2, …, n, and the mask of each speaker is multiplied by the corresponding element of the cross-domain fused feature map H to obtain A separation profile representing the ith speaker after separation using a mask:
the decoder uses a one-dimensional transposition convolution neural network to reconstruct a voice signal, and converts the separation characteristic diagram of each speaker in the single-channel voice into the voice signal of the corresponding speaker, thereby completing voice separation.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (6)
1. A voice separation method based on time-frequency cross-domain feature selection is characterized in that the method is realized by adopting a voice separation network based on time-frequency cross-domain feature selection, the voice separation network based on the time-frequency cross-domain feature selection mainly comprises a feature encoder, a voice separator and a decoder, and the feature encoder is a voice time-frequency domain cross-domain feature encoder; the method comprises the following steps:
step 1: sampling the voices of n speakers by using a single recording device to obtain mono voices containing the voices of the n speakers, and respectively encoding the mono voices by using a one-dimensional convolutional neural network and short-time Fourier transform in a voice time-frequency domain cross-domain feature encoder to obtain a feature map encoded by using the one-dimensional convolutional neural network and a feature map encoded by using the short-time Fourier transform, wherein n is a positive integer greater than 1;
step 2: performing cross-domain feature fusion on the obtained feature graph coded in the mode of using the one-dimensional convolutional neural network and the feature graph coded in the mode of using the short-time Fourier transform by adopting a time-frequency cross-domain feature selection method to obtain a cross-domain fused feature graph;
and step 3: according to the feature diagram of cross-domain fusion, a voice separator is adopted to calculate a mask for each speaker in the single-channel voice, and the mask is acted on the feature diagram of cross-domain fusion to obtain a separation feature diagram of each speaker in the single-channel voice;
and 4, step 4: based on the separation characteristic diagram of each speaker in the single-channel voice, a one-dimensional transposition convolution neural network in a decoder is adopted to reconstruct the voice signal, and finally the voice of each speaker in the single-channel voice is obtained.
2. The method for separating speech based on time-frequency cross-domain feature selection according to claim 1, wherein the specific process of respectively encoding the mono speech by using the one-dimensional convolutional neural network and the short-time fourier transform in the speech time-frequency domain cross-domain feature encoder in step 1 comprises the following steps:
step 1-1: calculating the single-channel voice by using a one-dimensional convolution neural network 1 in a voice time-frequency domain cross-domain feature encoder to obtain a time-domain feature map which is used as a feature map coded by using the one-dimensional convolution neural network;
step 1-2: calculating the single-channel voice by using short-time Fourier transform in a voice time-frequency domain cross-domain feature encoder to obtain a magnitude spectrum as a time-frequency domain feature map;
step 1-3: linearly transforming the time-frequency domain characteristic diagram to the same characteristic dimension as the time-domain characteristic diagram by using a full-connection network 1 in a voice time-frequency domain cross-domain characteristic encoder to obtain a transformed time-frequency domain characteristic diagram, carrying out nonlinear transformation on the transformed time-frequency domain characteristic diagram by using a one-dimensional convolutional neural network 2 in the voice time-frequency domain cross-domain characteristic encoder to obtain a time-frequency domain characteristic diagram after nonlinear transformation, and taking the time-frequency domain characteristic diagram after nonlinear transformation as a characteristic diagram coded by using short-time Fourier transform;
the number of input and output channels of the one-dimensional convolutional neural network 1 is different from that of the one-dimensional convolutional neural network 2.
3. The method for separating speech based on time-frequency cross-domain feature selection according to claim 2, wherein the specific process of performing cross-domain feature fusion in step 2 comprises the following steps:
step 2-1: corresponding elements of the time-frequency domain characteristic diagram and the time-domain characteristic diagram after the nonlinear transformation are added to obtain a sum characteristic diagram;
step 2-2: based on the addition feature map, performing average calculation on each feature channel along a time dimension by adopting global pooling to obtain a global feature descriptor, wherein the number of channels of the global feature descriptor is the same as that of the addition feature map;
step 2-3: compressing the global feature descriptor obtained in the step 2-2 by adopting a full-connection network 2 in the voice time-frequency domain cross-domain feature encoder, reducing the feature dimension of the global feature descriptor and obtaining a compressed feature descriptor;
step 2-4: respectively expanding the compressed feature descriptors by using a fully-connected network 3 and a fully-connected network 4 in a voice time-frequency domain cross-domain feature encoder, reducing the feature dimensions of the global feature descriptors to obtain time domain feature descriptors and time-frequency domain feature descriptors respectively, wherein the two fully-connected networks of the fully-connected network 3 and the fully-connected network 4 have the same parameter quantity but different parameter values;
step 2-5: multiplying the time domain feature descriptors to a time domain feature map in a form of multiplying corresponding elements, multiplying the time domain feature descriptors to a time domain feature map after nonlinear transformation in a form of multiplying corresponding elements, completing cross-domain feature selection, obtaining two feature maps after cross-domain feature selection, and finally adding the two feature maps after cross-domain feature selection according to corresponding elements, completing cross-domain feature fusion, and obtaining a cross-domain fused feature map.
4. The method for separating speech based on time-frequency cross-domain feature selection according to claim 3, wherein the step 3 specifically comprises the following steps:
step 3-1: adopting a stacked convolutional neural network in a voice separator to further extract the characteristics of the cross-domain fusion characteristic diagram to obtain a transformed cross-domain fusion characteristic diagram;
step 3-2: adopting a full-connection network 5 in a voice separator to carry out dimensionality increase on the transformed cross-domain fusion characteristic diagram, and transforming the cross-domain fusion characteristic diagram into tensors of three dimensionalities of time, characteristics and embedding;
step 3-3: calculating the mask of each speaker in the single-channel voice by adopting an attractor mechanism based on the time and the feature obtained by transformation and the tensor of the three embedded dimensions;
step 3-4: and (3) multiplying the mask of each speaker in the single-channel voice by the cross-domain fusion feature map obtained in the step (2-5) in a form of multiplying corresponding elements respectively to obtain a separation feature map of each speaker in the single-channel voice.
5. The method for separating speech based on time-frequency cross-domain feature selection according to claim 4, wherein the speech signal reconstruction in step 4 specifically comprises:
a one-dimensional transposition convolution neural network in a decoder is adopted to convert the separation characteristic diagram of each speaker in the single-channel voice into a voice signal of the corresponding speaker, so that voice separation is completed.
6. The method according to claim 5, wherein the structure and parameter values of the fully connected networks 1-5 are different.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110471865.6A CN113113041B (en) | 2021-04-29 | 2021-04-29 | Voice separation method based on time-frequency cross-domain feature selection |
NL2029780A NL2029780B1 (en) | 2021-04-29 | 2021-11-17 | Speech separation method based on time-frequency cross-domain feature selection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110471865.6A CN113113041B (en) | 2021-04-29 | 2021-04-29 | Voice separation method based on time-frequency cross-domain feature selection |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113113041A true CN113113041A (en) | 2021-07-13 |
CN113113041B CN113113041B (en) | 2022-10-11 |
Family
ID=76720916
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110471865.6A Expired - Fee Related CN113113041B (en) | 2021-04-29 | 2021-04-29 | Voice separation method based on time-frequency cross-domain feature selection |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113113041B (en) |
NL (1) | NL2029780B1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113539292A (en) * | 2021-07-28 | 2021-10-22 | 联想(北京)有限公司 | Voice separation method and device |
CN113555031A (en) * | 2021-07-30 | 2021-10-26 | 北京达佳互联信息技术有限公司 | Training method and device of voice enhancement model and voice enhancement method and device |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070021958A1 (en) * | 2005-07-22 | 2007-01-25 | Erik Visser | Robust separation of speech signals in a noisy environment |
CN107305774A (en) * | 2016-04-22 | 2017-10-31 | 腾讯科技(深圳)有限公司 | Speech detection method and device |
US20190066713A1 (en) * | 2016-06-14 | 2019-02-28 | The Trustees Of Columbia University In The City Of New York | Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments |
US20190139563A1 (en) * | 2017-11-06 | 2019-05-09 | Microsoft Technology Licensing, Llc | Multi-channel speech separation |
CN110459240A (en) * | 2019-08-12 | 2019-11-15 | 新疆大学 | The more speaker's speech separating methods clustered based on convolutional neural networks and depth |
CN110619887A (en) * | 2019-09-25 | 2019-12-27 | 电子科技大学 | Multi-speaker voice separation method based on convolutional neural network |
CN110808061A (en) * | 2019-11-11 | 2020-02-18 | 广州国音智能科技有限公司 | Voice separation method and device, mobile terminal and computer readable storage medium |
CN110970053A (en) * | 2019-12-04 | 2020-04-07 | 西北工业大学深圳研究院 | Multichannel speaker-independent voice separation method based on deep clustering |
CN111883166A (en) * | 2020-07-17 | 2020-11-03 | 北京百度网讯科技有限公司 | Voice signal processing method, device, equipment and storage medium |
CN112242149A (en) * | 2020-12-03 | 2021-01-19 | 北京声智科技有限公司 | Audio data processing method and device, earphone and computer readable storage medium |
CN112259120A (en) * | 2020-10-19 | 2021-01-22 | 成都明杰科技有限公司 | Single-channel human voice and background voice separation method based on convolution cyclic neural network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210012767A1 (en) * | 2020-09-25 | 2021-01-14 | Intel Corporation | Real-time dynamic noise reduction using convolutional networks |
-
2021
- 2021-04-29 CN CN202110471865.6A patent/CN113113041B/en not_active Expired - Fee Related
- 2021-11-17 NL NL2029780A patent/NL2029780B1/en active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070021958A1 (en) * | 2005-07-22 | 2007-01-25 | Erik Visser | Robust separation of speech signals in a noisy environment |
CN107305774A (en) * | 2016-04-22 | 2017-10-31 | 腾讯科技(深圳)有限公司 | Speech detection method and device |
US20190066713A1 (en) * | 2016-06-14 | 2019-02-28 | The Trustees Of Columbia University In The City Of New York | Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments |
US20190139563A1 (en) * | 2017-11-06 | 2019-05-09 | Microsoft Technology Licensing, Llc | Multi-channel speech separation |
CN110459240A (en) * | 2019-08-12 | 2019-11-15 | 新疆大学 | The more speaker's speech separating methods clustered based on convolutional neural networks and depth |
CN110619887A (en) * | 2019-09-25 | 2019-12-27 | 电子科技大学 | Multi-speaker voice separation method based on convolutional neural network |
CN110808061A (en) * | 2019-11-11 | 2020-02-18 | 广州国音智能科技有限公司 | Voice separation method and device, mobile terminal and computer readable storage medium |
CN110970053A (en) * | 2019-12-04 | 2020-04-07 | 西北工业大学深圳研究院 | Multichannel speaker-independent voice separation method based on deep clustering |
CN111883166A (en) * | 2020-07-17 | 2020-11-03 | 北京百度网讯科技有限公司 | Voice signal processing method, device, equipment and storage medium |
CN112259120A (en) * | 2020-10-19 | 2021-01-22 | 成都明杰科技有限公司 | Single-channel human voice and background voice separation method based on convolution cyclic neural network |
CN112242149A (en) * | 2020-12-03 | 2021-01-19 | 北京声智科技有限公司 | Audio data processing method and device, earphone and computer readable storage medium |
Non-Patent Citations (3)
Title |
---|
LAN T: ""Deep Attractor with Convolutional Network for Monaural Speech Separation"", 《2020 11TH IEEE ANNUAL UBIQUITOUS COMPUTING, ELECTRONICS & MOBILE COMMUNICATION CONFERENCE (UEMCON)》 * |
LI M: ""Multi-layer Attention Mechanism Based Speech Separation Model"", 《2019 IEEE 19TH INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT)》 * |
蓝天: ""采用上下文相关的注意力机制及循环神经网络的语音增强方法"", 《声学学报》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113539292A (en) * | 2021-07-28 | 2021-10-22 | 联想(北京)有限公司 | Voice separation method and device |
CN113555031A (en) * | 2021-07-30 | 2021-10-26 | 北京达佳互联信息技术有限公司 | Training method and device of voice enhancement model and voice enhancement method and device |
CN113555031B (en) * | 2021-07-30 | 2024-02-23 | 北京达佳互联信息技术有限公司 | Training method and device of voice enhancement model, and voice enhancement method and device |
Also Published As
Publication number | Publication date |
---|---|
NL2029780B1 (en) | 2023-03-14 |
CN113113041B (en) | 2022-10-11 |
NL2029780A (en) | 2022-11-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109671442B (en) | Many-to-many speaker conversion method based on STARGAN and x vectors | |
CN111243620A (en) | Voice separation model training method and device, storage medium and computer equipment | |
CN113113041B (en) | Voice separation method based on time-frequency cross-domain feature selection | |
CN112071330B (en) | Audio data processing method and device and computer readable storage medium | |
CN105872855A (en) | Labeling method and device for video files | |
CN112989107B (en) | Audio classification and separation method and device, electronic equipment and storage medium | |
CN110675891A (en) | Voice separation method and module based on multilayer attention mechanism | |
CN116092501B (en) | Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system | |
CN112633175A (en) | Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment | |
Shi et al. | End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network. | |
CN115602165A (en) | Digital staff intelligent system based on financial system | |
CN115101085A (en) | Multi-speaker time-domain voice separation method for enhancing external attention through convolution | |
CN112382308A (en) | Zero-order voice conversion system and method based on deep learning and simple acoustic features | |
CN114141237A (en) | Speech recognition method, speech recognition device, computer equipment and storage medium | |
CN106875944A (en) | A kind of system of Voice command home intelligent terminal | |
Routray et al. | Deep-sound field analysis for upscaling ambisonic signals | |
CN113593588A (en) | Multi-singer singing voice synthesis method and system based on generation countermeasure network | |
CN116612779A (en) | Single-channel voice separation method based on deep learning | |
CN116682463A (en) | Multi-mode emotion recognition method and system | |
CN116959468A (en) | Voice enhancement method, system and equipment based on DCCTN network model | |
CN115881156A (en) | Multi-scale-based multi-modal time domain voice separation method | |
CN115910091A (en) | Method and device for separating generated voice by introducing fundamental frequency clues | |
CN113488069A (en) | Method and device for quickly extracting high-dimensional voice features based on generative countermeasure network | |
Xu et al. | Speaker-Aware Monaural Speech Separation. | |
Wu et al. | Stacked sparse autoencoder for audio object coding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20221011 |
|
CF01 | Termination of patent right due to non-payment of annual fee |