CN113113041A - Voice separation method based on time-frequency cross-domain feature selection - Google Patents

Voice separation method based on time-frequency cross-domain feature selection Download PDF

Info

Publication number
CN113113041A
CN113113041A CN202110471865.6A CN202110471865A CN113113041A CN 113113041 A CN113113041 A CN 113113041A CN 202110471865 A CN202110471865 A CN 202110471865A CN 113113041 A CN113113041 A CN 113113041A
Authority
CN
China
Prior art keywords
time
domain
feature
voice
cross
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110471865.6A
Other languages
Chinese (zh)
Other versions
CN113113041B (en
Inventor
蓝天
刘峤
吴祖峰
钱宇欣
吕忆蓝
李佳佳
冯雨佳
陈聪
康宏博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202110471865.6A priority Critical patent/CN113113041B/en
Publication of CN113113041A publication Critical patent/CN113113041A/en
Priority to NL2029780A priority patent/NL2029780B1/en
Application granted granted Critical
Publication of CN113113041B publication Critical patent/CN113113041B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Stereophonic System (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a voice separation method based on time-frequency cross-domain feature selection, and belongs to the field of single-channel voice separation. The method comprises the following steps: coding single-channel voice by using a one-dimensional convolutional neural network and short-time Fourier transform respectively; fusing feature maps obtained by coding in two modes; calculating a mask for each speaker by adopting a separation network according to the feature graph obtained by fusion; applying the obtained mask to the fused feature map to obtain a separation feature map of each speaker; and obtaining the voice of each speaker based on the separation characteristic diagram of each speaker. The invention selects the characteristics in the voice separation, and the characteristics of the time domain and the frequency domain are mutually complemented, so that the characteristic signal of the voice can be accurately captured under the noise environment of multiple speakers, the problem of insufficient utilization of the characteristics of the time domain and the time-frequency domain in the industry is solved, and the problem of poor separation under the unstable noise environment in the past is solved.

Description

Voice separation method based on time-frequency cross-domain feature selection
Technical Field
The invention belongs to the field of single sound channel voice separation, and particularly relates to a voice separation method based on time-frequency cross-domain feature selection.
Background
The speech separation technology is a branch of the natural language processing field and is used for processing the problem that effective speech information cannot be identified in the multi-speaker noise environment. The goal of speech separation is to separate the target speech from background interference.
With the development of deep learning, many new algorithms based on neural networks come out, and Deep Clustering (DC) and Permutation Invariance Training (PIT) surpass the traditional method. Based on deep clustering and permutation invariance training, deep attraction networks (DANet) achieved unprecedented success by using an attractor mechanism to estimate the mask for each spoken utterance. Different from a voice separation network taking an amplitude spectrum as characteristic input, a time domain audio separation network (TasNet) and a full convolution time domain audio separation network (C onv-TasNet) propose to use a time domain signal as the input of the network, and are the most prominent models at present. The core idea of TasNet is to use a one-dimensional convolution to capture the features of the time-domain signal instead of a common transform such as a short-time fourier transform, which is not optimal for the separation task. And an optimal convolution encoder for capturing time domain signal characteristics is obtained by training the network end to end.
Based on these algorithms that directly use the time domain signal features, a combined approach of embedding and clustering the time domain and frequency domain features has been proposed by scholars. In the encoding stage, the features of the two domains, including the convolution extracted time domain features and the fourier transformed magnitude spectrum, are computed in parallel and stitched in the channel dimension. The features are separated by a separation network to obtain the output of a high-dimensional embedding space, and then a mask is generated for each speaking source by an attractor mechanism. In the decoder part, the processed voice signals are obtained by deconvoluting and inverse Fourier transform two characteristic graphs which are masked in different domains. Experiments show that the performance of splicing the time domain and frequency domain features is better than the performance of only using the time domain features.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a voice separation method based on time-frequency cross-domain feature selection, which is realized by adopting a voice separation network based on time-frequency cross-domain feature selection, wherein the voice separation network based on time-frequency cross-domain feature selection mainly comprises a feature encoder, a voice separator and a decoder, and the feature encoder is a voice time-frequency domain cross-domain feature encoder;
the method comprises the following steps:
step 1, sampling voices of a plurality of speakers through a single recording device to obtain single-channel voices containing the voices of the plurality of speakers, and respectively coding the single-channel voices by using a one-dimensional convolutional neural network and short-time Fourier transform in a voice time-frequency domain cross-domain feature coder to obtain a feature map coded by using the one-dimensional convolutional neural network and a feature map coded by using the short-time Fourier transform;
step 2, performing cross-domain feature fusion on the obtained feature graph coded in the mode of using the one-dimensional convolutional neural network and the feature graph coded in the mode of using the short-time Fourier transform by adopting a time-frequency cross-domain feature selection method to obtain a cross-domain fused feature graph;
step 3, according to the feature diagram of cross-domain fusion, a voice separator is adopted to calculate a mask for each speaker in the single sound channel voice, and the mask is acted on the feature diagram of cross-domain fusion to obtain the separation feature diagram of each speaker in the single sound channel voice;
and 4, based on the separation characteristic diagram of each speaker in the single-channel voice, adopting a one-dimensional transposition convolutional neural network in a decoder to reconstruct the voice signal, and finally obtaining the voice of each speaker in the single-channel voice.
Further, the specific process of respectively encoding the mono voice by using the one-dimensional convolutional neural network and the short-time fourier transform in the voice time-frequency domain cross-domain feature encoder in step 1 includes the following steps:
step 1-1: calculating the single-channel voice by using a one-dimensional convolution neural network 1 in a voice time-frequency domain cross-domain feature encoder to obtain a time-domain feature map which is used as a feature map coded by using the one-dimensional convolution neural network;
step 1-2: calculating the single-channel voice by using short-time Fourier transform in a voice time-frequency domain cross-domain feature encoder to obtain a magnitude spectrum as a time-frequency domain feature map;
step 1-3: linearly transforming the time-frequency domain characteristic diagram to the same characteristic dimension as the time-domain characteristic diagram by using a full-connection network 1 in a voice time-frequency domain cross-domain characteristic encoder to obtain a transformed time-frequency domain characteristic diagram, carrying out nonlinear transformation on the transformed time-frequency domain characteristic diagram by using a one-dimensional convolutional neural network 2 in the voice time-frequency domain cross-domain characteristic encoder to obtain a time-frequency domain characteristic diagram after nonlinear transformation, and taking the time-frequency domain characteristic diagram after nonlinear transformation as a characteristic diagram coded by using short-time Fourier transform;
the number of input and output channels of the one-dimensional convolutional neural network 1 is different from that of the one-dimensional convolutional neural network 2.
Further, the specific process of performing cross-domain feature fusion in step 2 includes the following steps:
step 2-1: corresponding elements of the time-frequency domain characteristic diagram and the time-domain characteristic diagram after the nonlinear transformation are added to obtain a sum characteristic diagram;
step 2-2: based on the addition feature map, performing average calculation on each feature channel along a time dimension by adopting global pooling to obtain a global feature descriptor, wherein the number of channels of the global feature descriptor is the same as that of the addition feature map;
step 2-3: compressing the global feature descriptor obtained in the step 2-2 by adopting a full-connection network 2 in the voice time-frequency domain cross-domain feature encoder, reducing the feature dimension of the global feature descriptor and obtaining a compressed feature descriptor;
step 2-4: respectively expanding the compressed feature descriptors by using a fully-connected network 3 and a fully-connected network 4 in a voice time-frequency domain cross-domain feature encoder, reducing the feature dimensions of the global feature descriptors to obtain time domain feature descriptors and time-frequency domain feature descriptors respectively, wherein the two fully-connected networks of the fully-connected network 3 and the fully-connected network 4 have the same parameter quantity but different parameter values;
step 2-5: multiplying the time domain feature descriptors to a time domain feature map in a form of multiplying corresponding elements, multiplying the time domain feature descriptors to a time domain feature map after nonlinear transformation in a form of multiplying corresponding elements, completing cross-domain feature selection, obtaining two feature maps after cross-domain feature selection, and finally adding the two feature maps after cross-domain feature selection according to corresponding elements, completing cross-domain feature fusion, and obtaining a cross-domain fused feature map.
Further, the step 3 specifically includes the following steps:
step 3-1: adopting a stacked convolutional neural network in a voice separator to further extract the characteristics of the cross-domain fusion characteristic diagram to obtain a transformed cross-domain fusion characteristic diagram;
step 3-2: adopting a full-connection network 5 in a voice separator to carry out dimensionality increase on the transformed cross-domain fusion characteristic diagram, and transforming the cross-domain fusion characteristic diagram into tensors of three dimensionalities of time, characteristics and embedding;
step 3-3: calculating the mask of each speaker in the single-channel voice by adopting an attractor mechanism based on the time and the feature obtained by transformation and the tensor of the three embedded dimensions;
step 3-4: and (3) multiplying the mask of each speaker in the single-channel voice by the cross-domain fusion feature map obtained in the step (2-5) in a form of multiplying corresponding elements respectively to obtain a separation feature map of each speaker in the single-channel voice.
Further, the speech signal reconstruction in step 4 specifically includes:
a one-dimensional transposition convolution neural network in a decoder is adopted to convert the separation characteristic diagram of each speaker in the single-channel voice into a voice signal of the corresponding speaker, so that voice separation is completed.
The fully connected networks 1-5 are all different in structure and parameter values.
The invention mainly selects the characteristics in the voice separation, and the characteristics of the time domain and the frequency domain complement each other through cross-domain mixing, can accurately capture the characteristic signals of the voice under the noise environment of multiple speakers, overcomes the problem of insufficient utilization of the characteristics of the time domain and the time frequency domain in the industry, constructs a novel characteristic encoder, and can effectively extract the effective information of the voice. The high-dimensional characteristics of the voice signals are extracted through the stacked convolutional neural network, a speaker voice signal cluster is constructed through deep clustering of the attractor sub-networks, the voice signals of non-corresponding speakers are filtered in a self-adaptive mode, the robustness of the model is improved, and the problem that the separation is poor in the unstable noise environment in the prior art is solved.
Drawings
Fig. 1 is a schematic diagram of an overall structure of a voice separation network based on time-frequency cross-domain feature selection.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings.
The invention provides a voice separation method based on time-frequency cross-domain feature selection, which is mainly used for the voice separation problem in a multi-speaker environment and can effectively mix time-frequency domain features.
As shown in fig. 1, the present invention is implemented by using a voice separation network based on time-frequency cross-domain feature selection, and the voice separation network based on time-frequency cross-domain feature selection mainly includes three parts, namely, a feature encoder, a voice separator, and a decoder.
The feature encoder adopts a voice time-frequency domain cross-domain feature encoder, the voice time-frequency domain cross-domain feature encoder extracts a time domain feature map and a time-frequency domain feature map of a single-channel voice containing a plurality of speaker voices through a one-dimensional convolutional neural network 1 and short-time Fourier transform respectively, and the time domain feature map is used as a feature map encoded by using the one-dimensional convolutional neural network; linearly transforming the time-frequency domain characteristic diagram to the same characteristic dimension as the time-domain characteristic diagram by using a full-connection network 1 in a voice time-frequency domain cross-domain characteristic encoder to obtain a transformed time-frequency domain characteristic diagram, carrying out nonlinear transformation on the transformed time-frequency domain characteristic diagram by using a one-dimensional convolutional neural network 2 in the voice time-frequency domain cross-domain characteristic encoder to obtain a time-frequency domain characteristic diagram after nonlinear transformation, and taking the time-frequency domain characteristic diagram after nonlinear transformation as a characteristic diagram coded by using short-time Fourier transform; adding corresponding position elements of the time-frequency domain characteristic diagram and the time-domain characteristic diagram after the nonlinear transformation to obtain a sum characteristic diagram; the adding feature graph is subjected to global pooling and compression and reduction of three full-connection networks (a full-connection network 2, a full-connection network 3 and a full-connection network 4) to obtain a time domain feature descriptor and a time-frequency domain feature descriptor, the time domain feature descriptor is multiplied to the time domain feature graph according to a multiplication form of corresponding elements, the time-frequency domain feature descriptor is multiplied to the time-frequency domain feature graph subjected to nonlinear transformation according to a multiplication form of corresponding elements to obtain two feature graphs subjected to cross-domain feature selection, and finally the two feature graphs subjected to cross-domain feature selection are added according to the corresponding elements to obtain a cross-domain fused feature graph.
Giving a mixed voice signal (namely, a single-channel voice containing n speaker voices) x comprising n speakers, wherein n is a positive integer larger than 1, the time-frequency domain characteristic diagram is obtained by x through short-time Fourier transform, and meanwhile, the time-domain characteristic diagram is obtained by x through one-dimensional convolution operation of a one-dimensional convolution neural network 1, and the formula is as follows:
Fspec=S(x),Fconv=F(x)
where S (-) represents a short-time Fourier transform operation, F (-) represents a one-dimensional convolution operation, FspecRepresenting a time-frequency domain profile obtained using a short-time Fourier transform, FconvRepresenting a time domain signature graph obtained using one-dimensional convolutional neural network coding.
After that FspecFirstly, carrying out linear transformation through a full-connection network 1, and linearly transforming the time-frequency domain characteristic diagram to the same characteristic dimension as the time-domain characteristic diagram to obtain a transformed time-frequency domain characteristic diagram; then, a one-dimensional convolution neural network 2 with a convolution kernel size of 3 is used for carrying out nonlinear transformation on the transformed time-frequency domain characteristic diagram, the transformed time-frequency domain characteristic diagram is coded and converted into a potential representation space which is the same as the time domain characteristic, and the time-frequency domain characteristic diagram after the nonlinear transformation is obtained
Figure BDA0003045750790000041
The number of input and output channels of the one-dimensional convolutional neural network 1 is different from that of the one-dimensional convolutional neural network 2.
The time-frequency domain characteristic diagram after the nonlinear transformation
Figure BDA0003045750790000051
And time domain feature map FconvAnd adding corresponding position elements to obtain a sum characteristic diagram U:
Figure BDA0003045750790000052
wherein the content of the first and second substances,
Figure BDA0003045750790000053
indicating a corresponding element addition operation.
Obtaining global feature descriptors by global pooling of summed feature maps
Figure BDA0003045750790000054
C represents the total number of feature channels, i.e. feature dimensions,
Figure BDA0003045750790000055
for computing time-domain and time-frequency-domain feature descriptors, T representing the length of the time dimension, UtSummation feature graph representing time t:
Figure BDA0003045750790000056
compressing global feature descriptors over a fully connected network 2
Figure BDA0003045750790000057
Obtaining a compressed feature descriptor
Figure BDA0003045750790000058
m is a compressed feature dimension, the compressed feature descriptor is used for guiding feature selection, and the calculation process is as follows, wherein N represents the operation of the fully-connected network 2, δ represents a sigmoid activation function, W represents a weight matrix of the fully-connected network 2, and g represents a global feature descriptor:
z=δ(N(Wg))
then, two full-connection layers (a full-connection network 3 and a full-connection network 4) are used for carrying out feature dimension reduction on the compressed feature descriptors to respectively obtain time domain feature descriptors
Figure BDA0003045750790000059
And time-frequency domain feature descriptor
Figure BDA00030457507900000510
ajA characteristic selection value b representing that the time domain characteristic descriptor a is positioned in the jth channel in the time domain characteristic diagramjRepresenting the time-frequency domain feature descriptor b after the nonlinear transformation
Figure BDA00030457507900000511
To middleFeature selection values for j channels, j ═ 1,2, …, C:
Figure BDA00030457507900000512
wherein
Figure BDA00030457507900000513
(where C denotes the total number of eigen-channels, i.e. eigen-dimensions) are the weight matrices for fully-connected network 3 and fully-connected network 4, respectively, AjAnd BjRespectively representing the weights of the two weight matrixes in the jth row, and e represents a natural base number.
The cross-domain fused feature map (i.e., cross-domain selected feature map) H is calculated by the following formula:
Figure BDA00030457507900000514
wherein |, represents the corresponding element multiplication operation.
The voice separator further extracts the characteristics of the cross-domain fusion characteristic diagram through a stacked convolutional neural network to obtain a transformed cross-domain fusion characteristic diagram, extracts high-dimensional cross-domain fusion characteristics (namely time, characteristics and tensor embedded with three dimensions) from the transformed cross-domain fusion characteristic diagram by adopting a full-connection network 5, obtains a mask code of each speaker in the monophonic voice through an attractor mechanism based on the high-dimensional cross-domain fusion characteristics, and multiplies the mask code of each speaker with the cross-domain fusion characteristic diagram according to corresponding elements to obtain a separation characteristic diagram of each speaker in the monophonic voice.
Further feature extraction is carried out on the cross-domain fused feature map H through a stacked convolutional neural network formed by stacking 8-32 one-dimensional convolutional neural networks to obtain a transformed cross-domain fused feature map, the transformed cross-domain fused feature map is mapped to a high-dimensional space through a full-connection network 5, and the method is represented by the following formula:
V=WembTs(H)
wherein
Figure BDA0003045750790000061
C represents a characteristic dimension, T represents the length of a time dimension, D represents D-dimension embedding, and the value of D is 20. WembRepresenting the weight, T, of the fully connected network 5s(. cndot.) represents the operation of a stacked convolutional neural network.
Then, a mask of each speaker in the monophonic voice is obtained based on an attractor mechanism, wherein M isiA mask representing the ith speaker obtained by the speech separator, i is 1,2, …, n, and the mask of each speaker is multiplied by the corresponding element of the cross-domain fused feature map H to obtain
Figure BDA0003045750790000062
Figure BDA0003045750790000063
A separation profile representing the ith speaker after separation using a mask:
Figure BDA0003045750790000064
the decoder uses a one-dimensional transposition convolution neural network to reconstruct a voice signal, and converts the separation characteristic diagram of each speaker in the single-channel voice into the voice signal of the corresponding speaker, thereby completing voice separation.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (6)

1. A voice separation method based on time-frequency cross-domain feature selection is characterized in that the method is realized by adopting a voice separation network based on time-frequency cross-domain feature selection, the voice separation network based on the time-frequency cross-domain feature selection mainly comprises a feature encoder, a voice separator and a decoder, and the feature encoder is a voice time-frequency domain cross-domain feature encoder; the method comprises the following steps:
step 1: sampling the voices of n speakers by using a single recording device to obtain mono voices containing the voices of the n speakers, and respectively encoding the mono voices by using a one-dimensional convolutional neural network and short-time Fourier transform in a voice time-frequency domain cross-domain feature encoder to obtain a feature map encoded by using the one-dimensional convolutional neural network and a feature map encoded by using the short-time Fourier transform, wherein n is a positive integer greater than 1;
step 2: performing cross-domain feature fusion on the obtained feature graph coded in the mode of using the one-dimensional convolutional neural network and the feature graph coded in the mode of using the short-time Fourier transform by adopting a time-frequency cross-domain feature selection method to obtain a cross-domain fused feature graph;
and step 3: according to the feature diagram of cross-domain fusion, a voice separator is adopted to calculate a mask for each speaker in the single-channel voice, and the mask is acted on the feature diagram of cross-domain fusion to obtain a separation feature diagram of each speaker in the single-channel voice;
and 4, step 4: based on the separation characteristic diagram of each speaker in the single-channel voice, a one-dimensional transposition convolution neural network in a decoder is adopted to reconstruct the voice signal, and finally the voice of each speaker in the single-channel voice is obtained.
2. The method for separating speech based on time-frequency cross-domain feature selection according to claim 1, wherein the specific process of respectively encoding the mono speech by using the one-dimensional convolutional neural network and the short-time fourier transform in the speech time-frequency domain cross-domain feature encoder in step 1 comprises the following steps:
step 1-1: calculating the single-channel voice by using a one-dimensional convolution neural network 1 in a voice time-frequency domain cross-domain feature encoder to obtain a time-domain feature map which is used as a feature map coded by using the one-dimensional convolution neural network;
step 1-2: calculating the single-channel voice by using short-time Fourier transform in a voice time-frequency domain cross-domain feature encoder to obtain a magnitude spectrum as a time-frequency domain feature map;
step 1-3: linearly transforming the time-frequency domain characteristic diagram to the same characteristic dimension as the time-domain characteristic diagram by using a full-connection network 1 in a voice time-frequency domain cross-domain characteristic encoder to obtain a transformed time-frequency domain characteristic diagram, carrying out nonlinear transformation on the transformed time-frequency domain characteristic diagram by using a one-dimensional convolutional neural network 2 in the voice time-frequency domain cross-domain characteristic encoder to obtain a time-frequency domain characteristic diagram after nonlinear transformation, and taking the time-frequency domain characteristic diagram after nonlinear transformation as a characteristic diagram coded by using short-time Fourier transform;
the number of input and output channels of the one-dimensional convolutional neural network 1 is different from that of the one-dimensional convolutional neural network 2.
3. The method for separating speech based on time-frequency cross-domain feature selection according to claim 2, wherein the specific process of performing cross-domain feature fusion in step 2 comprises the following steps:
step 2-1: corresponding elements of the time-frequency domain characteristic diagram and the time-domain characteristic diagram after the nonlinear transformation are added to obtain a sum characteristic diagram;
step 2-2: based on the addition feature map, performing average calculation on each feature channel along a time dimension by adopting global pooling to obtain a global feature descriptor, wherein the number of channels of the global feature descriptor is the same as that of the addition feature map;
step 2-3: compressing the global feature descriptor obtained in the step 2-2 by adopting a full-connection network 2 in the voice time-frequency domain cross-domain feature encoder, reducing the feature dimension of the global feature descriptor and obtaining a compressed feature descriptor;
step 2-4: respectively expanding the compressed feature descriptors by using a fully-connected network 3 and a fully-connected network 4 in a voice time-frequency domain cross-domain feature encoder, reducing the feature dimensions of the global feature descriptors to obtain time domain feature descriptors and time-frequency domain feature descriptors respectively, wherein the two fully-connected networks of the fully-connected network 3 and the fully-connected network 4 have the same parameter quantity but different parameter values;
step 2-5: multiplying the time domain feature descriptors to a time domain feature map in a form of multiplying corresponding elements, multiplying the time domain feature descriptors to a time domain feature map after nonlinear transformation in a form of multiplying corresponding elements, completing cross-domain feature selection, obtaining two feature maps after cross-domain feature selection, and finally adding the two feature maps after cross-domain feature selection according to corresponding elements, completing cross-domain feature fusion, and obtaining a cross-domain fused feature map.
4. The method for separating speech based on time-frequency cross-domain feature selection according to claim 3, wherein the step 3 specifically comprises the following steps:
step 3-1: adopting a stacked convolutional neural network in a voice separator to further extract the characteristics of the cross-domain fusion characteristic diagram to obtain a transformed cross-domain fusion characteristic diagram;
step 3-2: adopting a full-connection network 5 in a voice separator to carry out dimensionality increase on the transformed cross-domain fusion characteristic diagram, and transforming the cross-domain fusion characteristic diagram into tensors of three dimensionalities of time, characteristics and embedding;
step 3-3: calculating the mask of each speaker in the single-channel voice by adopting an attractor mechanism based on the time and the feature obtained by transformation and the tensor of the three embedded dimensions;
step 3-4: and (3) multiplying the mask of each speaker in the single-channel voice by the cross-domain fusion feature map obtained in the step (2-5) in a form of multiplying corresponding elements respectively to obtain a separation feature map of each speaker in the single-channel voice.
5. The method for separating speech based on time-frequency cross-domain feature selection according to claim 4, wherein the speech signal reconstruction in step 4 specifically comprises:
a one-dimensional transposition convolution neural network in a decoder is adopted to convert the separation characteristic diagram of each speaker in the single-channel voice into a voice signal of the corresponding speaker, so that voice separation is completed.
6. The method according to claim 5, wherein the structure and parameter values of the fully connected networks 1-5 are different.
CN202110471865.6A 2021-04-29 2021-04-29 Voice separation method based on time-frequency cross-domain feature selection Expired - Fee Related CN113113041B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110471865.6A CN113113041B (en) 2021-04-29 2021-04-29 Voice separation method based on time-frequency cross-domain feature selection
NL2029780A NL2029780B1 (en) 2021-04-29 2021-11-17 Speech separation method based on time-frequency cross-domain feature selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110471865.6A CN113113041B (en) 2021-04-29 2021-04-29 Voice separation method based on time-frequency cross-domain feature selection

Publications (2)

Publication Number Publication Date
CN113113041A true CN113113041A (en) 2021-07-13
CN113113041B CN113113041B (en) 2022-10-11

Family

ID=76720916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110471865.6A Expired - Fee Related CN113113041B (en) 2021-04-29 2021-04-29 Voice separation method based on time-frequency cross-domain feature selection

Country Status (2)

Country Link
CN (1) CN113113041B (en)
NL (1) NL2029780B1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539292A (en) * 2021-07-28 2021-10-22 联想(北京)有限公司 Voice separation method and device
CN113555031A (en) * 2021-07-30 2021-10-26 北京达佳互联信息技术有限公司 Training method and device of voice enhancement model and voice enhancement method and device

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070021958A1 (en) * 2005-07-22 2007-01-25 Erik Visser Robust separation of speech signals in a noisy environment
CN107305774A (en) * 2016-04-22 2017-10-31 腾讯科技(深圳)有限公司 Speech detection method and device
US20190066713A1 (en) * 2016-06-14 2019-02-28 The Trustees Of Columbia University In The City Of New York Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
US20190139563A1 (en) * 2017-11-06 2019-05-09 Microsoft Technology Licensing, Llc Multi-channel speech separation
CN110459240A (en) * 2019-08-12 2019-11-15 新疆大学 The more speaker's speech separating methods clustered based on convolutional neural networks and depth
CN110619887A (en) * 2019-09-25 2019-12-27 电子科技大学 Multi-speaker voice separation method based on convolutional neural network
CN110808061A (en) * 2019-11-11 2020-02-18 广州国音智能科技有限公司 Voice separation method and device, mobile terminal and computer readable storage medium
CN110970053A (en) * 2019-12-04 2020-04-07 西北工业大学深圳研究院 Multichannel speaker-independent voice separation method based on deep clustering
CN111883166A (en) * 2020-07-17 2020-11-03 北京百度网讯科技有限公司 Voice signal processing method, device, equipment and storage medium
CN112242149A (en) * 2020-12-03 2021-01-19 北京声智科技有限公司 Audio data processing method and device, earphone and computer readable storage medium
CN112259120A (en) * 2020-10-19 2021-01-22 成都明杰科技有限公司 Single-channel human voice and background voice separation method based on convolution cyclic neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210012767A1 (en) * 2020-09-25 2021-01-14 Intel Corporation Real-time dynamic noise reduction using convolutional networks

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070021958A1 (en) * 2005-07-22 2007-01-25 Erik Visser Robust separation of speech signals in a noisy environment
CN107305774A (en) * 2016-04-22 2017-10-31 腾讯科技(深圳)有限公司 Speech detection method and device
US20190066713A1 (en) * 2016-06-14 2019-02-28 The Trustees Of Columbia University In The City Of New York Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
US20190139563A1 (en) * 2017-11-06 2019-05-09 Microsoft Technology Licensing, Llc Multi-channel speech separation
CN110459240A (en) * 2019-08-12 2019-11-15 新疆大学 The more speaker's speech separating methods clustered based on convolutional neural networks and depth
CN110619887A (en) * 2019-09-25 2019-12-27 电子科技大学 Multi-speaker voice separation method based on convolutional neural network
CN110808061A (en) * 2019-11-11 2020-02-18 广州国音智能科技有限公司 Voice separation method and device, mobile terminal and computer readable storage medium
CN110970053A (en) * 2019-12-04 2020-04-07 西北工业大学深圳研究院 Multichannel speaker-independent voice separation method based on deep clustering
CN111883166A (en) * 2020-07-17 2020-11-03 北京百度网讯科技有限公司 Voice signal processing method, device, equipment and storage medium
CN112259120A (en) * 2020-10-19 2021-01-22 成都明杰科技有限公司 Single-channel human voice and background voice separation method based on convolution cyclic neural network
CN112242149A (en) * 2020-12-03 2021-01-19 北京声智科技有限公司 Audio data processing method and device, earphone and computer readable storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LAN T: ""Deep Attractor with Convolutional Network for Monaural Speech Separation"", 《2020 11TH IEEE ANNUAL UBIQUITOUS COMPUTING, ELECTRONICS & MOBILE COMMUNICATION CONFERENCE (UEMCON)》 *
LI M: ""Multi-layer Attention Mechanism Based Speech Separation Model"", 《2019 IEEE 19TH INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT)》 *
蓝天: ""采用上下文相关的注意力机制及循环神经网络的语音增强方法"", 《声学学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539292A (en) * 2021-07-28 2021-10-22 联想(北京)有限公司 Voice separation method and device
CN113555031A (en) * 2021-07-30 2021-10-26 北京达佳互联信息技术有限公司 Training method and device of voice enhancement model and voice enhancement method and device
CN113555031B (en) * 2021-07-30 2024-02-23 北京达佳互联信息技术有限公司 Training method and device of voice enhancement model, and voice enhancement method and device

Also Published As

Publication number Publication date
NL2029780B1 (en) 2023-03-14
CN113113041B (en) 2022-10-11
NL2029780A (en) 2022-11-15

Similar Documents

Publication Publication Date Title
CN109671442B (en) Many-to-many speaker conversion method based on STARGAN and x vectors
CN111243620A (en) Voice separation model training method and device, storage medium and computer equipment
CN113113041B (en) Voice separation method based on time-frequency cross-domain feature selection
CN112071330B (en) Audio data processing method and device and computer readable storage medium
CN105872855A (en) Labeling method and device for video files
CN112989107B (en) Audio classification and separation method and device, electronic equipment and storage medium
CN110675891A (en) Voice separation method and module based on multilayer attention mechanism
CN116092501B (en) Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system
CN112633175A (en) Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment
Shi et al. End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network.
CN115602165A (en) Digital staff intelligent system based on financial system
CN115101085A (en) Multi-speaker time-domain voice separation method for enhancing external attention through convolution
CN112382308A (en) Zero-order voice conversion system and method based on deep learning and simple acoustic features
CN114141237A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN106875944A (en) A kind of system of Voice command home intelligent terminal
Routray et al. Deep-sound field analysis for upscaling ambisonic signals
CN113593588A (en) Multi-singer singing voice synthesis method and system based on generation countermeasure network
CN116612779A (en) Single-channel voice separation method based on deep learning
CN116682463A (en) Multi-mode emotion recognition method and system
CN116959468A (en) Voice enhancement method, system and equipment based on DCCTN network model
CN115881156A (en) Multi-scale-based multi-modal time domain voice separation method
CN115910091A (en) Method and device for separating generated voice by introducing fundamental frequency clues
CN113488069A (en) Method and device for quickly extracting high-dimensional voice features based on generative countermeasure network
Xu et al. Speaker-Aware Monaural Speech Separation.
Wu et al. Stacked sparse autoencoder for audio object coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20221011

CF01 Termination of patent right due to non-payment of annual fee