CN111128209A - Speech enhancement method based on mixed masking learning target - Google Patents

Speech enhancement method based on mixed masking learning target Download PDF

Info

Publication number
CN111128209A
CN111128209A CN201911385421.XA CN201911385421A CN111128209A CN 111128209 A CN111128209 A CN 111128209A CN 201911385421 A CN201911385421 A CN 201911385421A CN 111128209 A CN111128209 A CN 111128209A
Authority
CN
China
Prior art keywords
dimensional
masking
learning target
mixed
voice signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911385421.XA
Other languages
Chinese (zh)
Other versions
CN111128209B (en
Inventor
张涛
王泽宇
朱诚诚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201911385421.XA priority Critical patent/CN111128209B/en
Publication of CN111128209A publication Critical patent/CN111128209A/en
Application granted granted Critical
Publication of CN111128209B publication Critical patent/CN111128209B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

A speech enhancement method based on a hybrid masking learning objective: performing traditional feature extraction on the voice signals, wherein the traditional feature extraction comprises dividing the acquired voice signals into a training set and a test set, and respectively extracting traditional features of the voice signals of the training set and the test set; respectively extracting the amplitude spectrum characteristics of the STFT domain of the voice signals of the training set and the test set; constructing a depth stack residual error network; constructing a learning target; training a deep stack residual error network by using the extracted traditional characteristics of the training set, the amplitude spectrum characteristics of the STFT domain and the learning target; and inputting the extracted traditional characteristics of the test set and the amplitude spectrum characteristics of the STFT domain into a trained deep stack residual error network to obtain a predicted learning target, obtaining an enhanced voice signal for the predicted learning target through an ISTFT, and calculating the PESQ value of the voice signal. The invention does not keep noise information in the time-frequency unit with voice dominance, reduces the calculated amount, and is easy to train neural network learning so as to improve the intelligibility and quality of voice.

Description

Speech enhancement method based on mixed masking learning target
Technical Field
The invention relates to a hybrid masking learning objective. And more particularly to a speech enhancement method based on a hybrid masking learning objective.
Background
At present, the speech enhancement methods based on deep learning are numerous, and the key technology mainly relates to three aspects of extracting what kind of characteristics, adopting what kind of model and learning what kind of target. Like the features, the study of the learning objective is also very valuable, and the model can be trained better through a better learning objective on the premise of the same training data features and the learning model.
In a speech enhancement system using a supervised neural network, the learning objective is generally obtained based on background noise and pure speech calculation, and the effective learning objective has an important influence on the learning capability of a speech enhancement model and the generalization of the system.
The currently used speech enhancement learning objectives mainly include two categories: training targets based on time-frequency masking and targets based on speech magnitude spectrum estimation. The former class of targets reflects the energy relationship between the clean speech signal and the background noise in the mixed signal, and the latter class of targets is the amplitude spectrum characteristic of the clean target speech. Common time-frequency masking objectives include: ideal Binary Mask (IBM), Ideal floating value Mask (Ideal Ratio Mask, IRM), Target Binary Mask (TBM), etc.; the most commonly used learning targets are ideal binary masking and ideal floating value masking, but the two learning targets respectively have the defects of inaccurate prediction, poor generalization and the like.
When the learning target is IRM, the model only needs to classify (0 or 1) whether each time-frequency unit belongs to noise dominance or target voice dominance, which can lead to the retention of noise information in the time-frequency unit dominated by the target voice, and the noise signals can seriously affect the intelligibility and quality of the voice; when the learning target is IRM, the model needs to predict the coefficient in each time-frequency unit, and in the time-frequency unit dominated by noise, the extracted features cannot well represent the features of the target speech in the time-frequency unit, but for the model, it is difficult to accurately predict the coefficient of the time-frequency unit.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a speech enhancement method based on a mixed masking learning target, which can improve the intelligibility and quality of speech.
The technical scheme adopted by the invention is as follows: a speech enhancement method based on a mixed masking learning objective comprises the following steps:
1) performing traditional feature extraction on the voice signals, wherein the traditional feature extraction comprises dividing the acquired voice signals into a training set and a test set, and respectively extracting traditional features of the voice signals of the training set and the test set;
2) respectively extracting the amplitude spectrum characteristics of the STFT domain of the voice signals of the training set and the test set;
3) constructing a depth stack residual error network;
4) constructing a learning target;
5) training a deep stack residual error network by using the extracted traditional characteristics of the training set, the amplitude spectrum characteristics of the STFT domain and the learning target;
6) and inputting the extracted traditional characteristics of the test set and the amplitude spectrum characteristics of the STFT domain into a trained deep stack residual error network to obtain a predicted learning target, obtaining an enhanced voice signal for the predicted learning target through an ISTFT, and calculating the PESQ value of the voice signal.
The voice enhancement method based on the mixed masking learning target combines the advantages of the ideal binary masking learning target and the ideal floating value masking learning target. Firstly, ensuring that noise information is not reserved in a time-frequency unit dominated by voice; the learning target is directly returned to 0 in the time-frequency unit with leading noise, although a small part of voice information is possibly lost, the performance is better than that of the learning target of the IRM due to the fact that the data redundancy is reduced, and the calculated amount is reduced; moreover, due to the fact that the mixed masking contains the 0 time-frequency unit, compared with the learning target of the IRM, the fitting capability and the calculation accuracy of the neural network are further improved, and the neural network is easier to train for learning so as to improve the intelligibility and the quality of the voice.
Detailed Description
The following describes a speech enhancement method based on a hybrid masking learning objective in detail with reference to the embodiments.
The invention discloses a voice enhancement method based on a mixed masking learning target, which comprises the following steps:
1) performing traditional feature extraction on the voice signals, wherein the traditional feature extraction comprises dividing the acquired voice signals into a training set and a test set, and respectively extracting traditional features of the voice signals of the training set and the test set;
the method comprises the following steps: randomly extracting 1500 sections of voice from a training part of a TIMIT corpus, randomly mixing the 1500 sections of voice with 9 kinds of noise extracted from a NOISEX-92 corpus, generating 1500 sections of mixed voice signals to form a training set under a continuously changing signal-to-noise ratio of-5 dB, randomly selecting 500 sections of pure voice from a testing part of the TIMIT corpus, randomly mixing the 500 sections of pure voice with 15 kinds of voice extracted from the NOISEX-92 corpus, and generating 500 sections of mixed voice signals to form a testing set under 10-8-6-4-2-0-2-4-6-8 dB different signal-to-noise ratios.
The traditional characteristic processes of extracting the speech signals of the training set and the test set are the same, and respectively obtain the following different characteristic vectors:
(1) carrying out 512-point short-time Fourier transform on a mixed voice signal with a sampling rate of 16kHz, extracting a 31-dimensional MFCC feature vector by adopting a Hamming window with a frame length of 20ms and a frame shift of 10ms, and calculating a first-order reciprocal of the 31-dimensional MFCC feature vector;
(2) performing full-wave rectification on a mixed voice signal with a sampling rate of 16kHz to extract the envelope of the mixed voice signal, performing quarter sampling, performing framing by adopting a 32ms frame length and a 10ms frame-shifted Hamming window, obtaining a 15-dimensional AMS eigenvector by utilizing 15 triangular windows with center frequencies uniformly distributed in 15.6-400 Hz, and calculating the first reciprocal of the 15-dimensional AMS eigenvector;
(3) decomposing a mixed voice signal with a sampling rate of 16kHz by adopting a 64-channel Gamma-tone filter bank, sampling each decomposition output result by adopting a sampling rate of 100Hz, carrying out amplitude suppression on the obtained sampling signal by adopting cubic root operation, finally extracting a 64-dimensional Gamma-tone eigenvector, and calculating the first order reciprocal of the 64-dimensional Gamma-tone eigenvector;
(4) converting a power spectrum of a mixed voice signal with a sampling rate of 16kHz into a bark scale of a 20-channel by adopting a ladder filter, then applying equal loudness pre-emphasis, then using an intensity loudness law and a 12-order linear prediction model to obtain a 13-dimensional PLP feature vector, and calculating a first order reciprocal of the 13-dimensional PLP feature vector;
respectively connecting 31-dimensional MFCC feature vectors, 15-dimensional AMS feature vectors, 64-dimensional Gamma feature vectors and 13-dimensional PLP feature vectors in series to obtain 123-dimensional feature vectors, connecting the first reciprocal of the 31-dimensional MFCC feature vectors, the first reciprocal of the 15-dimensional AMS feature vectors, the first reciprocal of the 64-dimensional Gamma feature vectors and the first reciprocal of the 13-dimensional PLP feature vectors in series to obtain 123-dimensional feature vectors, and connecting the two 123-dimensional feature vectors in series to obtain 246-dimensional feature vectors;
and respectively acquiring the zero-crossing rate characteristic, the root-mean-square energy characteristic and the characteristic of the spectrum centroid of the mixed voice signal with the sampling rate of 16kHz, forming 269-dimensional characteristic vectors together with the 246-dimensional characteristic vectors, sending the 269-dimensional characteristic vectors into a firework algorithm characteristic selector for characteristic dimension reduction, wherein the number N of the initialized fireworks is 400, and the characteristic subset dimension M is 50, 70 and 90.
2) Respectively extracting the amplitude spectrum characteristics of the STFT domain of the voice signals of the training set and the test set;
the method for extracting the amplitude spectrum characteristics of the STFT domain of the speech signals of the training set and the test set is the same as that of the STFT domain of the speech signals of the training set and the test set, and comprises the following steps: the method comprises the steps of carrying out short-time Fourier transform on a mixed voice signal with a sampling rate of 16kHz, framing the mixed voice signal by adopting a Hamming window with a frame length of 25ms and a frame shift of 10ms in the transformation process, adding the amplitude spectrums of two frames adjacent to the left and the right of a single frame when the amplitude spectrum of each single frame corresponding to the traditional characteristics is input, wherein the total number of the frames is 5, the dimensionality of the amplitude spectrum of each frame is 200, and obtaining the amplitude spectrum characteristics of an STFT domain with the input dimensionality of 1000.
3) Constructing a depth stack residual error network; wherein the content of the first and second substances,
the deep stack residual error network comprises: the device comprises an input channel I, an input channel II and a full-connection residual error network module connected with the output end of the input channel I and the input channel II after being connected, wherein,
the input channel I: the convolution error network module is composed of three convolution layers and three normalization layers which are formed by combining the convolution error network module through error networks, the dimensionality of each convolution kernel is set to be 2, the step length of each convolution kernel is set to be 1, and a 0 complementing mode is adopted, the size of the convolution kernel of the first convolution layer from top to bottom in the three convolution layers is 1 x 1, and the number of output channels is 32; the convolution kernel size of the second convolution layer is 3 x 3, and the number of output channels is 32; the convolution kernel size of the third layer of convolution layer is 1 x 1, the number of output channels is 64, and the activation functions of the three layers of convolution layer are all Relu activation functions;
the second input channel: the neural network is composed of a normalization layer and a full-link layer which are combined through a residual error network, wherein the full-link layer is provided with 1024 neurons, and the full-link layer uses a Relu activation function;
the fully connected residual error network module: is composed of a normalization layer and a full-connection layer with 4096 neurons, and the full-connection layer uses a Sigmoid activation function.
4) Constructing a learning target; the method comprises the following steps:
(1) respectively calculating an ideal binary masking learning target IBM and an ideal floating masking learning target IRM of a mixed voice signal of a training set by using the following formulas:
Figure BDA0002343470920000031
Figure BDA0002343470920000032
where LC is set to 20 dB; SNR (m, f) is a local signal-to-noise ratio of a time-frequency unit with a time frame of m and a frequency of f, wherein f is 80Hz to 5000 Hz; s (m, f)2And N (m, f)2Respectively representing speech energy and noise energy at the mth time frame and frequency f;
IBM is a binary time-frequency masking matrix, calculated from clean speech and noise. For each time-frequency unit, if the local signal-to-noise ratio SNR (m, f) is greater than a certain local threshold, the corresponding element in the masking matrix is marked as 1, otherwise, it is marked as 0. IRM is a widely used training target in supervised learning speech separation.
(2) In order to combine the advantages of both masks, the present invention proposes a learning objective based on Mixed Mask (MM). The masking value of the masking is consistent with IBM in the time-frequency unit with the leading noise, namely the masking value is equal to 0; the masking value is kept consistent with the IRM in the time-frequency unit dominated by the target voice. Specifically, an ideal binary masking learning target IBM and an ideal floating value masking learning target IRM are subjected to point multiplication to obtain a mixed masking learning target MM, and a final learning target is formed:
Figure BDA0002343470920000041
wherein x is1,1…xm,nRespectively representing an ideal floating value masking value in each time-frequency unit in a section of mixed voice signal; x is the number of1,1…xm,1Ideal floating value masking respectively representing the first frame mixed speech signal; y is1,1…ym,nRespectively representing ideal binary masking values in each time-frequency unit in a section of mixed voice signal; y is1,1…ym,1Ideal binary masks respectively representing the first frame mixed speech signal; x is the number of1,1*y1,1…xm,n*ym,nRespectively representing the ideal mixed masking value in each time-frequency unit in a section of mixed voice signal.
5) Training a deep stack residual error network by using the extracted traditional characteristics of the training set, the amplitude spectrum characteristics of the STFT domain and the learning target;
6) and inputting the extracted traditional characteristics of the test set and the amplitude spectrum characteristics of the STFT domain into a trained deep stack residual error network to obtain a predicted learning target, obtaining an enhanced voice signal for the predicted learning target through an ISTFT, and calculating the PESQ value of the voice signal.
According to the voice enhancement method based on the hybrid masking learning target, the PESQ index is greatly improved, and the voice quality after enhancement is improved by 1.6% compared with the learning target masked by an ideal floating value, as shown in Table 1.
TABLE 1 PESQ values for speech signals of two learning objectives
Figure BDA0002343470920000042

Claims (6)

1. A speech enhancement method based on a hybrid masking learning objective is characterized by comprising the following steps:
1) performing traditional feature extraction on the voice signals, wherein the traditional feature extraction comprises dividing the acquired voice signals into a training set and a test set, and respectively extracting traditional features of the voice signals of the training set and the test set;
2) respectively extracting the amplitude spectrum characteristics of the STFT domain of the voice signals of the training set and the test set;
3) constructing a depth stack residual error network;
4) constructing a learning target;
5) training a deep stack residual error network by using the extracted traditional characteristics of the training set, the amplitude spectrum characteristics of the STFT domain and the learning target;
6) and inputting the extracted traditional characteristics of the test set and the amplitude spectrum characteristics of the STFT domain into a trained deep stack residual error network to obtain a predicted learning target, obtaining an enhanced voice signal for the predicted learning target through an ISTFT, and calculating the PESQ value of the voice signal.
2. The method for enhancing speech based on the hybrid masking learning objective as claimed in claim 1, wherein the step 1) comprises: randomly extracting 1500 sections of voice from a training part of a TIMIT corpus, randomly mixing the 1500 sections of voice with 9 kinds of noise extracted from a NOISEX-92 corpus, generating 1500 sections of mixed voice signals to form a training set under a continuously changing signal-to-noise ratio of-5 dB, randomly selecting 500 sections of pure voice from a testing part of the TIMIT corpus, randomly mixing the 500 sections of pure voice with 15 kinds of voice extracted from the NOISEX-92 corpus, and generating 500 sections of mixed voice signals to form a testing set under 10-8-6-4-2-0-2-4-6-8 dB different signal-to-noise ratios.
3. The method according to claim 2, wherein the conventional feature processes for extracting the speech signals of the training set and the test set in step 1) are the same, and each process includes obtaining the following different feature vectors:
(1) carrying out 512-point short-time Fourier transform on a mixed voice signal with a sampling rate of 16kHz, extracting a 31-dimensional MFCC feature vector by adopting a Hamming window with a frame length of 20ms and a frame shift of 10ms, and calculating a first-order reciprocal of the 31-dimensional MFCC feature vector;
(2) performing full-wave rectification on a mixed voice signal with a sampling rate of 16kHz to extract the envelope of the mixed voice signal, performing quarter sampling, performing framing by adopting a 32ms frame length and a 10ms frame-shifted Hamming window, obtaining a 15-dimensional AMS eigenvector by utilizing 15 triangular windows with center frequencies uniformly distributed in 15.6-400 Hz, and calculating the first reciprocal of the 15-dimensional AMS eigenvector;
(3) decomposing a mixed voice signal with a sampling rate of 16kHz by adopting a 64-channel Gamma-tone filter bank, sampling each decomposition output result by adopting a sampling rate of 100Hz, carrying out amplitude suppression on the obtained sampling signal by adopting cubic root operation, finally extracting a 64-dimensional Gamma-tone eigenvector, and calculating the first order reciprocal of the 64-dimensional Gamma-tone eigenvector;
(4) converting a power spectrum of a mixed voice signal with a sampling rate of 16kHz into a bark scale of a 20-channel by adopting a ladder filter, then applying equal loudness pre-emphasis, then using an intensity loudness law and a 12-order linear prediction model to obtain a 13-dimensional PLP feature vector, and calculating a first order reciprocal of the 13-dimensional PLP feature vector;
respectively connecting 31-dimensional MFCC feature vectors, 15-dimensional AMS feature vectors, 64-dimensional Gamma feature vectors and 13-dimensional PLP feature vectors in series to obtain 123-dimensional feature vectors, connecting the first reciprocal of the 31-dimensional MFCC feature vectors, the first reciprocal of the 15-dimensional AMS feature vectors, the first reciprocal of the 64-dimensional Gamma feature vectors and the first reciprocal of the 13-dimensional PLP feature vectors in series to obtain 123-dimensional feature vectors, and connecting the two 123-dimensional feature vectors in series to obtain 246-dimensional feature vectors;
and respectively acquiring the zero-crossing rate characteristic, the root-mean-square energy characteristic and the characteristic of the spectrum centroid of the mixed voice signal with the sampling rate of 16kHz, forming 269-dimensional characteristic vectors together with the 246-dimensional characteristic vectors, sending the 269-dimensional characteristic vectors into a firework algorithm characteristic selector for characteristic dimension reduction, wherein the number N of the initialized fireworks is 400, and the characteristic subset dimension M is 50, 70 and 90.
4. The method of claim 1, wherein the extracting of the amplitude spectrum features of the STFT domain of the speech signals of the training set and the test set in step 2) is the same as the extracting of the amplitude spectrum features of the STFT domain of the speech signals of the training set and the test set, and both the extracting and the extracting comprise: the method comprises the steps of carrying out short-time Fourier transform on a mixed voice signal with a sampling rate of 16kHz, framing the mixed voice signal by adopting a Hamming window with a frame length of 25ms and a frame shift of 10ms in the transformation process, adding the amplitude spectrums of two frames adjacent to the left and the right of a single frame when the amplitude spectrum of each single frame corresponding to the traditional characteristics is input, wherein the total number of the frames is 5, the dimensionality of the amplitude spectrum of each frame is 200, and obtaining the amplitude spectrum characteristics of an STFT domain with the input dimensionality of 1000.
5. The method according to claim 1, wherein the deep stack residual network of step 3) comprises: the device comprises an input channel I, an input channel II and a full-connection residual error network module connected with the output end of the input channel I and the input channel II after being connected, wherein,
the input channel I: the convolution error network module is composed of three convolution layers and three normalization layers which are formed by combining the convolution error network module through error networks, the dimensionality of each convolution kernel is set to be 2, the step length of each convolution kernel is set to be 1, and a 0 complementing mode is adopted, the size of the convolution kernel of the first convolution layer from top to bottom in the three convolution layers is 1 x 1, and the number of output channels is 32; the convolution kernel size of the second convolution layer is 3 x 3, and the number of output channels is 32; the convolution kernel size of the third layer of convolution layer is 1 x 1, the number of output channels is 64, and the activation functions of the three layers of convolution layer are all Relu activation functions;
the second input channel: the neural network is composed of a normalization layer and a full-link layer which are combined through a residual error network, wherein the full-link layer is provided with 1024 neurons, and the full-link layer uses a Relu activation function;
the fully connected residual error network module: is composed of a normalization layer and a full-connection layer with 4096 neurons, and the full-connection layer uses a Sigmoid activation function.
6. The method of claim 1, wherein the step 4) of constructing the learning objective comprises:
(1) respectively calculating an ideal binary masking learning target IBM and an ideal floating masking learning target IRM of a mixed voice signal of a training set by using the following formulas:
Figure FDA0002343470910000021
Figure FDA0002343470910000022
where LC is set to 20 dB; SNR (m, f) is a local signal-to-noise ratio of a time-frequency unit with a time frame of m and a frequency of f, wherein f is 80Hz to 5000 Hz; s (m, f)2And N (m, f)2Respectively representing speech energy and noise energy at the mth time frame and frequency f;
(2) performing point multiplication on the ideal binary masking learning target IBM and the ideal floating value masking learning target IRM to obtain a mixed masking learning target MM, and forming a final learning target:
Figure FDA0002343470910000031
wherein x is1,1…xm,nRespectively representing an ideal floating value masking value in each time-frequency unit in a section of mixed voice signal; x is the number of1,1…xm,1Ideal floating value masking respectively representing the first frame mixed speech signal; y is1,1…ym,nRespectively representing ideal binary masking values in each time-frequency unit in a section of mixed voice signal; y is1,1…ym,1Ideal binary masks respectively representing the first frame mixed speech signal; x is the number of1,1*y1,1…xm,n*ym,nRespectively representing the ideal mixed masking value in each time-frequency unit in a section of mixed voice signal.
CN201911385421.XA 2019-12-28 2019-12-28 Speech enhancement method based on mixed masking learning target Active CN111128209B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911385421.XA CN111128209B (en) 2019-12-28 2019-12-28 Speech enhancement method based on mixed masking learning target

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911385421.XA CN111128209B (en) 2019-12-28 2019-12-28 Speech enhancement method based on mixed masking learning target

Publications (2)

Publication Number Publication Date
CN111128209A true CN111128209A (en) 2020-05-08
CN111128209B CN111128209B (en) 2022-05-10

Family

ID=70504227

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911385421.XA Active CN111128209B (en) 2019-12-28 2019-12-28 Speech enhancement method based on mixed masking learning target

Country Status (1)

Country Link
CN (1) CN111128209B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583954A (en) * 2020-05-12 2020-08-25 中国人民解放军国防科技大学 Speaker independent single-channel voice separation method
CN111653287A (en) * 2020-06-04 2020-09-11 重庆邮电大学 Single-channel speech enhancement algorithm based on DNN and in-band cross-correlation coefficient
CN111899750A (en) * 2020-07-29 2020-11-06 哈尔滨理工大学 Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
CN112562706A (en) * 2020-11-30 2021-03-26 哈尔滨工程大学 Target voice extraction method based on time potential domain specific speaker information
CN113257267A (en) * 2021-05-31 2021-08-13 北京达佳互联信息技术有限公司 Method for training interference signal elimination model and method and equipment for eliminating interference signal
CN113470671A (en) * 2021-06-28 2021-10-01 安徽大学 Audio-visual voice enhancement method and system by fully utilizing visual and voice connection
CN114495968A (en) * 2022-03-30 2022-05-13 北京世纪好未来教育科技有限公司 Voice processing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080049385A (en) * 2006-11-30 2008-06-04 한국전자통신연구원 Pre-processing method and device for clean speech feature estimation based on masking probability
CN101237303A (en) * 2007-01-30 2008-08-06 华为技术有限公司 Data transmission method, system and transmitter, receiver
US20150124987A1 (en) * 2013-11-07 2015-05-07 The Board Of Regents Of The University Of Texas System Enhancement of reverberant speech by binary mask estimation
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN110120227A (en) * 2019-04-26 2019-08-13 天津大学 A kind of depth stacks the speech separating method of residual error network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080049385A (en) * 2006-11-30 2008-06-04 한국전자통신연구원 Pre-processing method and device for clean speech feature estimation based on masking probability
CN101237303A (en) * 2007-01-30 2008-08-06 华为技术有限公司 Data transmission method, system and transmitter, receiver
US20150124987A1 (en) * 2013-11-07 2015-05-07 The Board Of Regents Of The University Of Texas System Enhancement of reverberant speech by binary mask estimation
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN110120227A (en) * 2019-04-26 2019-08-13 天津大学 A kind of depth stacks the speech separating method of residual error network

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
SHASHA XIA 等: "Using optimal ratio mask as training target for supervised speech separation", 《2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE 》 *
YAN ZHAO 等: "Perceptually Guided Speech Enhancement Using Deep Neural Networks", 《 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING 》 *
YUXUAN WANG 等: "On Training Targets for Supervised Speech Separation", 《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 》 *
夏莎莎等: "基于优化浮值掩蔽的监督性语音分离", 《自动化学报》 *
李如玮 等: "基于深度学习的听觉倒谱系数语音增强算法", 《华中科技大学学报(自然科学版)》 *
梁山等: "基于噪声追踪的二值时频掩蔽到浮值掩蔽的泛化算法", 《声学学报》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583954A (en) * 2020-05-12 2020-08-25 中国人民解放军国防科技大学 Speaker independent single-channel voice separation method
CN111653287A (en) * 2020-06-04 2020-09-11 重庆邮电大学 Single-channel speech enhancement algorithm based on DNN and in-band cross-correlation coefficient
CN111899750A (en) * 2020-07-29 2020-11-06 哈尔滨理工大学 Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
CN111899750B (en) * 2020-07-29 2022-06-14 哈尔滨理工大学 Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
CN112562706A (en) * 2020-11-30 2021-03-26 哈尔滨工程大学 Target voice extraction method based on time potential domain specific speaker information
CN112562706B (en) * 2020-11-30 2023-05-05 哈尔滨工程大学 Target voice extraction method based on time potential domain specific speaker information
CN113257267A (en) * 2021-05-31 2021-08-13 北京达佳互联信息技术有限公司 Method for training interference signal elimination model and method and equipment for eliminating interference signal
CN113470671A (en) * 2021-06-28 2021-10-01 安徽大学 Audio-visual voice enhancement method and system by fully utilizing visual and voice connection
CN113470671B (en) * 2021-06-28 2024-01-23 安徽大学 Audio-visual voice enhancement method and system fully utilizing vision and voice connection
CN114495968A (en) * 2022-03-30 2022-05-13 北京世纪好未来教育科技有限公司 Voice processing method and device, electronic equipment and storage medium
CN114495968B (en) * 2022-03-30 2022-06-14 北京世纪好未来教育科技有限公司 Voice processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111128209B (en) 2022-05-10

Similar Documents

Publication Publication Date Title
CN111128209B (en) Speech enhancement method based on mixed masking learning target
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN109524014A (en) A kind of Application on Voiceprint Recognition analysis method based on depth convolutional neural networks
CN107146601A (en) A kind of rear end i vector Enhancement Methods for Speaker Recognition System
CN107633842A (en) Audio recognition method, device, computer equipment and storage medium
CN106952649A (en) Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN107331384A (en) Audio recognition method, device, computer equipment and storage medium
CN102968990B (en) Speaker identifying method and system
Graciarena et al. All for one: feature combination for highly channel-degraded speech activity detection.
CN108777146A (en) Speech model training method, method for distinguishing speek person, device, equipment and medium
CN110120227A (en) A kind of depth stacks the speech separating method of residual error network
Fan et al. End-to-end post-filter for speech separation with deep attention fusion features
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN103117059A (en) Voice signal characteristics extracting method based on tensor decomposition
CN106531174A (en) Animal sound recognition method based on wavelet packet decomposition and spectrogram features
CN111192598A (en) Voice enhancement method for jump connection deep neural network
CN106024010A (en) Speech signal dynamic characteristic extraction method based on formant curves
CN113539293B (en) Single-channel voice separation method based on convolutional neural network and joint optimization
CN111899750B (en) Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
CN108806725A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN109036470A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN110136746B (en) Method for identifying mobile phone source in additive noise environment based on fusion features
CN108364641A (en) A kind of speech emotional characteristic extraction method based on the estimation of long time frame ambient noise
Fan et al. Deep attention fusion feature for speech separation with end-to-end post-filter method
Mowlaee et al. Improved single-channel speech separation using sinusoidal modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant