CN113129872A - Voice enhancement method based on deep compressed sensing - Google Patents
Voice enhancement method based on deep compressed sensing Download PDFInfo
- Publication number
- CN113129872A CN113129872A CN202110367869.XA CN202110367869A CN113129872A CN 113129872 A CN113129872 A CN 113129872A CN 202110367869 A CN202110367869 A CN 202110367869A CN 113129872 A CN113129872 A CN 113129872A
- Authority
- CN
- China
- Prior art keywords
- voice
- model
- signal
- speech
- enhancement
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000012549 training Methods 0.000 claims abstract description 45
- 238000012360 testing method Methods 0.000 claims abstract description 26
- 238000005457 optimization Methods 0.000 claims abstract description 19
- 238000007781 pre-processing Methods 0.000 claims abstract description 15
- 238000011156 evaluation Methods 0.000 claims abstract description 8
- 108010076504 Protein Sorting Signals Proteins 0.000 claims abstract description 4
- 238000005259 measurement Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 10
- 238000009432 framing Methods 0.000 claims description 7
- 101000659995 Homo sapiens Ribosomal L1 domain-containing protein 1 Proteins 0.000 claims description 5
- 102100035066 Ribosomal L1 domain-containing protein 1 Human genes 0.000 claims description 5
- 230000002708 enhancing effect Effects 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 4
- 238000011946 reduction process Methods 0.000 claims description 4
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 abstract description 7
- 238000005516 engineering process Methods 0.000 description 9
- 230000008901 benefit Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 230000007613 environmental effect Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000001303 quality assessment method Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000013210 evaluation model Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000011410 subtraction method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Evolutionary Computation (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
The invention discloses a voice enhancement method based on deep compressed sensing, which comprises the following steps: step 1: preprocessing training data to obtain a time domain voice signal sequence; step 2: constructing a speech enhancement model (SEDCS) based on deep compressed sensing, and carrying out joint training on the SEDCS; and step 3: preprocessing a voice test set with noise, denoising and reconstructing the voice test set by using the trained SEDCS model, and storing a result to complete a voice enhancement task; and 4, step 4: and evaluating the quality and intelligibility of the enhanced voice signal by adopting a plurality of evaluation indexes. The invention combines the compressed sensing with deep learning to realize voice enhancement, can release the sparsity constraint on voice signals in the traditional compressed sensing method, solves the problems of reduced voice intelligibility of reconstructed voice and the like in the traditional compressed sensing method, takes the observation signals of the voice signals as an optimization object, effectively improves the voice enhancement efficiency, reduces the model complexity, and can realize voice enhancement more simply, conveniently and flexibly.
Description
Technical Field
The invention relates to the technical field of voice enhancement of voice signal processing, in particular to a voice enhancement method based on deep compressed sensing.
Background
Voice is the most natural, fast and efficient way for people to communicate, but in real life voice is often disturbed by various noises, such as environmental noise, mechanical noise, etc. These noises affect the speech quality to different extents, resulting in a degradation of speech intelligibility. To solve these problems, it is necessary to apply to speech enhancement. Speech enhancement is a technique for extracting clean speech from noisy speech, and is an important component of a speech recognition system, and has two main purposes, namely, improving speech quality and improving speech intelligibility.
The existing speech enhancement means mainly comprise two traditional methods such as a spectral subtraction method, a subspace method, a wiener filtering method and the like, although the traditional methods can effectively remove noise and improve speech quality, the traditional methods are generally based on specific hypothesis premise, for example, the noise is stationary, but the speech enhancement effect is poor under low signal-to-noise ratio and non-stationary noise. In view of this problem, a deep learning based speech enhancement method is proposed, and common deep learning speech enhancement methods are a Convolutional Neural Network (CNN) based speech enhancement method, a Recurrent Neural Network (RNN) based speech enhancement method, and a generative countermeasure network (GAN) based speech enhancement method. The voice enhancement method based on the CNN is a common method, and completes a voice enhancement task by training a voice enhancement model, but the method has a large model parameter amount, and if the voice enhancement is performed in a time-frequency domain, there are problems of phase information loss and the like, which causes the voice enhancement quality to be reduced. RNN-based speech enhancement methods are also of interest, but RNN methods have a larger number of parameters and more complex models than CNN methods. The creation of a countermeasure network (GAN) has provided a new approach to speech enhancement that enables end-to-end enhancement of speech signals and that directly accomplishes the speech enhancement task in the time domain. With the development of a compressed sensing technology, a new exploration field is provided for speech enhancement, and although the method can solve the problem that the speech enhancement effect is poor under non-stationary noise in the traditional method, the compressed sensing requires that a speech signal needs to meet a specific structure, for example, the speech signal needs to be sparse, and the speech signal may cause effective information loss in the sparse process, so that the intelligibility of reconstructed speech is reduced.
Most of the existing voice enhancement technologies are realized in a time-frequency domain, and the problems of phase information loss and the like are easily caused after data processing; although speech enhancement is realized in the time domain by many deep learning-based speech enhancement techniques, the models thereof are complex and the original speech signal is taken as an optimization object, resulting in a reduction in the enhancement rate; the speech enhancement method based on the traditional compressed sensing technology is influenced by the sparsity of speech signals, so that the intelligibility of reconstructed speech is reduced.
Disclosure of Invention
The method mainly takes an observation signal of a voice signal as an optimization object to solve the problems of complex model, slow enhancement rate and the like existing in the existing voice enhancement technology and solve the problem of reduced intelligibility of reconstructed voice by the traditional compressed sensing method; the invention aims to provide a compressed sensing voice enhancement method combined with deep learning, which can complete a voice enhancement task, improve the voice enhancement rate and solve the problem of reduced intelligibility of reconstructed voice by a traditional compressed sensing method.
The purpose of the invention is realized by the following technical scheme.
A speech enhancement method based on deep compressed sensing comprises the following steps.
Step 1: preprocessing training data: and pre-emphasis, pairing and framing processing are carried out on the training data to obtain a time domain voice signal sequence.
Step 2: constructing a model and training: establishing a voice enhancement model (SEDCS) based on deep compressed sensing, setting a proper error function, performing combined training on a preprocessed voice training set input model, and deploying the trained SEDCS model into a server.
And step 3: testing the model: and preprocessing the noisy speech test set, denoising the noisy speech test set by using the trained SEDCS model, reconstructing to obtain a denoised speech signal, and completing a speech enhancement task.
And 4, step 4: and (3) evaluating the model: and evaluating the quality and intelligibility of the enhanced voice signal by adopting a plurality of evaluation indexes.
The model is called as an SEDCS model, a voice enhancement task can be completed in a time domain by training the SEDCS model, the problems of phase information loss and the like after time-frequency domain data processing are avoided, and a user can quickly obtain enhanced voice only by providing a voice file with noise; the invention can get rid of the sparsity constraint on the voice signal in the traditional compressed sensing method, solves the problems of the traditional compressed sensing method such as the reduction of the intelligibility of the reconstructed voice and the like, and more conveniently and flexibly realizes the voice enhancement.
Further, the SEDCS model in the step 2 is constructed by two deep neural network models which are respectively called as a generation modelAnd a measurement model:
Generating modelsReplacing a signal sparse process in compressed sensing, mapping an input voice signal with noise, and reconstructing a generated voice signal related to a clean voice signal;
measurement modelAnd replacing a measurement matrix in compressed sensing to realize a signal observation dimension reduction process for obtaining a clean voice signal and generating an observation signal of the voice signal, wherein the observation signal is used as an optimization object.
Further, the two models forming the SEDCS model in the step 2 are trained in a joint training mode, the training set is firstly optimized for noisy speech before training, the model can be converged faster in the process, the training period is shortened, the optimization mode adopts a gradient descent method, and the objective function is as follows:
whereinRepresents the optimized noisy speech signal,which represents a clean speech signal, is,representing a voice signal with noise, and,representing a generated speech signal obtained after passing the noisy speech signal through the generative model,andrepresenting the observation signal obtained after the voice signal passes through the measurement model,is a weight coefficient, the generative model and the measurement model are not updated during the optimization of data, and the number of times of optimization can be specified.
Further, inputting the optimized noisy speech signal into a generation modelThen passes through the measurement modelObtaining an observation signal, taking the observation signal as an optimization object, and performing joint training optimization on the two models, wherein the target functions of the models are respectively as follows:
whereinIndicating added L1The weight factor of the regularization term,which represents a clean speech signal, is,representing a reconstructed speech signal, a depth-compressed-sensing-based speech enhancement method is aimed atI.e. minimizing the model global objective function.
The joint training mode takes the observation signal of the signal as an optimization object, utilizes the advantage of compressed sensing, and simultaneously converges two models by using less data, thereby simplifying the training process, enabling the reconstructed voice signal to quickly approach a clean voice signal, and effectively solving the problem of slow enhancement rate of the existing voice enhancement technology.
The reconstructed speech intelligibility is reduced in the traditional compressed sensing method due to the reason of speech signal sparseness, the SEDCS model of the invention utilizes the deep neural network to replace the sparse process and the observation dimension reduction process, and the problem of reduced reconstructed speech intelligibility in the traditional compressed sensing method can be effectively solved without considering which sparse basis and which measurement matrix are selected.
Further, step 3 includes the following substeps.
Step 3-1: and preprocessing the test data, and performing pre-emphasis and framing processing on the noisy speech signals of the test set, wherein the pre-emphasis factor and the size of each frame are the same as those of the processed training data.
Step 3-2: enhancing the voice: inputting the preprocessed voice signals with noise into the trained SEDCS model, denoising each small section of voice by the model, and splicing and reconstructing the small section of voice according to the sequence of the original clean voice signals to obtain the denoised voice signals.
Step 3-3: and (4) storing the result: de-emphasis is carried out on the de-noised voice signal, and the finally obtained de-noised voice signal is stored in a specified directory.
Further, in the data preprocessing process of step 1 and step 3, the pre-emphasis factor is set to 0.95, the frame length is set to 16384 sampling points, the frame overlap is set to 1/2, that is, a window with the size of 16384 × 1 and the sliding step of 0.5 is used to sample and frame the voice signal, and if the difference is less, 0 is supplemented.
Furthermore, when the trained model is used for carrying out voice denoising processing in the step 3, noise conditions of different noisy voice test data may be different, and the model can complete a voice enhancement task when dealing with unknown noise conditions, which shows that the method can adapt to different noise scenes and has practicability.
Further, the various evaluation indexes in step 4 include: indicators for evaluating speech intelligibility: STOI; indicators for evaluating speech quality: PESQ, CSIG, CBAK, COVL, and SSNR. Wherein STOI is short-term objective intelligibility, PESQ is perceptual speech quality assessment, CSIG is mean opinion score for speech signal distortion, CBAK is mean opinion score for assessing background noise interference, COVL is mean opinion score for overall enhancement effect, and SSNR is segmental signal-to-noise ratio. Through the evaluation indexes, the model can be accurately and effectively evaluated.
By adopting the scheme, the invention has the following beneficial effects.
1. The invention provides a voice enhancement method based on deep compressed sensing, which effectively utilizes the respective advantages of a deep learning method and a compressed sensing technology to construct a compressed sensing voice enhancement model combined with deep learning, takes an observation signal of a voice signal as an optimization object, effectively improves the voice enhancement efficiency and reduces the model complexity.
2. The invention adopts a joint training mode to train the model, so that the voice signal with noise can be fitted with a clean voice signal, and the voice enhancement quality and the intelligibility are effectively improved.
3. The invention can complete the voice enhancement under different noise conditions, and has stronger adaptability and certain practicability.
Drawings
To further understand the technical solutions of the embodiments of the present invention, the drawings are described herein, and the drawings herein form a part of the present application and do not form a limitation of the embodiments of the present invention.
In the drawings: fig. 1 is a schematic diagram of a speech enhancement technique according to an embodiment of the present invention.
Detailed Description
The purpose, technical solution and advantages of the embodiments of the present invention will be fully described in detail herein with reference to the accompanying drawings. The embodiments described herein are some, but not all embodiments of the inventions. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The embodiment of the invention provides a voice enhancement method, which obtains an SEDCS model through a combined training mode, completes a voice enhancement task in a time domain, solves the problems of complex model, low enhancement rate and the like existing in the existing voice enhancement technology, solves the problems of reduced intelligibility of reconstructed voice of a traditional compressed sensing method and the like, and realizes voice enhancement more simply, conveniently and flexibly.
As shown in fig. 1, a route diagram of a speech enhancement technology provided by an embodiment of the present invention includes the following steps.
Step 1: preprocessing training data: and pre-emphasis, pairing and framing processing are carried out on the training data to obtain a time domain voice signal sequence.
The main role of pre-emphasis in the data pre-processing is to boost the high frequency components to prevent the reconstructed speech quality from being affected, and correspondingly, to perform de-emphasis at the output. The pre-emphasis factor is set to 0.95, the frame length is set to 16384 samples, the frame overlap is set to 1/2, that is, a window with the size of 16384 × 1 and the sliding step of 0.5 is used to sample and frame the voice signal, and if the frame overlap is insufficient, 0 is complemented.
Step 2: constructing a model and training: and constructing a voice enhancement model based on deep compressed sensing, setting a proper error function, performing joint training on the preprocessed voice training set input model, and deploying the trained SEDCS model into a server.
The SEDCS model is constructed by two deep neural network models which are respectively generation modelsAnd a measurement model。
Generating modelsAnd replacing the signal sparse process in compressed sensing, mapping the input voice signal with noise, and reconstructing a generated voice signal related to a clean voice signal.
Measurement modelInstead of implementing a signal observation dimension reduction process by using a measurement matrix in compressed sensing, the input of the model comprises a clean speech signal and a generated speech signal, aiming at obtaining the clean speech signal and generating an observation signal of the speech signal, and the observation signal is used as an optimization object.
After the model is built, training is carried out in a joint training mode, the noisy speech of the training set is optimized before training, the generated model and the measurement model are not updated during the period of optimizing the noisy speech, and the optimization times can be specified.
The process can lead the model to be converged more quickly, reduce the training period, adopt a gradient descent method in an optimization mode, and have the following objective functions:
whereinRepresents the optimized noisy speech signal,which represents a clean speech signal, is,representing a voice signal with noise, and,representing a generated speech signal obtained after passing the noisy speech signal through the generative model,andrepresenting the observation signal obtained after the voice signal passes through the measurement model,are the weight coefficients.
Inputting the optimized noisy speech signal into a generation modelThen passes through the measurement modelObtaining an observation signal, taking the observation signal as an optimization object, and performing joint training optimization on the two models, wherein the target functions of the models are respectively as follows:
whereinIndicating added L1The weight factor of the regularization term,which represents a clean speech signal, is,representing a reconstructed speech signal, a depth-compressed-sensing-based speech enhancement method is aimed atI.e. minimizing the model global objective function.
The joint training mode takes the observation signal of the signal as an optimization object, utilizes the advantage of compressed sensing, and simultaneously converges two models by using less data, thereby simplifying the training process, enabling the reconstructed voice signal to quickly approach a clean voice signal, and effectively solving the problem of slow enhancement rate of the existing voice enhancement technology.
And step 3: testing the model: and preprocessing the voice data with the noise test set, denoising the voice data by using the trained SEDCS model, reconstructing to obtain a denoised voice signal, and completing a voice enhancement task.
Step 3-1: and preprocessing the test data, and performing pre-emphasis and framing processing on the noisy speech signals of the test set, wherein the pre-emphasis factor and the size of each frame are the same as those of the processed training data.
The preprocessing mode of the noisy test set is the same as that of the training set, namely the pre-emphasis factor is still selected to be 0.95, the frame length is still set to be 16384 sampling points, the frame overlap is still set to be 1/2, a window with the size of 16384 x 1 and the sliding step of 0.5 is used for sampling and framing the voice signal, and if the window is insufficient, 0 is supplemented.
Step 3-2: enhancing the voice: inputting the preprocessed voice signals with noise into the trained SEDCS model, denoising each small section of voice by the model, and splicing and reconstructing the small section of voice according to the sequence of the original clean voice signals to obtain the denoised voice signals.
The noise conditions of different noisy speech test data may be different, and when the model deals with unknown noise conditions, the model can also complete the speech enhancement task.
Step 3-3: and (4) storing the result: de-emphasis is carried out on the de-noised voice signal, and the finally obtained de-noised voice signal is stored in a specified directory.
And 4, step 4: and (3) evaluating the model: and evaluating the quality and intelligibility of the enhanced voice signal by adopting a plurality of evaluation indexes.
And evaluating the stored denoised voice so as to evaluate the performance of the model.
The evaluation indexes include: indicators for evaluating speech intelligibility: STOI; indicators for evaluating speech quality: PESQ, CSIG, CBAK, COVL, and SSNR. Wherein STOI is short-term objective intelligibility, PESQ is perceptual speech quality assessment, CSIG is mean opinion score for speech signal distortion, CBAK is mean opinion score for assessing background noise interference, COVL is mean opinion score for overall enhancement effect, and SSNR is segmental signal-to-noise ratio. Through the evaluation indexes, the model can be accurately and effectively evaluated.
In one embodiment of the invention, the evaluation model employs two noisy test sets.
The noise types in the test set I are 5 kinds of environmental noise different from those in the training set, the enhancement effect obtained when the model is used for dealing with unknown environmental noise is simulated, meanwhile, the effectiveness and the feasibility of the model are proved, and the embodiment results are shown in table 1.
The noise types in the test set two are white, volvo and babble, white noise simulates a stationary noise environment, other noises simulate a non-stationary noise environment, the test set is used for evaluating whether the problem that the intelligibility of the denoised voice in the traditional compressed sensing method is reduced is solved, and the embodiment results are shown in table 2.
Table 1 tests the score of the various indexes of set one.
Table 2 tests the scores for both PESQ and STOI.
To demonstrate the effectiveness and feasibility of the present invention, this example was also compared with the results using wiener speech enhancement methods. As shown in table 1, although PESQ of the present embodiment is 0.01 lower than that of the wiener method, scores of other indexes are all better, which indicates that the present embodiment can effectively suppress noise, improve voice quality, and adapt to different noise environments.
As shown in table 2, the scores in this embodiment are all better and improved to some extent, which indicates that this embodiment can solve the problem of reduced intelligibility in the conventional compressed sensing method.
The above-mentioned embodiments are intended to further illustrate the objects, technical lines and advantages of the present invention, and are only preferred embodiments of the present invention, which should not be construed as limiting the present invention, and any modifications, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (8)
1. A speech enhancement method based on deep compressed sensing is characterized by comprising the following steps:
step 1: preprocessing training data: pre-emphasis, pairing and framing processing are carried out on training data to obtain a time domain voice signal sequence;
step 2: constructing a model and training: constructing a voice enhancement model (SEDCS) based on deep compressed sensing, setting a proper error function, inputting the processed training set voice signals into a model for joint training, and deploying the trained SEDCS model into a server;
and step 3: testing the model: preprocessing a voice test set with noise, denoising the voice test set with the trained SEDCS model, reconstructing to obtain a denoised voice signal, and completing a voice enhancement task;
and 4, step 4: and (3) evaluating the model: and evaluating the quality and intelligibility of the enhanced voice signal by adopting a plurality of evaluation indexes.
2. The method of claim 1, wherein the SEDCS model of step 2 is constructed by two deep neural network models, namely a generative modelAnd a measurement model:
Generating modelsReplacing a signal sparse process in compressed sensing, mapping an input voice signal with noise, and reconstructing a generated voice signal related to a clean voice signal;
3. The method of claim 2, wherein the two models forming the SEDCS model are trained in a joint training manner, before training, noisy speech in the training set is first optimized in a gradient descent method, and the objective function is:
whereinRepresents the optimized noisy speech signal,which represents a clean speech signal, is,representing a voice signal with noise, and,representing a generated speech signal obtained after passing the noisy speech signal through the generative model,andrepresenting the observation signal obtained after the voice signal passes through the measurement model,are weight coefficients.
4. The method of claim 3, wherein the optimized noisy speech signal is input into the generative modelThen passes through the measurement modelObtaining an observation signal, using the observation signal as an optimization object, and performing joint training on the two modelsOptimizing, wherein the objective functions of the model are respectively as follows:
5. The method for enhancing speech based on deep compressed sensing of claim 1, wherein the step 3 comprises the following sub-steps:
step 3-1: preprocessing test data, performing pre-emphasis and framing processing on the noisy speech signals of the test set, wherein pre-emphasis factors and the size of each frame are the same as those of the training data;
step 3-2: enhancing the voice: inputting the preprocessed voice signal with noise into a trained SEDCS model, denoising each small segment of voice by the model, and splicing and reconstructing the small segments of voice according to the sequence of the original clean voice signal to obtain a denoised voice signal;
step 3-3: and (4) storing the result: de-emphasis is carried out on the de-noised voice signal, and the finally obtained de-noised voice signal is stored at a specified position.
6. The method as claimed in claims 1 and 5, wherein the pre-emphasis factor in the data pre-processing of step 1 and step 3 is set to 0.95, the frame length is set to 16384 samples, the frame overlap is set to 1/2, that is, the window with size of 16384 × 1 and sliding step of 0.5 is used to sample and frame the speech signal, and if the difference is less than 0, the frame overlap is complemented with 0.
7. The method as claimed in claim 1, wherein when the trained model is used for performing the speech denoising process in step 3, noise conditions of different noisy speech test data may be different, and the model can complete a speech enhancement task when dealing with unknown noise conditions.
8. The method according to claim 1, wherein the evaluating the index in step 4 comprises: indicators for evaluating speech intelligibility: STOI; indicators for evaluating speech quality: PESQ, CSIG, CBAK, COVL, and SSNR.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110367869.XA CN113129872B (en) | 2021-04-06 | 2021-04-06 | Voice enhancement method based on deep compressed sensing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110367869.XA CN113129872B (en) | 2021-04-06 | 2021-04-06 | Voice enhancement method based on deep compressed sensing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113129872A true CN113129872A (en) | 2021-07-16 |
CN113129872B CN113129872B (en) | 2023-03-14 |
Family
ID=76774973
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110367869.XA Active CN113129872B (en) | 2021-04-06 | 2021-04-06 | Voice enhancement method based on deep compressed sensing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113129872B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102081928A (en) * | 2010-11-24 | 2011-06-01 | 南京邮电大学 | Method for separating single-channel mixed voice based on compressed sensing and K-SVD |
CN103559888A (en) * | 2013-11-07 | 2014-02-05 | 航空电子系统综合技术重点实验室 | Speech enhancement method based on non-negative low-rank and sparse matrix decomposition principle |
CN103745727A (en) * | 2013-12-25 | 2014-04-23 | 南京邮电大学 | Compressed sensing method of noise-containing voice signal |
EP3090574A1 (en) * | 2014-01-03 | 2016-11-09 | Samsung Electronics Co., Ltd. | Method and apparatus for improved ambisonic decoding |
CN115410589A (en) * | 2022-09-05 | 2022-11-29 | 新疆大学 | Attention generation confrontation voice enhancement method based on joint perception loss |
-
2021
- 2021-04-06 CN CN202110367869.XA patent/CN113129872B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102081928A (en) * | 2010-11-24 | 2011-06-01 | 南京邮电大学 | Method for separating single-channel mixed voice based on compressed sensing and K-SVD |
CN103559888A (en) * | 2013-11-07 | 2014-02-05 | 航空电子系统综合技术重点实验室 | Speech enhancement method based on non-negative low-rank and sparse matrix decomposition principle |
CN103745727A (en) * | 2013-12-25 | 2014-04-23 | 南京邮电大学 | Compressed sensing method of noise-containing voice signal |
EP3090574A1 (en) * | 2014-01-03 | 2016-11-09 | Samsung Electronics Co., Ltd. | Method and apparatus for improved ambisonic decoding |
CN115410589A (en) * | 2022-09-05 | 2022-11-29 | 新疆大学 | Attention generation confrontation voice enhancement method based on joint perception loss |
Non-Patent Citations (4)
Title |
---|
HOURIA HANECHE 等: ""A new way to enhance speech signal based on compressed sensing"", 《MEASUREMENT》 * |
KANG ZHENG 等: ""Speech Enhancement Using U-Net with Compressed Sensing"", 《APPLIED SCIENCES》 * |
张健: ""基于压缩感知的语音信号建模技术的研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
黄志华 等: ""基于噪声稀疏特性的语音增强算法分析"", 《声学技术》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113129872B (en) | 2023-03-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Gupta et al. | Comparing recurrent convolutional neural networks for large scale bird species classification | |
CN110246510B (en) | End-to-end voice enhancement method based on RefineNet | |
CN108172238A (en) | A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system | |
CN108922513A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN113707176A (en) | Transformer fault detection method based on acoustic signal and deep learning technology | |
Xu et al. | Cross-language transfer learning for deep neural network based speech enhancement | |
WO2019232833A1 (en) | Speech differentiating method and device, computer device and storage medium | |
CN115410589A (en) | Attention generation confrontation voice enhancement method based on joint perception loss | |
Hsieh et al. | Improving perceptual quality by phone-fortified perceptual loss for speech enhancement | |
Poorjam et al. | Automatic quality control and enhancement for voice-based remote Parkinson’s disease detection | |
CN112992172A (en) | Single-channel time domain bird song separating method based on attention mechanism | |
Zhang et al. | Birdsoundsdenoising: Deep visual audio denoising for bird sounds | |
CN110427978B (en) | Variational self-encoder network model and device for small sample learning | |
CN116626753B (en) | Microseism event identification method and system based on multi-modal neural network | |
Saleem et al. | NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network | |
CN113129872B (en) | Voice enhancement method based on deep compressed sensing | |
CN117310668A (en) | Underwater sound target identification method integrating attention mechanism and depth residual error shrinkage network | |
CN116741144A (en) | Voice tone conversion method and system | |
CN115497492A (en) | Real-time voice enhancement method based on full convolution neural network | |
CN114724589A (en) | Voice quality inspection method and device, electronic equipment and storage medium | |
CN114302301A (en) | Frequency response correction method and related product | |
CN117974736B (en) | Underwater sensor output signal noise reduction method and system based on machine learning | |
Zhou | Research on English speech enhancement algorithm based on improved spectral subtraction and deep neural network | |
CN117095674B (en) | Interactive control method and system for intelligent doors and windows | |
CN112259126B (en) | Robot and method for assisting in identifying autism voice features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |