CN113129872A - Voice enhancement method based on deep compressed sensing - Google Patents

Voice enhancement method based on deep compressed sensing Download PDF

Info

Publication number
CN113129872A
CN113129872A CN202110367869.XA CN202110367869A CN113129872A CN 113129872 A CN113129872 A CN 113129872A CN 202110367869 A CN202110367869 A CN 202110367869A CN 113129872 A CN113129872 A CN 113129872A
Authority
CN
China
Prior art keywords
voice
model
signal
speech
enhancement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110367869.XA
Other languages
Chinese (zh)
Other versions
CN113129872B (en
Inventor
康峥
黄志华
赖惠成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinjiang University
Original Assignee
Xinjiang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinjiang University filed Critical Xinjiang University
Priority to CN202110367869.XA priority Critical patent/CN113129872B/en
Publication of CN113129872A publication Critical patent/CN113129872A/en
Application granted granted Critical
Publication of CN113129872B publication Critical patent/CN113129872B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention discloses a voice enhancement method based on deep compressed sensing, which comprises the following steps: step 1: preprocessing training data to obtain a time domain voice signal sequence; step 2: constructing a speech enhancement model (SEDCS) based on deep compressed sensing, and carrying out joint training on the SEDCS; and step 3: preprocessing a voice test set with noise, denoising and reconstructing the voice test set by using the trained SEDCS model, and storing a result to complete a voice enhancement task; and 4, step 4: and evaluating the quality and intelligibility of the enhanced voice signal by adopting a plurality of evaluation indexes. The invention combines the compressed sensing with deep learning to realize voice enhancement, can release the sparsity constraint on voice signals in the traditional compressed sensing method, solves the problems of reduced voice intelligibility of reconstructed voice and the like in the traditional compressed sensing method, takes the observation signals of the voice signals as an optimization object, effectively improves the voice enhancement efficiency, reduces the model complexity, and can realize voice enhancement more simply, conveniently and flexibly.

Description

Voice enhancement method based on deep compressed sensing
Technical Field
The invention relates to the technical field of voice enhancement of voice signal processing, in particular to a voice enhancement method based on deep compressed sensing.
Background
Voice is the most natural, fast and efficient way for people to communicate, but in real life voice is often disturbed by various noises, such as environmental noise, mechanical noise, etc. These noises affect the speech quality to different extents, resulting in a degradation of speech intelligibility. To solve these problems, it is necessary to apply to speech enhancement. Speech enhancement is a technique for extracting clean speech from noisy speech, and is an important component of a speech recognition system, and has two main purposes, namely, improving speech quality and improving speech intelligibility.
The existing speech enhancement means mainly comprise two traditional methods such as a spectral subtraction method, a subspace method, a wiener filtering method and the like, although the traditional methods can effectively remove noise and improve speech quality, the traditional methods are generally based on specific hypothesis premise, for example, the noise is stationary, but the speech enhancement effect is poor under low signal-to-noise ratio and non-stationary noise. In view of this problem, a deep learning based speech enhancement method is proposed, and common deep learning speech enhancement methods are a Convolutional Neural Network (CNN) based speech enhancement method, a Recurrent Neural Network (RNN) based speech enhancement method, and a generative countermeasure network (GAN) based speech enhancement method. The voice enhancement method based on the CNN is a common method, and completes a voice enhancement task by training a voice enhancement model, but the method has a large model parameter amount, and if the voice enhancement is performed in a time-frequency domain, there are problems of phase information loss and the like, which causes the voice enhancement quality to be reduced. RNN-based speech enhancement methods are also of interest, but RNN methods have a larger number of parameters and more complex models than CNN methods. The creation of a countermeasure network (GAN) has provided a new approach to speech enhancement that enables end-to-end enhancement of speech signals and that directly accomplishes the speech enhancement task in the time domain. With the development of a compressed sensing technology, a new exploration field is provided for speech enhancement, and although the method can solve the problem that the speech enhancement effect is poor under non-stationary noise in the traditional method, the compressed sensing requires that a speech signal needs to meet a specific structure, for example, the speech signal needs to be sparse, and the speech signal may cause effective information loss in the sparse process, so that the intelligibility of reconstructed speech is reduced.
Most of the existing voice enhancement technologies are realized in a time-frequency domain, and the problems of phase information loss and the like are easily caused after data processing; although speech enhancement is realized in the time domain by many deep learning-based speech enhancement techniques, the models thereof are complex and the original speech signal is taken as an optimization object, resulting in a reduction in the enhancement rate; the speech enhancement method based on the traditional compressed sensing technology is influenced by the sparsity of speech signals, so that the intelligibility of reconstructed speech is reduced.
Disclosure of Invention
The method mainly takes an observation signal of a voice signal as an optimization object to solve the problems of complex model, slow enhancement rate and the like existing in the existing voice enhancement technology and solve the problem of reduced intelligibility of reconstructed voice by the traditional compressed sensing method; the invention aims to provide a compressed sensing voice enhancement method combined with deep learning, which can complete a voice enhancement task, improve the voice enhancement rate and solve the problem of reduced intelligibility of reconstructed voice by a traditional compressed sensing method.
The purpose of the invention is realized by the following technical scheme.
A speech enhancement method based on deep compressed sensing comprises the following steps.
Step 1: preprocessing training data: and pre-emphasis, pairing and framing processing are carried out on the training data to obtain a time domain voice signal sequence.
Step 2: constructing a model and training: establishing a voice enhancement model (SEDCS) based on deep compressed sensing, setting a proper error function, performing combined training on a preprocessed voice training set input model, and deploying the trained SEDCS model into a server.
And step 3: testing the model: and preprocessing the noisy speech test set, denoising the noisy speech test set by using the trained SEDCS model, reconstructing to obtain a denoised speech signal, and completing a speech enhancement task.
And 4, step 4: and (3) evaluating the model: and evaluating the quality and intelligibility of the enhanced voice signal by adopting a plurality of evaluation indexes.
The model is called as an SEDCS model, a voice enhancement task can be completed in a time domain by training the SEDCS model, the problems of phase information loss and the like after time-frequency domain data processing are avoided, and a user can quickly obtain enhanced voice only by providing a voice file with noise; the invention can get rid of the sparsity constraint on the voice signal in the traditional compressed sensing method, solves the problems of the traditional compressed sensing method such as the reduction of the intelligibility of the reconstructed voice and the like, and more conveniently and flexibly realizes the voice enhancement.
Further, the SEDCS model in the step 2 is constructed by two deep neural network models which are respectively called as a generation model
Figure 7565DEST_PATH_IMAGE001
And a measurement model
Figure 44791DEST_PATH_IMAGE002
Generating models
Figure 261009DEST_PATH_IMAGE001
Replacing a signal sparse process in compressed sensing, mapping an input voice signal with noise, and reconstructing a generated voice signal related to a clean voice signal;
measurement model
Figure 143514DEST_PATH_IMAGE002
And replacing a measurement matrix in compressed sensing to realize a signal observation dimension reduction process for obtaining a clean voice signal and generating an observation signal of the voice signal, wherein the observation signal is used as an optimization object.
Further, the two models forming the SEDCS model in the step 2 are trained in a joint training mode, the training set is firstly optimized for noisy speech before training, the model can be converged faster in the process, the training period is shortened, the optimization mode adopts a gradient descent method, and the objective function is as follows:
Figure 761577DEST_PATH_IMAGE003
wherein
Figure 969705DEST_PATH_IMAGE004
Represents the optimized noisy speech signal,
Figure 407639DEST_PATH_IMAGE005
which represents a clean speech signal, is,
Figure 93836DEST_PATH_IMAGE006
representing a voice signal with noise, and,
Figure 831984DEST_PATH_IMAGE007
representing a generated speech signal obtained after passing the noisy speech signal through the generative model,
Figure 945434DEST_PATH_IMAGE008
and
Figure 136244DEST_PATH_IMAGE009
representing the observation signal obtained after the voice signal passes through the measurement model,
Figure 626131DEST_PATH_IMAGE010
is a weight coefficient, the generative model and the measurement model are not updated during the optimization of data, and the number of times of optimization can be specified.
Further, inputting the optimized noisy speech signal into a generation model
Figure 687628DEST_PATH_IMAGE001
Then passes through the measurement model
Figure 237558DEST_PATH_IMAGE002
Obtaining an observation signal, taking the observation signal as an optimization object, and performing joint training optimization on the two models, wherein the target functions of the models are respectively as follows:
Figure 650085DEST_PATH_IMAGE011
Figure 943663DEST_PATH_IMAGE012
wherein
Figure 390825DEST_PATH_IMAGE013
Indicating added L1The weight factor of the regularization term,
Figure 846077DEST_PATH_IMAGE005
which represents a clean speech signal, is,
Figure 745900DEST_PATH_IMAGE014
representing a reconstructed speech signal, a depth-compressed-sensing-based speech enhancement method is aimed at
Figure 577590DEST_PATH_IMAGE015
I.e. minimizing the model global objective function.
The joint training mode takes the observation signal of the signal as an optimization object, utilizes the advantage of compressed sensing, and simultaneously converges two models by using less data, thereby simplifying the training process, enabling the reconstructed voice signal to quickly approach a clean voice signal, and effectively solving the problem of slow enhancement rate of the existing voice enhancement technology.
The reconstructed speech intelligibility is reduced in the traditional compressed sensing method due to the reason of speech signal sparseness, the SEDCS model of the invention utilizes the deep neural network to replace the sparse process and the observation dimension reduction process, and the problem of reduced reconstructed speech intelligibility in the traditional compressed sensing method can be effectively solved without considering which sparse basis and which measurement matrix are selected.
Further, step 3 includes the following substeps.
Step 3-1: and preprocessing the test data, and performing pre-emphasis and framing processing on the noisy speech signals of the test set, wherein the pre-emphasis factor and the size of each frame are the same as those of the processed training data.
Step 3-2: enhancing the voice: inputting the preprocessed voice signals with noise into the trained SEDCS model, denoising each small section of voice by the model, and splicing and reconstructing the small section of voice according to the sequence of the original clean voice signals to obtain the denoised voice signals.
Step 3-3: and (4) storing the result: de-emphasis is carried out on the de-noised voice signal, and the finally obtained de-noised voice signal is stored in a specified directory.
Further, in the data preprocessing process of step 1 and step 3, the pre-emphasis factor is set to 0.95, the frame length is set to 16384 sampling points, the frame overlap is set to 1/2, that is, a window with the size of 16384 × 1 and the sliding step of 0.5 is used to sample and frame the voice signal, and if the difference is less, 0 is supplemented.
Furthermore, when the trained model is used for carrying out voice denoising processing in the step 3, noise conditions of different noisy voice test data may be different, and the model can complete a voice enhancement task when dealing with unknown noise conditions, which shows that the method can adapt to different noise scenes and has practicability.
Further, the various evaluation indexes in step 4 include: indicators for evaluating speech intelligibility: STOI; indicators for evaluating speech quality: PESQ, CSIG, CBAK, COVL, and SSNR. Wherein STOI is short-term objective intelligibility, PESQ is perceptual speech quality assessment, CSIG is mean opinion score for speech signal distortion, CBAK is mean opinion score for assessing background noise interference, COVL is mean opinion score for overall enhancement effect, and SSNR is segmental signal-to-noise ratio. Through the evaluation indexes, the model can be accurately and effectively evaluated.
By adopting the scheme, the invention has the following beneficial effects.
1. The invention provides a voice enhancement method based on deep compressed sensing, which effectively utilizes the respective advantages of a deep learning method and a compressed sensing technology to construct a compressed sensing voice enhancement model combined with deep learning, takes an observation signal of a voice signal as an optimization object, effectively improves the voice enhancement efficiency and reduces the model complexity.
2. The invention adopts a joint training mode to train the model, so that the voice signal with noise can be fitted with a clean voice signal, and the voice enhancement quality and the intelligibility are effectively improved.
3. The invention can complete the voice enhancement under different noise conditions, and has stronger adaptability and certain practicability.
Drawings
To further understand the technical solutions of the embodiments of the present invention, the drawings are described herein, and the drawings herein form a part of the present application and do not form a limitation of the embodiments of the present invention.
In the drawings: fig. 1 is a schematic diagram of a speech enhancement technique according to an embodiment of the present invention.
Detailed Description
The purpose, technical solution and advantages of the embodiments of the present invention will be fully described in detail herein with reference to the accompanying drawings. The embodiments described herein are some, but not all embodiments of the inventions. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The embodiment of the invention provides a voice enhancement method, which obtains an SEDCS model through a combined training mode, completes a voice enhancement task in a time domain, solves the problems of complex model, low enhancement rate and the like existing in the existing voice enhancement technology, solves the problems of reduced intelligibility of reconstructed voice of a traditional compressed sensing method and the like, and realizes voice enhancement more simply, conveniently and flexibly.
As shown in fig. 1, a route diagram of a speech enhancement technology provided by an embodiment of the present invention includes the following steps.
Step 1: preprocessing training data: and pre-emphasis, pairing and framing processing are carried out on the training data to obtain a time domain voice signal sequence.
The main role of pre-emphasis in the data pre-processing is to boost the high frequency components to prevent the reconstructed speech quality from being affected, and correspondingly, to perform de-emphasis at the output. The pre-emphasis factor is set to 0.95, the frame length is set to 16384 samples, the frame overlap is set to 1/2, that is, a window with the size of 16384 × 1 and the sliding step of 0.5 is used to sample and frame the voice signal, and if the frame overlap is insufficient, 0 is complemented.
Step 2: constructing a model and training: and constructing a voice enhancement model based on deep compressed sensing, setting a proper error function, performing joint training on the preprocessed voice training set input model, and deploying the trained SEDCS model into a server.
The SEDCS model is constructed by two deep neural network models which are respectively generation models
Figure 144837DEST_PATH_IMAGE001
And a measurement model
Figure 505411DEST_PATH_IMAGE002
Generating models
Figure 158109DEST_PATH_IMAGE001
And replacing the signal sparse process in compressed sensing, mapping the input voice signal with noise, and reconstructing a generated voice signal related to a clean voice signal.
Measurement model
Figure 527911DEST_PATH_IMAGE002
Instead of implementing a signal observation dimension reduction process by using a measurement matrix in compressed sensing, the input of the model comprises a clean speech signal and a generated speech signal, aiming at obtaining the clean speech signal and generating an observation signal of the speech signal, and the observation signal is used as an optimization object.
After the model is built, training is carried out in a joint training mode, the noisy speech of the training set is optimized before training, the generated model and the measurement model are not updated during the period of optimizing the noisy speech, and the optimization times can be specified.
The process can lead the model to be converged more quickly, reduce the training period, adopt a gradient descent method in an optimization mode, and have the following objective functions:
Figure 684086DEST_PATH_IMAGE003
wherein
Figure 746720DEST_PATH_IMAGE004
Represents the optimized noisy speech signal,
Figure 355556DEST_PATH_IMAGE005
which represents a clean speech signal, is,
Figure 794627DEST_PATH_IMAGE006
representing a voice signal with noise, and,
Figure 70888DEST_PATH_IMAGE007
representing a generated speech signal obtained after passing the noisy speech signal through the generative model,
Figure 570002DEST_PATH_IMAGE008
and
Figure 931713DEST_PATH_IMAGE009
representing the observation signal obtained after the voice signal passes through the measurement model,
Figure 908897DEST_PATH_IMAGE010
are the weight coefficients.
Inputting the optimized noisy speech signal into a generation model
Figure 774085DEST_PATH_IMAGE001
Then passes through the measurement model
Figure 178521DEST_PATH_IMAGE002
Obtaining an observation signal, taking the observation signal as an optimization object, and performing joint training optimization on the two models, wherein the target functions of the models are respectively as follows:
Figure 27528DEST_PATH_IMAGE011
Figure 277244DEST_PATH_IMAGE012
wherein
Figure 528097DEST_PATH_IMAGE013
Indicating added L1The weight factor of the regularization term,
Figure 837855DEST_PATH_IMAGE005
which represents a clean speech signal, is,
Figure 174159DEST_PATH_IMAGE014
representing a reconstructed speech signal, a depth-compressed-sensing-based speech enhancement method is aimed at
Figure 493145DEST_PATH_IMAGE015
I.e. minimizing the model global objective function.
The joint training mode takes the observation signal of the signal as an optimization object, utilizes the advantage of compressed sensing, and simultaneously converges two models by using less data, thereby simplifying the training process, enabling the reconstructed voice signal to quickly approach a clean voice signal, and effectively solving the problem of slow enhancement rate of the existing voice enhancement technology.
And step 3: testing the model: and preprocessing the voice data with the noise test set, denoising the voice data by using the trained SEDCS model, reconstructing to obtain a denoised voice signal, and completing a voice enhancement task.
Step 3-1: and preprocessing the test data, and performing pre-emphasis and framing processing on the noisy speech signals of the test set, wherein the pre-emphasis factor and the size of each frame are the same as those of the processed training data.
The preprocessing mode of the noisy test set is the same as that of the training set, namely the pre-emphasis factor is still selected to be 0.95, the frame length is still set to be 16384 sampling points, the frame overlap is still set to be 1/2, a window with the size of 16384 x 1 and the sliding step of 0.5 is used for sampling and framing the voice signal, and if the window is insufficient, 0 is supplemented.
Step 3-2: enhancing the voice: inputting the preprocessed voice signals with noise into the trained SEDCS model, denoising each small section of voice by the model, and splicing and reconstructing the small section of voice according to the sequence of the original clean voice signals to obtain the denoised voice signals.
The noise conditions of different noisy speech test data may be different, and when the model deals with unknown noise conditions, the model can also complete the speech enhancement task.
Step 3-3: and (4) storing the result: de-emphasis is carried out on the de-noised voice signal, and the finally obtained de-noised voice signal is stored in a specified directory.
And 4, step 4: and (3) evaluating the model: and evaluating the quality and intelligibility of the enhanced voice signal by adopting a plurality of evaluation indexes.
And evaluating the stored denoised voice so as to evaluate the performance of the model.
The evaluation indexes include: indicators for evaluating speech intelligibility: STOI; indicators for evaluating speech quality: PESQ, CSIG, CBAK, COVL, and SSNR. Wherein STOI is short-term objective intelligibility, PESQ is perceptual speech quality assessment, CSIG is mean opinion score for speech signal distortion, CBAK is mean opinion score for assessing background noise interference, COVL is mean opinion score for overall enhancement effect, and SSNR is segmental signal-to-noise ratio. Through the evaluation indexes, the model can be accurately and effectively evaluated.
In one embodiment of the invention, the evaluation model employs two noisy test sets.
The noise types in the test set I are 5 kinds of environmental noise different from those in the training set, the enhancement effect obtained when the model is used for dealing with unknown environmental noise is simulated, meanwhile, the effectiveness and the feasibility of the model are proved, and the embodiment results are shown in table 1.
The noise types in the test set two are white, volvo and babble, white noise simulates a stationary noise environment, other noises simulate a non-stationary noise environment, the test set is used for evaluating whether the problem that the intelligibility of the denoised voice in the traditional compressed sensing method is reduced is solved, and the embodiment results are shown in table 2.
Table 1 tests the score of the various indexes of set one.
Figure 332925DEST_PATH_IMAGE016
Table 2 tests the scores for both PESQ and STOI.
Figure 79164DEST_PATH_IMAGE017
To demonstrate the effectiveness and feasibility of the present invention, this example was also compared with the results using wiener speech enhancement methods. As shown in table 1, although PESQ of the present embodiment is 0.01 lower than that of the wiener method, scores of other indexes are all better, which indicates that the present embodiment can effectively suppress noise, improve voice quality, and adapt to different noise environments.
As shown in table 2, the scores in this embodiment are all better and improved to some extent, which indicates that this embodiment can solve the problem of reduced intelligibility in the conventional compressed sensing method.
The above-mentioned embodiments are intended to further illustrate the objects, technical lines and advantages of the present invention, and are only preferred embodiments of the present invention, which should not be construed as limiting the present invention, and any modifications, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1. A speech enhancement method based on deep compressed sensing is characterized by comprising the following steps:
step 1: preprocessing training data: pre-emphasis, pairing and framing processing are carried out on training data to obtain a time domain voice signal sequence;
step 2: constructing a model and training: constructing a voice enhancement model (SEDCS) based on deep compressed sensing, setting a proper error function, inputting the processed training set voice signals into a model for joint training, and deploying the trained SEDCS model into a server;
and step 3: testing the model: preprocessing a voice test set with noise, denoising the voice test set with the trained SEDCS model, reconstructing to obtain a denoised voice signal, and completing a voice enhancement task;
and 4, step 4: and (3) evaluating the model: and evaluating the quality and intelligibility of the enhanced voice signal by adopting a plurality of evaluation indexes.
2. The method of claim 1, wherein the SEDCS model of step 2 is constructed by two deep neural network models, namely a generative model
Figure 503811DEST_PATH_IMAGE001
And a measurement model
Figure 686531DEST_PATH_IMAGE002
Generating models
Figure 731847DEST_PATH_IMAGE001
Replacing a signal sparse process in compressed sensing, mapping an input voice signal with noise, and reconstructing a generated voice signal related to a clean voice signal;
measurement model
Figure 127056DEST_PATH_IMAGE002
And replacing a measurement matrix in compressed sensing to realize a signal observation dimension reduction process for obtaining a clean voice signal and generating an observation signal of the voice signal, wherein the observation signal is used as an optimization object.
3. The method of claim 2, wherein the two models forming the SEDCS model are trained in a joint training manner, before training, noisy speech in the training set is first optimized in a gradient descent method, and the objective function is:
Figure 207008DEST_PATH_IMAGE003
wherein
Figure 29470DEST_PATH_IMAGE004
Represents the optimized noisy speech signal,
Figure 562083DEST_PATH_IMAGE005
which represents a clean speech signal, is,
Figure 26562DEST_PATH_IMAGE006
representing a voice signal with noise, and,
Figure 695441DEST_PATH_IMAGE007
representing a generated speech signal obtained after passing the noisy speech signal through the generative model,
Figure 954384DEST_PATH_IMAGE008
and
Figure 708713DEST_PATH_IMAGE009
representing the observation signal obtained after the voice signal passes through the measurement model,
Figure 711304DEST_PATH_IMAGE010
are weight coefficients.
4. The method of claim 3, wherein the optimized noisy speech signal is input into the generative model
Figure 234690DEST_PATH_IMAGE001
Then passes through the measurement model
Figure 930113DEST_PATH_IMAGE002
Obtaining an observation signal, using the observation signal as an optimization object, and performing joint training on the two modelsOptimizing, wherein the objective functions of the model are respectively as follows:
Figure 437318DEST_PATH_IMAGE011
Figure 978021DEST_PATH_IMAGE012
wherein
Figure 621492DEST_PATH_IMAGE013
Indicating added L1The weight factor of the regularization term,
Figure 222237DEST_PATH_IMAGE005
which represents a clean speech signal, is,
Figure 216738DEST_PATH_IMAGE014
representing a reconstructed speech signal, a depth-compressed-sensing-based speech enhancement method is aimed at
Figure 295552DEST_PATH_IMAGE015
I.e. minimizing the model global objective function.
5. The method for enhancing speech based on deep compressed sensing of claim 1, wherein the step 3 comprises the following sub-steps:
step 3-1: preprocessing test data, performing pre-emphasis and framing processing on the noisy speech signals of the test set, wherein pre-emphasis factors and the size of each frame are the same as those of the training data;
step 3-2: enhancing the voice: inputting the preprocessed voice signal with noise into a trained SEDCS model, denoising each small segment of voice by the model, and splicing and reconstructing the small segments of voice according to the sequence of the original clean voice signal to obtain a denoised voice signal;
step 3-3: and (4) storing the result: de-emphasis is carried out on the de-noised voice signal, and the finally obtained de-noised voice signal is stored at a specified position.
6. The method as claimed in claims 1 and 5, wherein the pre-emphasis factor in the data pre-processing of step 1 and step 3 is set to 0.95, the frame length is set to 16384 samples, the frame overlap is set to 1/2, that is, the window with size of 16384 × 1 and sliding step of 0.5 is used to sample and frame the speech signal, and if the difference is less than 0, the frame overlap is complemented with 0.
7. The method as claimed in claim 1, wherein when the trained model is used for performing the speech denoising process in step 3, noise conditions of different noisy speech test data may be different, and the model can complete a speech enhancement task when dealing with unknown noise conditions.
8. The method according to claim 1, wherein the evaluating the index in step 4 comprises: indicators for evaluating speech intelligibility: STOI; indicators for evaluating speech quality: PESQ, CSIG, CBAK, COVL, and SSNR.
CN202110367869.XA 2021-04-06 2021-04-06 Voice enhancement method based on deep compressed sensing Active CN113129872B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110367869.XA CN113129872B (en) 2021-04-06 2021-04-06 Voice enhancement method based on deep compressed sensing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110367869.XA CN113129872B (en) 2021-04-06 2021-04-06 Voice enhancement method based on deep compressed sensing

Publications (2)

Publication Number Publication Date
CN113129872A true CN113129872A (en) 2021-07-16
CN113129872B CN113129872B (en) 2023-03-14

Family

ID=76774973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110367869.XA Active CN113129872B (en) 2021-04-06 2021-04-06 Voice enhancement method based on deep compressed sensing

Country Status (1)

Country Link
CN (1) CN113129872B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081928A (en) * 2010-11-24 2011-06-01 南京邮电大学 Method for separating single-channel mixed voice based on compressed sensing and K-SVD
CN103559888A (en) * 2013-11-07 2014-02-05 航空电子系统综合技术重点实验室 Speech enhancement method based on non-negative low-rank and sparse matrix decomposition principle
CN103745727A (en) * 2013-12-25 2014-04-23 南京邮电大学 Compressed sensing method of noise-containing voice signal
EP3090574A1 (en) * 2014-01-03 2016-11-09 Samsung Electronics Co., Ltd. Method and apparatus for improved ambisonic decoding
CN115410589A (en) * 2022-09-05 2022-11-29 新疆大学 Attention generation confrontation voice enhancement method based on joint perception loss

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081928A (en) * 2010-11-24 2011-06-01 南京邮电大学 Method for separating single-channel mixed voice based on compressed sensing and K-SVD
CN103559888A (en) * 2013-11-07 2014-02-05 航空电子系统综合技术重点实验室 Speech enhancement method based on non-negative low-rank and sparse matrix decomposition principle
CN103745727A (en) * 2013-12-25 2014-04-23 南京邮电大学 Compressed sensing method of noise-containing voice signal
EP3090574A1 (en) * 2014-01-03 2016-11-09 Samsung Electronics Co., Ltd. Method and apparatus for improved ambisonic decoding
CN115410589A (en) * 2022-09-05 2022-11-29 新疆大学 Attention generation confrontation voice enhancement method based on joint perception loss

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HOURIA HANECHE 等: ""A new way to enhance speech signal based on compressed sensing"", 《MEASUREMENT》 *
KANG ZHENG 等: ""Speech Enhancement Using U-Net with Compressed Sensing"", 《APPLIED SCIENCES》 *
张健: ""基于压缩感知的语音信号建模技术的研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *
黄志华 等: ""基于噪声稀疏特性的语音增强算法分析"", 《声学技术》 *

Also Published As

Publication number Publication date
CN113129872B (en) 2023-03-14

Similar Documents

Publication Publication Date Title
Gupta et al. Comparing recurrent convolutional neural networks for large scale bird species classification
CN110246510B (en) End-to-end voice enhancement method based on RefineNet
CN108172238A (en) A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN113707176A (en) Transformer fault detection method based on acoustic signal and deep learning technology
Xu et al. Cross-language transfer learning for deep neural network based speech enhancement
WO2019232833A1 (en) Speech differentiating method and device, computer device and storage medium
CN115410589A (en) Attention generation confrontation voice enhancement method based on joint perception loss
Hsieh et al. Improving perceptual quality by phone-fortified perceptual loss for speech enhancement
Poorjam et al. Automatic quality control and enhancement for voice-based remote Parkinson’s disease detection
CN112992172A (en) Single-channel time domain bird song separating method based on attention mechanism
Zhang et al. Birdsoundsdenoising: Deep visual audio denoising for bird sounds
CN110427978B (en) Variational self-encoder network model and device for small sample learning
CN116626753B (en) Microseism event identification method and system based on multi-modal neural network
Saleem et al. NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network
CN113129872B (en) Voice enhancement method based on deep compressed sensing
CN117310668A (en) Underwater sound target identification method integrating attention mechanism and depth residual error shrinkage network
CN116741144A (en) Voice tone conversion method and system
CN115497492A (en) Real-time voice enhancement method based on full convolution neural network
CN114724589A (en) Voice quality inspection method and device, electronic equipment and storage medium
CN114302301A (en) Frequency response correction method and related product
CN117974736B (en) Underwater sensor output signal noise reduction method and system based on machine learning
Zhou Research on English speech enhancement algorithm based on improved spectral subtraction and deep neural network
CN117095674B (en) Interactive control method and system for intelligent doors and windows
CN112259126B (en) Robot and method for assisting in identifying autism voice features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant