CN111383652B - Single-channel voice enhancement method based on double-layer dictionary learning - Google Patents

Single-channel voice enhancement method based on double-layer dictionary learning Download PDF

Info

Publication number
CN111383652B
CN111383652B CN201911021192.3A CN201911021192A CN111383652B CN 111383652 B CN111383652 B CN 111383652B CN 201911021192 A CN201911021192 A CN 201911021192A CN 111383652 B CN111383652 B CN 111383652B
Authority
CN
China
Prior art keywords
dictionary
layer
noise
sub
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911021192.3A
Other languages
Chinese (zh)
Other versions
CN111383652A (en
Inventor
孙林慧
吴子皓
谢可丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201911021192.3A priority Critical patent/CN111383652B/en
Publication of CN111383652A publication Critical patent/CN111383652A/en
Application granted granted Critical
Publication of CN111383652B publication Critical patent/CN111383652B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/28Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

A single-channel voice enhancement method based on double-layer dictionary learning comprises the following steps of S1, collecting, preprocessing and mixing input voice and noise samples, training the voice and noise samples into a sparse sub-dictionary, and obtaining a single-layer joint dictionary and a double-layer joint dictionary by the sparse sub-dictionary through a constraint target optimization function; s2, enhancing the noisy speech, projecting the noisy speech on a first layer of the double-layer combined dictionary, and judging whether to project on a second layer of the double-layer combined dictionary by comparing the energy of the enhanced noisy speech with a preset energy threshold; and S3, performing performance evaluation on the single-channel voice enhancement method based on double-layer dictionary learning. According to the method, whether the more redundant combined dictionary is needed or not is selected according to the commonality of the voice and the noise in the signal enhanced by the single-layer dictionary, so that the cross projection phenomenon is effectively reduced, and the distinguishing property of the combined dictionary is improved.

Description

Single-channel voice enhancement method based on double-layer dictionary learning
Technical Field
The invention belongs to the technical field of voice signal processing, and particularly relates to a single-channel voice enhancement method based on double-layer dictionary learning.
Background
In a practical environment, a voice signal is often interfered by various noises, the purpose of voice enhancement is to effectively remove the noises from the voice signal with noises as much as possible and estimate the voice signal without distortion, and the most traditional voice enhancement algorithm is a spectral subtraction method, a statistical model-based method and a subspace method. In real life, however, the noise is often non-stationary, possibly similar to pure speech, exposing the drawbacks of the conventional methods. In recent years, a sparse model-based speech enhancement algorithm has proven to be effective in reducing the interference of non-stationary noise and to perform well in terms of speech enhancement and source separation. Thus, various speech signal processing algorithms based on sparse models continue to draw close attention to the relevant scholars. Sigg proposes a speech enhancement algorithm based on joint dictionary learning, in which pure speech is estimated by sparse coding of noisy speech signals on a joint dictionary consisting of a pure speech dictionary and a noisy dictionary.
When the noisy mixed signal is sparsely represented on the joint dictionary, a part of the speech component in the noisy mixed signal is still projected on the interference noise sub-dictionary. Similarly, some of the noise components in the noisy mixed signal may be projected onto the interfering speech dictionary, i.e., the "cross-projection" may cause more source aliasing. To some extent, there is also some similarity component between the speech signal and the noise signal. Especially between non-stationary noise such as car noise and speech signals, the commonality is prominent.
Disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects of the prior art, and provides a single-channel voice enhancement method based on double-layer dictionary learning.
The invention provides a single-channel voice enhancement method based on double-layer dictionary learning, which comprises the following steps,
step S1, acquiring, preprocessing and mixing input voice and noise samples, training the voice and noise samples into a sparse sub-dictionary, and obtaining a single-layer joint dictionary and a double-layer joint dictionary through a constraint target optimization function by the sparse sub-dictionary;
and S2, enhancing the noisy speech, projecting the noisy speech on a first layer of the double-layer combined dictionary, and judging whether to project the noisy speech on a second layer of the double-layer combined dictionary by comparing the energy of the enhanced noisy speech with a preset threshold.
As a further technical solution of the present invention, the step S1 includes the following steps,
step S11, sampling each sentence of input time-domain continuous pure voice and noise samples, and adjusting the amplitude of the noise signal according to the value of the preset signal-to-noise ratio to obtain M frames of voice signals, noise signals and noise-carrying voice signals for training;
step S12, respectively carrying out K-SVD dictionary training on the voice signal and the noise signal, and reconstructing the signal by adopting BP algorithm to obtain a voice sub-dictionary and a noise sub-dictionary;
and S13, splicing the voice sub-dictionary and the noise sub-dictionary into an initial single-layer joint dictionary and a double-layer joint dictionary, and obtaining the optimal single-layer joint dictionary and double-layer joint dictionary through constraint objective optimization functions.
Further, the specific steps of obtaining the optimal single-layer joint dictionary and the optimal double-layer joint dictionary are that,
step S131, the voice sub-dictionary and the noise sub-dictionary are spliced into a joint dictionary D;
step S132, projecting the trained noisy speech signal on an initial joint dictionary D through a BP algorithm to obtain a sparse coding coefficient;
s133, calculating the gradient of a target optimization function required in the L-BFGS algorithm;
and step S134, iteratively solving an optimization function through an L-BFGS algorithm to obtain an optimal joint dictionary.
Further, in step S131, the first-layer speech sub-dictionary D is used in learning the single-layer joint dictionary s And noise sub-dictionary D n Concatenation is single-layer joint dictionary D= [ D ] s ,D n ]The method comprises the steps of carrying out a first treatment on the surface of the In the learning of the double-layer combined dictionary, a first-layer learned speech sub-dictionary D is fixed s And noise sub-dictionary D n Optimizing a second layer of learned speech sub-dictionary E s And noise sub-dictionary E n And finally, splicing all sub-dictionaries to obtain the double-layer joint dictionary D= [ D ] s ,E s ,D n ,E n ]。
Further, in step S134, in the optimization of the single-layer joint dictionary, the optimization function is divided into four error constraints: the first term is to minimize the training noisy speech in the joint dictionary d= [ D ] s ,D n ]Performing sparse expression errors;
the second term is to minimize the clean speech signal in the speech sub-dictionary D s Performing sparse expression errors;
the third term is to minimize the noise signal in the noise sub-dictionary D n Performing sparse expression errors;
the fourth term is to minimize the inner product of the speech sub-dictionary and the noise sub-dictionary;
in the optimization of the double-layer joint dictionary, the optimization function is divided into five error constraints:
the first term is to minimize the training noisy speech in the joint dictionary d= [ D ] s ,E s ,D n ,E n ]Performing sparse expression errors;
the second term is to minimize the sub-dictionary D of source correspondence in the dual-layer dictionary of the clean speech signal and the speech component of the trained noisy speech signal s And E is s Errors in the sparse representation;
the third term is to minimize the sub-dictionary D of source correspondence in the dual-layer dictionary of the clean noise signal and the noise component of the trained noisy speech signal n And E is n Error of sparse representation;
the fourth term is to minimize the first-layer learned phonetic sub-dictionary D s Sub-dictionary E capable of expressing speech components learned with second layer s Error of inner product;
the fifth term is to minimize the noise sub-dictionary D of the first-layer learning n Component dictionary E capable of expressing noise and learned with second layer n Error of inner product.
Further, the step S2 includes the steps of,
s21, projecting the tested noisy speech on a first-layer joint dictionary, and multiplying the sparse coefficient matrix by the first-layer joint dictionary to obtain an enhanced speech signal based on the first-layer joint dictionary learning;
step S22, comparing the energy of the voice with noise which is enhanced by the first-layer combined dictionary learning with a preset energy threshold, if the energy is smaller than the energy threshold, indicating that the phenomenon of cross projection in the voice enhanced by the first-layer combined dictionary is serious, and executing step S23 to project the voice with noise in the second-layer combined dictionary; and if the energy threshold value is larger than the energy threshold value, the enhanced voice signal is learned based on the first-layer dictionary and is output as the final enhanced voice.
And S23, projecting the noisy speech on the second-layer joint dictionary, multiplying the sparse coefficient matrix by the second-layer joint dictionary to obtain an enhanced speech signal based on the second-layer joint dictionary learning, and outputting the enhanced speech signal as final enhanced speech.
Further, the method also comprises a step S3 of evaluating the performance of the single-channel voice enhancement method based on double-layer dictionary learning.
The invention uses the double-layer dictionary through the energy threshold self-adaptive selection, greatly improves the system performance, adopts the constraint target optimization function to inhibit the generation of cross projection, improves the distinguishing property of the combined dictionary, ensures that the voice-bearing component is projected on the corresponding voice dictionary more, and improves the enhancement effect of the noise-bearing mixed signal.
Compared with the prior art, the method provided by the invention adopts the joint dictionary extraction mode, so that the cross projection phenomenon is fewer, and the distinguishing property of the joint dictionary is higher.
Drawings
FIG. 1 is a block diagram of the operational flow of the present invention;
FIG. 2 is a block diagram of a training dual-layer dictionary process of the present invention;
FIG. 3 is a schematic diagram of a single-layer dictionary with noise speech signal enhancement module according to the present invention;
FIG. 4 is a schematic diagram of a dual-layer dictionary with noisy speech signal enhancement module of the present invention.
Detailed Description
Referring to fig. 1, the present embodiment provides a single-channel speech enhancement method based on dual-layer dictionary learning, which includes the following steps,
step 1: and acquiring, preprocessing and mixing the input pure voice and noise samples, respectively training the pure voice and noise samples into sparse sub-dictionaries, and then obtaining a single-layer joint dictionary and a double-layer joint dictionary through constraint target optimization functions.
Step 1-1 pretreatment of input Signal
A section of the wav format voice signal and the noise signal are input, each sentence of the input time domain continuous signal is sampled, and then the pretreatment is carried out on the signals. The pretreatment mainly comprises: pre-emphasis, framing, and windowing. And preprocessing each section of voice to obtain M frame signals. And mixing according to the preset signal-to-noise ratio to obtain the voice signal with noise for training.
Step 1-2, learning based on sparse dictionary
And training the sparse dictionary by utilizing the voice signal and the noise signal based on the K-SVD algorithm to obtain a voice sub-dictionary and a noise sub-dictionary. The K-SVD algorithm consists of two stages of sparse coding and dictionary updating. First, using any sparse representation method (the present invention is based on OMP method), a sparse coefficient vector in a sample T on a given dictionary D is calculated. The column atoms in the dictionary matrix are then updated to better fit the signal using the sparse representation obtained in the first step. Each atom of the dictionary in the following formula is updated in sequence, i.e. column by column, while keeping the other atoms fixed.
Finally, the two steps are repeated until the condition of algorithm convergence is reached.
The basic content of the dictionary updating step is presented here. First, the overall representation error matrix E is calculated by l
Wherein d is i Is the ith column atom of dictionary matrix D;is the ith row, E of the coding coefficient matrix Γ l Representing the residuals when the first atom was deleted from the dictionary matrix. Then pair E l SVD decomposition application is carried out, and a substituted d is found through a column vector corresponding to the maximum singular value i And->Make E l Approximately rank-1 matrix. When all atoms in the dictionary are updated, one iteration ends. On the next iteration, the learned wordThe dictionary will act as an initial dictionary of sparse codes until the iteration is terminated.
Step 1-3, obtaining an initial single-layer joint dictionary
Similar to the general dictionary learning method, each iterative optimization of the dictionary can be divided into two phases: firstly, a single-layer joint dictionary D is fixed, and sparse representation is carried out on signals under the current dictionary to obtain X 1 The method comprises the steps of carrying out a first treatment on the surface of the Then the sparse vector matrix X is fixed 1 The single-layer joint dictionary D is updated so that the objective function is minimized.
The first step in dictionary learning is to initialize the federated dictionary D. In the invention, the first-layer speech sub dictionary D obtained in the step 1-2 s And noise sub-dictionary D n Concatenation is single-layer joint dictionary D= [ D ] s ,D n ]The size is 256 x 512.
The second step of dictionary learning is the code update phase. When the initial D is fixed, the coding coefficients will be updated with the following optimization function:
min||X 1 || 1
s.t.M=DX
the sparse coding method BP algorithm which can well reconstruct signals is used for obtaining the coding vector matrix.
The third step of dictionary learning is dictionary updating. When fixing the coding coefficients, the joint dictionary is updated using the following optimization function:
to jointly optimize each sub-dictionary, consider introducing a matrix Q i ,i=1,2,3,4。 And->Wherein 0 represents an all-zero matrixI represents an identity matrix. Thus, the above-mentioned transformation is:
when the joint dictionary is updated in the new error function, the speech sub-dictionary and the noise sub-dictionary can be updated simultaneously. Similar to the way the objective function value is minimized, the finite memory BFGS algorithm (L-BFGS) in quasi-Newton's method is still employed to solve the optimization problem. Meanwhile, it is necessary to find the gradient function of the above-mentioned objective function. Therefore, the gradient function is:
through the method, the distinguishing single-layer combined dictionary with the size of 256 x 512 can be learned.
Step 1-4, obtaining an initial double-layer joint dictionary
Training a dual-layer dictionary is similar to steps 1-3, and includes three processes of initializing the dictionary, updating the code, and updating the dictionary.
In the stage of initializing the double-layer joint dictionary, training samples are firstly utilized to obtain [ D ] in the first-layer dictionary learning s ,D n ]Then, in the second layer dictionary learning, the sub-dictionary D learned by the first layer is fixed s And D n Optimizing only the update sub-dictionary E s And E is n . Initial dictionary E of the second layer s And E is n Training samples corresponding to the sources are obtained through learning in the step 1-2. Finally, each sub-dictionary is spliced to obtain a double-layer joint dictionary D= [ D ] s ,E s ,D n ,E n ]The dictionary size is 256 x 1024.
In the code updating stage, the sparse coding matrix is solved through a BP algorithm. When the initial D is fixed, the sparse coding problem is still solved with the following optimization function.
min||X 1 || 1
s.t.M=DX
It is worth emphasizing here that, unlike single-layer dictionary learning, the dictionary here is a double-layer joint dictionary d= [ D ] s ,E s ,D n ,E n ]Rather than [ D ] s ,D n ]。
In the dictionary updating phase, when the coding coefficients are fixed, the joint dictionary is updated by an optimization function of the following formula:
the following variables are defined herein:
substituting the above formula into the objective function, the objective optimization function can be written as follows:
for joint optimization of the second layer joint dictionary, define e= [ E s ,E n ],And introducing a matrix:and->Where 0 represents an all-zero matrix and I represents an identity matrix. Thus, it can be written as:
the optimization problem is solved by adopting a limited memory BFGS algorithm (L-BFGS) in a quasi-Newton method. Meanwhile, it is necessary to find its bias derivative for the above-mentioned objective function. Therefore, the partial derivative function is:
finally, the second layer dictionary [ E ] obtained by learning s ,E n ]With a first layer dictionary [ D ] s ,D n ]Combined and spliced double-layer joint dictionary D= [ D ] s ,E s ,D n ,E n ]The dictionary size is 256 x 1024.
As shown in fig. 2, a sample of the process of training a dual-layer dictionary is shown.
Step 2: and carrying out enhancement processing on the noisy speech by adopting a speech enhancement method based on dictionary learning, projecting the tested noisy speech on the first layer of joint dictionary, and then selecting whether to project the second layer of dictionary according to the condition that the energy of the enhanced speech signal is compared with a preset energy threshold value.
Step 2-1, projection on first layer joint dictionary
In the voice enhancement stage, a sparse coefficient X is obtained through a BP method. When the mixed signal s is obtained in the joint dictionary D= [ D ] s ,D n ]After the code matrix X of the upper projection, the first layer of phonetic sub-dictionary D can be used s And reconstruct the speech signal in response thereto. Definition x= [ (X) s ) T ,(X n ) T ] T Wherein X is s And X n Respectively representing phonetic sub-dictionaries D s And D n The corresponding coding coefficient. If the noisy speech signal s and the joint dictionary D are known, the module can be written as follows.
The enhanced speech signal may be reconstructed according to the following equation
As shown in fig. 3, a sample of single layer dictionary noisy speech signal enhancement is shown.
Step 2-2, comparing with a preset energy threshold
The energy of the speech signal after the first layer dictionary enhancement is calculated using an energy formula and compared with an energy threshold derived from an experimental summary. When the voice energy is larger than the energy threshold, the voice enhanced by the first layer dictionary has good enhancement effect, and the dual-layer dictionary is used for enhancing the voice enhancement effect to be small, so that the time and the operation amount required by the voice enhancement process are saved; when the speech energy is less than the energy threshold, it is indicated that more "cross-projection" still exists for speech enhanced by the first layer dictionary. To some extent, there is also some similarity component between the speech signal and the noise signal. In order to suppress cross projection, after training the first-layer joint dictionary, second-layer dictionary learning is required, mainly by interpreting indistinguishable components of the first layer by the second-layer dictionary.
Step 2-3, projection on a second layer joint dictionary
Joint dictionary d= [ D s ,E s ,D n ,E n ]Is a dual-layer dictionary comprising four sub-dictionaries. Wherein D is s And D n Is a distinguishing dictionary capable of expressing distinguishable components of speech signal and noise signal, and E s And E is n Components of the speech signal and the noise signal that interfere with each other can be expressed.Is the mixed signal s in the joint dictionary D= [ D ] s ,E s ,D n ,E n ]Is used for the sparse coefficient matrix of (1). And the sparse coefficient X is obtained by a BP method as the same as the coding algorithm in dictionary optimization. When the mixed signal s is obtained in the joint dictionary D= [ D ] s ,E s ,D n ,E n ]After the code matrix X of the upper projection, can be usedFirst-layer speech sub-dictionary D s Is a response to the second layer speech sub-dictionary E s And reconstruct the speech signal in response thereto. Specifically, the estimated speech signal can be obtained by the following equation.
As shown in fig. 4, a sample of noisy speech signal enhancement is shown.
Step 3: and performing performance evaluation on the single-channel voice enhancement method based on double-layer dictionary learning.
The invention selects the male voice signal from the Chinese voice library of the CASIA for single-channel voice denoising. All utterances were sampled at 16 khz. To obtain a noisy speech signal, this section artificially adds three background noises (taken from the noise-92 database), namely white noise (white), cabin noise (f 16) and vehicle noise (volvo), with signal-to-noise ratios of-5 db,0db,5db, to the utterances of the speech library. In the training stage, the duration of the voice signal and the duration of the noise signal are set to be the same, wherein the voice training set is 60 male speaker sentences. The mixed signal is formed by additively combining clean voice data and noise data according to a certain signal-to-noise ratio. In the test stage, ten male voices and noise signals are randomly selected from a voice library by the mixed signals and are additively combined according to a certain signal-to-noise ratio, and the simulation result is the average value of all experiments. In this experiment, the speech used in the test phase was not coincident with the training set. In the experiment, a time domain amplitude spectrum of a voice signal and noise is adopted, and meanwhile, a rectangular window with a window length of 256 is used for framing and windowing an input signal, so that a clean voice training set, a noise training set and a noisy voice training set which correspond to signal to noise ratios of 256 x 10000 are respectively obtained.
In order to better verify the effectiveness of the method, two objective evaluation criteria, namely a global signal-to-noise ratio (SNR) and a perceived speech quality evaluation (PESQ), are adopted to measure the effect of single-channel speech noise reduction. Global signal to noise ratio is a method of evaluating the overall quality of speech by comparing the energies of clean speech and interfering components. The perceptual speech quality assessment is used to evaluate the hearing quality of the noise reduced speech, and a higher PESQ indicates that the noise reduced speech has a better speech quality.
The quality of the single-channel voice noise reduction algorithm is evaluated by using the global signal-to-noise ratio lifting value after voice noise reduction, wherein the lifting value refers to the difference value between the index value calculated after voice noise reduction and the index value calculated by an unprocessed noisy signal. Table 1 shows the speech noise reduction results under different noise environments.
TABLE 1 comparison of the noise reduction and signal to noise ratio improvement values of voices under different noise environments
The White noise in the table is 5dB, and when F16 is-5, 0 and 5dB, the energy is larger than the energy threshold after passing through the single-layer dictionary, and the signal to noise ratio after the double-layer dictionary is enhanced is calculated for observing the effect of enhancing the voice enhancement by using the double-layer dictionary after the energy threshold is judged.
From the table the following conclusions can be drawn: by comparing the single-channel voice noise reduction based on double-layer dictionary learning with the single-channel voice noise reduction based on K-SVD dictionary and the single-channel voice noise reduction based on distinguishing single-layer dictionary learning, the method of the invention basically has higher signal-to-noise ratio improvement value. Particularly, under the environment of vehicle-mounted noise (volvo) and white noise, the signal-to-noise ratio improvement value before and after noise reduction based on the method is obviously higher than that of other single-layer dictionary learning methods no matter what the signal-to-noise ratio is input. As can be seen from the table, the boosting signal-to-noise ratio based on the K-SVD dictionary method is lower compared to other methods. This is mainly because the joint dictionary of this method is not sufficiently differentiated, and "cross projection" is generated between sub-dictionaries, resulting in poor noise reduction. When the method is used for training the dictionary, the double-layer dictionary learning method can enable the distinguishable components of the source signal to be sparsely expressed on the sub-dictionary corresponding to the first layer, and the confusable components are expressed on the sub-dictionary of the second layer, so that cross projection is restrained, and further, the quality of voice is improved. It can be seen from the table that the noise reduction effect of the algorithm is good in the low signal-to-noise ratio environment, and the noise reduction performance in the high signal-to-noise ratio environment has certain limitation. This can be explained by the fact that the amplitude of the noise signal is small under the condition of high signal-to-noise ratio, the noise signal is greatly interfered by the voice signal, and the noise reduction performance is affected to some extent. It can also be seen from the table that when the speech energy after the first layer dictionary enhancement is greater than the energy threshold, a good enhancement effect is already obtained, and the effect of enhancing the speech enhancement by using the double-layer dictionary is small. To sum up: compared with a comparison algorithm, the algorithm has more effective noise reduction effect on noise-carrying voice under the interference noise environment from the point of view of signal-to-noise ratio increasing value, particularly has better dictionary selectivity under the judgment of an energy threshold value in a low signal-to-noise ratio environment, and reduces the operation amount of the algorithm.
To better compare the noise reduction of single channel based speech, the noise reduction performance of the algorithm of the present invention will next be measured from another indicator PESQ. Speech is often corrupted by various noise, which not only reduces speech intelligibility, but also reduces the performance of the speech processing system. Therefore, the task of voice noise reduction is to improve the clarity and intelligibility of voice in noisy background. The score value of PESQ of speech is proportional to the clarity of speech, i.e. a higher score, the better the hearing effect. The noisy speech refers to the speech signal respectively input with mixed signals with signal-to-noise ratios of-5, 0 and 5dB under different noise types. The PESQ calculation results of each algorithm are shown in table 2.
TABLE 2 PSEQ values for different methods in different noise environments
The White noise in the table is 5dB, and when F16 is-5, 0 and 5dB, the energy is larger than the energy threshold after passing through the single-layer dictionary, and the PESQ value after the double-layer dictionary enhancement is also calculated, so that the effect of obtaining definition by enhancing the voice enhancement through the double-layer dictionary after the energy threshold is judged is observed.
In order to compare the voice noise reduction effects of different methods under different signal-to-noise environments of different noises, the PESQ values of the voice noise reduction of different methods in the graph are calculated. Compared with other methods, the algorithm of the double-layer dictionary has higher PESQ value in the whole no matter in any noise environment, and the hearing effect is improved to a certain extent. In particular, the PESQ value of the method is far higher than that of other methods under the environment of different input signal-to-noise ratios of vehicle noise. Although PESQ of the algorithm of the present invention is equivalent to a single-layer discrimination dictionary effect in f16 noise with an input signal-to-noise ratio of 5dB, in other environments, the algorithm of the present invention has a certain boosting effect. It can also be seen from the table that when the speech energy after the first layer dictionary enhancement is greater than the energy threshold, a good enhancement effect is already obtained, and the PESQ of the speech enhancement is enhanced by using the double layer dictionary is basically unchanged. From the PESQ value point of view it can be deduced that: the single-channel voice noise reduction method based on double-layer dictionary learning has a good noise reduction effect, has better dictionary selectivity under the judgment of the energy threshold value, and reduces the operation amount of an algorithm.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the specific embodiments described above, and that the above specific embodiments and descriptions are provided for further illustration of the principles of the present invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined in the appended claims. The scope of the invention is defined by the claims and their equivalents.

Claims (6)

1. A single-channel voice enhancement method based on double-layer dictionary learning is characterized by comprising the following steps,
step S1, acquiring, preprocessing and mixing input voice and noise samples, training the voice and noise samples into a sparse sub-dictionary, and obtaining a single-layer joint dictionary and a double-layer joint dictionary through a constraint target optimization function by the sparse sub-dictionary;
s2, enhancing the noisy speech, projecting the noisy speech on a first layer of the double-layer combined dictionary, and judging whether to project on a second layer of the double-layer combined dictionary by comparing the energy of the enhanced noisy speech with a preset threshold;
the step S1 comprises the steps of,
step S11, sampling each sentence of input time-domain continuous pure voice and noise samples, and adjusting the amplitude of the noise signal according to the value of the preset signal-to-noise ratio to obtain M frames of voice signals, noise signals and noise-carrying voice signals for training;
step S12, respectively carrying out K-SVD dictionary training on the voice signal and the noise signal, and reconstructing the signal by adopting BP algorithm to obtain a voice sub-dictionary and a noise sub-dictionary;
and S13, splicing the voice sub-dictionary and the noise sub-dictionary into an initial single-layer joint dictionary and a double-layer joint dictionary, and obtaining the optimal single-layer joint dictionary and double-layer joint dictionary through constraint objective optimization functions.
2. The single-channel speech enhancement method based on double-layer dictionary learning as claimed in claim 1, wherein the specific steps of obtaining the optimal single-layer joint dictionary and double-layer joint dictionary are,
step S131, the voice sub-dictionary and the noise sub-dictionary are spliced into a joint dictionary D;
step S132, projecting the trained noisy speech signal on an initial joint dictionary D through a BP algorithm to obtain a sparse coding coefficient;
s133, calculating the gradient of a target optimization function required in the L-BFGS algorithm;
and step S134, iteratively solving an optimization function through an L-BFGS algorithm to obtain an optimal joint dictionary.
3. The single-channel speech enhancement method according to claim 2, wherein in step S131, in the learning of the single-layer joint dictionary, the first-layer speech sub-dictionary D s And noise sub-dictionary D n Concatenation is single-layer joint dictionary D= [ D ] s ,D n ]The method comprises the steps of carrying out a first treatment on the surface of the In the learning of the double-layer combined dictionary, the first layer of learning is fixedLearned phonetic sub-dictionary D s And noise sub-dictionary D n Optimizing a second layer of learned speech sub-dictionary E s And noise sub-dictionary E n And finally, splicing all sub-dictionaries to obtain the double-layer joint dictionary D= [ D ] s ,E s ,D n ,E n ]。
4. The single-channel speech enhancement method according to claim 2, wherein in the step S134, in the optimization of the single-layer joint dictionary, the optimization function is divided into four error constraints:
the first term is to minimize the training noisy speech in the joint dictionary d= [ D ] s ,D n ]Performing sparse expression errors;
the second term is to minimize the clean speech signal in the speech sub-dictionary D s Performing sparse expression errors;
the third term is to minimize the noise signal in the noise sub-dictionary D n Performing sparse expression errors;
the fourth term is to minimize the inner product of the speech sub-dictionary and the noise sub-dictionary;
in the optimization of the double-layer joint dictionary, the optimization function is divided into five error constraints:
the first term is to minimize the training noisy speech in the joint dictionary d= [ D ] s ,E s ,D n ,E n ]Performing sparse expression errors;
the second term is to minimize the sub-dictionary D of source correspondence in the dual-layer dictionary of the clean speech signal and the speech component of the trained noisy speech signal s And E is s Errors in the sparse representation;
the third term is to minimize the sub-dictionary D of source correspondence in the dual-layer dictionary of the clean noise signal and the noise component of the trained noisy speech signal n And E is n Error of sparse representation;
the fourth term is to minimize the first-layer learned phonetic sub-dictionary D s Sub-dictionary E capable of expressing speech components learned with second layer s Error of inner product;
the fifth item is minimumNoise sub-dictionary D for learning of a first layer n Component dictionary E capable of expressing noise and learned with second layer n Error of inner product.
5. The method for single-channel speech enhancement based on dual-layer dictionary learning as claimed in claim 1, wherein said step S2 comprises the steps of,
s21, projecting the tested noisy speech on a first-layer joint dictionary, and multiplying the sparse coefficient matrix by the first-layer joint dictionary to obtain an enhanced speech signal based on the first-layer joint dictionary learning;
step S22, comparing the energy of the voice with noise which is enhanced by the first-layer combined dictionary learning with a preset energy threshold, if the energy is smaller than the energy threshold, indicating that the phenomenon of cross projection in the voice enhanced by the first-layer combined dictionary is serious, and executing step S23 to project the voice with noise on the second-layer combined dictionary; if the energy threshold value is larger than the energy threshold value, the voice signal which is enhanced based on the first-layer dictionary learning is used as the final enhanced voice output;
and S23, projecting the noisy speech on the second-layer joint dictionary, multiplying the sparse coefficient matrix by the second-layer joint dictionary to obtain an enhanced speech signal based on the second-layer joint dictionary learning, and outputting the enhanced speech signal as final enhanced speech.
6. The single-channel speech enhancement method based on double-layer dictionary learning according to claim 1, further comprising step S3 of performing performance evaluation on the proposed single-channel speech enhancement method based on double-layer dictionary learning.
CN201911021192.3A 2019-10-25 2019-10-25 Single-channel voice enhancement method based on double-layer dictionary learning Active CN111383652B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911021192.3A CN111383652B (en) 2019-10-25 2019-10-25 Single-channel voice enhancement method based on double-layer dictionary learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911021192.3A CN111383652B (en) 2019-10-25 2019-10-25 Single-channel voice enhancement method based on double-layer dictionary learning

Publications (2)

Publication Number Publication Date
CN111383652A CN111383652A (en) 2020-07-07
CN111383652B true CN111383652B (en) 2023-09-12

Family

ID=71218489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911021192.3A Active CN111383652B (en) 2019-10-25 2019-10-25 Single-channel voice enhancement method based on double-layer dictionary learning

Country Status (1)

Country Link
CN (1) CN111383652B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081928A (en) * 2010-11-24 2011-06-01 南京邮电大学 Method for separating single-channel mixed voice based on compressed sensing and K-SVD
CN106663446A (en) * 2014-07-02 2017-05-10 微软技术许可有限责任公司 User environment aware acoustic noise reduction
CN109256144A (en) * 2018-11-20 2019-01-22 中国科学技术大学 Sound enhancement method based on integrated study and noise perception training
CN110189761A (en) * 2019-05-21 2019-08-30 哈尔滨工程大学 A kind of single channel speech dereverberation method based on greedy depth dictionary learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8165215B2 (en) * 2005-04-04 2012-04-24 Technion Research And Development Foundation Ltd. System and method for designing of dictionaries for sparse representation
US10013975B2 (en) * 2014-02-27 2018-07-03 Qualcomm Incorporated Systems and methods for speaker dictionary based speech modeling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081928A (en) * 2010-11-24 2011-06-01 南京邮电大学 Method for separating single-channel mixed voice based on compressed sensing and K-SVD
CN106663446A (en) * 2014-07-02 2017-05-10 微软技术许可有限责任公司 User environment aware acoustic noise reduction
CN109256144A (en) * 2018-11-20 2019-01-22 中国科学技术大学 Sound enhancement method based on integrated study and noise perception training
CN110189761A (en) * 2019-05-21 2019-08-30 哈尔滨工程大学 A kind of single channel speech dereverberation method based on greedy depth dictionary learning

Also Published As

Publication number Publication date
CN111383652A (en) 2020-07-07

Similar Documents

Publication Publication Date Title
Zhao et al. Perceptually guided speech enhancement using deep neural networks
Ghanbari et al. A new approach for speech enhancement based on the adaptive thresholding of the wavelet packets
Mammone et al. Robust speaker recognition: A feature-based approach
Srinivasan et al. Binary and ratio time-frequency masks for robust speech recognition
CN109256144B (en) Speech enhancement method based on ensemble learning and noise perception training
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN110767244B (en) Speech enhancement method
CN112735456A (en) Speech enhancement method based on DNN-CLSTM network
Xian et al. Convolutional fusion network for monaural speech enhancement
CN110998723A (en) Signal processing device using neural network, signal processing method using neural network, and signal processing program
Nian et al. A progressive learning approach to adaptive noise and speech estimation for speech enhancement and noisy speech recognition
Chao et al. Cross-domain single-channel speech enhancement model with bi-projection fusion module for noise-robust ASR
Islam et al. Supervised single channel speech enhancement based on stationary wavelet transforms and non-negative matrix factorization with concatenated framing process and subband smooth ratio mask
Haton Automatic speech recognition: A Review
Pellom et al. An improved (auto: I, lsp: t) constrained iterative speech enhancement for colored noise environments
Saleem et al. Variance based time-frequency mask estimation for unsupervised speech enhancement
Singh et al. Speech enhancement for Punjabi language using deep neural network
CN111383652B (en) Single-channel voice enhancement method based on double-layer dictionary learning
Nisa et al. The speech signal enhancement approach with multiple sub-frames analysis for complex magnitude and phase spectrum recompense
Duangpummet et al. A robust method for blindly estimating speech transmission index using convolutional neural network with temporal amplitude envelope
Liu et al. Using Shifted Real Spectrum Mask as Training Target for Supervised Speech Separation.
Ma et al. Combining speech fragment decoding and adaptive noise floor modeling
Ludeña-Choez et al. Speech denoising using non-negative matrix factorization with kullback-leibler divergence and sparseness constraints
Liu et al. Investigation of Cost Function for Supervised Monaural Speech Separation.
Soni et al. Comparing front-end enhancement techniques and multiconditioned training for robust automatic speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant