CN114220449A - Voice signal noise reduction processing method and device and computer readable medium - Google Patents
Voice signal noise reduction processing method and device and computer readable medium Download PDFInfo
- Publication number
- CN114220449A CN114220449A CN202111607290.2A CN202111607290A CN114220449A CN 114220449 A CN114220449 A CN 114220449A CN 202111607290 A CN202111607290 A CN 202111607290A CN 114220449 A CN114220449 A CN 114220449A
- Authority
- CN
- China
- Prior art keywords
- neural network
- pass filter
- mel
- band
- computational model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000009467 reduction Effects 0.000 title claims abstract description 39
- 238000003672 processing method Methods 0.000 title claims abstract description 13
- 238000013528 artificial neural network Methods 0.000 claims abstract description 164
- 238000005094 computer simulation Methods 0.000 claims abstract description 86
- 238000004364 calculation method Methods 0.000 claims abstract description 43
- 238000000034 method Methods 0.000 claims abstract description 37
- 238000012545 processing Methods 0.000 claims abstract description 30
- 230000009466 transformation Effects 0.000 claims abstract description 27
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 238000005070 sampling Methods 0.000 claims description 35
- 239000011159 matrix material Substances 0.000 claims description 20
- 238000001228 spectrum Methods 0.000 claims description 19
- 238000012805 post-processing Methods 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 11
- 238000012549 training Methods 0.000 claims description 8
- 230000015654 memory Effects 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 5
- 238000009432 framing Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 17
- 230000008569 process Effects 0.000 description 10
- 238000004891 communication Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 5
- 230000008030 elimination Effects 0.000 description 4
- 238000003379 elimination reaction Methods 0.000 description 4
- 210000003477 cochlea Anatomy 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 230000000644 propagated effect Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 230000002441 reversible effect Effects 0.000 description 3
- 230000000712 assembly Effects 0.000 description 2
- 238000000429 assembly Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000001303 quality assessment method Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Circuit For Audible Band Transducer (AREA)
- Soundproofing, Sound Blocking, And Sound Damping (AREA)
Abstract
The invention provides a voice signal noise reduction processing method, a device and a computer readable medium, wherein the method comprises the following steps: acquiring a first neural network calculation model; generating a Mel band pass filter, and obtaining the inverse transformation of the Mel band pass filter based on the Mel band pass filter; taking the Mel band-pass filter as an input layer of the first neural network computational model, taking the inverse transformation of the Mel band-pass filter as an output layer of the first neural network computational model, and forming a second neural network computational model; preprocessing the voice signal to obtain a first input signal; inputting the first input signal into the second neural network computational model to obtain a second output signal; and obtaining a noise-reduced voice signal based on the first input signal and the second output signal. The invention can realize effective noise reduction processing on voice signals.
Description
Technical Field
The present invention relates to the field of information technology, and in particular, to a method and an apparatus for noise reduction processing of a speech signal, and a computer readable medium.
Background
The voice noise reduction is widely applied to scenes such as video recording, mobile phone conversation, outdoor live broadcast, Bluetooth headset conversation, vehicle-mounted conversation and the like. The cochlea of the human ear essentially functions as a filter bank, and the filtering function of the cochlea is performed at logarithmic frequencies, and is a linear scale below a specific numerical range and is a logarithmic (logarithm) scale above the specific numerical range, so that the human ear is more sensitive to low-frequency signals. Based on the characteristic, the voice signal can be processed, and better noise reduction of the voice signal is realized.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method, a device and a computer readable medium for noise reduction processing of a voice signal, which can realize more effective noise reduction of the voice signal and simultaneously keep the definition degree of an effective signal in the voice signal.
In order to solve the above technical problem, the present invention provides a speech signal noise reduction processing method, which comprises the following steps:
acquiring a first neural network calculation model; generating a Mel band pass filter, and obtaining the inverse transformation of the Mel band pass filter based on the Mel band pass filter; taking the Mel band-pass filter as an input layer of the first neural network computational model, taking the inverse transformation of the Mel band-pass filter as an output layer of the first neural network computational model, and forming a second neural network computational model; wherein the second neural network computational model is formed by training using the same data set and training method as the first neural network computational model; preprocessing the voice signal to obtain a first input signal; inputting the first input signal into the second neural network calculation model loaded with trained network weight parameters to obtain a second output signal; and obtaining a noise-reduced voice signal based on the first input signal and the second output signal.
In an embodiment of the present invention, forming a second neural network computational model by using the mel-band-pass filter as an input layer of the first neural network computational model and an inverse transform of the mel-band-pass filter as an output layer of the first neural network computational model includes:
and if the first neural network calculation model does not perform down-sampling operation and up-sampling operation on the frequency dimension of the voice signal, connecting the Mel band-pass filter before the input end of the first neural network calculation model as a new input layer, connecting the inverse transform of the Mel band-pass filter after the output end of the first neural network calculation model as a new output layer, and forming a new neural network as the second neural network calculation model.
In an embodiment of the present invention, forming a second neural network computational model by using the mel-band-pass filter as an input layer of the first neural network computational model and an inverse transform of the mel-band-pass filter as an output layer of the first neural network computational model includes:
and if the first neural network calculation model carries out down-sampling operation and up-sampling operation on the frequency dimension of the voice signal, but the frequency point number of the sampling operation does not reach the sampling limit condition that the frequency dimension reaches only one residual frequency point, the input layer is used before the Mel band-pass filter is connected to the input end of the first neural network calculation model, the output layer is used after the inverse transformation of the Mel band-pass filter is connected to the output end of the first neural network calculation model, and the formed new neural network is used as the second neural network calculation model.
In an embodiment of the present invention, forming a second neural network computational model by using the mel-band-pass filter as an input layer of the first neural network computational model and an inverse transform of the mel-band-pass filter as an output layer of the first neural network computational model includes:
if the first neural network computation model carries out down-sampling operation and up-sampling operation on the frequency dimension of the voice signal and the frequency point number of the sampling operation reaches the sampling limit condition that the frequency dimension reaches only one residual frequency point, the Mel band-pass filter is used as a new input layer before being connected to the input end of the first neural network computation model, and a network layer with a down-sampling function is removed; and connecting the inverse transformation of the Mel band-pass filter to the output end of the first neural network computational model to be used as a new output layer, removing a network layer with an upsampling function, and forming a new neural network to be used as the second neural network computational model.
In an embodiment of the present invention, forming a second neural network computational model by using the mel-band-pass filter as an input layer of the first neural network computational model and an inverse transform of the mel-band-pass filter as an output layer of the first neural network computational model includes:
and replacing the original input layer of the first neural network calculation model with a Mel band-pass filter to be used as a new input layer, replacing the original output layer of the first neural network calculation model with the inverse transformation of the Mel band-pass filter to be used as a new output layer, and forming a new neural network to be used as the second neural network calculation model.
In an embodiment of the invention, the inverse transformation of the mel-band-pass filter is obtained by solving an inverse matrix of a mel-band-pass filter coefficient matrix corresponding to the mel-band-pass filter.
In an embodiment of the present invention, the inverse transformation of the mel-band-pass filter is obtained by learning through a fully-connected neural network, and obtaining an inverse matrix of a mel-band-pass filter coefficient matrix corresponding to the mel-band-pass filter.
In an embodiment of the present invention, after the fully-connected neural network, a post-processing convolutional layer is connected, and a parameter of the fully-connected neural network is adjusted as an inverse transformation of the mel-band-pass filter;
wherein the feature dimension of the voice signal frequency dimension remains unchanged after the voice signal passes through the post-processing convolutional layer.
In an embodiment of the present invention, preprocessing the voice signal to obtain the first input signal includes: performing framing operation, band-pass filtering operation and fast Fourier transform on the voice signal to obtain short-time Fourier transform characteristic data of the voice signal; and extracting amplitude spectrum characteristics from the short-time Fourier transform characteristic data to form the first input signal.
In an embodiment of the present invention, obtaining the noise-reduced speech signal based on the first input signal and the second output signal includes:
multiplying the first input signal and the second output signal to obtain a pre-output signal;
and carrying out post-processing on the pre-output signal to obtain the noise-reduced voice signal.
The present invention also provides a speech signal noise reduction processing apparatus, comprising:
a memory for storing instructions executable by the processor; and
a processor for executing the instructions to implement the method of any preceding claim.
The invention also provides a computer readable medium having stored thereon computer program code which, when executed by a processor, implements a method as in any of the preceding.
Compared with the prior art, the invention has the following advantages: the technical scheme of this application can realize the cleaner elimination to the steady noise signal in the speech signal, and is better to the harmonic protection effect in the speech signal simultaneously, realizes making an uproar when falling in the pronunciation, still can keep the definition of effective signal in the original speech signal.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the principle of the application. In the drawings:
fig. 1 is a flowchart of a speech signal noise reduction processing method according to an embodiment of the present application.
Fig. 2 is a schematic diagram of frequency spectrums of filters in a mel filter bank according to an embodiment of the present application.
FIG. 3 is a schematic diagram of a process for forming a second neural network computational model from a first neural network computational model according to an embodiment of the present application.
FIG. 4 is a schematic diagram of a process for forming a second neural network computational model from a first neural network computational model according to an embodiment of the present application.
Fig. 5 is a schematic process diagram of a speech signal noise reduction processing method according to an embodiment of the present application.
Fig. 6 is a schematic diagram of a speech signal and its corresponding frequency spectrum.
Fig. 7 is a schematic frequency spectrum diagram of a speech signal (a noise-reduced speech signal) after being processed by the scheme of the embodiment of the present application.
FIG. 8 is a schematic diagram of a frequency spectrum of a speech signal obtained after processing the speech signal by a method of speech noise reduction.
FIG. 9 is a process diagram of a method of speech noise reduction.
Fig. 10 is a schematic diagram of a speech signal noise reduction processing apparatus according to an embodiment of the present application.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only examples or embodiments of the application, from which the application can also be applied to other similar scenarios without inventive effort for a person skilled in the art. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.
As used in this application and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.
The relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present application unless specifically stated otherwise. In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
In the description of the present application, it is to be understood that the orientation or positional relationship indicated by the directional terms such as "front, rear, upper, lower, left, right", "lateral, vertical, horizontal" and "top, bottom", etc., are generally based on the orientation or positional relationship shown in the drawings, and are used for convenience of description and simplicity of description only, and in the case of not making a reverse description, these directional terms do not indicate and imply that the device or element being referred to must have a particular orientation or be constructed and operated in a particular orientation, and therefore, should not be considered as limiting the scope of the present application; the terms "inner and outer" refer to the inner and outer relative to the profile of the respective component itself.
Furthermore, it should be noted that the terms "first", "second", etc. are used to define the components or assemblies, and are only used for convenience to distinguish the corresponding components or assemblies, and the terms have no special meaning if not stated, and therefore, the scope of protection of the present application should not be construed as being limited. Further, although the terms used in the present application are selected from publicly known and used terms, some of the terms mentioned in the specification of the present application may be selected by the applicant at his or her discretion, the detailed meanings of which are described in relevant parts of the description herein. Further, it is required that the present application is understood not only by the actual terms used but also by the meaning of each term lying within.
Flow charts are used herein to illustrate operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, various steps may be processed in reverse order or simultaneously. Meanwhile, other operations are added to or removed from these processes.
Embodiments of the present application describe a speech signal noise reduction processing method, apparatus, and computer readable medium.
Fig. 1 is a flowchart of a speech signal noise reduction processing method according to an embodiment of the present application.
As shown in fig. 1, the speech signal noise reduction processing method of the present application includes, step 101, obtaining a first neural network computational model; 102, generating a Mel band-pass filter, and obtaining the inverse transformation of the Mel band-pass filter based on the Mel band-pass filter; 103, taking the Mel band-pass filter as an input layer of the first neural network computational model, taking the inverse transformation of the Mel band-pass filter as an output layer of the first neural network computational model, and forming a second neural network computational model; wherein the second neural network computational model is formed by training using the same data set and training method as the first neural network computational model; step 104, preprocessing the voice signal to obtain a first input signal; step 105, inputting the first input signal into the second neural network calculation model loaded with trained network weight parameters to obtain a second output signal; and 106, obtaining a noise-reduced voice signal based on the first input signal and the second output signal.
Specifically, at step 101, a first neural network computational model is obtained. The first neural network includes, for example, a complex network structure composed of basic units such as a fully connected layer, a convolutional layer, an anti-convolutional layer, a pooling layer, a LSTM (Long Short-Term Memory), a GRU (Gated current Unit), an Attention mechanism, a normalization layer, and an excitation layer.
In step 102, a mel-band-pass filter is generated and an inverse transform of the mel-band-pass filter is obtained based on the mel-band-pass filter.
The Mel (Mel) frequency filter bank is a filter bank which acts like a cochlea and can convert the linear characteristics of a voice signal to Mel scales so as to reduce the hearing sense of noise of human ears.
An example of generating a mel-band pass filter is listed below.
Setting the sampling frequency fs to 8000Hz, the lowest frequency fl of each filter frequency range in the filter bank to 0, and the highest frequency fh of each filter frequency range in the filter to 4000 Hz; the number of filters M is set to 10, and the length NL of the FFT (fast fourier transform) is set to 256.
Substituting fh to 4000 in equation (1) as the f-value yields the maximum mel-frequency fmax.
At [ fl, fmax]Is equally divided into M parts according to the number M of the filters, fmelThen calculate each f by formula (1)melCorresponding f-number.
The center frequencies of the mel band pass filters are uniformly arranged according to the mel scales represented by the formula (1). Setting each filter in the filter bank to have a triangular filtering characteristic with a center frequency of f (m), each band-pass filter having a transfer function of
Where k is the signal input frequency.
Through the above calculation, for example, a mel filter bank as shown in fig. 2 is obtained. The spectrum of each filter in the mel-filter bank is characterized in fig. 2, and may also be referred to as a triangular filter bank corresponding to the mel-frequency spectrum.
Fig. 2 is a schematic diagram of frequency spectrums of filters in a mel filter bank according to an embodiment of the present application. In fig. 2, the horizontal axis represents frequency values, and the vertical axis represents the response of the filter (which may be a normalized response value). The response spectrum of each filter in the mel filter bank shown in fig. 2, which can be seen to include M triangular spectra therein, corresponds to each filter in the mel filter bank.
Next, an inverse transform (abbreviated as Mel2Linear) of the Mel band pass filter is obtained based on the Mel band pass filter.
In some embodiments, the inverse transformation of the mel-band pass filter is obtained by solving an inverse matrix of a mel-band pass filter coefficient matrix corresponding to the mel-band pass filter. For example, the Mel-band-pass filter coefficient matrix (M-Mel) and the inverse matrix (M-mel.t) corresponding to the Mel-band-pass filter coefficient matrix satisfy a diagonal matrix with (M-mel.t) × (M-Mel) ═ Diag and a diagonals of 1. When the algebraic generation of the mel-band-pass filter coefficient matrix is not reversible, the inverse matrix corresponding to the mel-band-pass filter coefficient matrix can also be a pseudo-inverse matrix.
In some embodiments, the inverse transformation of the mel-band pass filter is obtained by learning using a fully-connected neural network and obtaining an inverse matrix of a mel-band pass filter coefficient matrix corresponding to the mel-band pass filter.
In one embodiment, after the fully-connected neural network, a post-processing convolutional layer is connected, and the parameters of the fully-connected neural network are adjusted to be inverse transformation of the Mel band-pass filter;
after the speech signal passes through the post-processing convolutional layer, the feature dimension of the frequency dimension of the speech signal remains unchanged, for example, the dimension of the post-processing convolutional layer is 1 × 1.
In step 103, using the mel-band pass filter as an input layer of the first neural network computational model, and using the inverse transformation of the mel-band pass filter as an output layer of the first neural network computational model to form a second neural network computational model; wherein the second neural network computational model is formed by training using the same data set and training method as the first neural network computational model.
In some embodiments, taking the mel-band-pass filter as an input layer of the first neural network computational model and taking an inverse transform of the mel-band-pass filter as an output layer of the first neural network computational model, forming a second neural network computational model comprises:
and if the first neural network calculation model does not perform down-sampling operation and up-sampling operation on the frequency dimension of the voice signal, connecting the Mel band-pass filter before the input end of the first neural network calculation model as a new input layer, connecting the inverse transform of the Mel band-pass filter after the output end of the first neural network calculation model as a new output layer, and forming a new neural network as the second neural network calculation model.
FIG. 3 is a schematic diagram of a process for forming a second neural network computational model 302 from a first neural network computational model 301 according to an embodiment of the present application.
In fig. 3, Conv1, Conv2, Conv3 and … represent, for example, a first convolutional network layer, a second convolutional network layer, a third convolutional network layer and … (which may also be referred to as a positive first convolutional network layer, a positive second convolutional network layer, a positive third convolutional network layer and …, and correspond to the operation order of the neural network). Deconv1, Deconv2, Deconv3, … denote, for example, a first deconvolution network layer, a second deconvolution network layer, a third deconvolution network layer, … (which may also be referred to as a first-last deconvolution network layer, a second-last deconvolution network layer, a third-last deconvolution network layer, …, in order of operations of the neural network). The connecting line between the convolutional network layer and the deconvolution network layer in the graph indicates the correspondence of the convolutional calculation and the deconvolution calculation in the network.
In some embodiments, taking the mel-band-pass filter as an input layer of the first neural network computational model and taking an inverse transform of the mel-band-pass filter as an output layer of the first neural network computational model, forming a second neural network computational model comprises:
and if the first neural network calculation model carries out down-sampling operation and up-sampling operation on the frequency dimension of the voice signal, but the frequency point number of the sampling operation does not reach the sampling limit condition that the frequency dimension reaches only one residual frequency point, the input layer is used before the Mel band-pass filter is connected to the input end of the first neural network calculation model, the output layer is used after the inverse transformation of the Mel band-pass filter is connected to the output end of the first neural network calculation model, and the formed new neural network is used as the second neural network calculation model.
In these embodiments, the process of forming the second neural network computational model from the first neural network computational model may also refer to fig. 3. The ellipses in fig. 3 represent the remaining convolutional and deconvolution layers, or other network layers in a full convolutional neural network, such as the LSTM network, pooling layer, Attention, etc.
In some embodiments, taking the mel-band-pass filter as an input layer of the first neural network computational model and taking an inverse transform of the mel-band-pass filter as an output layer of the first neural network computational model, forming a second neural network computational model comprises:
if the first neural network computation model carries out down-sampling operation and up-sampling operation on the frequency dimension of the voice signal and the frequency point number of the sampling operation reaches the sampling limit condition that the frequency dimension reaches only one residual frequency point, the Mel band-pass filter is used as a new input layer before being connected to the input end of the first neural network computation model, and a network layer with a down-sampling function is removed; and connecting the inverse transformation of the Mel band-pass filter to the output end of the first neural network computational model to be used as a new output layer, removing a network layer with an upsampling function, and forming a new neural network to be used as the second neural network computational model.
In one embodiment, the removed network layer with the down-sampling function is, for example, a pooling layer or a convolution layer with a convolution calculation step size larger than 2. The removal of a network layer with an upsampling function may also be similar.
In some embodiments, taking the mel-band-pass filter as an input layer of the first neural network computational model and taking an inverse transform of the mel-band-pass filter as an output layer of the first neural network computational model, forming a second neural network computational model comprises:
and replacing the original input layer of the first neural network calculation model with a Mel band-pass filter to be used as a new input layer, replacing the original output layer of the first neural network calculation model with the inverse transformation of the Mel band-pass filter to be used as a new output layer, and forming a new neural network to be used as the second neural network calculation model.
FIG. 4 is a schematic diagram of a process for forming a second neural network computational model 402 from a first neural network computational model 401 according to an embodiment of the present application.
In fig. 4, the Mel band-pass filter (Mel) directly replaces the original input layer Conv1 of the first neural network computational model 401, and the inverse transform (Mel2Linear) of the Mel band-pass filter directly replaces the original output layer Deconv1 of the first neural network computational model 401, forming the second neural network computational model 402. When the Mel band-pass filter (Mel) and the inverse transformation (Mel2Linear) of the Mel band-pass filter replace the corresponding network layer, the number of bits of the input parameters and the dimension of the output parameters are correspondingly adjusted to realize the normal operation of the network.
In step 104, the speech signal is preprocessed to obtain a first input signal.
In some embodiments, preprocessing a speech signal to obtain a first input signal includes, step 501, performing a framing operation, a band-pass filtering operation (also referred to as a windowing operation, such as a hamming window, a hamming window), and a fast fourier transform on the speech signal to obtain short-time fourier transform characteristic data of the speech signal; step 502, extracting amplitude spectrum features from the short-time fourier transform feature data to form the first input signal.
In step 105, the first input signal is input into the second neural network computational model loaded with trained network weight parameters to obtain a second output signal.
Fig. 5 is a schematic process diagram of a speech signal noise reduction processing method according to an embodiment of the present application. In fig. 5, 531 denotes a signal preprocessing module. 532 denotes a first neural network computational model, and as described above, a second neural network computational model is obtained by combining the first neural network computational model with a mel-band pass filter and an inverse transform of the mel-band pass filter according to different situations. The second neural network computational model is denoted, for example, 542.
In fig. 5, a voice signal is indicated, for example, by 561. The first input signal is indicated at 534 and the second output signal is indicated at 535, for example.
In step 106, a noise-reduced speech signal is obtained based on the first input signal and the second output signal.
In some embodiments, deriving a noise-reduced speech signal based on the first input signal and the second output signal comprises, step 601, multiplying the first input signal and the second output signal to obtain a pre-output signal; step 602, performing post-processing on the pre-output signal to obtain the noise-reduced voice signal. The post-processing comprises, for example, an inverse short-time fourier transform.
In fig. 5, 551 indicates, for example, a post-processing module. A noise-reduced speech signal 581 is derived based on said first input signal 534 and said second output signal 535.
Specifically, in step 601, the first input signal and the second output signal are multiplied to obtain a pre-output signal 563. In step 602, the pre-output signal 563 is post-processed to obtain the noise-reduced speech signal 581. The post-processing comprises, for example, an inverse short-time fourier transform.
The technical scheme of this application can realize the cleaner elimination to the steady noise signal in the speech signal, and is better to the harmonic protection effect in the speech signal simultaneously to the realization is when the pronunciation is fallen and is fallen the noise, still can keep the definition of effective signal in the original speech signal.
The technical scheme of this application compares with only simply with the neural network of inserting Mel band pass filter, forms the neural network who handles speech signal, also can realize the cleaner elimination to the steady noise signal in the speech signal, and the harmonic protection effect in the speech signal is better simultaneously.
Fig. 6 is a schematic diagram of a speech signal and its corresponding frequency spectrum. Fig. 7 is a schematic frequency spectrum diagram of a speech signal (a noise-reduced speech signal) processed by the technical solution of the present application. Fig. 8 is a schematic frequency spectrum diagram of a speech signal obtained by simply connecting a mel band-pass filter to a neural network to form a neural network for processing the speech signal and then processing the speech signal.
In the graphs (a) in fig. 6 to 8, the horizontal axis represents time t, and the vertical axis represents signal intensity a. In the graphs (b) of fig. 6 to 8, the horizontal axis represents time t, and the vertical axis represents frequency f of the spectrum, or the frequency may be quantized to form frequency points, for example, 200 or 300 frequency points by quantizing a frequency range of 0 to 4000 Hz. In the graphs (b) in fig. 6 to 8, the depth of the spectrum represents the intensity value of the spectrum at the corresponding frequency value or frequency point value, and the deeper the intensity is.
Fig. 9 is a schematic view of a process of accessing a mel band pass filter to a neural network to form a neural network for processing a voice signal, and processing the voice signal to obtain a processed voice signal, and fig. 9 is a schematic view of a voice noise reduction method. In fig. 9, 903 denotes an input speech signal, 901 denotes a preprocessing module, 902 denotes a post-processing module, and 904 denotes a processed speech signal. 923 denotes a neural network.
In fig. 6 (a) to fig. 8 (a), the original speech signals are all indicated by 611.
Comparing the region 660 in fig. 6, the region 662 in fig. 8 and the region 661 in fig. 7, it can be seen that the technical solution of the present application can achieve cleaner elimination of stationary noise signals in speech signals. Comparing the region 672 in fig. 8 with the region 671 in fig. 7, it can be seen that the technical solution of the present application can achieve better protection effect on harmonics in a speech signal.
In some embodiments of the present application, different evaluation index comparison tables of one embodiment are obtained by specifically setting the model parameters.
TABLE 1
Evaluation index | PESQ | STOI | SISNR |
The scheme shown in FIG. 9 | 2.13 | 0.85 | 13.24 |
Technical scheme of the application | 2.17 | 0.86 | 13.73 |
In table 1, pesq (perceptual evaluation of speech quality), which is an objective speech quality assessment, the full score is 4.5, with higher being better. Stoi (short time objective intelligibility), i.e. short intelligibility, is full-scale to 1, the higher the better, the improved improvement of 0.01 is also more complicated. SISNR (scale invariant signal noise ratio), the scale-invariant signal-to-noise ratio, is higher the better. It can be seen that the technical scheme of the application has an obvious technical improvement effect compared with a processed voice signal obtained by simply accessing the mel frequency spectrum to the full convolution neural network to form the neural network for processing the voice signal and processing the voice signal.
The present application further provides a speech signal noise reduction processing apparatus, including: a memory for storing instructions executable by the processor; and a processor for executing the instructions to implement the method as previously described.
Fig. 10 is a schematic diagram of a speech signal noise reduction processing apparatus according to an embodiment of the present application. The voice signal noise reduction processing device 410 may include an internal communication bus 411, a Processor (Processor)412, a Read Only Memory (ROM)413, a Random Access Memory (RAM)414, and a communication port 405. The speech signal noise reduction processing device 410 is connected to the network through the communication port and can be connected to other devices. The internal communication bus 411 may enable data communication among the components of the speech signal noise reduction processing apparatus 410. The processor 412 may make the determination and issue the prompt. In some embodiments, the processor 412 may be comprised of one or more processors. The communication port 415 may enable sending and receiving information and data from a network. The speech signal noise reduction processing apparatus 410 may also include various forms of program storage units and data storage units, such as a Read Only Memory (ROM)413 and a Random Access Memory (RAM)414, capable of storing various data files for computer processing and/or communication, as well as possible program instructions for execution by the processor 412. The processor executes these instructions to implement the main parts of the method. The results of the processing by the processor may be communicated to the user device via the communication port for display on the user interface.
The speech signal noise reduction processing apparatus 410 may be implemented as a computer program, stored in a memory, and recorded in the processor 412 for execution, so as to implement the speech signal noise reduction processing method of the present application.
The present application also provides a computer readable medium having stored thereon computer program code which, when executed by a processor, implements a speech signal noise reduction processing method as described above.
Aspects of the present application may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. The processor may be one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), digital signal processing devices (DAPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, or a combination thereof. Furthermore, aspects of the present application may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media. For example, computer-readable media may include, but are not limited to, magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips … …), optical disks (e.g., Compact Disk (CD), Digital Versatile Disk (DVD) … …), smart cards, and flash memory devices (e.g., card, stick, key drive … …).
The computer readable medium may comprise a propagated data signal with the computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, and the like, or any suitable combination. The computer readable medium can be any computer readable medium that can communicate, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or device. Program code on a computer readable medium may be propagated over any suitable medium, including radio, electrical cable, fiber optic cable, radio frequency signals, or the like, or any combination of the preceding.
Similarly, it should be noted that in the preceding description of embodiments of the application, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to require more features than are expressly recited in the claims. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.
Although the present application has been described with reference to the present specific embodiments, it will be recognized by those skilled in the art that the foregoing embodiments are merely illustrative of the present application and that various changes and substitutions of equivalents may be made without departing from the spirit of the application, and therefore, it is intended that all changes and modifications to the above-described embodiments that come within the spirit of the application fall within the scope of the claims of the application.
Claims (12)
1. A speech signal noise reduction processing method comprises the following steps:
acquiring a first neural network calculation model;
generating a Mel band pass filter, and obtaining the inverse transformation of the Mel band pass filter based on the Mel band pass filter;
taking the Mel band-pass filter as an input layer of the first neural network computational model, taking the inverse transformation of the Mel band-pass filter as an output layer of the first neural network computational model, and forming a second neural network computational model; wherein the second neural network computational model is formed by training using the same data set and training method as the first neural network computational model;
preprocessing the voice signal to obtain a first input signal;
inputting the first input signal into the second neural network calculation model loaded with trained network weight parameters to obtain a second output signal;
and obtaining a noise-reduced voice signal based on the first input signal and the second output signal.
2. The method of noise reduction processing for a speech signal according to claim 1, wherein forming a second neural network computational model by using the mel-band pass filter as an input layer of the first neural network computational model and an inverse transform of the mel-band pass filter as an output layer of the first neural network computational model comprises:
and if the first neural network calculation model does not perform down-sampling operation and up-sampling operation on the frequency dimension of the voice signal, connecting the Mel band-pass filter before the input end of the first neural network calculation model as a new input layer, connecting the inverse transform of the Mel band-pass filter after the output end of the first neural network calculation model as a new output layer, and forming a new neural network as the second neural network calculation model.
3. The method of noise reduction processing for a speech signal according to claim 1, wherein forming a second neural network computational model by using the mel-band pass filter as an input layer of the first neural network computational model and an inverse transform of the mel-band pass filter as an output layer of the first neural network computational model comprises:
and if the first neural network calculation model carries out down-sampling operation and up-sampling operation on the frequency dimension of the voice signal, but the frequency point number of the sampling operation does not reach the sampling limit condition that the frequency dimension reaches only one residual frequency point, the input layer is used before the Mel band-pass filter is connected to the input end of the first neural network calculation model, the output layer is used after the inverse transformation of the Mel band-pass filter is connected to the output end of the first neural network calculation model, and the formed new neural network is used as the second neural network calculation model.
4. The method of noise reduction processing for a speech signal according to claim 1, wherein forming a second neural network computational model by using the mel-band pass filter as an input layer of the first neural network computational model and an inverse transform of the mel-band pass filter as an output layer of the first neural network computational model comprises:
if the first neural network computation model carries out down-sampling operation and up-sampling operation on the frequency dimension of the voice signal and the frequency point number of the sampling operation reaches the sampling limit condition that the frequency dimension reaches only one residual frequency point, the Mel band-pass filter is used as a new input layer before being connected to the input end of the first neural network computation model, and a network layer with a down-sampling function is removed; and connecting the inverse transformation of the Mel band-pass filter to the output end of the first neural network computational model to be used as a new output layer, removing a network layer with an upsampling function, and forming a new neural network to be used as the second neural network computational model.
5. The method of noise reduction processing for a speech signal according to claim 1, wherein forming a second neural network computational model by using the mel-band pass filter as an input layer of the first neural network computational model and an inverse transform of the mel-band pass filter as an output layer of the first neural network computational model comprises:
and replacing the original input layer of the first neural network calculation model with a Mel band-pass filter to be used as a new input layer, replacing the original output layer of the first neural network calculation model with the inverse transformation of the Mel band-pass filter to be used as a new output layer, and forming a new neural network to be used as the second neural network calculation model.
6. The method for noise reduction processing of a speech signal according to claim 1, wherein the inverse transformation of the mel band-pass filter is obtained by solving an inverse matrix of a mel band-pass filter coefficient matrix corresponding to the mel band-pass filter.
7. The method of noise reduction processing for a speech signal according to claim 1, wherein the inverse transformation of the mel-band pass filter is obtained by learning using a fully-connected neural network to obtain an inverse matrix of a mel-band pass filter coefficient matrix corresponding to the mel-band pass filter.
8. The method of claim 7, wherein a post-processing convolutional layer is connected after the fully-connected neural network, and parameters of the fully-connected neural network are adjusted to be inverse to the Mel band-pass filter;
wherein the feature dimension of the voice signal frequency dimension remains unchanged after the voice signal passes through the post-processing convolutional layer.
9. The method of claim 1, wherein preprocessing the speech signal to obtain the first input signal comprises:
performing framing operation, band-pass filtering operation and fast Fourier transform on the voice signal to obtain short-time Fourier transform characteristic data of the voice signal;
and extracting amplitude spectrum characteristics from the short-time Fourier transform characteristic data to form the first input signal.
10. The method for noise reduction processing of a speech signal according to claim 1, wherein obtaining a noise-reduced speech signal based on the first input signal and the second output signal comprises:
multiplying the first input signal and the second output signal to obtain a pre-output signal;
and carrying out post-processing on the pre-output signal to obtain the noise-reduced voice signal.
11. A speech signal noise reduction processing apparatus comprising:
a memory for storing instructions executable by the processor; and
a processor for executing the instructions to implement the method of any one of claims 1-10.
12. A computer-readable medium having stored thereon computer program code which, when executed by a processor, implements the method of any of claims 1-10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111607290.2A CN114220449A (en) | 2021-12-24 | 2021-12-24 | Voice signal noise reduction processing method and device and computer readable medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111607290.2A CN114220449A (en) | 2021-12-24 | 2021-12-24 | Voice signal noise reduction processing method and device and computer readable medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114220449A true CN114220449A (en) | 2022-03-22 |
Family
ID=80705857
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111607290.2A Pending CN114220449A (en) | 2021-12-24 | 2021-12-24 | Voice signal noise reduction processing method and device and computer readable medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114220449A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114743016A (en) * | 2022-04-15 | 2022-07-12 | 成都新希望金融信息有限公司 | Certificate authenticity identification method and device, electronic equipment and storage medium |
-
2021
- 2021-12-24 CN CN202111607290.2A patent/CN114220449A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114743016A (en) * | 2022-04-15 | 2022-07-12 | 成都新希望金融信息有限公司 | Certificate authenticity identification method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107845389B (en) | Speech enhancement method based on multi-resolution auditory cepstrum coefficient and deep convolutional neural network | |
CN109326299B (en) | Speech enhancement method, device and storage medium based on full convolution neural network | |
WO2019227586A1 (en) | Voice model training method, speaker recognition method, apparatus, device and medium | |
CN110956957B (en) | Training method and system of speech enhancement model | |
CN110111769B (en) | Electronic cochlea control method and device, readable storage medium and electronic cochlea | |
CN108766454A (en) | A kind of voice noise suppressing method and device | |
CN111223493A (en) | Voice signal noise reduction processing method, microphone and electronic equipment | |
US8359195B2 (en) | Method and apparatus for processing audio and speech signals | |
CN110120225A (en) | A kind of audio defeat system and method for the structure based on GRU network | |
CN110755108A (en) | Heart sound classification method, system and device based on intelligent stethoscope and readable storage medium | |
WO2022141868A1 (en) | Method and apparatus for extracting speech features, terminal, and storage medium | |
CN110942766A (en) | Audio event detection method, system, mobile terminal and storage medium | |
CN111986660A (en) | Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling | |
CN116994564B (en) | Voice data processing method and processing device | |
CN110556125A (en) | Feature extraction method and device based on voice signal and computer storage medium | |
CN111968651A (en) | WT (WT) -based voiceprint recognition method and system | |
CN109065043A (en) | A kind of order word recognition method and computer storage medium | |
CN112382302A (en) | Baby cry identification method and terminal equipment | |
CN110970044B (en) | Speech enhancement method oriented to speech recognition | |
CN114220449A (en) | Voice signal noise reduction processing method and device and computer readable medium | |
CN111862978A (en) | Voice awakening method and system based on improved MFCC (Mel frequency cepstrum coefficient) | |
CN116052706B (en) | Low-complexity voice enhancement method based on neural network | |
CN110197657B (en) | Dynamic sound feature extraction method based on cosine similarity | |
CN111261192A (en) | Audio detection method based on LSTM network, electronic equipment and storage medium | |
CN113611321B (en) | Voice enhancement method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |