CN116312616A - Processing recovery method and control system for noisy speech signals - Google Patents

Processing recovery method and control system for noisy speech signals Download PDF

Info

Publication number
CN116312616A
CN116312616A CN202211678470.4A CN202211678470A CN116312616A CN 116312616 A CN116312616 A CN 116312616A CN 202211678470 A CN202211678470 A CN 202211678470A CN 116312616 A CN116312616 A CN 116312616A
Authority
CN
China
Prior art keywords
learning network
noise suppression
noise
frequency
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211678470.4A
Other languages
Chinese (zh)
Inventor
李倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bestechnic Shanghai Co Ltd
Original Assignee
Bestechnic Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bestechnic Shanghai Co Ltd filed Critical Bestechnic Shanghai Co Ltd
Priority to CN202211678470.4A priority Critical patent/CN116312616A/en
Publication of CN116312616A publication Critical patent/CN116312616A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K11/00Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/16Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/175Methods or devices for protecting against, or for damping, noise or other acoustic waves in general using interference effects; Masking sound
    • G10K11/1752Masking
    • G10K11/1754Speech masking
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/0017Lossless audio signal coding; Perfect reconstruction of coded audio signal by transmission of coding error
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application relates to a processing recovery method and a control system for a noisy speech signal. The method comprises the steps of obtaining a voice signal with noise, and performing STFT (short time Fourier transform) to obtain a spectrogram. Based on the spectrogram, determining time-frequency voice characteristics, and estimating masking values of all frequency points by using a noise suppression learning network. And determining the frequency domain voice signal after noise suppression based on the masking value and the spectrogram of each frequency point, performing LPC processing, and predicting the linear part and the residual part of the time domain voice signal after noise suppression. ISTFT conversion is carried out on the frequency domain voice signals, and the time domain voice signals after noise suppression are obtained. And recovering the enhanced residual error part by utilizing a recovery learning network based on the noise-suppressed time domain voice signal, the linearity and the residual error part. The predicted linear portion and the enhanced residual portion are summed to obtain a recovered speech signal. Therefore, the LPC technology can be effectively combined with the adaptive learning network on the small chip, and efficient and rapid noise reduction and restoration processing of the voice signal with noise under the variable noise environment can be realized.

Description

Processing recovery method and control system for noisy speech signals
Technical Field
The present application relates to the field of wireless communications, and more particularly, to a processing recovery method and control system for noisy speech signals in wireless communications.
Background
With the development of the internet of things, besides mobile phones, people frequently and widely use various miniaturized portable intelligent devices, such as intelligent glasses, wireless bluetooth headphones, wireless bluetooth speakers and the like, and execute voice call functions under various changeable noise backgrounds, such as subways, business district groups, competition fields, outdoor sites and the like. Unlike cell phones, these miniaturized portable smart devices often have stringent requirements for cost and size, and the equipped chips are also smaller, with limited memory space and computational effort, also known as "edge computing".
At present, although some voice communication noise reduction technologies are adopted, intensity suppression is usually carried out on frequency components with high noise energy in a frequency domain, voice definition is often lost under the condition of large noise, so that voice quality after noise reduction is poor, voice is inevitably damaged, and user listening experience is affected. In addition, these voice call noise reduction techniques are limited by the chip configuration of the miniaturized portable intelligent device, and the algorithm is generally rough or slow in calculation resulting in hearing hysteresis, which cannot meet the requirements of people for high voice quality and real-time performance.
Disclosure of Invention
The present application is provided to address the above-mentioned deficiencies in the prior art. What is needed is a processing recovery method and control system for noisy speech signals that can effectively configure an adaptive learning network on a small chip for edge computation in combination with an LPC (linear predictive coding) technique, implement efficient and rapid noise reduction processing of noisy speech signals in a diverse noise environment, and recover lossless, high-definition, and good-real-time speech signals.
According to a first aspect of the present application, a processing recovery method for a noisy speech signal is provided. The process recovery method includes the following steps. A noisy speech signal to be processed is acquired. And performing STFT on the voice signal with noise to obtain a spectrogram. And determining time-frequency voice characteristics based on the spectrogram. Based on the time-frequency voice characteristics, the masking value of each frequency point is estimated by utilizing a noise suppression learning network and is used as the noise suppression amount of each frequency point. And determining the frequency domain voice signal after noise suppression based on the masking value and the spectrogram of each frequency point. Based on the frequency domain speech signal, a power spectral density is calculated. Based on the power spectral density, a linear portion and a residual portion of the denoised time domain speech signal are predicted by performing an LPC process. And carrying out ISTFT conversion on the frequency domain voice signal to obtain the time domain voice signal after noise suppression. And recovering the enhanced residual error part by utilizing a recovery learning network based on the noise-suppressed time domain voice signal, the linear part and the residual error part. The predicted linear portion and the enhanced residual portion are summed to obtain a recovered speech signal having a speech intelligibility above a predetermined threshold.
According to a second aspect of the present application, a control system for processing recovery of noisy speech signals is provided. The control system includes an interface, a processing unit, and a memory. The interface is configured to obtain a noisy speech signal to be processed. The processing unit is configured to be a processing recovery method for noisy speech signals according to various embodiments of the present application, and includes the following steps. A noisy speech signal to be processed is acquired. And performing STFT on the voice signal with noise to obtain a spectrogram. And determining time-frequency voice characteristics based on the spectrogram. Based on the time-frequency voice characteristics, the masking value of each frequency point is estimated by utilizing a noise suppression learning network and is used as the noise suppression amount of each frequency point. And determining the frequency domain voice signal after noise suppression based on the masking value and the spectrogram of each frequency point. Based on the frequency domain speech signal, a power spectral density is calculated. Based on the power spectral density, a linear portion and a residual portion of the denoised time domain speech signal are predicted by performing an LPC process. And carrying out ISTFT conversion on the frequency domain voice signal to obtain the time domain voice signal after noise suppression. And recovering the enhanced residual error part by utilizing a recovery learning network based on the noise-suppressed time domain voice signal, the linear part and the residual error part. The predicted linear portion and the enhanced residual portion are summed to obtain a recovered speech signal having a speech intelligibility above a predetermined threshold. The memory is configured to store a trained noise suppression learning network and a recovery learning network.
According to the processing recovery method and the control system for the noisy speech signal, which are provided by the embodiments of the application, the adaptive learning network can be effectively configured on the small chip of the edge calculation and the LPC (Linear predictive coding) technology is combined, so that efficient and rapid noise reduction processing of the noisy speech signal in a changeable noise environment is realized, and the lossless, high-definition and good-instantaneity speech signal can be recovered.
Drawings
Features, advantages, and technical and industrial significance of exemplary embodiments of the present invention will be described below with reference to the accompanying drawings, in which like reference numerals denote like elements, and in which:
FIG. 1 shows a flow chart of a process restoration method for noisy speech signals according to an embodiment of the present application;
FIG. 2 shows a block diagram of a control system for processing recovery of noisy speech signals according to an embodiment of the present application; and
fig. 3 shows a flowchart of an example of a processing recovery method for a noisy speech signal according to an embodiment of the present application.
Detailed Description
In order to better understand the technical solutions of the present application, the following detailed description of the present application is provided with reference to the accompanying drawings and the specific embodiments. Embodiments of the present application will now be described in further detail with reference to the accompanying drawings and specific examples, but are not intended to be limiting of the present application.
The terms "first," "second," and the like, as used herein, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises" and the like means that elements preceding the word encompass the elements recited after the word, and not exclude the possibility of also encompassing other elements. The order of the steps shown by the arrows in the drawings of the present application is merely an example, and does not mean that the steps must be performed in the order shown by the arrows. If not specifically indicated, the steps may be combined or the order of execution may be changed to perform them in a different order than indicated by the arrows, so long as the logical relationship of the steps is not affected. The group of fully-connected layers in the present application may be one layer or may be several layers, and is not particularly limited herein. The technical term "residual" in this application is intended to mean the less residual part of the speech signal after removal of the predicted part.
Fig. 1 shows a flowchart of a process restoration method for noisy speech signals according to an embodiment of the present application. In particular, the process restoration method is particularly suitable for various small chips performing edge calculations, and is generally small in size and limited in memory space and computational power. Referring to fig. 2, these chips (also referred to as control systems) are commonly used in various miniaturized portable smart devices, such as smart glasses, wireless bluetooth headsets, wireless bluetooth speakers, multifunctional intelligent charging boxes (e.g., bluetooth headset charging boxes), smart watches (e.g., without limitation, child multifunctional positioning monitoring watches), and the like. Note that the processing recovery method of the embodiments of the present application is particularly suitable for various small chips that perform edge computation, and does not mean that the processing recovery method can only be performed on such small chips, and can certainly be performed on chips with larger processing capacity and storage space, such as mobile phones, or even on processors with larger processing capacity and storage space, such as CPUs, where the processing steps are particularly friendly for various small chips that perform edge computation, so that the defects of limited computing power and storage space can be overcome, and efficient and rapid noise reduction processing on noisy speech signals in a changeable noise environment can be ensured.
As shown in fig. 1, the process recovery method starts with step 101, acquiring a noisy speech signal to be processed. The noisy speech signal to be processed may be collected by a microphone and analog-to-digital converted.
In step 102, STFT transformation is performed on the noisy speech signal to obtain a spectrogram. The STFT transform, also known as a short-time Fourier transform, first frames the noisy speech signal. For example, for a sampling rate of 16kHz, the time length of each frame is 8ms and the frame interval length is 8ms. And then, windowing noisy speech signal data of each frame, performing Fourier transform (FFT), and splicing the transformation results obtained by each frame together to obtain a spectrogram. For example, each segment of data subjected to fourier transform has a total length of 256 sampling points of 16 ms. The components of the noisy speech signal in the time domain and the frequency domain are usually changed instead of being kept stable, and by performing the STFT transformation, a spectrogram can be obtained to reflect different times at which different frequency states are located, that is, the spectrogram shows the joint distribution condition of the noisy speech signal in the time domain and the frequency domain.
In step 103, a time-frequency speech feature is determined based on the spectrogram. The time-frequency voice characteristic is further extracted on the basis of the spectrogram which reflects the joint distribution condition of the voice signal with noise on the time domain and the frequency domain, and the characteristic parameters which are more in line with the auditory characteristics of the human ear can be extracted by taking the auditory mechanism of the human ear into consideration, so that the voice signal-to-noise ratio is still better in recognition performance when the signal-to-noise ratio is reduced. In some embodiments, the time-frequency speech features that conform to the auditory characteristics of the human ear include at least one of MFCC features, BFCC (barker cepstral coefficient) features, fbank features (Mel-filter-bank-based features), and the like, particularly MFCC features. The extraction process of the MFCC, namely the Mel (Mel) cepstrum coefficient, comprises the steps of pre-emphasis filtering, framing, windowing, FFT, filtering of a Mel filter bank, logarithmic operation, discrete Cosine Transform (DCT), dynamic feature (differential feature) extraction and the like. The Mel filter bank simulates the hearing mechanism of the human cochlear cilia acoustic sensor, and has the advantages of high low-frequency resolution, low high-frequency resolution and similar logarithmic relationship with the linear frequency corresponding relationship, which is not described in detail herein. In step 102, FFT transform results based on the segments of noisy speech data have been obtained. Then at step 103 the filtering process may be performed by modulo and by means of a Mel filter bank, the filtering process result being subjected to a natural logarithm operation. The DCT may then be performed to obtain MFCC parameters and MFCC differential parameters, which in combination may result in MFCC characteristics.
In step 104, based on the time-frequency voice characteristics, the masking value of each frequency point is estimated by using the noise suppression learning network and is used as the noise suppression amount of each frequency point. The masking value, also called Mask, represents the noise suppression amount for each frequency bin at each different time. In step 105, a noise-suppressed frequency domain speech signal is determined based on the masking values and spectrograms of the respective frequency points. In some embodiments, the Mask may be used as a noise suppression coefficient for the frequency domain component of each time of the spectrogram, and by multiplying the estimated Mask with the frequency domain component of each time of the spectrogram, a frequency domain speech signal with overall noise suppression for the frequency domain of each time may be obtained. The noise suppression learning network may be implemented using various RNN neural networks, such as, but not limited to, a GRU neural network, an LSTM neural network, and the like. By using these RNN neural networks, more accurate Mask can be estimated taking into account the interactions between adjacent points of the time-frequency speech features in the time and frequency domains. By using the LSTM neural network, the influence of points with longer distances in the time domain and the frequency domain can be forgotten while the interaction between adjacent points in the time domain and the frequency domain of the time-frequency voice characteristics is considered, so that a randomness mechanism introduced by noise in the noise-carrying voice is engaged, mask estimation is more accurate, and convergence of estimation calculation is quicker. The inventors have found that LSTM neural networks can be controlled at layers 2-4, and that this scale of LSTM neural network can be stored on a small chip memory, and that the processing unit of the small chip can also take over the workload of performing Mask estimation calculations.
In step 106, a power spectral density is calculated based on the frequency domain speech signal. For example, the frequency domain voice signal includes signal amplitude and phase information corresponding to each frequency, and the power spectral density can be obtained by squaring the amplitude.
In step 107, the linear and residual portions of the denoised time domain speech signal are predicted by performing an LPC process based on the power spectral density. Performing an LPC (linear predictive coding) process based on the power spectral density is a conventional noise reduction technique that predicts the linear part of speech to obtain a linear part and a residual part, which are not described in detail herein.
At step 108, an ISTFT transform (i.e., an inverse STFT transform) is performed on the frequency domain speech signal to obtain a denoised time domain speech signal.
In step 109, the enhanced residual portion is recovered using a recovery learning network based on the noise-suppressed time-domain speech signal, the linear portion, and the residual portion. The recovery learning network may be implemented using various RNN neural networks, such as, but not limited to, a GRU neural network, an LSTM neural network, and the like. Generally, in the case of small chips with limited computational power (single core) and memory space, 2-4 layers of GRU neural networks may be used to save computational power and memory space, preferably to ensure that the noise suppression learning network in step 104 may use LSTM neural networks of sufficient size. The inventors have found that the size of the GRU neural network can be controlled to 2-4 layers, and that this size of the GRU neural network can be stored in a small chip memory, and that the processing unit of the small chip can also be fully charged when performing the recovery and enhancement calculations of the residual portion; further, for a single core chip, both the GRU neural network and the LSTM neural network of layers 2-4 of such scale can work cooperatively to perform noise suppression processing and recovery enhancement processing of the residual portion in a stream.
The predicted linear portion and the enhanced residual portion are summed 110 to obtain a recovered speech signal having a speech intelligibility above a predetermined threshold. Through the processing procedure, noise influence can be eliminated as much as possible on two layers of a time domain and a frequency domain by using the noise suppression neural network with learning capability, then the linear part of clean voice is predicted by the LPC technology, the linear part is the main component of the clean voice signal, and a nonlinear residual part (residual part) with a relatively small proportion is left to be recovered and enhanced by using the recovery neural network, so that a system on a chip (even a single-core design) which can still calculate with edges under the condition of ensuring that the scale of the recovery neural network is small can realize efficient and rapid lossless noise reduction processing of the noisy voice signal under changeable noise environments (especially large noise, noise with poor regularity and the like), and can recover the voice signal with high definition and good real-time performance. The processing recovery method of the application is used for the speech definition of the noisy speech signal under a changeable noise environment (especially large noise, multi-source complex noise and the like).
An example of a process restoration method for noisy speech signals is described in detail below with reference to fig. 3. As shown in fig. 3, a noisy speech signal x (n) is acquired by a microphone, n represents the current sampling time, firstly, STFT conversion is performed on the noisy speech signal (step 301), then MFCC characteristics are calculated (step 302), the MFCC characteristics are used as characteristic values, noise suppression is performed on the noisy speech according to the characteristic values, and y (n) is generated, so that the amount of data to be processed can be further reduced. If data under the sampling rate of 16KHz is processed, the frame length is 8ms, the frame interval length is 8ms, the data length of FFT transformation is 16ms,256 sampling points, and the dimension of the MFCC feature obtained after the extraction of the MFCC feature is 32 dimensions, wherein 22 dimensions are MFCC features, 6 dimensions are first-order MFCC differential features, and 4 dimensions are second-order MFCC differential features.
Then, based on the MFCC characteristics, a noise suppression learning network may be utilized to estimate masking values for the respective bins. Specifically, MFCC features may be fed to the first set of fully connected layers 303 for dimension reduction processing and then to the noise suppression learning network, i.e., RNN/GRU/LSTM neural network 304. The first set of fully connected layers 303 is used for dimension reduction, 32-dimensional input, and becomes 16-dimensional output after dimension reduction, for further reducing the input data dimension of the LSTM/RNN/GRU neural network 304, thereby reducing the network size. After the MFCC feature is reduced in dimension, the noise suppression learning network is utilized, the RNN/GRU/LSTM neural network 304 performs preliminary estimation, and the estimation result is fed to the second group of full-connection layers 305 to perform dimension-lifting processing, so as to obtain a masking value of each frequency point, thereby obtaining a masking value estimation (result) 306. For example, the output of the RNN/GRU/LSTM neural network 304 is fed to the second set of fully-connected layers 305, where the second set of fully-connected layers 305 is used for up-scaling, 16-dimensional input, and 127-dimensional output after up-scaling, to obtain Mask values (i.e., noise suppression amounts) of 127 frequency points, and further obtain the noise-reduced frequency domain speech signal Y (w).
The noise-reduced frequency domain voice signal Y (w) is calculated in two paths. One path is sent to an LPC calculation module to execute LPC processing so as to calculate LPC coefficients, linear prediction is carried out on voice signals, and the linear part z (n) of the voice enhanced signal s (n) which is finally obtained at the current moment is predicted by combining the linear prediction coefficients from the signals s (n-1), s (n-2) and s (n-16) which are subjected to voice enhancement in the previous period. The other path is subjected to ISTFT transformation 307 to obtain a denoised time domain signal y (n), which is fed to a recovery neural network, e.g. a GRU/RNN, together with a linear part z (n) and a residual part e (n) (nonlinear part) obtained by LPC processing the path, so that the GRU/RNN corrects the residual part of the speech, and a recovered and enhanced residual part L (n) is obtained after correction, which is then added 313 to the linear part z (n) to obtain a final speech enhanced signal s (n).
With respect to the noise suppression learning network used in step 304 and the recovery learning network used in step 312, training may be accomplished on a server, and parameters of the trained noise suppression learning network and recovery learning network may be communicated from the server to a control system based on the system-on-chip performing the edge calculations. In some embodiments, the control system may request updated parameters of the noise suppression learning network and the recovery learning network from the server via the communication interface. In some embodiments, the parameters of the trained noise suppression learning network and the recovery learning network may also be stored in the memory and used continuously before the control system leaves the factory.
In some embodiments, the noise suppression learning network and the recovery learning network may be jointly trained to obtain coordinated and more optimized noise suppression and residual correction effects under the condition that the computational load of the server can be supported. In some embodiments, under the condition that the calculation load of the server is limited, training of the noise suppression learning network may be performed on the server, and then the output of the trained noise suppression learning network is used as training data to train the recovery learning network, so that the training speed and the training effect are both considered. For the learning suppression network, MFCC characteristics can be extracted from a voice signal of preset noise as a network input, and a Mask value with good noise suppression effect can be obtained through testing as the network input, so that one piece of training data can be obtained, the noise suppression learning network can be trained based on each piece of data of the training data set, and for example, a batch gradient descent method or a random gradient descent method can be adopted to perform training. For the training of the recovery learning network, the noise suppression speech signal y (n), the linearly predicted z (n) and e (n) of the speech signal of the preset noise can be taken as network inputs, and the true value of the residual error can be taken as network outputs to jointly form one piece of training data, and the noise suppression learning network can be trained based on each piece of data of the training data set, for example, a batch gradient descent method or a random gradient descent method can be adopted to perform the training. When the noise suppression learning network is trained first and then the output of the trained noise suppression learning network is used as training data to train the recovery learning network, steps 301 to 307 can be sequentially executed by using the trained noise suppression learning network to obtain y (n), and steps 308 to 311 (details will be described later) can be sequentially executed to obtain e (n) and z (n), wherein y (n) which are output, z (n) which are linearly predicted, and e (n) which are output are used as inputs of the network and true values of residual errors are used as outputs of the network to generate training data, and the recovery learning network is trained accordingly.
Turning to fig. 3, the LPC processing will be described in detail.
In step 308, a power spectral density calculation is performed on the denoised frequency domain signal Y (W).
At step 309, an IFFT transformation is performed on the power spectral density to obtain autocorrelation coefficients.
In step 310, the LPC linear prediction coefficients are calculated using the Levinson-Durbin algorithm based on the autocorrelation coefficients, so that the linear portion of the denoised time-domain speech signal, i.e., the speech linear portion z (n), is predicted (step 311), and the residual portion e (n), is obtained. The residual part e (n) is enhanced after nonlinear prediction of the recovery neural network, and the final voice signal s (n) is obtained by summing the enhanced residual data L (n) and the linear prediction data z (n).
Specifically, the LPC linear prediction coefficients are solved according to the following equation (1):
Figure BDA0004018108720000081
wherein R is n (j) J=1,..p, which is an autocorrelation function of a speech signal, represents the number of LPC linear prediction coefficients, a 1 ,...,a p Is the LPC linear prediction coefficient. The above equation is also called Yule-Walker equation, the matrix on the left of the equation becomes Toeplitz matrix, the main diagonal is symmetrical, the element values of each axial direction along the parallel direction of the main diagonal are equal, and the equation can be solved by using Levinson-Durbin recursive algorithm.
The speech linear portion z (n) can be calculated according to equation (2):
Figure BDA0004018108720000082
the residual portion e (n) may be calculated according to equation (3):
Figure BDA0004018108720000083
in this way, the calculated voice linear portion z (n) and residual portion e (n) can be fed to a recovery neural network to repair and improve the residual of the nonlinear portion.
Returning to fig. 2, a control system 200 for processing recovery of noisy speech signals according to various embodiments of the present application is illustrated. The control system 200 may include an interface 201, a processing unit 202, and a memory 203. The interface 201 may be configured to obtain a noisy speech signal to be processed.
The processing unit 202 may be configured to perform the process restoration method for noisy speech signals according to various embodiments of the present application. Returning to fig. 1, the process restoration method may include the following steps.
In step 102, STFT transformation is performed on the noisy speech signal to obtain a spectrogram. The STFT transform, also known as a short-time Fourier transform, first frames the noisy speech signal. For example, for a sampling rate of 16kHz, the time length of each frame is 8ms and the frame interval length is 8ms. And then, windowing noisy speech signal data of each frame, performing Fourier transform (FFT), and splicing the transformation results obtained by each frame together to obtain a spectrogram. For example, each segment of data subjected to fourier transform has a total length of 256 sampling points of 16 ms. The components of the noisy speech signal in the time domain and the frequency domain are usually changed instead of being kept stable, and by performing the STFT transformation, a spectrogram can be obtained to reflect different times at which different frequency states are located, that is, the spectrogram shows the joint distribution condition of the noisy speech signal in the time domain and the frequency domain.
In step 103, a time-frequency speech feature is determined based on the spectrogram. The time-frequency voice characteristic is further extracted on the basis of the spectrogram which reflects the joint distribution condition of the voice signal with noise on the time domain and the frequency domain, and the characteristic parameters which are more in line with the auditory characteristics of the human ear can be extracted by taking the auditory mechanism of the human ear into consideration, so that the voice signal-to-noise ratio is still better in recognition performance when the signal-to-noise ratio is reduced. In some embodiments, the time-frequency speech features that conform to the auditory characteristics of the human ear include at least one of MFCC features, BFCC (barker cepstral coefficient) features, fbank features (Mel-filter-bank-based features), and the like, particularly MFCC features. The extraction process of the MFCC, namely the Mel (Mel) cepstrum coefficient, comprises the steps of pre-emphasis filtering, framing, windowing, FFT, filtering of a Mel filter bank, logarithmic operation, discrete Cosine Transform (DCT), dynamic feature (differential feature) extraction and the like. The Mel filter bank simulates the hearing mechanism of the human cochlear cilia acoustic sensor, and has the advantages of high low-frequency resolution, low high-frequency resolution and similar logarithmic relationship with the linear frequency corresponding relationship, which is not described in detail herein. In step 102, FFT transform results based on the segments of noisy speech data have been obtained. Then at step 103 the filtering process may be performed by modulo and by means of a Mel filter bank, the filtering process result being subjected to a natural logarithm operation. The DCT may then be performed to obtain MFCC parameters and MFCC differential parameters, which in combination may result in MFCC characteristics.
In step 104, based on the time-frequency voice characteristics, the masking value of each frequency point is estimated by using the noise suppression learning network and is used as the noise suppression amount of each frequency point. The masking value, also called Mask, represents the noise suppression amount for each frequency bin at each different time. In step 105, a noise-suppressed frequency domain speech signal is determined based on the masking values and spectrograms of the respective frequency points. In some embodiments, the estimated Mask is multiplied by the frequency domain component of each time of the spectrogram, so that a frequency domain voice signal with the frequency domain of each time being completely suppressed can be obtained. The noise suppression learning network may be implemented using various RNN neural networks, such as, but not limited to, a GRU neural network, an LSTM neural network, and the like. By using these RNN neural networks, more accurate Mask can be estimated taking into account the interactions between adjacent points of the time-frequency speech features in the time and frequency domains. By using the LSTM neural network, the influence of points with longer distances in the time domain and the frequency domain can be forgotten while the interaction between adjacent points in the time domain and the frequency domain of the time-frequency voice characteristics is considered, so that a randomness mechanism introduced by noise in the noise-carrying voice is engaged, mask estimation is more accurate, and convergence of estimation calculation is quicker. The inventors have found that LSTM neural networks can be controlled at layers 2-4, and that this scale of LSTM neural network can be stored on a small chip memory, and that the processing unit of the small chip can also take over the workload of performing Mask estimation calculations.
In step 106, a power spectral density is calculated based on the frequency domain speech signal. For example, the frequency domain voice signal includes signal amplitude and phase information corresponding to each frequency, and the power spectral density can be obtained by squaring the amplitude.
In step 107, the linear and residual portions of the denoised time domain speech signal are predicted by performing an LPC process based on the power spectral density. Performing an LPC (linear predictive coding) process based on the power spectral density is a conventional noise reduction technique that predicts the linear part of speech to obtain a linear part and a residual part, which are not described in detail herein.
At step 108, an ISTFT transform (i.e., an inverse STFT transform) is performed on the frequency domain speech signal to obtain a denoised time domain speech signal.
In step 109, the enhanced residual portion is recovered using a recovery learning network based on the noise-suppressed time-domain speech signal, the linear portion, and the residual portion. The recovery learning network may be implemented using various RNN neural networks, such as, but not limited to, a GRU neural network, an LSTM neural network, and the like. Generally, in the case of small chips with limited computational power (single core) and memory space, 2-4 layers of GRU neural networks may be used to save computational power and memory space, preferably to ensure that the noise suppression learning network in step 104 may use LSTM neural networks of sufficient size. The inventors have found that the size of the GRU neural network can be controlled to 2-4 layers, and that this size of the GRU neural network can be stored in a small chip memory, and that the processing unit of the small chip can also be fully charged when performing the recovery and enhancement calculations of the residual portion; further, for a single core chip, both the GRU neural network and the LSTM neural network of layers 2-4 of such scale can work cooperatively to perform noise suppression processing and recovery enhancement processing of the residual portion in a stream.
The predicted linear portion and the enhanced residual portion are summed 110 to obtain a recovered speech signal having a speech intelligibility above a predetermined threshold. Through the processing procedure, the noise influence can be eliminated as much as possible on two layers of a time domain and a frequency domain by utilizing the noise suppression neural network with learning capability, then the linear part of clean voice is predicted by the LPC technology, the linear part is the main component of the clean voice signal, and a nonlinear residual part (residual part) with a relatively small proportion is left to be recovered and enhanced by utilizing the recovery neural network, so that a control system (even realized based on a system on a chip of a single-core design) for edge calculation can still be used under the condition of ensuring the small scale of the recovery neural network, the efficient and rapid noise reduction processing of the noisy voice signal under the changeable noise environment (especially large noise, poor regularity and the like) can be realized, and the voice signal with no damage, high definition and good real-time performance can be recovered. The processing recovery method of the application is used for the speech definition of the noisy speech signal under a changeable noise environment (especially large noise, multi-source complex noise and the like). Note that, examples of the respective steps of the processing recovery method according to the respective embodiments of the present application may be incorporated herein, and are not described herein.
The memory 203 may be configured to store a trained noise suppression learning network and a recovery learning network.
In some embodiments, where the processing unit 202 is a single core, it is configured to: processing the noise suppression learning network and processing the recovery learning network in a flow mode; and in the case that the processing unit is dual-core, the processing of the noise suppression learning network and the processing of the recovery learning network are executed in parallel. Therefore, when the calculation power of the single-core device is limited, the processing of the noise suppression learning network can be preferentially executed, and then the processing of the recovery learning network is executed, the input of the recovery learning network is originally dependent on the output of the noise suppression learning network, the structure of the noise suppression learning network adopted in the method can cope with the situation that the calculation power is limited, and the data processing amount can be further reduced by matching with the dimension reduction of a full-connection layer, so that the real-time performance of restoration is not affected even if the streaming-type alternate execution is adopted, and the user is ensured to still have good hearing experience of voice signals with no damage, high definition and good real-time performance.
In some embodiments, various RISC (reduced instruction set computer) processors IP purchased from ARM corporation or the like may be utilized as the processing unit 202 of the control system of the present application to perform corresponding functions, with embedded systems (e.g., without limitation, SOCs) to effect processing recovery for noisy speech signals. In particular, there are many modules available on the market (IP), such as, but not limited to, memory (the storage 203 may be a memory or an external expansion memory on the IP), various communication modules (e.g. bluetooth module), a codec, a buffer, etc. Other devices such as an antenna, microphone, speaker, etc. may be external to the chip. The interface 201 may be used to external a microphone for collecting noisy speech signals. A user can implement various communication modules, codecs, and various steps of the process restoration method of the present application, etc., by constructing an ASIC (application specific integrated circuit) based on purchased IP or an autonomously developed module, so as to reduce power consumption and cost. Note that "control system" in this application is intended to mean a system that manipulates a target device in which it is located, which may generally represent, for example, a chip, such as an ASIC implemented based on an SOC, but is not limited thereto, and any hardware circuitry, software-processor configuration, and soft-hard combined firmware capable of achieving the manipulation may be used to implement the control system. For example, the processing performed by the processing unit 202 may be implemented as executable instructions executed by a RISC processor, may be formed as different hardware circuit modules, or may be formed as soft-hard combined firmware, which is not described herein.
Furthermore, although exemplary embodiments have been described herein, the scope thereof includes any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of the various embodiments across), adaptations or alterations as pertains to the present application. Elements in the claims are to be construed broadly based on the language employed in the claims and are not limited to examples described in the present specification or during the practice of the present application, which examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.
The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more aspects thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. In addition, in the above detailed description, various features may be grouped together to streamline the application. This is not to be interpreted as an intention that the features of the non-claimed application are essential to any claim. Rather, the subject matter of the present application is capable of less than all features of an embodiment of a particular application. Thus, the claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with one another in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
The above embodiments are only exemplary embodiments of the present application and are not intended to limit the present invention, the scope of which is defined by the claims. Various modifications and equivalent arrangements may be made to the present invention by those skilled in the art, which modifications and equivalents are also considered to be within the scope of the present invention.

Claims (10)

1. A process restoration method for a noisy speech signal, comprising:
acquiring a voice signal with noise to be processed;
performing STFT (short-time Fourier transform) on the voice signal with noise to obtain a spectrogram;
determining time-frequency voice characteristics based on the spectrogram;
estimating masking values of all frequency points by using a noise suppression learning network based on the time-frequency voice characteristics, and taking the masking values as noise suppression amounts of all frequency points;
determining a frequency domain voice signal after noise suppression based on the masking value and the spectrogram of each frequency point;
calculating a power spectral density based on the frequency domain speech signal;
predicting a linear portion and a residual portion of the denoised time domain speech signal by performing an LPC process based on the power spectral density;
ISTFT conversion is carried out on the frequency domain voice signals so as to obtain time domain voice signals after noise suppression;
recovering an enhanced residual error part by using a recovery learning network based on the noise-suppressed time domain speech signal, the linear part and the residual error part;
the predicted linear portion and the enhanced residual portion are summed to obtain a recovered speech signal having a speech intelligibility above a predetermined threshold.
2. The process restoration method according to claim 1, wherein the voice feature includes an MFCC feature.
3. The process restoration method according to claim 1, wherein the noise suppression learning network is an LSTM neural network and the restoration learning network is a GRU neural network.
4. The process restoration method according to claim 1, wherein estimating the masking values of the respective frequency points using a noise suppression learning network based on the time-frequency speech characteristics specifically includes: the voice features are fed to a first group of full-connection layers for dimension reduction processing, and then fed to the noise suppression learning network.
5. The process restoration method according to claim 1 or 4, wherein estimating the masking values of the respective frequency points using a noise suppression learning network based on the time-frequency speech characteristics specifically further comprises: and estimating by utilizing the noise suppression learning network based on the time-frequency voice characteristics, and feeding the estimation result to a second group of full-connection layers for dimension-lifting processing to obtain masking values of all frequency points.
6. The process restoration method according to claim 1, further comprising: and firstly executing the training of the noise suppression learning network on a server, and then training the recovery learning network by using the output of the trained noise suppression learning network as training data.
7. The process restoration method according to claim 1, wherein performing the LPC process specifically includes:
performing an IFFT transformation on the power spectral density to obtain an autocorrelation coefficient;
based on the autocorrelation coefficient, an LPC linear prediction coefficient is calculated by utilizing a Levinson-Durbin algorithm, so that the linear part of the time domain voice signal after noise suppression is predicted, and a residual part is obtained.
8. A control system for processing recovery of noisy speech signals, comprising:
an interface configured to obtain a noisy speech signal to be processed;
a processing unit configured to:
performing a process restoration method for a noisy speech signal according to any of claims 1 to 7; and
a memory configured to: and storing the trained noise suppression learning network and the recovery learning network.
9. The control system of claim 8, wherein, in the case where the processing unit is a single core, it is configured to: processing the noise suppression learning network and processing the recovery learning network in a flow mode; and in the case that the processing unit is dual-core, the processing of the noise suppression learning network and the processing of the recovery learning network are executed in parallel.
10. The control system of claim 8, wherein the control system is implemented based on a system-on-chip that performs edge calculations.
CN202211678470.4A 2022-12-26 2022-12-26 Processing recovery method and control system for noisy speech signals Pending CN116312616A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211678470.4A CN116312616A (en) 2022-12-26 2022-12-26 Processing recovery method and control system for noisy speech signals

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211678470.4A CN116312616A (en) 2022-12-26 2022-12-26 Processing recovery method and control system for noisy speech signals

Publications (1)

Publication Number Publication Date
CN116312616A true CN116312616A (en) 2023-06-23

Family

ID=86782312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211678470.4A Pending CN116312616A (en) 2022-12-26 2022-12-26 Processing recovery method and control system for noisy speech signals

Country Status (1)

Country Link
CN (1) CN116312616A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117690421A (en) * 2024-02-02 2024-03-12 深圳市友杰智新科技有限公司 Speech recognition method, device, equipment and medium of noise reduction recognition combined network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117690421A (en) * 2024-02-02 2024-03-12 深圳市友杰智新科技有限公司 Speech recognition method, device, equipment and medium of noise reduction recognition combined network
CN117690421B (en) * 2024-02-02 2024-06-04 深圳市友杰智新科技有限公司 Speech recognition method, device, equipment and medium of noise reduction recognition combined network

Similar Documents

Publication Publication Date Title
CN111223493B (en) Voice signal noise reduction processing method, microphone and electronic equipment
CN109767783B (en) Voice enhancement method, device, equipment and storage medium
Qian et al. Speech Enhancement Using Bayesian Wavenet.
JP6480644B1 (en) Adaptive audio enhancement for multi-channel speech recognition
WO2019113130A1 (en) Voice activity detection systems and methods
US11902759B2 (en) Systems and methods for audio signal generation
CN108172231A (en) A kind of dereverberation method and system based on Kalman filtering
Wang et al. Recurrent deep stacking networks for supervised speech separation
CN113077806B (en) Audio processing method and device, model training method and device, medium and equipment
US20190378529A1 (en) Voice processing method, apparatus, device and storage medium
CN110556125A (en) Feature extraction method and device based on voice signal and computer storage medium
Ueda et al. Environment-dependent denoising autoencoder for distant-talking speech recognition
CN116312616A (en) Processing recovery method and control system for noisy speech signals
Girirajan et al. Real-Time Speech Enhancement Based on Convolutional Recurrent Neural Network.
US11915718B2 (en) Position detection method, apparatus, electronic device and computer readable storage medium
CN113782044A (en) Voice enhancement method and device
CN113096679A (en) Audio data processing method and device
CN117219102A (en) Low-complexity voice enhancement method based on auditory perception
CN111755010A (en) Signal processing method and device combining voice enhancement and keyword recognition
WO2020015546A1 (en) Far-field speech recognition method, speech recognition model training method, and server
CN110875037A (en) Voice data processing method and device and electronic equipment
WO2024139120A1 (en) Noisy voice signal processing recovery method and control system
Zhao et al. Time Domain Speech Enhancement using self-attention-based subspace projection
CN114566179A (en) Time delay controllable voice noise reduction method
Razani et al. A reduced complexity MFCC-based deep neural network approach for speech enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination