WO2022183806A1 - 基于神经网络的语音增强方法、装置及电子设备 - Google Patents

基于神经网络的语音增强方法、装置及电子设备 Download PDF

Info

Publication number
WO2022183806A1
WO2022183806A1 PCT/CN2021/137973 CN2021137973W WO2022183806A1 WO 2022183806 A1 WO2022183806 A1 WO 2022183806A1 CN 2021137973 W CN2021137973 W CN 2021137973W WO 2022183806 A1 WO2022183806 A1 WO 2022183806A1
Authority
WO
WIPO (PCT)
Prior art keywords
original
time
frequency
amplitude spectrum
speech signal
Prior art date
Application number
PCT/CN2021/137973
Other languages
English (en)
French (fr)
Inventor
陈泽华
吴俊仪
蔡玉玉
雪巍
杨帆
丁国宏
何晓冬
Original Assignee
北京沃东天骏信息技术有限公司
北京京东世纪贸易有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京沃东天骏信息技术有限公司, 北京京东世纪贸易有限公司 filed Critical 北京沃东天骏信息技术有限公司
Publication of WO2022183806A1 publication Critical patent/WO2022183806A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present disclosure relates to the field of speech signal processing, and in particular, to a neural network-based speech enhancement method, a speech enhancement apparatus, a computer-readable storage medium, and an electronic device.
  • speech recognition technology can be mainly applied to scenarios such as intelligent customer service, conference recording transcription, and intelligent hardware.
  • the speech recognition technology may not be able to accurately identify the semantics of the speaker, which in turn affects The overall accuracy of speech recognition.
  • a neural network-based speech enhancement method comprising:
  • Feature extraction is performed on the original amplitude spectrum by using a time-dimensional convolution kernel to obtain a time-domain smooth feature map;
  • Feature extraction is performed on the original amplitude spectrum by using a frequency-dimensional convolution kernel to obtain a smoothed feature map in the frequency domain;
  • An inverse time-frequency transform is performed on the enhanced amplitude spectrum to obtain an enhanced speech signal.
  • the feature extraction is performed on the original amplitude spectrum by using a time-dimension convolution kernel to obtain a time-domain smoothed feature map, including:
  • a convolution operation is performed on the weight matrix of the time-dimension convolution kernel and the original amplitude spectrum to obtain the time-domain smoothing feature map.
  • the feature extraction is performed on the original amplitude spectrum by using a frequency-dimensional convolution kernel to obtain a smoothed feature map in the frequency domain, including:
  • a convolution operation is performed on the weight matrix of the frequency-dimensional convolution kernel and the transposed matrix of the original amplitude spectrum to obtain the frequency-domain smoothing feature map.
  • the combined feature extraction is performed on the original amplitude spectrum, the time-domain smoothed feature map, and the frequency-domain smoothed feature map to obtain the enhanced amplitude of the original speech signal spectrum, including:
  • the weight matrix of the time-dimension convolution kernel and the weight matrix of the frequency-dimension convolution kernel are trained by using the back-propagation algorithm;
  • the combined feature extraction is performed on the speech signal to be enhanced according to the weight matrix obtained by training, and the enhanced amplitude spectrum of the original speech signal is obtained.
  • performing time-frequency transform on the original speech signal to obtain the original amplitude spectrum of the original speech signal includes:
  • Windowing and framing processing is performed on the original voice signal to obtain the voice signal after framing
  • Discrete Fourier transform is performed on each frame of speech signal, and modulo operation is performed on the transformed speech signal to obtain the original amplitude spectrum of the original speech signal.
  • performing an inverse time-frequency transform on the enhanced amplitude spectrum to obtain an enhanced speech signal includes:
  • Inverse time-frequency transform is performed on the enhanced amplitude spectrum and the original phase spectrum of the original voice signal to obtain the enhanced voice signal.
  • the original amplitude spectrum of the original speech signal obeys a two-dimensional Gaussian distribution in the complex domain.
  • a speech enhancement device comprising:
  • a signal transformation module for performing time-frequency transformation on the original speech signal to obtain the original amplitude spectrum of the original speech signal
  • a time-domain smoothing feature extraction module for extracting features from the original amplitude spectrum using a time-dimension convolution kernel to obtain a time-domain smoothing feature map
  • a frequency-domain smoothing feature extraction module used for extracting features from the original amplitude spectrum by using a frequency-dimensional convolution kernel to obtain a frequency-domain smoothing feature map
  • a combined feature extraction module configured to perform combined feature extraction on the original amplitude spectrum, the time-domain smoothed feature map and the frequency-domain smoothed feature map by using a deep neural network algorithm to obtain an enhanced amplitude spectrum of the original speech signal
  • a signal inverse transformation module configured to perform time-frequency inverse transformation on the enhanced amplitude spectrum to obtain an enhanced speech signal.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements any one of the methods described above.
  • an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the executable instructions to Perform any of the methods described above.
  • FIG. 1 shows a schematic diagram of an exemplary system architecture to which a speech enhancement method and apparatus according to an embodiment of the present disclosure can be applied;
  • FIG. 2 shows a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present disclosure
  • FIG. 3 schematically shows a flowchart of a speech enhancement method according to an embodiment of the present disclosure
  • FIG. 4 schematically shows a flowchart of temporal smoothing feature extraction according to an embodiment of the present disclosure
  • FIG. 5 schematically shows a flowchart of frequency-domain smoothing feature extraction according to an embodiment of the present disclosure
  • FIG. 6 schematically shows a flowchart of enhanced amplitude spectrum acquisition according to an embodiment of the present disclosure
  • FIG. 7 schematically shows a flow chart of speech enhancement according to an embodiment of the present disclosure
  • FIGS. 8A-8B schematically show a schematic diagram of the combination of the TFDAL module and the U-Net deep neural network according to a specific embodiment of the present disclosure
  • FIG. 9 schematically shows a block diagram of a speech enhancement apparatus according to an embodiment of the present disclosure.
  • Example embodiments will now be described more fully with reference to the accompanying drawings.
  • Example embodiments can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
  • the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
  • numerous specific details are provided in order to give a thorough understanding of the embodiments of the present disclosure.
  • those skilled in the art will appreciate that the technical solutions of the present disclosure may be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. may be employed.
  • well-known solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
  • FIG. 1 shows a schematic diagram of a system architecture of an exemplary application environment to which a speech enhancement method and apparatus according to embodiments of the present disclosure can be applied.
  • the system architecture 100 may include one or more of terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
  • the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the terminal devices 101, 102, 103 may be various electronic devices with display screens, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • the server 105 may be a server cluster composed of multiple servers, or the like.
  • the speech enhancement method provided by the embodiment of the present disclosure is generally executed by the server 105 , and accordingly, the speech enhancement apparatus is generally set in the server 105 .
  • the voice enhancement method provided by the embodiments of the present disclosure can also be executed by the terminal devices 101 , 102 , and 103 , and correspondingly, the voice enhancement device can also be set in the terminal devices 101 , 102 , and 103 . , which is not specially limited in this exemplary embodiment.
  • FIG. 2 shows a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present disclosure.
  • a computer system 200 includes a central processing unit (CPU) 201, which can be loaded into a random access memory (RAM) 203 according to a program stored in a read only memory (ROM) 202 or a program from a storage section 208 Instead, various appropriate actions and processes are performed.
  • RAM 203 various programs and data required for system operation are also stored.
  • the CPU 201, the ROM 202, and the RAM 203 are connected to each other through a bus 204.
  • An input/output (I/O) interface 205 is also connected to the bus 204 .
  • the following components are connected to the I/O interface 205: an input section 206 including a keyboard, a mouse, etc.; an output section 207 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 208 including a hard disk, etc. ; and a communication section 209 including a network interface card such as a LAN card, a modem, and the like. The communication section 209 performs communication processing via a network such as the Internet.
  • a drive 210 is also connected to the I/O interface 205 as needed.
  • a removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 210 as needed so that a computer program read therefrom is installed into the storage section 208 as needed.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication portion 209 and/or installed from the removable medium 211 .
  • CPU central processing unit
  • various functions defined in the method and apparatus of the present disclosure are performed.
  • the present disclosure also provides a computer-readable medium.
  • the computer-readable medium may be included in the electronic device described in the above embodiments; it may also exist alone without being assembled into the electronic device. middle.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by an electronic device, causes the electronic device to implement the methods described in the following embodiments. For example, the electronic device can implement various steps as shown in FIG. 3 to FIG. 7 .
  • the computer-readable medium shown in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • y(n) After obtaining the actual observed noisy speech signal y(n), y(n) can be changed from a one-dimensional time domain signal to a two-dimensional time-frequency domain through Short-Time Fourier Transform (STFT) STFT complex parameter. Since the STFT conversion process is reversible, the conversion matrix is a full-rank matrix, which makes the speech information lossless.
  • STFT Short-Time Fourier Transform
  • the actual observed speech signal can be expressed as the sum of the pure speech signal and the noise signal, namely:
  • y(n) represents the actual observed noisy speech signal
  • x(n) represents the pure speech signal in the time domain
  • w(n) represents the noise signal in the time domain.
  • Y(k,l) represents the STFT parameter of the noisy speech signal
  • X(k,l) represents the STFT parameter of the pure speech signal
  • W(k,l) represents the STFT parameter of the noise signal
  • k represents the first frequency on the frequency axis k frequency grids
  • l represents the l-th time frame on the time axis.
  • the noise signal is Gaussian white noise
  • the time-domain amplitude spectrum of the noise signal obeys the Gaussian distribution, that is, w is the probability density function of the time-domain amplitude spectrum, and N is the variance.
  • W(k,l) has isotropic characteristics in the time-frequency domain, that is, the Gaussian white noise has the same properties along the time axis T and the frequency axis F.
  • W(k, l) of noise and its probability density function (Probability Density Function, PDF) obeys a two-dimensional Gaussian distribution in the complex domain.
  • the noise reduction of the speech signal can be achieved by solving the gain function G(k,l).
  • the gain function can be set as a time-varying and frequency-dependent function, that is, corresponding to different time frames l and frequency grids k, there are different gain function values.
  • the predicted pure speech signal can be obtained STFT parameters That is, according to:
  • the gain function G(k, l) is related to the probability of speech occurrence, and correspondingly there may be speech missing segments and speech occurrence segments. Assuming that the missing part of speech in the kth frequency segment and the lth time segment is H 0 (k,l), and the part where speech appears is H 1 (k,l), when there is only noise signal, it is manifested as missing speech segment; when there is a pure speech signal based on the noise signal, it appears as a segment of speech, and the observed noisy speech signal can be segmented as:
  • noisy speech signal Y(k,l) can be expressed as:
  • Y(k,l)) is the posterior probability of missing speech for each frequency point estimated according to Y(k,l)
  • Y(k,l)) is the posterior probability of the speech occurrence of each frequency point estimated according to Y(k,l), that is, the speech occurrence segment and the speech absence segment can be determined by Y(k,l).
  • the predicted pure speech signal can be obtained according to different gain functions G(k,l).
  • the prediction process is:
  • p(k,l) is the posterior probability of speech appearance, that is, the probability of speech appearance when Y(k,l) is known. It can be seen that in different time periods and frequency grids, by adjusting the gain function G(k,l), different noise reduction methods can be realized under different speech occurrence probabilities, that is, in the speech occurrence segment and speech absence segment. Different smoothing strategies can be implemented, and then time-varying and frequency-dependent smoothing algorithms can be implemented.
  • the algorithm since the calculation formula of the gain function G(k,l) and the time-varying and frequency-dependent changing methods of the gain function are all algorithms developed by expert knowledge, when the types of noise increase and the amount of data increases, the algorithm will There are limitations in enhancing the speech signal.
  • the speech enhancement algorithm of Deep Neural Network (DNN) the algorithm also has shortcomings such as lack of expert knowledge, model interpretability, and lack of pertinence in model structure design.
  • this exemplary embodiment provides a neural network-based speech enhancement method, which can be applied to the above server 105 or one or more of the above terminal devices 101 , 102 , and 103 each, which is not particularly limited in this exemplary embodiment.
  • the speech enhancement method may include the following steps S310 and S350:
  • Step S310 Perform time-frequency transformation on the original voice signal to obtain the original amplitude spectrum of the original voice signal
  • Step S320 Use the time dimension convolution kernel to perform feature extraction on the original amplitude spectrum to obtain a time domain smoothing feature map
  • Step S330 Use the frequency dimension convolution kernel to perform feature extraction on the original amplitude spectrum to obtain a frequency domain smoothing feature map
  • Step S340 Perform combined feature extraction on the original amplitude spectrum, the time-domain smoothing feature map and the frequency-domain smoothing feature map to obtain an enhanced amplitude spectrum of the original speech signal;
  • Step S350 Perform inverse time-frequency transform on the enhanced amplitude spectrum to obtain an enhanced speech signal.
  • the original amplitude spectrum of the original speech signal is obtained by performing time-frequency transformation on the original speech signal; the time-dimensional convolution kernel is used to perform feature extraction on the original amplitude spectrum to obtain time-domain smoothing features Figure; using the frequency dimension convolution kernel to perform feature extraction on the original amplitude spectrum to obtain a frequency domain smooth feature map; perform combined feature extraction on the original amplitude spectrum, time domain smooth feature map and frequency domain smooth feature map to obtain the enhancement amplitude of the original speech signal
  • the enhanced speech signal is obtained by performing inverse time-frequency transform on the enhanced amplitude spectrum.
  • the time-frequency smoothing feature is extracted from the two-dimensional combination of the time axis and the frequency axis through the convolutional neural network, and combined with the deep neural network, the self-learning of the noise reduction parameters can be realized, and the quality of the speech signal can be further improved; on the other hand, according to the The statistical characteristics of speech signals on the time axis and frequency axis can realize dual-axis noise reduction on the time axis and frequency axis, and then achieve the effect of speech enhancement in a variety of complex noise environments.
  • step S310 the original speech signal is subjected to time-frequency transformation to obtain the original amplitude spectrum of the original speech signal.
  • the interference of environmental noise is inevitable in the process of voice communication.
  • the actual observed original voice signal is generally a noisy voice signal, which is a non-stationary and time-varying signal.
  • the time domain analysis of the original speech signal is to process the speech waveform to obtain a series of characteristics that change with time. of.
  • the time-frequency domain speech signal is generally enhanced. Therefore, the one-dimensional time-domain speech signal can be transformed into a two-dimensional time-frequency domain speech signal, so as to extract the pure speech signal from the noisy speech signal.
  • the original speech signal can be transformed into a time-frequency domain speech signal through short-time Fourier transform.
  • the original voice signal can be divided into frames, and the specific frame length can be set according to the actual situation.
  • the frame length can be set to 32ms, that is, the sampling point of every 32ms is a frame of signal. If the rate is 8 kHz, the corresponding one frame is 256 sampling points. In this embodiment, the preferred sampling rate is 16 kHz, and one frame is 512 sampling points.
  • Short-time Fourier transform has the characteristics of fast transformation speed and small amount of calculation.
  • the time-domain speech signal can also be obtained by discrete cosine transform to obtain the time-frequency domain speech signal, and the original speech signal can also be filtered through an auditory filter group such as a Gammatone filter group to obtain the time-frequency domain speech signal. signal, and then can reflect the frequency spectrum transformation law of the speech signal in a certain time period.
  • an auditory filter group such as a Gammatone filter group
  • the original speech signal may be divided into a plurality of short periods by windowing, each short period is called a frame, and the signals of each frame are overlapped.
  • a window function can be used to intercept the signal in the time domain, and Fourier transform can be performed on the intercepted local signal.
  • the time window function can be used to multiply the original speech signal to intercept the signal to obtain a multi-frame speech signal.
  • the time window function may be a Rectangular window (rectangular window), a Hamming window (Hamming window), a Hanning window (Hanning window), a Bartlett window (Bartlett window), etc.
  • a sliding window can also be used, that is, there is a certain overlap between frames, which is called window shift, and the window shift can take half of the window length.
  • window shift can also be 10ms.
  • the discrete Fourier transform can be performed on each frame of the voice signal. For example, the center position of the time window function can be continuously moved to obtain the Fourier transform of each frame. Due to the symmetry of discrete Fourier transform, only half of the discrete Fourier transform result can be taken as the short-time Fourier transform result of each frame of speech signal in each frame.
  • the set of short-time Fourier transform results is the original The time-frequency transform result of the speech signal.
  • the value of the time-frequency domain speech signal at each frequency point is a complex number, which can be determined by the modulus and the argument, so the The time-frequency domain speech signal is decomposed into amplitude spectrum and phase spectrum.
  • the magnitude spectrum is a function of the modulus of the complex number as a function of frequency
  • the phase spectrum is a function of the argument of the complex number as a function of frequency.
  • the modulo operation can be performed on the time-frequency domain speech signal Y(k,l) to obtain the original amplitude spectrum of the original speech signal, namely:
  • is the original amplitude spectrum of the speech signal in the time-frequency domain. Since the information of the speech signal is lossless after Fourier transform,
  • Real(Y(k,l)) is the real part of the time-frequency domain speech signal, and Img(Y(k,l)) is the imaginary part of the time-frequency domain speech signal.
  • the original amplitude spectrum of the original speech signal obeys a two-dimensional Gaussian distribution in the complex domain.
  • the noise signal contained in it such as white noise signal
  • its probability density distribution on the time axis and frequency axis obeys a two-dimensional Gaussian distribution, that is, on the time axis and frequency axis. All have statistical characteristics, which is convenient for noise reduction processing on the time axis and frequency axis.
  • the original amplitude spectrum of the original speech signal can be input into the deep neural network to extract different time-varying and frequency-dependent features. For example, based on the correlation between adjacent frames and adjacent frequency bands of the time-frequency domain speech signal, the local features of the time-frequency domain speech signal can be calculated by performing smoothing in the two dimensions of time and frequency.
  • the deep neural network model can be used for speech enhancement, and the smoothing algorithm can be incorporated into the two-dimensional convolution module of the deep neural network when noise reduction is performed on the time-frequency domain speech signal through the smoothing algorithm.
  • a single convolution module corresponds to the extraction of a single feature, and the weight is kept unchanged during the sliding process, single feature extraction can be achieved for the entire input Y(k,l). If the extraction of time-varying and frequency-dependent segmentation and different features is to be achieved, it can be accomplished by first extracting features using multiple convolution kernels, and then performing feature combination.
  • the two-dimensional convolution module may be a TFDAL (Time-Frequency Domain Averaging Layer, time-frequency domain smoothing layer) module
  • the TFDAL module may include two sub-modules, the Time-Dimensional Averaging Module (Time-Dimensional Averaging Module, TAM) and frequency-dimensional smoothing module (Frequency-Dimensional Averaging Module, FAM), which can respectively correspond to noise smoothing in the time axis dimension and noise smoothing in the frequency axis dimension.
  • TAM Time-Dimensional Averaging Module
  • FAM frequency-dimensional smoothing module
  • step S320 feature extraction is performed on the original amplitude spectrum using a temporal convolution kernel to obtain a temporal smooth feature map.
  • the original amplitude spectrum can be used as the input of the TAM module, and the original speech signal can be filtered through the TAM module, that is, noise smoothing in the time axis dimension is performed.
  • the weighted moving average method can be used to predict the amplitude spectrum information of each time point on the time axis to be smoothed, wherein the weighted moving average method can be based on the influence degree of the data at different times in the same moving segment on the predicted value (corresponding to different weights) to predict future values.
  • noise smoothing in the time axis dimension can be performed according to steps S410 to S430:
  • Step S410 Determine the time-domain smoothing parameter matrix according to the convolution sliding window and the time-domain smoothing factor.
  • the smoothing of the time domain by the TAM module can be implemented by a sliding window, and the corresponding smoothing algorithm can be:
  • l represents the lth time frame on the time axis
  • k Indicates the kth frequency grid on the frequency axis
  • the width of the sliding window can be set according to the actual situation.
  • the width of the sliding window can preferably be set to 32 frames;
  • smoothing factor, which indicates the degree of utilization of the amplitude spectrum of the historical time frame within the sliding window width when the signal is smoothed along the time axis, [ ⁇ 0 ... ⁇ N ] are different smoothing factors, the value of each smoothing factor is The value range is [0, 1], corresponding to the value of ⁇ , the number of convolution kernels in the TAM module can be N;
  • the amplitude spectrum of each historical time frame can be used.
  • the amplitude spectrum when the time point is the 32nd frame can be composed of the amplitude spectrum of the previous 31 frames within the width of the sliding window;
  • T ⁇ (k,l) Indicates that the new amplitude spectrum is obtained by superimposing the amplitude spectrum of the historical time frame within the sliding window width, which is also the amplitude spectrum obtained by time domain smoothing.
  • Step S420 Perform a product operation on the time-domain smoothing parameter matrix to obtain a weight matrix of the time-dimension convolution kernel.
  • the weight matrix of the temporal convolution kernel may be determined first.
  • the corresponding first time-domain smoothing parameter matrix may be [ ⁇ 0 ... ⁇ Di ], combined with the second time-domain smoothing parameter matrix [1- ⁇ ], for example, the first time-domain smoothing parameter matrix and the second time-domain smoothing parameter
  • the matrix product operation can get the final weight matrix of the time-dimensional convolution kernel
  • Step S430 Perform a convolution operation on the weight matrix of the temporal convolution kernel and the original amplitude spectrum to obtain the temporal smoothing feature map.
  • the original amplitude spectrum of the time-frequency domain speech signal has the same size as the original input image and is also in two-dimensional form, it can be used as the frequency domain map of the original input image. Then, a statistical method can be used to construct and extract features. Specifically, all pixels in the original input image can be smoothed sequentially to obtain a filtered image. Among them, in order to ensure that the filtered image has the same size as the original input image, for the edge pixels in the original input image that cause the neighborhood size to exceed the image area, the pre-completion operations can be performed by methods such as zero-filling and symmetrically supplementing adjacent pixels.
  • the original amplitude spectrum of the speech signal in the time-frequency domain can be used as the original input image
  • the spectrogram can be a two-dimensional image matrix of T ⁇ F, where T is the time dimension and F is the frequency dimension, which can be used for
  • T is the time dimension
  • F is the frequency dimension
  • step S330 a frequency dimension convolution kernel is used to perform feature extraction on the original amplitude spectrum to obtain a frequency domain smooth feature map.
  • the original amplitude spectrum can also be used as the input of the FAM module, and the original speech signal can be filtered through the FAM module, that is, noise smoothing in the frequency axis dimension is performed.
  • the weighted moving average method can be used to predict the amplitude spectrum information of each frequency grid on the frequency axis to be smoothed. Referring to FIG. 5 , the weighted moving average method can be used to smooth the noise in the frequency axis dimension according to steps S510 to S530:
  • Step S510 Determine the frequency domain smoothing parameter matrix according to the convolution sliding window and the frequency domain smoothing factor.
  • the smoothing of the frequency domain by the FAM module can be implemented by a sliding window, and the corresponding smoothing algorithm can be:
  • k represents the kth frequency grid on the frequency axis
  • l represents the lth time frame on the time axis
  • D represents the width of the sliding window, and its width can be set according to the actual situation, in this example, preferably the width of the sliding window can be set to 32 frames;
  • smoothing factor, which indicates the utilization of the historical amplitude spectrum within the width of the sliding window when the signal is smoothed along the frequency axis, [ ⁇ 0 ... ⁇ M ] are different smoothing factors, the value range of each smoothing factor is [0, 1], corresponding to the value of ⁇ , the number of convolution kernels in the FAM module can be M;
  • each historical amplitude spectrum can be utilized.
  • the amplitude spectrum at the 32nd frame in the sliding window width can be composed of the amplitude spectra of the previous 31 frames in the sliding window;
  • F ⁇ (k,l) Indicates that the new amplitude spectrum is obtained by superimposing the historical amplitude spectrum within the width of the sliding window, which is also the amplitude spectrum obtained by frequency domain smoothing.
  • Step S520 Perform a product operation on the frequency-domain smoothing parameter matrix to obtain a weight matrix of the frequency-dimensional convolution kernel.
  • the distribution of the frequency domain map also changes, and the corresponding feature vector can be constructed, and each dimension is used to represent the distribution characteristics of different regions.
  • the weight matrix of the frequency-dimensional convolution kernel may be determined before the frequency domain feature extraction is performed on the original input image.
  • the corresponding first frequency domain smoothing parameter matrix can be [ ⁇ 0 ... ⁇ Di ], combined with the second frequency domain smoothing parameter matrix [1- ⁇ ], for example, the first frequency domain smoothing parameter matrix and the second frequency domain smoothing parameter matrix can be combined
  • the final weight matrix of the frequency-dimensional convolution kernel can be obtained by multiplying the matrix
  • Step S530 Perform a convolution operation on the weight matrix of the frequency-dimensional convolution kernel and the transposed matrix of the original amplitude spectrum to obtain the frequency-domain smoothing feature map.
  • the transposed matrix of the original amplitude spectrum of the time-frequency domain speech signal can be used as the original input image, and the original input image can be convolved with a sliding window, and the window of the convolution kernel of each channel can be continuously Slide to perform multiple convolution operations on the original input image.
  • the transposed matrix of the original amplitude spectrum can be a two-dimensional image matrix of F ⁇ T, where F is the frequency dimension, and T is the time dimension, and the weight of the two-dimensional image matrix and the frequency-dimension convolution kernel can be used.
  • matrix Perform the product operation to obtain a smooth feature map in the frequency domain.
  • the idea of the convolution kernel in the convolutional neural network is used, and the noise reduction algorithm is made into a convolution kernel.
  • Noise reduction is used.
  • the probability density function of the noise W(k,l) is a two-dimensional Gaussian distribution, which has statistical characteristics on both the time axis and the frequency axis, and can realize dual-axis noise reduction on the time axis and frequency axis.
  • step S340 combined feature extraction is performed on the original amplitude spectrum, the time-domain smoothed feature map, and the frequency-domain smoothed feature map to obtain an enhanced amplitude spectrum of the original speech signal.
  • the enhanced amplitude spectrum of the original speech signal can be obtained according to steps S610 to S630:
  • Step S610 Combine the original amplitude spectrum of the original speech signal, the time-domain smoothing feature map and the frequency-domain smoothing feature map to obtain the speech signal to be enhanced.
  • the noisy speech signal Y(k,l) smoothed by the TAM module and the FAM module will smooth the noise signal W(k,l) on both the time axis T and the frequency axis F.
  • the features of the original input Y(k,l) can be spliced with the output of the TFDAL module, which can not only retain the features of the original speech signal, but also learn deep-level features.
  • the input of the deep neural network can be changed from the original input Y(k,l) to the combined input, and the combined input can be a three-dimensional tensor C i (k,l):
  • Y(k,l) is a two-dimensional tensor of 1 ⁇ F ⁇ T, which is equivalent to a filter whose smoothing factor is 0, that is, the original information is not processed and remains unchanged
  • T ⁇ (k,l) is a three-dimensional tensor of M ⁇ F ⁇ T
  • F ⁇ (k,l) is a three-dimensional tensor of N ⁇ F ⁇ T
  • the combined speech signal C i (k,l) to be enhanced is (M+N+1 ) ⁇ F ⁇ T 3D tensor.
  • the TFDAL module augments the input of the neural network, giving the neural network more input information. Moreover, the TFDAL module has both the interpretability of the noise reduction algorithm developed by expert knowledge and the strong fitting ability formed after being incorporated into the neural network. of advanced signal processing algorithms combined with deep neural networks.
  • Step S620 Using the voice signal to be enhanced as the input of the deep neural network, use the back-propagation algorithm to train the weight matrix of the time-dimension convolution kernel and the weight matrix of the frequency-dimension convolution kernel.
  • the TFDAL module can be incorporated into a deep neural network model to and the weight matrix of the frequency-dimensional convolution kernel to train, and to train the weighting factors of the layers in the model.
  • the TFDAL module can be combined with network models such as convolutional neural networks, recurrent neural networks, and fully-connected neural networks to realize gradient conduction. Understandably, the training objective of the neural network can determine the final value of each element in the convolution kernel.
  • a back-propagation algorithm may be used in the training process of the neural network model, parameters may be randomly initialized, and the parameters may be continuously updated as the training deepens.
  • the BP (error Back Propagation) algorithm can be used.
  • the output of the output layer can be obtained by calculating from front to back according to the original input; the difference between the current output and the target output can be calculated, that is, the loss function can be calculated. ;
  • the loss function can be minimized by using the gradient descent algorithm, Adam optimization algorithm, etc., and the parameters are updated sequentially from the back to the front, that is, the weight matrix of the time-dimensional convolution kernel is updated in turn and the weight matrix of the frequency-dimensional convolution kernel
  • the gradient descent algorithm can be stochastic gradient descent, mini-batch gradient descent or batch gradient descent to minimize the error between the noisy speech and the pure speech.
  • the batch gradient descent method can use all samples to update each parameter; the stochastic gradient descent method can use one sample to update each parameter, and update many times, when the sample size is very large When it is large, the optimal solution can be iterated by selecting a small number of samples; the mini-batch gradient descent method can use a part of the samples to update each parameter, and can take into account the characteristics of the stochastic gradient descent method and the batch gradient descent method at the same time. .
  • Step S630 Perform combined feature extraction on the speech signal to be enhanced according to the weight matrix obtained by training to obtain an enhanced amplitude spectrum of the original speech signal.
  • the weight matrix of the original input can be learned by The weight matrix of the temporal convolution kernel
  • the weight matrix of the frequency-dimensional convolution kernel From the row adjustment to the original input Y(k,l), the temporal smoothing feature map of each layer in T ⁇ (k, l) output by the TAM module, and the frequency of each layer in F ⁇ (k, l) output by the FAM module
  • the domain smoothing feature map is used for combined feature extraction to obtain the enhanced amplitude spectrum of the original speech signal, so as to achieve different smoothing effects in the speech appearing segment and the speech missing segment.
  • the two-dimensional convolutional structure can be successfully incorporated into the deep neural network model, and can be combined with convolutional neural networks, recurrent neural networks, and fully-connected neural networks to achieve gradient conduction.
  • make the convolution kernel parameters within the TFDAL module That is, the parameters of the noise reduction algorithm can be driven by data, and the optimal value in the statistical sense can be obtained without expert knowledge as prior information.
  • step S350 an inverse time-frequency transform is performed on the enhanced amplitude spectrum to obtain an enhanced speech signal.
  • Speech enhancement is the improvement of pure speech signals
  • the amplitude spectrum and phase spectrum are predicted. Since the phase spectrum has little influence on the de-noising effect, in an example implementation, only the original amplitude spectrum of the time-frequency domain speech signal can be enhanced, and the phase uses Y(k, l), therefore, the original phase spectrum of Y(k,l) can be obtained first.
  • the original phase spectrum of the original speech signal can be obtained by taking a phase angle operation on the transformed speech signal.
  • ⁇ Y(k,l) is the original phase spectrum of the time-frequency domain speech signal
  • Real(Y(k,l)) is the real part of the time-frequency domain speech signal
  • Img(Y(k,l)) is the time The imaginary part of the frequency-domain speech signal.
  • the enhanced amplitude spectrum and the original phase spectrum of the original speech signal can be inverse time-frequency transformed to obtain the enhanced speech signal.
  • the enhanced amplitude spectrum and the original phase spectrum can be synthesized into a complex number domain spectrum, and the complex number spectrum dimension is the same as that of the real part and the imaginary part spectrum.
  • the inverse discrete Fourier transform is performed on the complex domain spectrum to obtain the corresponding time domain speech signal, and the enhanced speech signal can be obtained by using the overlap-add method.
  • FIG. 7 schematically shows a flow chart of speech enhancement including a TFDAL module and a deep neural network, wherein the TFDAL module includes a TAM module and a FAM module, and the process may include steps S701 to S708:
  • Step S701. Input speech signal y(n), which is a noisy speech signal;
  • Step S702. Perform STFT transformation on the speech signal: perform STFT transformation on the noisy speech signal y(n) to obtain the time-frequency domain speech signal Y(k,l);
  • Modulo operation take the time-frequency domain voice signal Y(k,l) as a modulo operation to obtain the amplitude information of the voice signal, that is, the original amplitude spectrum
  • Step S704. Input the original amplitude spectrum into the TAM module, extract the time-domain smoothing feature from the original amplitude spectrum, and obtain the amplitude spectrum T(k, l) after noise reduction along the time axis;
  • Step S705. Input the original amplitude spectrum into the FAM module, extract the frequency domain smoothing feature from the transposed matrix of the original amplitude spectrum, and obtain the amplitude spectrum F(k, l) after noise reduction along the frequency axis;
  • Step S706 Combine the original amplitude spectrum
  • Step S707 Get the phase information: make the time-frequency domain speech signal Y(k, l) take the phase angle operation to obtain the noisy phase spectrum ⁇ Y(k, l) of the speech signal;
  • Step S708 Perform ISTFT transformation on the enhanced amplitude spectrum and the noisy phase spectrum of the speech signal to obtain an enhanced speech signal.
  • the time-frequency smoothing feature extraction in the two-dimensional combination of the time axis and the frequency axis can be achieved through the convolutional neural network.
  • the TFDAL module is incorporated into the neural network model, which can be returned through the gradient.
  • Self-learning of smoothing parameters that is, the weights of convolution kernels
  • FIG. 8A schematically shows a schematic diagram of the combination of a TFDAL module and a U-Net deep neural network, that is, a U-Net convolutional neural network with an encoder-decoder structure can be constructed.
  • Network model As a speech enhancement model, the U-Net convolutional neural network model can include a full convolution part (Encoder layer) and a deconvolution part (Decoder layer). Among them, the full convolution part can be used to extract features and obtain a low-resolution feature map; the deconvolution part can upsample the small-sized feature map to obtain the same feature map as the original size, and upsampling can improve the resolution of the image. The rate, exemplarily, upsampling can be accomplished by resampling and interpolation such as using bilinear interpolation to interpolate the remaining points.
  • the original voice signal can be obtained by time-frequency transformation to obtain the original input
  • the original input can be input into the TAM( ⁇ ) convolution module and the FAM( ⁇ ) convolution module respectively, and the original input and the TAM( ⁇ ) convolution module, FAM ( ⁇ )
  • the output of the convolution module is combined and input into the U-NET convolutional neural network model.
  • the combined features of the original input, the output of the TAM module and the output of the FAM module can be extracted to achieve Different smoothing effects of speech appearance segment and speech absence segment, and finally output the enhanced speech signal.
  • FIG 8B presents a schematic diagram of a combination of TFDAL module and U-Net deep neural network.
  • the U-Net deep neural network model can be a convolutional neural network structure with a 4-layer encoder and a 4-layer decoder.
  • the encoder can extract time-frequency domain smoothing features by downsampling the time dimension and frequency dimension.
  • Each encoder can It includes a convolutional layer with a convolution kernel size of 3 ⁇ 3, a pooling layer, and a nonlinear layer whose activation function is ReLU (Rectified Linear Unit, linear rectification function).
  • ReLU Rectified Linear Unit, linear rectification function
  • the time and frequency dimensions are down-sampled layer by layer, and a 3 ⁇ 3 convolution kernel can be used for feature extraction, so that the number of channels can be expanded to 64, 128, 256, and 256 layer by layer.
  • a 3 ⁇ 3 convolution kernel for upsampling operations.
  • Each step of upsampling will add the feature map from the corresponding encoder.
  • the number of channels is changed from 256 to 512, 256, 128 layer by layer, until it is restored to An image of the same size as the input.
  • the activation function of the last layer can choose the Tanh (hyperbolic tangent function) activation function.
  • the original amplitude spectrum may be used as the original input image
  • the original input image may be a T ⁇ F two-dimensional image matrix, where T is the time dimension, and F is the frequency dimension.
  • the original input image sequentially connects the time-frequency feature extraction layer, the encoder, the decoder and the output layer.
  • the original input image can be preprocessed, and the time-spectral features are relatively independent in time and frequency.
  • the time-frequency feature extraction layer can be used for convolution and smoothing along the time axis and frequency axis respectively.
  • the original input image can be input into the time recursive smoothing layer in the U-Net deep neural network, and the two-dimensional image matrix and the weight matrix N( ⁇ ) of the time-dimensional convolution kernel can be convolved to obtain the time domain.
  • Smooth the feature map the original input image can be transposed and input to the frequency recursive smoothing layer in the U-Net deep neural network, and the transposed two-dimensional image matrix and the weight matrix M( ⁇ ) of the frequency-dimension convolution kernel can be calculated. Convolution operation to obtain a smooth feature map in the frequency domain.
  • the time-frequency feature extraction layer can fuse features from the dimension level.
  • the encoder can perform four convolutions on the combined output time-frequency domain smoothed feature map and the original input image.
  • the size of the time dimension convolution kernel can be 32 ⁇ 201.
  • the window of the convolution kernel of each channel can be continuously slid to perform multiple convolution operations on the original input image, and 51 ⁇ 51, 13 Feature maps of four different sizes: ⁇ 13, 4 ⁇ 4, and 1 ⁇ 1.
  • the encoder can extract high-dimensional features in the original speech signal.
  • the high-dimensional encoded features output by the encoder are used as the input of the decoder, and the decoder and encoder have a symmetric structure.
  • upsampling or deconvolution can be performed on the 1 ⁇ 1 feature map to obtain a 4 ⁇ 4 feature map.
  • This 4 ⁇ 4 feature map and the previous 4 ⁇ 4 feature map are stitched on the channel, and then Convolve and upsample the spliced feature map to obtain a 13 ⁇ 13 feature map, which is then spliced with the previous 13 ⁇ 13 feature, convolved, and upsampled.
  • a Prediction results of 201 ⁇ 201 with the same image size After a total of four upsampling, a Prediction results of 201 ⁇ 201 with the same image size.
  • the decoder can restore high-dimensional features to low-dimensional features with more sound information, and the output layer can restore the enhanced time-spectral features.
  • the two-dimensional TFDAL module can be successfully incorporated into the deep neural network model, and can be ideally combined with convolutional neural networks, recurrent neural networks, and fully-connected neural networks to achieve gradient conduction.
  • the parameters of the convolution kernel in the TFDAL module that is, the parameters of the noise reduction algorithm, can be driven by data, and the optimal value in the statistical sense can be obtained without expert knowledge as prior information.
  • the TFDAL module has both the interpretability of the algorithm developed by expert knowledge and the strong fitting ability formed after being incorporated into the neural network. It is an interpretable neural network module, which can effectively denoise speech. Advanced signal processing algorithms in the field are combined with deep neural networks.
  • the noise signal can be measured by PESQ (Perceptual Evaluation of Speech Quality, speech quality perception evaluation index), STOI (Short-Time Objective Intelligibility, short-term objective intelligibility index), and signal-to-noise ratio SNR.
  • PESQ Perceptual Evaluation of Speech Quality, speech quality perception evaluation index
  • STOI Short-Time Objective Intelligibility, short-term objective intelligibility index
  • signal-to-noise ratio SNR signal-to-noise ratio
  • the original amplitude spectrum of the original speech signal is obtained by performing time-frequency transformation on the original speech signal; the time-dimensional convolution kernel is used to perform feature extraction on the original amplitude spectrum to obtain time-domain smoothing features Figure; using the frequency dimension convolution kernel to perform feature extraction on the original amplitude spectrum to obtain a frequency domain smooth feature map; perform combined feature extraction on the original amplitude spectrum, time domain smooth feature map and frequency domain smooth feature map to obtain the enhancement amplitude of the original speech signal
  • the enhanced speech signal is obtained by performing inverse time-frequency transform on the enhanced amplitude spectrum.
  • the time-frequency smoothing feature is extracted from the two-dimensional combination of the time axis and the frequency axis through the convolutional neural network, and combined with the deep neural network, the self-learning of the noise reduction parameters can be realized, and the quality of the speech signal can be further improved; on the other hand, according to the The statistical characteristics of speech signals on the time axis and frequency axis can realize dual-axis noise reduction on the time axis and frequency axis, and then achieve the effect of speech enhancement in a variety of complex noise environments.
  • a neural network-based voice enhancement apparatus is also provided, and the apparatus can be applied to a server or a terminal device.
  • the speech enhancement apparatus 900 may include a signal transformation module 910, a time domain smoothing feature extraction module 920, a frequency domain smoothing feature extraction module 930, a combined feature extraction module 940, and a signal inverse transformation module 950, wherein:
  • a signal transformation module 910 configured to perform time-frequency transformation on the original speech signal to obtain the original amplitude spectrum of the original speech signal
  • a time-domain smoothing feature extraction module 920 configured to perform feature extraction on the original amplitude spectrum by using a time-dimension convolution kernel to obtain a time-domain smoothing feature map
  • the frequency-domain smoothing feature extraction module 930 uses a frequency-dimensional convolution kernel to perform feature extraction on the original amplitude spectrum to obtain a frequency-domain smoothing feature map;
  • the combined feature extraction module 940 is configured to perform combined feature extraction on the original amplitude spectrum, the time-domain smoothed feature map and the frequency-domain smoothed feature map to obtain an enhanced amplitude spectrum of the original speech signal;
  • the signal inverse transformation module 950 is configured to perform time-frequency inverse transformation on the enhanced amplitude spectrum to obtain an enhanced speech signal.
  • the temporal smoothing feature extraction module 920 includes:
  • the time-domain smoothing parameter matrix determination module determines the time-domain smoothing parameter matrix according to the convolution sliding window and the time-domain smoothing factor
  • a first weight matrix determination module configured to perform a product operation on the time-domain smoothing parameter matrix to obtain a weight matrix of the time-dimension convolution kernel
  • a time-domain operation module configured to perform a product operation on the weight matrix of the time-dimension convolution kernel and the original amplitude spectrum to obtain the time-domain smoothing feature map.
  • the frequency domain smoothing feature extraction module 930 includes:
  • a frequency-domain smoothing parameter matrix determination module for performing a product operation on the frequency-domain smoothing parameter matrix to obtain a weight matrix of the frequency-dimensional convolution kernel
  • a second weight matrix determination module configured to perform a product operation on the frequency-domain smoothing parameter matrix to obtain a weight matrix of the frequency-dimensional convolution kernel
  • a frequency domain operation module configured to perform a product operation on the weight matrix of the frequency dimension convolution kernel and the transposed matrix of the original amplitude spectrum to obtain the frequency domain smoothing feature map.
  • the combined feature extraction module 940 includes:
  • an input signal acquisition module configured to combine the original amplitude spectrum of the original voice signal, the time-domain smoothing feature map and the frequency-domain smoothing feature map to obtain the voice signal to be enhanced;
  • a weight matrix training module used for taking the to-be-enhanced speech signal as the input of the deep neural network, and using the back-propagation algorithm to train the weight matrix of the time-dimension convolution kernel and the frequency-dimension convolution kernel;
  • the enhanced amplitude spectrum acquisition module is configured to perform combined feature extraction on the to-be-enhanced speech signal according to the weight matrix obtained by training to obtain the enhanced amplitude spectrum of the original speech signal.
  • the signal transformation module 910 includes:
  • a signal preprocessing module configured to perform windowing and framing processing on the original voice signal to obtain a framed voice signal
  • the original amplitude spectrum acquisition module is used to perform discrete Fourier transform on each frame of speech signal, and perform modulo operation on the transformed speech signal to obtain the original amplitude spectrum of the original speech signal.
  • the signal inverse transformation module 950 includes:
  • the original phase spectrum acquisition module is used to obtain the original phase spectrum of the original voice signal by taking a phase angle operation on the transformed voice signal;
  • the enhanced speech signal acquisition module is configured to perform inverse time-frequency transform on the enhanced amplitude spectrum and the original phase spectrum of the original speech signal to obtain the enhanced speech signal.
  • the speech enhancement apparatus 900 is further configured to:
  • the original amplitude spectrum of the original speech signal obeys a two-dimensional Gaussian distribution in the complex number domain.
  • modules or units of the apparatus for action performance are mentioned in the above detailed description, this division is not mandatory. Indeed, according to embodiments of the present disclosure, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of one module or unit described above may be further divided into multiple modules or units to be embodied.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)
  • Complex Calculations (AREA)

Abstract

一种基于神经网络的语音增强方法、装置、存储介质及电子设备;涉及语音信号处理领域。方法包括:将原始语音信号进行时频变换得到原始语音信号的原始幅度谱(S310);利用时间维卷积核对原始幅度谱进行特征提取,得到时域平滑特征图(S320);利用频率维卷积核对原始幅度谱进行特征提取,得到频域平滑特征图(S330);对原始幅度谱、时域平滑特征图和频域平滑特征图进行组合特征提取,得到原始语音信号的增强幅度谱(S340);对增强幅度谱进行时频逆变换得到增强语音信号(S350)。通过对原始语音信号提取时频平滑特征,可以在时间轴和频率轴上实现双轴降噪,并结合深度神经网络可以实现降噪参数的自学习,进一步提升语音信号的质量。

Description

基于神经网络的语音增强方法、装置及电子设备
相关申请的交叉引用
本公开要求于2021年03月05日提交的申请号为202110245564.1、名称为“基于神经网络的语音增强方法、装置及电子设备”的中国专利申请的优先权,该中国专利申请的全部内容通过引用全部并入本文。
技术领域
本公开涉及语音信号处理领域,具体而言,涉及一种基于神经网络的语音增强方法、语音增强装置、计算机可读存储介质以及电子设备。
背景技术
近几年,随着深度学习技术的高速发展,语音识别技术的识别效果也得到很大提升,该技术在无噪音场景下语音的识别准确率,已达到可以替代人工的语音识别标准。
目前,语音识别技术主要可以应用于智能客服、会议录音转写、智能硬件等场景。但是,当背景环境有噪音时,如在智能客服通话时用户周围环境杂音或会议记录音频中的背景杂音等,受此类杂音影响,语音识别技术可能无法准确地识别说话人的语义,进而影响语音识别的整体准确率。
因此,如何提高有噪音情况下的语音识别准确率成为语音识别技术下一个需要攻克的难关。
需要说明的是,在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。
发明内容
根据本公开的第一方面,提供一种基于神经网络的语音增强方法,包括:
将原始语音信号进行时频变换得到所述原始语音信号的原始幅度谱;
利用时间维卷积核对所述原始幅度谱进行特征提取,得到时域平滑特征图;
利用频率维卷积核对所述原始幅度谱进行特征提取,得到频域平滑特征图;
对所述原始幅度谱、所述时域平滑特征图和所述频域平滑特征图进行组合特征提取,得到所述原始语音信号的增强幅度谱;
对所述增强幅度谱进行时频逆变换得到增强语音信号。
在本公开的一种示例性实施例中,所述利用时间维卷积核对所述原始幅度谱进行特征提取,得到时域平滑特征图,包括:
根据卷积滑窗和时域平滑因子确定时域平滑参数矩阵;
对所述时域平滑参数矩阵作乘积运算得到所述时间维卷积核的权重矩阵;
对所述时间维卷积核的权重矩阵和所述原始幅度谱作卷积运算,得到所述时域平滑特征图。
在本公开的一种示例性实施例中,所述利用频率维卷积核对所述原始幅度谱进行特征提取,得到频域平滑特征图,包括:
根据卷积滑窗和频域平滑因子确定频域平滑参数矩阵;
对所述频域平滑参数矩阵作乘积运算得到所述频率维卷积核的权重矩阵;
对所述频率维卷积核的权重矩阵和所述原始幅度谱的转置矩阵作卷积运算,得到所述频域平滑特征图。
在本公开的一种示例性实施例中,所述对所述原始幅度谱、所述时域平滑特征图和所述频域平滑特征图进行组合特征提取,得到所述原始语音信号的增强幅度谱,包括:
合并所述原始语音信号的原始幅度谱、所述时域平滑特征图和所述频域平滑特征图,得到待增强语音信号;
以所述待增强语音信号为深度神经网络的输入,利用反向传播算法对所述时间维卷积核的权重矩阵和所述频率维卷积核的权重矩阵进行训练;
根据训练得到的权重矩阵对所述待增强语音信号进行组合特征提取,得到所述原始语音信号的增强幅度谱。
在本公开的一种示例性实施例中,所述将原始语音信号进行时频变换得到所述原始语音信号的原始幅度谱,包括:
对所述原始语音信号进行加窗分帧处理,得到分帧后的语音信号;
对每帧语音信号作离散傅里叶变换,并对变换后的语音信号作取模运算得到所述原始语音信号的原始幅度谱。
在本公开的一种示例性实施例中,所述对所述增强幅度谱进行时频逆变换得到增强语音信号,包括:
对所述变换后的语音信号作取相位角运算得到所述原始语音信号的原始相位谱;
对所述原始语音信号的增强幅度谱和所述原始相位谱作时频逆变换,得到所述增强语音信号。
在本公开的一种示例性实施例中,所述原始语音信号的原始幅度谱服从复数域二维高斯分布。
根据本公开的第二方面,提供一种语音增强装置,包括:
信号变换模块,用于将原始语音信号进行时频变换得到所述原始语音信号的原始幅度谱;
时域平滑特征提取模块,用于利用时间维卷积核对所述原始幅度谱进行特征提取,得到时域平滑特征图;
频域平滑特征提取模块,用于利用频率维卷积核对所述原始幅度谱进行特征提取,得到频域平滑特征图;
组合特征提取模块,用于利用深度神经网络算法对所述原始幅度谱、所述时域平滑特征图和所述频域平滑特征图进行组合特征提取,得到所述原始语音信号的增强幅度谱;
信号逆变换模块,用于对所述增强幅度谱进行时频逆变换得到增强语音信号。
根据本公开的第三方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述任意一项所述的方法。
根据本公开的第四方面,提供一种电子设备,包括:处理器;以及存储器,用于存储所述处理器的可执行指令;其中,所述处理器配置为经由执行所述可执行指令来执行上述任意一项所述的方法。
应当理解的是,以上的一般描述和下文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了可以应用本公开实施例的一种语音增强方法及装置的示例性系统架构的示意图;
图2示出了适于用来实现本公开实施例的电子设备的计算机系统的结构示意图;
图3示意性示出了根据本公开的一个实施例的语音增强方法的流程图;
图4示意性示出了根据本公开的一个实施例的时域平滑特征提取的流程图;
图5示意性示出了根据本公开的一个实施例的频域平滑特征提取的流程图;
图6示意性示出了根据本公开的一个实施例的增强幅度谱获取的流程图;
图7示意性示出了根据本公开的一个实施例的语音增强的流程图;
图8A-8B示意性示出了根据本公开的一个具体实施例的TFDAL模块与U-Net深度神经网络结合的示意图;
图9示意性示出了根据本公开的一个实施例的语音增强装置的框图。
具体实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中,提供许多具体细节从而给出对本公开的实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本公开的技术方案而省略所述特定细节中的一个或更多,或者可以采用其它的方 法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知技术方案以避免喧宾夺主而使得本公开的各方面变得模糊。
此外,附图仅为本公开的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
图1示出了可以应用本公开实施例的一种语音增强方法及装置的示例性应用环境的系统架构的示意图。
如图1所示,系统架构100可以包括终端设备101、102、103中的一个或多个,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。终端设备101、102、103可以是具有显示屏的各种电子设备,包括但不限于台式计算机、便携式计算机、智能手机和平板电脑等等。应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。比如服务器105可以是多个服务器组成的服务器集群等。
本公开实施例所提供的语音增强方法一般由服务器105执行,相应地,语音增强装置一般设置于服务器105中。但本领域技术人员容易理解的是,本公开实施例所提供的语音增强方法也可以由终端设备101、102、103执行,相应的,语音增强装置也可以设置于终端设备101、102、103中,本示例性实施例中对此不做特殊限定。
图2示出了适于用来实现本公开实施例的电子设备的计算机系统的结构示意图。
需要说明的是,图2示出的电子设备的计算机系统200仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图2所示,计算机系统200包括中央处理单元(CPU)201,其可以根据存储在只读存储器(ROM)202中的程序或者从存储部分208加载到随机访问存储器(RAM)203中的程序而执行各种适当的动作和处理。在RAM 203中,还存储有系统操作所需的各种程序和数据。CPU 201、ROM 202以及RAM 203通过总线204彼此相连。输入/输出(I/O)接口205也连接至总线204。
以下部件连接至I/O接口205:包括键盘、鼠标等的输入部分206;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分207;包括硬盘等的存储部分208;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分209。通信部分209经由诸如因特网的网络执行通信处理。驱动器210也根据需要连接至I/O接口205。可拆卸介质211,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器210上,以便于从其上读出的计算机程序根据需要被安装入存储部分208。
特别地,根据本公开的实施例,下文参考流程图描述的过程可以被实现为计算机软件 程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分209从网络上被下载和安装,和/或从可拆卸介质211被安装。在该计算机程序被中央处理单元(CPU)201执行时,执行本公开的方法和装置中限定的各种功能。
作为另一方面,本公开还提供了一种计算机可读介质,该计算机可读介质可以是上述实施例中描述的电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被一个该电子设备执行时,使得该电子设备实现如下述实施例中所述的方法。例如,所述的电子设备可以实现如图3至图7所示的各个步骤等。
需要说明的是,本公开所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。
以下对本公开实施例的技术方案进行详细阐述:
在得到实际观测的带噪语音信号y(n)后,可以将y(n)通过短时傅里叶变换(Short-Time Fourier Transform,STFT)从一维时域信号变为二维时频域STFT复数参数。由于STFT转换过程可逆,转换矩阵为满秩矩阵,使得语音信息无损。
在时域上,实际观测到的语音信号可以表示为纯净语音信号和噪声信号的加和,即:
y(n)=x(n)+w(n)
其中,y(n)表示实际观测到的带噪语音信号,x(n)表示时域上的纯净语音信号,w(n)表示时域上的噪声信号。在经过STFT变换的时频域上,对应的有:
Y(k,l)=X(k,l)+W(k,l)
其中,Y(k,l)表示带噪语音信号的STFT参数,X(k,l)表示纯净语音信号的STFT参 数,W(k,l)表示噪声信号的STFT参数,k表示频率轴上第k个频率格,l表示时间轴上第l个时间帧。
假设噪声信号为高斯白噪声时,噪声信号的时域幅度谱服从高斯分布,即
Figure PCTCN2021137973-appb-000001
w为时域幅度谱的概率密度函数,N为方差。此时,高斯白噪声经过STFT变换后,W(k,l)在时频域上具有各向同性的特点,即高斯白噪声沿时间轴T和频率轴F具有相同性质。类似的,也可以对噪声的STFT参数W(k,l)做一般性假设,其概率密度函数(Probability Density Function,PDF)服从复数域二维高斯分布。
当噪声信号的概率密度函数沿时间轴和频率轴服从二维高斯分布时,可以通过求解增益函数G(k,l)实现语音信号的降噪。其中,可以将该增益函数设为时变且频率依赖的函数,即对应于不同的时间帧l和频率格k,有不同的增益函数值。通过该增益函数和带噪语音信号Y(k,l),可以得到预测的纯净语音信号
Figure PCTCN2021137973-appb-000002
的STFT参数
Figure PCTCN2021137973-appb-000003
即根据:
Figure PCTCN2021137973-appb-000004
其中,增益函数G(k,l)与语音出现概率有关,对应的可以存在语音缺失段和语音出现段。假设在第k个频率段和第l个时间段语音缺失的部分为H 0(k,l),语音出现的部分为H 1(k,l),当只存在噪声信号时,表现为语音缺失段;当基于噪声信号还存在纯净语音信号时,表现为语音出现段,则观测到的带噪语音信号可以分段表示为:
H 0(k,l):Y(k,l)=W(k,l)
H 1(k,l):Y(k,l)=X(k,l)+W(k,l)
对应的,带噪语音信号Y(k,l)可以用条件概率表示为:
Figure PCTCN2021137973-appb-000005
Figure PCTCN2021137973-appb-000006
其中,p(H 0(k,l)|Y(k,l))为根据Y(k,l)估计每个频点的语音缺失的后验概率,p(H 1(k,l)|Y(k,l))为根据Y(k,l)估计每个频点的语音出现的后验概率,也就是可以通过Y(k,l)确定语音出现段和语音缺失段。从而可以根据不同的增益函数G(k,l)得到预测的纯净语音信号
Figure PCTCN2021137973-appb-000007
的STFT参数
Figure PCTCN2021137973-appb-000008
具体的,
Figure PCTCN2021137973-appb-000009
的预测过程为:
Figure PCTCN2021137973-appb-000010
其中,p(k,l)为语音出现的后验概率,即已知Y(k,l)时语音出现的概率。由此可知,在不同的时间段与频率格上,通过调整增益函数G(k,l),可以实现不同语音出现概率下的对应的不同降噪方法,也即在语音出现段和语音缺失段实现不同的平滑策略,进而可以实现时变、频率依赖的平滑算法。
该方法中,由于增益函数G(k,l)的计算公式以及增益函数时变、频率依赖的变化方法都是由专家知识驱动开发的算法,当噪声种类增多、数据量扩增时,该算法在增强语音信 号时存在局限性。另外,对于深度神经网络(Deep Neural Network,DNN)的语音增强算法,该算法也存在着缺乏专家知识、模型可解释性、模型结构设计缺乏针对性等缺点。
基于上述一个或多个问题,本示例实施方式提供了一种基于神经网络的语音增强方法,该方法可以应用于上述服务器105,也可以应用于上述终端设备101、102、103中的一个或多个,本示例性实施例中对此不做特殊限定。参考图3所示,该语音增强方法可以包括以下步骤S310和步骤S350:
步骤S310.将原始语音信号进行时频变换得到所述原始语音信号的原始幅度谱;
步骤S320.利用时间维卷积核对所述原始幅度谱进行特征提取,得到时域平滑特征图;
步骤S330.利用频率维卷积核对所述原始幅度谱进行特征提取,得到频域平滑特征图;
步骤S340.对所述原始幅度谱、所述时域平滑特征图和所述频域平滑特征图进行组合特征提取,得到所述原始语音信号的增强幅度谱;
步骤S350.对所述增强幅度谱进行时频逆变换得到增强语音信号。
在本公开示例实施方式所提供的语音增强方法中,通过将原始语音信号进行时频变换得到原始语音信号的原始幅度谱;利用时间维卷积核对原始幅度谱进行特征提取,得到时域平滑特征图;利用频率维卷积核对原始幅度谱进行特征提取,得到频域平滑特征图;对原始幅度谱、时域平滑特征图和频域平滑特征图进行组合特征提取,得到原始语音信号的增强幅度谱;对增强幅度谱进行时频逆变换得到增强语音信号。一方面,通过卷积神经网络对时间轴和频率轴的二维组合提取时频平滑特征,并结合深度神经网络可以实现降噪参数的自学习,进一步提升语音信号的质量;另一方面,根据语音信号在时间轴和频率轴上的统计特性,能够实现在时间轴、频率轴双轴降噪,进而在多种复杂噪声环境下达到语音增强的效果。
下面,对于本示例实施方式的上述步骤进行更加详细的说明。
在步骤S310中,将原始语音信号进行时频变换得到所述原始语音信号的原始幅度谱。
语音通信过程中环境噪声的干扰是不可避免的,实际观测到的原始语音信号一般为带噪语音信号,该信号是一个非稳态、时变的信号。原始语音信号的时域分析是对语音波形进行处理得到一系列随时间变化的特征,时域分析是基于语音信号的短时不变性,即在较短时间内语音信号的各种特性是不变的。但是,一般是对时频域语音信号作增强处理,因此,可以将一维时域语音信号变换为二维时频域语音信号,以便于从带噪语音信号中提取纯净的语音信号。
一种示例实施方式中,由于傅里叶变换不会改变语音信号的统计特性,例如,可以通过短时傅里叶变换将原始语音信号变换为时频域语音信号。为了进行短时分析,可以将原始语音信号进行分帧,具体的帧长度可以根据实际情况进行设置,例如,可以将帧长度设置为32ms,即每32ms采样点为一帧信号,还可以是采样率8kHz,对应的一帧为256个 采样点,在本实施例中,优选采样率16kHz,一帧为512个采样点。短时傅里叶变换具有变换速度快,计算量小的特点。在其他示例实施方式中,也可以将时域语音信号通过离散余弦变换得到时频域语音信号,还可以通过听觉滤波组如Gammatone滤波器组,对原始语音信号进行滤波处理,得到时频域语音信号,进而可以反映出某一时间段内的语音信号的频谱变换规律。
示例性的,可以将原始语音信号通过加窗划分为多个短时段,每一短时段称为一帧,每帧信号都是有重叠的。例如,可以在时域上用窗函数去截取信号,对截下来的局部信号作傅里叶变换,具体的,可以利用时间窗函数乘以原始语音信号去截取信号,得到多帧语音信号。其中,时间窗函数可以是Rectangular窗(矩形窗),也可以是Hamming窗(汉明窗,还可以是Hanning窗(汉宁窗)、Bartlett窗(巴特雷特窗)等。另外,为了尽可能的不丢失语音信号动态变化的信息,还可以采用滑动窗,即帧与帧之间有一定的重叠,称为窗移,窗移可以取窗长的一半,示例性的,当窗长为25ms时,窗移也可以取10ms。
对原始语音信号分帧完成后,可以对每帧语音信号作离散傅里叶变换,如可以不断的移动时间窗函数的中心位置,即可得到每帧的傅里叶变换。由于离散傅里叶变换具有对称性,每帧可以只取离散傅里叶变换结果的一半点数作为每帧语音信号的短时傅里叶变换结果,短时傅里叶变换结果的集合也就是原始语音信号的时频变换结果。
原始语音信号经过短时傅里叶变换后得到时频域语音信号后,时频域语音信号在每个频率点的取值是一个复数,该复数可以由模和辐角确定,所以可以将该时频域语音信号分解为幅度谱和相位谱。其中,幅度谱是该复数的模关于频率的函数,相位谱是该复数的辐角关于频率的函数。
例如,可以将时频域语音信号Y(k,l)进行取模运算,得到原始语音信号的原始幅度谱,即:
|Y(k,l)| 2=Img(Y(k,l)) 2+Real(Y(k,l)) 2
其中,|Y(k,l)|为时频域语音信号的原始幅度谱,由于语音信号经过傅里叶变换后信息无损,所以,|Y(k,l)|也是原始语音信号的原始幅度谱。Real(Y(k,l))为时频域语音信号的实部,Img(Y(k,l))为时频域语音信号的虚部。
另外,可以假设原始语音信号的原始幅度谱服从复数域二维高斯分布。可以理解的是,对于其中包含的噪声信号如白噪声信号,是平稳噪声中的一种,也可以假设其概率密度分布在时间轴和频率轴服从二维高斯分布,即在时间轴、频率轴都具有统计特性,便于在时间轴、频率轴上可以实现降噪处理。
获取原始语音信号的原始幅度谱后,可以将该原始幅度谱输入深度神经网络中以进行时变、频率依赖的不同特征的提取。例如,可以基于时频域语音信号相邻帧和相邻频带之间的相关性,通过在时间和频率两个维度进行平滑处理来计算该时频域语音信号的局部特征。
一种示例实施方式中,将带噪语音信号由一维时域信号转换到二维频域的STFT参数, 即由y(n)=x(n)+w(n)转换到Y(k,l)=X(k,l)+W(k,l)后,可以对时频域语音信号进行降噪处理,以增强原始语音信号。例如,可以利用深度神经网络模型进行语音增强,通过平滑算法对时频域语音信号进行降噪处理时,可以将平滑算法并入深度神经网络的二维卷积模块当中。由于单个卷积模块对应的是单个特征的提取,在滑动过程中保持权重不变,可以对整个输入Y(k,l)实现单特征提取。如果要实现时变、频率依赖的分段不同特征的提取,可以通过先使用多个卷积核提取特征、再进行特征组合来完成。
示例性的,该二维卷积模块可以是一个TFDAL(Time-Frequency Domain Averaging Layer,时频域平滑层)模块,TFDAL模块又可以包含两个子模块,时间维平滑模块(Time-Dimensional Averaging Module,TAM)和频率维平滑模块(Frequency-Dimensional Averaging Module,FAM),可以分别对应时间轴维度的噪声平滑和频率轴维度的噪声平滑。
在步骤S320中,利用时间维卷积核对所述原始幅度谱进行特征提取,得到时域平滑特征图。
可以将该原始幅度谱作为TAM模块的输入,通过TAM模块对原始语音信号进行滤波处理,也就是进行时间轴维度的噪声平滑。例如,可以使用加权移动平均法来预测待平滑时间轴上每个时间点的幅度谱信息,其中,加权移动平均法可以根据同一个移动段内不同时间的数据对预测值的影响程度(对应不同的权重)来预测未来值。
参考图4所示,可以根据步骤S410至步骤S430进行时间轴维度的噪声平滑:
步骤S410.根据卷积滑窗和时域平滑因子确定时域平滑参数矩阵。
一种示例实施方式中,TAM模块对时域的平滑可以通过一个滑窗来实现,对应的平滑算法可以是:
Figure PCTCN2021137973-appb-000011
其中,l:表示时间轴上的第l个时间帧;
k:表示频率轴上的第k个频率格;
D:表示滑窗宽度,其宽度可以根据实际情况进行设置,在本示例中,优选可以将滑窗宽度设置为32帧;
α:平滑因子,表示对信号沿时间轴作平滑处理时,对滑窗宽度内的历史时间帧的幅度谱的利用程度,[α 0 … α N]为不同的平滑因子,每个平滑因子的取值范围为[0,1],对应于α的取值,TAM模块中的卷积核数量可以为N;
Y(k,l-D+i):其中i∈[1,D],表示滑窗宽度内的历史时间帧的幅度谱。本示例中,可以对各个历史时间帧的幅度谱都加以利用,示例性的,时间点为第32帧时的幅度谱可以由滑窗宽度内的前面31帧的幅度谱组成;
另外,某一时间点离l值越远时,α D-i的值越小,该时间点的幅度谱的权重越小;离l值越近时,α D-i的值越大,该时间点的幅度谱的权重越大;
T α(k,l):表示由滑窗宽度内历史时间帧的幅度谱叠加得到新的幅度谱,也是经过时 域平滑得到的幅度谱。
可以理解的是,在TAM模块中,可以根据卷积滑窗和时域平滑因子确定时域平滑参数矩阵,即可以根据滑窗宽度D和时域平滑因子α=[α 0 … α N]确定第一时域平滑参数矩阵[α 0 … α D-i]和第二时域平滑参数矩阵[1-α]。
步骤S420.对所述时域平滑参数矩阵作乘积运算得到所述时间维卷积核的权重矩阵。
一种示例实施方式中,在对原始输入图像进行时域特征提取之前,可以先确定时间维卷积核的权重矩阵。对时间轴进行平滑时,在TAM模块中对应可以有N个卷积核,每个卷积核对应不同的平滑因子,其中,平滑因子α=[α 0 … α N],每个卷积核对应的第一时域平滑参数矩阵可以为[α 0 … α D-i],结合第二时域平滑参数矩阵[1-α],如可以将第一时域平滑参数矩阵和第二时域平滑参数矩阵作乘积运算可以得到时间维卷积核的最终权重矩阵
Figure PCTCN2021137973-appb-000012
步骤S430.对所述时间维卷积核的权重矩阵和所述原始幅度谱作卷积运算,得到所述时域平滑特征图。
由于时频域语音信号的原始幅度谱与原始输入图像大小相同,也是二维形式,可以将其作为原始输入图像的频域图。然后,可以利用统计方法进行特征的构建和提取,具体的,可以依次对原始输入图像中的所有像素点进行平滑处理,以得到滤波后的图像。其中,为了保证滤波后的图像与原始输入图像尺寸相同,对原始输入图像中导致邻域大小超出图像区域的边缘像素,可以采用补零、对称补充相邻像素等方法进行预先补全的操作。
一种示例实施方式中,可以将时频域语音信号的原始幅度谱作为原始输入图像,该频谱图可以是一个T×F的二维图像矩阵,T为时间维度,F为频率维度,可以对该二维图像矩阵和时间维卷积核的权重矩阵
Figure PCTCN2021137973-appb-000013
作乘积运算,得到时域平滑特征图。
在步骤S330中,利用频率维卷积核对所述原始幅度谱进行特征提取,得到频域平滑特征图。
同时,也可以将该原始幅度谱作为FAM模块的输入,通过FAM模块对原始语音信号进行滤波处理,也就是进行频率轴维度的噪声平滑。例如,可以使用加权移动平均法来预测待平滑频率轴上每个频率格的幅度谱信息,参考图5所示,可以根据步骤S510至步骤S530利用加权移动平均法进行频率轴维度的噪声平滑:
步骤S510.根据卷积滑窗和频域平滑因子确定频域平滑参数矩阵。
一种示例实施方式中,FAM模块对频域的平滑可以通过一个滑窗来实现,对应的平滑算法可以是:
Figure PCTCN2021137973-appb-000014
其中,k:表示频率轴上的第k个频率格;
l表示时间轴上的第l个时间帧;
D表示滑窗宽度,其宽度可以根据实际情况进行设置,在本示例中,优选可以将滑窗 宽度设置为32帧;
β:平滑因子,表示对信号沿频率轴作平滑处理时,对滑窗宽度内的历史幅度谱的利用程度,[β 0 … β M]为不同的平滑因子,每个平滑因子的取值范围为[0,1],对应于β的取值,FAM模块中的卷积核数量可以为M;
Y(k-D+i,l):其中i∈[1,D],表示滑窗宽度内的历史幅度谱。本示例中,可以对各个历史幅度谱都加以利用,示例性的,滑窗宽度内第32帧时的幅度谱可以由滑窗内的前面31帧的幅度谱组成;
同样,某一频率格离k值越远,β D-i值越小,该频率格对应的历史幅度谱的权重越小;离k值越近,β D-i值越大,该频率格对应的历史幅度谱的权重越大;
F β(k,l):表示由滑窗宽度内历史幅度谱叠加得到新的幅度谱,也是经过频域平滑得到的幅度谱。
可以理解的是,在FAM模块中,可以根据卷积滑窗和频域平滑因子确定频域平滑参数矩阵,即可以根据滑窗宽度D和频域平滑因子β=[β 0 … β M]确定第一时域平滑参数矩阵[β 0 … β D-i]和第二时域平滑参数矩阵[1-β]。
步骤S520.对所述频域平滑参数矩阵作乘积运算得到所述频率维卷积核的权重矩阵。
随着频率的变化,频域图的分布也会发生变化,可以构建相应的特征矢量,并用每一维表示不同区域的分布特性。一种示例实施方式中,在对原始输入图像进行频域特征提取之前,可以先确定频率维卷积核的权重矩阵。对频率轴进行平滑时,在FAM模块中对应可以有M个卷积核,每个卷积核对应不同的平滑因子,其中,平滑因子β=[β 0 … β M],每个卷积核对应的第一频域平滑参数矩阵可以为[β 0 … β D-i],结合第二频域平滑参数矩阵[1-β],如可以将第一频域平滑参数矩阵和第二频域平滑参数矩阵作乘积运算可以得到频率维卷积核的最终权重矩阵
Figure PCTCN2021137973-appb-000015
步骤S530.对所述频率维卷积核的权重矩阵和所述原始幅度谱的转置矩阵作卷积运算,得到所述频域平滑特征图。
一种示例实施方式中,可以将时频域语音信号的原始幅度谱的转置矩阵作为原始输入图像,对该原始输入图像作滑动窗卷积,可以将每个通道的卷积核的窗口不断滑动,以对该原始输入图像作多次卷积运算。示例性的,可以将原始幅度谱的转置矩阵可以是一个F×T的二维图像矩阵,F为频率维度,T为时间维度,可以对该二维图像矩阵和频率维卷积核的权重矩阵
Figure PCTCN2021137973-appb-000016
作乘积运算,得到频域平滑特征图。
本方法中,利用卷积神经网络中卷积核的思想,将降噪算法做成卷积核,并通过多卷积核的组合,在神经网络中实现了时变、频率依赖的语音信号的降噪。另外,噪声W(k,l)的概率密度函数为二维高斯分布,在时间轴、频率轴都具有统计特性,可以实现在时间轴、频率轴双轴降噪。
在步骤S340中,对所述原始幅度谱、所述时域平滑特征图和所述频域平滑特征图进行组合特征提取,得到所述原始语音信号的增强幅度谱。
参考图6所示,可以根据步骤S610至步骤S630得到原始语音信号的增强幅度谱:
步骤S610.合并所述原始语音信号的原始幅度谱、所述时域平滑特征图和所述频域平滑特征图,得到待增强语音信号。
经过TAM模块和FAM模块平滑过的带噪语音信号Y(k,l),会对其中的噪声信号W(k,l)在时间轴T和频率轴F上都进行平滑。为了更好的保留原始输入的语音特征,可以将原始输入Y(k,l)的特征和TFDAL模块的输出进行拼接,这样既能保留原始语音信号的特征,又可以学习到深层次特征。
对应的,深度神经网络的输入可以由原始输入Y(k,l)变为组合输入,该组合输入可以是一个三维张量C i(k,l):
Figure PCTCN2021137973-appb-000017
其中,Y(k,l)是1×F×T的二维张量,相当于一个滤波器的平滑因子为0,即对原始信息不做处理,保持不变,T α(k,l)是M×F×T的三维张量,F β(k,l)是N×F×T的三维张量,合并组成的待增强语音信号C i(k,l)为(M+N+1)×F×T的三维张量。
本示例中,TFDAL模块对神经网络的输入进行了扩增,给予了神经网络更多的输入信息。而且,TFDAL模块兼具由专家知识开发出的降噪算法的可解释性和并入神经网络以后形成的强大拟合能力,是具有可解释性的神经网络模块,可以有效地将语音降噪领域的高级信号处理算法与深度神经网络进行结合。
步骤S620.以所述待增强语音信号为深度神经网络的输入,利用反向传播算法对所述时间维卷积核的权重矩阵和所述频率维卷积核的权重矩阵进行训练。
一种示例实施方式中,可以将TFDAL模块并入深度神经网络模型中,以对时间维卷积核的权重矩阵
Figure PCTCN2021137973-appb-000018
和所述频率维卷积核的权重矩阵
Figure PCTCN2021137973-appb-000019
进行训练,以及对该模型中各层的权重因子进行训练。例如,初始化TFDAL模块中时间维卷积核和频率维卷积核的权重矩阵,可以将该TFDAL模块与卷积神经网络、递归神经网络、全连接神经网络等网络模型结合,实现梯度传导。可以理解的是,神经网络的训练目标可以确定卷积核中每个元素的最终取值。
示例性的,神经网络模型的训练过程可以采用反向传播算法,可以随机初始化参数,随着训练的加深,不断更新参数。例如,可以采用BP(error Back Propagation,误差反向传播)算法,具体的,可以根据原始输入从前向后依次计算,得到输出层的输出;可以计算当前输出与目标输出的差距,即计算损失函数;可以利用梯度下降算法、Adam优化算法等最小化损失函数,从后向前依次更新参数,也就是依次更新时间维卷积核的权重矩阵
Figure PCTCN2021137973-appb-000020
和频率维卷积核的权重矩阵
Figure PCTCN2021137973-appb-000021
其中,梯度下降算法可以是随机梯度下降、小批量梯度下降或者批量梯度下降等方法来最小化加噪语音和纯净语音之间的误差。其中,批量梯度下降法可以在更新每一参数时 都使用所有的样本来进行更新;随机梯度下降法可以在更新每一参数时都使用一个样本来进行更新,并更新很多次,当样本量很大时,通过选取少部分样本就可以迭代到最优解;小批量梯度下降法可以在更新每一参数时都使用一部分样本来进行更新,可以同时兼顾随机梯度下降法和批量梯度下降法的特点。
步骤S630.根据训练得到的权重矩阵对所述待增强语音信号进行组合特征提取,得到所述原始语音信号的增强幅度谱。
将待增强语音信号C i(k,l)作为神经网络模型的输入时,在训练过程中,可以通过学习原始输入的权重矩阵
Figure PCTCN2021137973-appb-000022
时间维卷积核的权重矩阵
Figure PCTCN2021137973-appb-000023
频率维卷积核的权重矩阵
Figure PCTCN2021137973-appb-000024
来自行调整对原始输入Y(k,l)、TAM模块输出的T α(k,l)中的各层时域平滑特征图、FAM模块输出的F β(k,l)中的各层频域平滑特征图进行组合特征抽取,得到原始语音信号的增强幅度谱,以实现在语音出现段、语音缺失段的不同平滑效果。
本示例中,二维卷积结构可以成功并入深度神经网络模型,与卷积神经网络、递归神经网络、全连接神经网络均可以结合,实现梯度传导。使得TFDAL模块内的卷积核参数
Figure PCTCN2021137973-appb-000025
也即降噪算法参数可以由数据驱动,无需专家知识作为先验信息,就可以得到统计意义上的最优值。
在步骤S350中,对所述增强幅度谱进行时频逆变换得到增强语音信号。
语音增强是对纯净语音信号
Figure PCTCN2021137973-appb-000026
的幅度谱和相位谱进行预测,由于相位谱对去噪效果影响较小,一种示例实施方式中,可以只对时频域语音信号的原始幅度谱进行增强,而相位则沿用Y(k,l)的相位,因此,可以先获取Y(k,l)的原始相位谱。
例如,可以对变换后的语音信号作取相位角运算得到原始语音信号的原始相位谱。
Figure PCTCN2021137973-appb-000027
其中,∠Y(k,l)为时频域语音信号的原始相位谱,Real(Y(k,l))为时频域语音信号的实部,Img(Y(k,l))为时频域语音信号的虚部。
获得原始语音信号的原始相位谱后,可以对原始语音信号的增强幅度谱和原始相位谱作时频逆变换,得到增强语音信号。具体的,可以将增强幅度谱和原始相位谱合成复数域谱,复数谱维度与实部、虚部谱维度相同。然后,对复数域谱进行离散傅里叶逆变换得到对应的时域语音信号,并可以利用重叠相加法得到增强语音信号。
图7示意性的给出了包含TFDAL模块与深度神经网络的语音增强的流程图,其中,TFDAL模块包括TAM模块和FAM模块,该过程可以包括步骤S701至步骤S708:
步骤S701.输入语音信号y(n),该信号为带噪语音信号;
步骤S702.对语音信号作STFT变换:将带噪语音信号y(n)作STFT变换,得到时频域语音信号Y(k,l);
步骤S703.取模运算:将时频域语音信号Y(k,l)作取模运算,得到语音信号的幅度信息即原始幅度谱|Y(k,l)|;
步骤S704.将该原始幅度谱输入TAM模块,对该原始幅度谱提取时域平滑特征,得到沿时间轴降噪后的幅度谱T(k,l);
步骤S705.将该原始幅度谱输入FAM模块,对该原始幅度谱的转置矩阵提取频域平滑特征,得到沿频率轴降噪后的幅度谱F(k,l);
步骤S706.将原始幅度谱|Y(k,l)|、时间轴降噪后的幅度谱T(k,l)和频率轴降噪后的幅度谱F(k,l)合并输入深度神经网络中,以进行组合特征的提取,得到语音信号的增强幅度谱;
步骤S707.取相位信息:将时频域语音信号Y(k,l)作取相位角运算,得到语音信号的带噪相位谱∠Y(k,l);
步骤S708.对语音信号的增强幅度谱和带噪相位谱作ISTFT变换,得到增强后的语音信号。
本示例中,在语音增强的过程中,通过卷积神经网络可以实现在时间轴和频率轴的二维组合的时频平滑特征提取,将TFDAL模块并入神经网络模型中,可以通过梯度回传实现平滑参数(也即卷积核的权重)的自学习,无需手工设置。
一种具体的示例实施方式中,图8A示意性的给出了一种TFDAL模块与U-Net深度神经网络结合的示意图,也就是可以构建具有编码器-解码器结构的U-Net卷积神经网络模型作为语音增强模型,U-Net卷积神经网络模型可以包括全卷积部分(Encoder层)和反卷积部分(Decoder层)。其中,全卷积部分可以用于提取特征,得到低分辨率的特征图;反卷积部分可以将小尺寸的特征图通过上采样得到与原始尺寸相同的特征图,上采样可以提高图像的分辨率,示例性的,可以通过重采样和插值如使用双线性插值等方法对其余点进行插值来完成上采样。
首先,可以将原始语音信号通过时频变换得到原始输入,将原始输入分别输入TAM(α)卷积模块和FAM(β)卷积模块,并将原始输入以及TAM(α)卷积模块、FAM(β)卷积模块的输出合并输入到U-NET卷积神经网络模型中,对各个权重因子进行训练后,可以对原始输入、TAM模块输出、FAM模块输出进行组合特征的抽取,从而实现在语音出现段、语音缺失段的不同平滑效果,最后输出增强语音信号。
本示例中,将TFDAL模块与U-Net深度神经网络完整结合后,其中时频特征提取层中的两组平滑参数α=[α 0 … α N]与β=[β 0 … β M]可以在训练过程中通过学习得到,由训练数据驱动得到在统计意义上更优的平滑因子值,更有助于神经网络在TFDAL模块中抽取特征以及组合高级特征。
图8B给出了一种TFDAL模块与U-Net深度神经网络结合的示意图。该U-Net深度神经网络模型可以是4层编码器和4层解码器的卷积神经网络结构,编码器可以通过对时间维度和频率维度下采样提取时频域平滑特征,每个编码器可以包含卷积核大小为3×3的卷积层、池化层以及激活函数为ReLU(Rectified Linear Unit,线性整流函数)的非线性层。其中,对时间和频率维度逐层进行下采样操作,可以利用3×3的卷积核进行特征 提取,使得通道数逐层扩展至64、128、256、256。对称的,可以利用3×3的卷积核进行上采样操作,每一步的上采样都会加入来自相对应编码器的特征图,通道数由256逐层变为512、256、128,直至恢复到与输入等尺寸的图像。另外,最后一层的激活函数可以选择Tanh(双曲正切函数)激活函数。
具体的,可以将原始幅度谱作为原始输入图像,该原始输入图像可以是一个T×F的二维图像矩阵,T为时间维度,F为频率维度。该原始输入图像依次连接的时频特征提取层、编码器、解码器和输出层。
首先,可以对该原始输入图像作预处理,时频谱特征在时间、频率相对独立,如可以通过时频特征提取层分别沿时间轴和频率轴作卷积平滑处理。对应的,可以将该原始输入图像输入U-Net深度神经网络中的时间递归平滑层,对该二维图像矩阵和时间维卷积核的权重矩阵N(α)作卷积运算,得到时域平滑特征图;可以将该原始输入图像转置后输入U-Net深度神经网络中的频率递归平滑层,对转置后的二维图像矩阵和频率维卷积核的权重矩阵M(β)作卷积运算,得到频域平滑特征图。时频特征提取层可以从维度层面融合特征。
然后,编码器可以对组合后的输出的时频域平滑特征图和原始输入图像作4次卷积处理,对于尺寸为201×201的原始输入图像,时间维卷积核的大小可以为32×201,对二维的原始输入图像作滑动窗卷积,可以将每个通道的卷积核的窗口不断滑动,以对该原始输入图像作多次卷积运算,可以依次得到51×51、13×13、4×4、1×1四个不同尺寸的特征图。编码器可以提取原始语音信号中的高维特征。
最后,将编码器输出的高维编码特征作为解码器的输入,解码器和编码器具有对称结构。如可以对1×1的特征图做上采样或者反卷积处理,得到4×4的特征图,这个4×4的特征图与之前的4×4的特征图进行通道上的拼接,然后再对拼接之后的特征图做卷积和上采样,得到13×13的特征图,再与之前的13×13的特征拼接,卷积,再上采样,共经过四次上采样可以得到一个与输入图像尺寸相同的201×201的预测结果。解码器可以将高维特征恢复为含有更多声音信息的低维特征,输出层可以恢复出增强后的时频谱特征。
本示例中,二维TFDAL模块可以成功并入深度神经网络模型,与卷积神经网络、递归神经网络、全连接神经网络均能理想结合,实现梯度传导。使得TFDAL模块内的卷积核参数,也即降噪算法参数可以由数据驱动,无需专家知识作为先验信息,就可以得到统计意义上的最优值。而且,该TFDAL模块兼具由专家知识开发出的算法的可解释性和并入神经网络以后形成的强大拟合能力,是一种具有可解释性的神经网络模块,可以有效的将语音降噪领域的高级信号处理算法与深度神经网络进行结合。
另外,可以通过PESQ(Perceptual Evaluation of Speech Quality,语音质量感知评价指标)、STOI(Short-Time Objective Intelligibility,短时客观可懂度指标)、信噪比SNR来度量噪声信号,在本示例中加入TFDAL后的U-Net神经网络与原始U-Net神经网络相比,语音增强的效果在-5db,0db,5db,10db,15db的信噪比条件下,PESQ、STOI和SNR 语音增强评测指标均有更大提升。
在本公开示例实施方式所提供的语音增强方法中,通过将原始语音信号进行时频变换得到原始语音信号的原始幅度谱;利用时间维卷积核对原始幅度谱进行特征提取,得到时域平滑特征图;利用频率维卷积核对原始幅度谱进行特征提取,得到频域平滑特征图;对原始幅度谱、时域平滑特征图和频域平滑特征图进行组合特征提取,得到原始语音信号的增强幅度谱;对增强幅度谱进行时频逆变换得到增强语音信号。一方面,通过卷积神经网络对时间轴和频率轴的二维组合提取时频平滑特征,并结合深度神经网络可以实现降噪参数的自学习,进一步提升语音信号的质量;另一方面,根据语音信号在时间轴和频率轴上的统计特性,能够实现在时间轴、频率轴双轴降噪,进而在多种复杂噪声环境下达到语音增强的效果。
应当注意,尽管在附图中以特定顺序描述了本公开中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。
进一步的,本示例实施方式中,还提供了一种基于神经网络的语音增强装置,该装置可以应用于一服务器或终端设备。参考图9所示,该语音增强装置900可以包括信号变换模块910、时域平滑特征提取模块920、频域平滑特征提取模块930、组合特征提取模块940以及信号逆变换模块950,其中:
信号变换模块910,用于将原始语音信号进行时频变换得到所述原始语音信号的原始幅度谱;
时域平滑特征提取模块920,用于利用时间维卷积核对所述原始幅度谱进行特征提取,得到时域平滑特征图;
频域平滑特征提取模块930,利用频率维卷积核对所述原始幅度谱进行特征提取,得到频域平滑特征图;
组合特征提取模块940,用于对所述原始幅度谱、所述时域平滑特征图和所述频域平滑特征图进行组合特征提取,得到所述原始语音信号的增强幅度谱;
信号逆变换模块950,用于对所述增强幅度谱进行时频逆变换得到增强语音信号。
在一种可选的实施方式中,时域平滑特征提取模块920包括:
时域平滑参数矩阵确定模块,根据卷积滑窗和时域平滑因子确定时域平滑参数矩阵;
第一权重矩阵确定模块,用于对所述时域平滑参数矩阵作乘积运算得到所述时间维卷积核的权重矩阵;
时域运算模块,用于对所述时间维卷积核的权重矩阵和所述原始幅度谱作乘积运算,得到所述时域平滑特征图。
在一种可选的实施方式中,频域平滑特征提取模块930包括:
频域平滑参数矩阵确定模块,用于对所述频域平滑参数矩阵作乘积运算得到所述频率 维卷积核的权重矩阵;
第二权重矩阵确定模块,用于对所述频域平滑参数矩阵作乘积运算得到所述频率维卷积核的权重矩阵;
频域运算模块,用于对所述频率维卷积核的权重矩阵和所述原始幅度谱的转置矩阵作乘积运算,得到所述频域平滑特征图。
在一种可选的实施方式中,组合特征提取模块940包括:
输入信号获取模块,用于合并所述原始语音信号的原始幅度谱、所述时域平滑特征图和所述频域平滑特征图,得到待增强语音信号;
权重矩阵训练模块,用于以所述待增强语音信号为深度神经网络的输入,利用反向传播算法对所述时间维卷积核和所述频率维卷积核的权重矩阵进行训练;
增强幅度谱获取模块,用于根据训练得到的权重矩阵对所述待增强语音信号进行组合特征提取,得到所述原始语音信号的增强幅度谱。
在一种可选的实施方式中,信号变换模块910包括:
信号预处理模块,用于对所述原始语音信号进行加窗分帧处理,得到分帧后的语音信号;
原始幅度谱获取模块,用于对每帧语音信号作离散傅里叶变换,并对变换后的语音信号作取模运算得到所述原始语音信号的原始幅度谱。
在一种可选的实施方式中,信号逆变换模块950包括:
原始相位谱获取模块,用于对所述变换后的语音信号作取相位角运算得到所述原始语音信号的原始相位谱;
增强语音信号获取模块,用于对所述原始语音信号的增强幅度谱和所述原始相位谱作时频逆变换,得到所述增强语音信号。
在一种可选的实施方式中,语音增强装置900还被配置为:
所述原始语音信号的原始幅度谱服从复数域二维高斯分布。
上述语音增强装置中各模块的具体细节已经在对应的语音增强方法中进行了详细的描述,因此此处不再赘述。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。

Claims (10)

  1. 一种基于神经网络的语音增强方法,包括:
    将原始语音信号进行时频变换得到所述原始语音信号的原始幅度谱;
    利用时间维卷积核对所述原始幅度谱进行特征提取,得到时域平滑特征图;
    利用频率维卷积核对所述原始幅度谱进行特征提取,得到频域平滑特征图;
    对所述原始幅度谱、所述时域平滑特征图和所述频域平滑特征图进行组合特征提取,得到所述原始语音信号的增强幅度谱;
    对所述增强幅度谱进行时频逆变换得到增强语音信号。
  2. 根据权利要求1所述的语音增强方法,所述利用时间维卷积核对所述原始幅度谱进行特征提取,得到时域平滑特征图,包括:
    根据卷积滑窗和时域平滑因子确定时域平滑参数矩阵;
    对所述时域平滑参数矩阵作乘积运算得到所述时间维卷积核的权重矩阵;
    对所述时间维卷积核的权重矩阵和所述原始幅度谱作卷积运算,得到所述时域平滑特征图。
  3. 根据权利要求1所述的语音增强方法,所述利用频率维卷积核对所述原始幅度谱进行特征提取,得到频域平滑特征图,包括:
    根据卷积滑窗和频域平滑因子确定频域平滑参数矩阵;
    对所述频域平滑参数矩阵作乘积运算得到所述频率维卷积核的权重矩阵;
    对所述频率维卷积核的权重矩阵和所述原始幅度谱的转置矩阵作卷积运算,得到所述频域平滑特征图。
  4. 根据权利要求1所述的语音增强方法,所述对所述原始幅度谱、所述时域平滑特征图和所述频域平滑特征图进行组合特征提取,得到所述原始语音信号的增强幅度谱,包括:
    合并所述原始语音信号的原始幅度谱、所述时域平滑特征图和所述频域平滑特征图,得到待增强语音信号;
    以所述待增强语音信号为深度神经网络的输入,利用反向传播算法对所述时间维卷积核的权重矩阵和所述频率维卷积核的权重矩阵进行训练;
    根据训练得到的权重矩阵对所述待增强语音信号进行组合特征提取,得到所述原始语音信号的增强幅度谱。
  5. 根据权利要求1所述的语音增强方法,所述将原始语音信号进行时频变换得到所述原始语音信号的原始幅度谱,包括:
    对所述原始语音信号进行加窗分帧处理,得到分帧后的语音信号;
    对每帧语音信号作离散傅里叶变换,并对变换后的语音信号作取模运算得到所述原始语音信号的原始幅度谱。
  6. 根据权利要求5所述的语音增强方法,所述对所述增强幅度谱进行时频逆变换得到增强语音信号,包括:
    对所述变换后的语音信号作取相位角运算得到所述原始语音信号的原始相位谱;
    对所述原始语音信号的增强幅度谱和所述原始相位谱作时频逆变换,得到所述增强语音信号。
  7. 根据权利要求1所述的语音增强方法,所述原始语音信号的原始幅度谱服从复数域二维高斯分布。
  8. 一种基于神经网络的语音增强装置,包括:
    信号变换模块,用于将原始语音信号进行时频变换得到所述原始语音信号的原始幅度谱;
    时域平滑特征提取模块,用于利用时间维卷积核对所述原始幅度谱进行特征提取,得到时域平滑特征图;
    频域平滑特征提取模块,用于利用频率维卷积核对所述原始幅度谱进行特征提取,得到频域平滑特征图;
    组合特征提取模块,用于对所述原始幅度谱、所述时域平滑特征图和所述频域平滑特征图进行组合特征提取,得到所述原始语音信号的增强幅度谱;
    信号逆变换模块,用于对所述增强幅度谱进行时频逆变换得到增强语音信号。
  9. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1-7任一项所述方法。
  10. 一种电子设备,包括:
    处理器;以及
    存储器,用于存储所述处理器的可执行指令;
    其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1-7任一项所述的方法。
PCT/CN2021/137973 2021-03-05 2021-12-14 基于神经网络的语音增强方法、装置及电子设备 WO2022183806A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110245564.1A CN113808607A (zh) 2021-03-05 2021-03-05 基于神经网络的语音增强方法、装置及电子设备
CN202110245564.1 2021-03-05

Publications (1)

Publication Number Publication Date
WO2022183806A1 true WO2022183806A1 (zh) 2022-09-09

Family

ID=78892966

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/137973 WO2022183806A1 (zh) 2021-03-05 2021-12-14 基于神经网络的语音增强方法、装置及电子设备

Country Status (2)

Country Link
CN (1) CN113808607A (zh)
WO (1) WO2022183806A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114093380B (zh) * 2022-01-24 2022-07-05 北京荣耀终端有限公司 一种语音增强方法、电子设备、芯片系统及可读存储介质
CN114897033B (zh) * 2022-07-13 2022-09-27 中国人民解放军海军工程大学 用于多波束窄带历程数据的三维卷积核组计算方法
CN116631410B (zh) * 2023-07-25 2023-10-24 陈志丰 一种基于深度学习的语音识别方法
CN117116289B (zh) * 2023-10-24 2023-12-26 吉林大学 病区医护对讲管理系统及其方法

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108231086A (zh) * 2017-12-24 2018-06-29 航天恒星科技有限公司 一种基于fpga的深度学习语音增强器及方法
CN109215674A (zh) * 2018-08-10 2019-01-15 上海大学 实时语音增强方法
CN109360581A (zh) * 2018-10-12 2019-02-19 平安科技(深圳)有限公司 基于神经网络的语音增强方法、可读存储介质及终端设备
CN110503967A (zh) * 2018-05-17 2019-11-26 中国移动通信有限公司研究院 一种语音增强方法、装置、介质和设备
CN111081268A (zh) * 2019-12-18 2020-04-28 浙江大学 一种相位相关的共享深度卷积神经网络语音增强方法
US20210012767A1 (en) * 2020-09-25 2021-01-14 Intel Corporation Real-time dynamic noise reduction using convolutional networks
CN112259120A (zh) * 2020-10-19 2021-01-22 成都明杰科技有限公司 基于卷积循环神经网络的单通道人声与背景声分离方法
CN112331224A (zh) * 2020-11-24 2021-02-05 深圳信息职业技术学院 轻量级时域卷积网络语音增强方法与系统

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009035614A1 (en) * 2007-09-12 2009-03-19 Dolby Laboratories Licensing Corporation Speech enhancement with voice clarity
EP2226794B1 (en) * 2009-03-06 2017-11-08 Harman Becker Automotive Systems GmbH Background noise estimation
US9431987B2 (en) * 2013-06-04 2016-08-30 Sony Interactive Entertainment America Llc Sound synthesis with fixed partition size convolution of audio signals
CN103559887B (zh) * 2013-11-04 2016-08-17 深港产学研基地 用于语音增强系统的背景噪声估计方法
US10381020B2 (en) * 2017-06-16 2019-08-13 Apple Inc. Speech model-based neural network-assisted signal enhancement
BR112020008216A2 (pt) * 2017-10-27 2020-10-27 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. aparelho e seu método para gerar um sinal de áudio intensificado, sistema para processar um sinal de áudio
CN108447498B (zh) * 2018-03-19 2022-04-19 中国科学技术大学 应用于麦克风阵列的语音增强方法
CN108564963B (zh) * 2018-04-23 2019-10-18 百度在线网络技术(北京)有限公司 用于增强语音的方法和装置
CN108711433B (zh) * 2018-05-18 2020-08-14 歌尔科技有限公司 一种回声消除方法和装置
CN109584895B (zh) * 2018-12-24 2019-10-25 龙马智芯(珠海横琴)科技有限公司 语音降噪方法及装置
CN110148420A (zh) * 2019-06-30 2019-08-20 桂林电子科技大学 一种适用于噪声环境下的语音识别方法
CN112309421B (zh) * 2019-07-29 2024-03-19 中国科学院声学研究所 一种融合信噪比与可懂度双重目标的语音增强方法及系统
CN112289333B (zh) * 2020-12-25 2021-04-13 北京达佳互联信息技术有限公司 语音增强模型的训练方法和装置及语音增强方法和装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108231086A (zh) * 2017-12-24 2018-06-29 航天恒星科技有限公司 一种基于fpga的深度学习语音增强器及方法
CN110503967A (zh) * 2018-05-17 2019-11-26 中国移动通信有限公司研究院 一种语音增强方法、装置、介质和设备
CN109215674A (zh) * 2018-08-10 2019-01-15 上海大学 实时语音增强方法
CN109360581A (zh) * 2018-10-12 2019-02-19 平安科技(深圳)有限公司 基于神经网络的语音增强方法、可读存储介质及终端设备
CN111081268A (zh) * 2019-12-18 2020-04-28 浙江大学 一种相位相关的共享深度卷积神经网络语音增强方法
US20210012767A1 (en) * 2020-09-25 2021-01-14 Intel Corporation Real-time dynamic noise reduction using convolutional networks
CN112259120A (zh) * 2020-10-19 2021-01-22 成都明杰科技有限公司 基于卷积循环神经网络的单通道人声与背景声分离方法
CN112331224A (zh) * 2020-11-24 2021-02-05 深圳信息职业技术学院 轻量级时域卷积网络语音增强方法与系统

Also Published As

Publication number Publication date
CN113808607A (zh) 2021-12-17

Similar Documents

Publication Publication Date Title
WO2022183806A1 (zh) 基于神经网络的语音增强方法、装置及电子设备
US11462209B2 (en) Spectrogram to waveform synthesis using convolutional networks
Pandey et al. Dense CNN with self-attention for time-domain speech enhancement
EP3806089B1 (en) Mixed speech recognition method and apparatus, and computer readable storage medium
US7725314B2 (en) Method and apparatus for constructing a speech filter using estimates of clean speech and noise
WO2021179424A1 (zh) 结合ai模型的语音增强方法、系统、电子设备和介质
KR20190005217A (ko) 신경망을 이용한 주파수 기반 오디오 분석
WO2018223727A1 (zh) 识别声纹的方法、装置、设备及介质
CN110164467A (zh) 语音降噪的方法和装置、计算设备和计算机可读存储介质
WO2022126924A1 (zh) 基于域分离的语音转换模型的训练方法及装置
TR201810466T4 (tr) Özellik çıkarımı kullanılarak konuşmanın iyileştirilmesi için bir ses sinyalinin işlenmesine yönelik aparat ve yöntem.
WO2022161277A1 (zh) 语音增强方法、模型训练方法以及相关设备
CN113345460B (zh) 音频信号处理方法、装置、设备及存储介质
WO2022213825A1 (zh) 基于神经网络的端到端语音增强方法、装置
CN114898762A (zh) 基于目标人的实时语音降噪方法、装置和电子设备
Götz et al. Neural network for multi-exponential sound energy decay analysis
Dash et al. Speech intelligibility based enhancement system using modified deep neural network and adaptive multi-band spectral subtraction
CN115223583A (zh) 一种语音增强方法、装置、设备及介质
CN113327594B (zh) 语音识别模型训练方法、装置、设备及存储介质
CN116403594B (zh) 基于噪声更新因子的语音增强方法和装置
Lee et al. Real-time neural speech enhancement based on temporal refinement network and channel-wise gating methods
CN117496990A (zh) 语音去噪方法、装置、计算机设备及存储介质
CN116913307A (zh) 语音处理方法、装置、通信设备及可读存储介质
CN112687284B (zh) 混响语音的混响抑制方法及装置
CN114783455A (zh) 用于语音降噪的方法、装置、电子设备和计算机可读介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21928877

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM1205A DATED 16.01.2024)