CN110931031A - Deep learning voice extraction and noise reduction method fusing bone vibration sensor and microphone signals - Google Patents

Deep learning voice extraction and noise reduction method fusing bone vibration sensor and microphone signals Download PDF

Info

Publication number
CN110931031A
CN110931031A CN201910953534.9A CN201910953534A CN110931031A CN 110931031 A CN110931031 A CN 110931031A CN 201910953534 A CN201910953534 A CN 201910953534A CN 110931031 A CN110931031 A CN 110931031A
Authority
CN
China
Prior art keywords
vibration sensor
bone vibration
microphone
audio signal
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910953534.9A
Other languages
Chinese (zh)
Inventor
闫永杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Elephant Acoustical (shenzhen) Technology Co Ltd
Original Assignee
Elephant Acoustical (shenzhen) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Elephant Acoustical (shenzhen) Technology Co Ltd filed Critical Elephant Acoustical (shenzhen) Technology Co Ltd
Priority to CN201910953534.9A priority Critical patent/CN110931031A/en
Publication of CN110931031A publication Critical patent/CN110931031A/en
Priority to TW109134873A priority patent/TWI763073B/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Abstract

The invention relates to a deep learning noise reduction method for fusing bone vibration sensor and microphone signals, which comprises the following steps: the bone vibration sensor and the microphone collect audio signals to respectively obtain an audio signal of the bone vibration sensor and an audio signal of the microphone; inputting the audio signal of the bone vibration sensor into a high-pass filtering module, and performing high-pass filtering; inputting the high-pass filtered bone vibration sensor audio signal or the signal subjected to frequency band widening and a microphone audio signal into a deep neural network module; and the deep neural network module obtains the voice after noise reduction through prediction. The invention combines the signals of the bone vibration sensor and the traditional microphone, realizes high voice reduction degree and strong noise suppression capacity by utilizing the strong modeling capacity of the deep neural network, can solve the voice extraction problem in a complex noise scene, realizes the extraction of target voice, reduces interference noise, and can adopt a single microphone structure to reduce cost. In addition, the signal of the audio signal of the bone vibration sensor after the frequency band is widened can be directly used as the output.

Description

Deep learning voice extraction and noise reduction method fusing bone vibration sensor and microphone signals
Technical Field
The invention relates to the technical field of voice noise reduction of electronic equipment, in particular to a deep learning noise reduction method fusing bone vibration sensors and microphone signals.
Background
The voice noise reduction technology is used for separating a voice signal from a voice signal with noise, has wide application, and generally has a single-microphone noise reduction technology and a multi-microphone noise reduction technology, however, the traditional noise reduction technology has some defects, and the traditional single-microphone noise reduction technology assumes noise as stable noise in advance, so that the adaptability is not high, and the limitation is large; the traditional multi-microphone noise reduction technology needs two or more microphones, so that the cost is increased, the multi-microphone structure has higher requirements on the structural design of a product, the structural design of the product is limited, in addition, the multi-microphone noise reduction technology depends on direction information to reduce noise, the noise from the direction of a target voice cannot be inhibited, and the defects are worth improving.
The traditional multi-microphone and single-microphone communication noise reduction technology has the following defects:
1. the number of the microphones and the cost are in a linear relation, and the more the number of the microphones is, the higher the cost is;
2. the multi-microphone has higher requirements on the structural design of the product, and the structural design of the product is limited;
3. the multi-microphone noise reduction technology is used for noise reduction depending on direction information, and noise from a direction close to a target human voice cannot be suppressed;
4. the single-microphone noise reduction technology relies on noise estimation, and noise is built into the noise in advance to be stable sound, so that the single-microphone noise reduction technology has limitation.
The invention combines the signals of the bone vibration sensor and the traditional microphone, adopts deep learning to carry out fusion so as to realize noise reduction, and realizes the extraction of the target voice and the reduction of interference noise under various noise environments. The technology can be applied to the communication scene of the earphone, the mobile phone and the like which are attached to the ear (or other body parts). In contrast to techniques that employ only one or more microphones to reduce noise, the combination with bone vibration sensors can be used in environments where the signal-to-noise ratio is very low, such as: and in the scenes such as subways, wind noises and the like, good conversation experience can still be kept. Compared with the traditional single-microphone noise reduction technology, the technology does not make any assumption on noise (the traditional single-microphone noise reduction technology assumes noise to be stable noise in advance), utilizes the strong modeling capability of the deep neural network, has good voice reduction degree and strong noise suppression capability, and can solve the voice extraction problem in a complex noise scene. Compared with a noise reduction scheme that the traditional multi-microphone noise reduction technology needs 2 or more microphones for beam forming, a single microphone is adopted.
With respect to air conduction microphones, bone vibration sensor signal sampling is predominantly in the low frequency range, but is not disturbed by air conduction noise. Different from other noise reduction modes combining a bone vibration sensor and an air conduction microphone, the technology only utilizes a bone vibration sensor signal as a human voice activation detection mark, takes a bone conduction signal as a low-frequency input signal, and sends the low-frequency input signal and a microphone signal into a deep neural network together for integral fusion after high-frequency reconstruction (optional) so as to realize noise reduction. By means of the bone vibration sensor, a high-quality low-frequency signal can be obtained, and on the basis, the accuracy of deep neural network prediction is greatly improved, so that the noise reduction effect is better.
Compared with the patent with the application number of 201710594168.3 (named as a universal single-channel real-time noise reduction method), the method introduces the bone vibration sensor signal, and utilizes the characteristic that the bone vibration sensor is not interfered by air noise to fuse the bone vibration sensor signal and the air conduction microphone signal by using the deep neural network, so that the high-quality noise reduction effect can be achieved under the condition of extremely low signal-to-noise ratio.
Compared with the signal of the bone vibration sensor as the voice activity detection sign in the patent with the application number of 201811199154.2 (named as a system for recognizing the voice of the user through human body vibration to control the electronic equipment), the signal of the bone vibration sensor and the signal of the microphone are used as the input of a deep neural network together to carry out the organic fusion of signal layers, so that the high-quality noise reduction effect is achieved.
Disclosure of Invention
The invention aims to solve the technical problems that in the prior art, multiple microphones limit product structures, the cost is too high, the traditional single-microphone noise reduction technology is limited and the like by adopting a deep learning noise reduction method fusing bone vibration sensors and microphone signals. Different from other technologies which combine a bone vibration sensor and an air conduction microphone and only use a bone vibration sensor signal as an activation detection mark, the technology uses the characteristic that the bone vibration sensor signal is not interfered by air conduction noise, uses a bone transmission signal as a direct input signal, and sends the bone transmission signal and the microphone signal into a deep neural network together for integral fusion and noise reduction after high-frequency reconstruction (selection). By means of the bone vibration sensor, a high-quality low-frequency signal can be obtained, and on the basis, the accuracy of deep neural network prediction is greatly improved, so that the noise reduction effect is better.
The technical scheme adopted by the invention for solving the technical problems is as follows: a deep learning noise reduction method fusing bone vibration sensors and microphone signals is constructed, the respective advantages of the bone vibration sensors and the signals of the traditional microphone are combined, a deep learning voice extraction and noise reduction technology is adopted, and under various noise environments, the extraction of target voice is achieved, and interference noise is reduced. The technology can be applied to the communication scene of the ear (or other body parts) fit such as earphones, mobile phones and the like, and is low in cost and easy to realize.
In the deep learning noise reduction method for fusing bone vibration sensor and microphone signal, the method comprises the following steps:
s1, collecting audio signals by the bone vibration sensor and the microphone to respectively obtain an audio signal of the bone vibration sensor and an audio signal of the microphone;
s2, inputting the audio signal of the bone vibration sensor into a high-pass filtering module for high-pass filtering;
s3, inputting the high-pass filtered bone vibration sensor audio signal and microphone audio signal into a deep neural network module;
and S4, the deep neural network module predicts and obtains the noise reduction voice after fusion.
In the deep learning noise reduction method fusing the bone vibration sensor and the microphone signal, the high-pass filtering module corrects the audio signal direct current offset of the bone vibration sensor and filters out low-frequency clutter signals.
In the deep learning noise reduction method for fusing the bone vibration sensor and the microphone signal, the audio signal of the bone vibration sensor is subjected to high-pass filtering, and more preferably, the frequency range is further widened through high-frequency reconstruction, namely, a frequency band widening method, so that the audio signal of the bone vibration sensor is widened to more than two kilohertz, and then the audio signal is input into a deep neural network module.
Further, only the bone vibration signal with the broadened frequency band may be used as the final output signal, thereby eliminating the need to rely on a microphone signal.
In the deep learning noise reduction method for fusing the bone vibration sensor and the microphone signal, the deep neural network module further comprises a fusion module, and the fusion module fuses and reduces noise of the microphone audio signal and the bone vibration sensor audio signal.
In the deep learning noise reduction method fusing the bone vibration sensor and the microphone signal, an implementation method of the deep neural network module is realized through a convolution cyclic neural network, and a pure voice magnitude spectrum is obtained through prediction.
In the deep learning noise reduction method fusing the bone vibration sensor and the microphone signal, a deep neural network module is composed of a plurality of layers of convolution networks, a plurality of layers of long and short term memory networks and a plurality of layers of corresponding deconvolution networks.
In the deep learning noise reduction method fusing the bone vibration sensor and the microphone signals, the training target of the deep neural network module is a pure voice magnitude spectrum. Firstly, pure voice is subjected to short-time Fourier transform, and then a pure voice amplitude spectrum is obtained to be used as a training target, namely a target amplitude spectrum.
In the deep learning noise reduction method fusing the bone vibration sensor and the microphone signal, an input signal of a deep neural network module is formed by stacking the amplitude spectrum of the audio signal of the bone vibration sensor (or the amplitude spectrum after the frequency band is widened) and the amplitude spectrum of the audio signal of the microphone;
firstly, respectively carrying out short-time Fourier transform on an audio signal of the bone vibration sensor and an audio signal of the microphone, respectively obtaining two paths of amplitude spectrums, and stacking.
In the deep learning noise reduction method fusing the bone vibration sensor and the microphone signals, the stacked amplitude spectrum passes through a deep neural network module to obtain a predicted amplitude spectrum, and the predicted amplitude spectrum is output.
In the deep learning noise reduction method fusing the bone vibration sensor and the microphone signals, the target amplitude spectrum and the predicted amplitude spectrum are subjected to mean square error.
The method for extracting the voice and reducing the noise through the deep learning, which is disclosed by the invention, has the beneficial effects that the method for extracting the voice and reducing the noise through the deep learning fusing the bone vibration sensor and the microphone signals is provided, the strong modeling capability of a deep neural network is utilized, the good voice reduction degree and the strong noise suppression capability are realized, and the problem of voice extraction in a complex noise scene can be solved. The invention utilizes the characteristic that the bone vibration sensor is not interfered by air conduction noise, and can be used in the environment with extremely low signal-to-noise ratio, such as: and the scenes such as subways, wind noises and the like still keep good conversation experience. And the use of a single microphone significantly simplifies implementation and reduces cost. Different from other noise reduction modes combining a bone vibration sensor and an air conduction microphone, the method only uses a bone vibration sensor signal as an activation detection mark, and uses the characteristic that the bone vibration sensor signal is not interfered by air conduction noise, uses a bone transmission signal as a low-frequency input signal, and sends the low-frequency input signal and a microphone signal into a deep neural network together for integral fusion after high-frequency reconstruction (selection) so as to obtain human voice. By means of the bone vibration sensor, a high-quality low-frequency signal can be obtained, and on the basis, the accuracy of predicting human voice by the deep neural network is greatly improved, so that the noise reduction effect is better.
Drawings
The invention will be further explained with reference to the drawings and the embodiments. In the drawings:
FIG. 1 is a block flow diagram of a method of deep learning noise reduction incorporating bone vibration sensor and microphone signals in accordance with the present invention;
FIG. 2 is a functional block diagram of a method of high frequency reconstruction;
FIG. 3 is a block diagram of a deep neural network fusion module structure of a deep learning noise reduction method for fusing bone vibration sensors and microphone signals according to the present invention;
FIG. 4 is a schematic diagram of a frequency spectrum of an audio signal collected by a bone vibration sensor according to the deep learning noise reduction method for integrating a bone vibration sensor and a microphone signal;
FIG. 5 is a schematic diagram of a frequency spectrum of an audio signal collected by a microphone of the deep learning noise reduction method for fusing bone vibration sensors and microphone signals according to the present invention;
FIG. 6 is a schematic diagram of a frequency spectrum of an audio signal processed by a deep learning noise reduction method for fusing bone vibration sensors and microphone signals according to the present invention;
FIG. 7 is a comparison graph of noise reduction effects of a method for noise reduction by fusing bone vibration sensors and microphone signals and a method for noise reduction by deep learning in real time corresponding to a single sound channel without a bone vibration sensor according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, the present invention is a deep learning speech extraction and noise reduction method fusing bone vibration sensor and microphone signals, comprising the following steps:
s1, collecting audio signals by the bone vibration sensor and the microphone to respectively obtain an audio signal of the bone vibration sensor and an audio signal of the microphone;
s2, inputting the audio signal of the bone vibration sensor into a high-pass filtering module, and carrying out high-pass filtering;
s3, inputting the high-pass filtered bone vibration sensor audio signal and microphone audio signal into a deep neural network module;
and S4, predicting by the deep neural network module to obtain the fused and noise-reduced voice. According to the invention, the bone vibration sensor is introduced, and the characteristics that the bone vibration sensor is not interfered by air noise are utilized, so that the signals of the bone vibration sensor and the air conduction microphone are fused by using a deep neural network, and an ideal noise reduction effect can be achieved under the condition of extremely low signal to noise ratio.
The most advanced practical speech noise reduction scheme before is the feedforward Deep Neural Network (DNN) trained using a large amount of data, and although this scheme can achieve the separation of specific voices from untrained noisy voices, the model does not have good noise reduction effect on non-specific voices. To improve the noise reduction effect of non-specific human voices, it is most effective to add multiple speakers' voices in the training set, however, this can confuse the DNN with the voice and background noise and tends to misinterpret the noise as voice.
The published patent application No. 201710594168.3 (entitled monophonic real-time noise reduction method) relates to a general monophonic real-time noise reduction method, comprising the steps of: receiving the voice with noise in an electronic format, wherein the voice with noise comprises voice and non-human voice interference noise; extracting a short-time Fourier magnitude spectrum from received sound frame by frame to serve as acoustic features; generating a ratio film frame by frame using a deep recurrent neural network with long and short term memory; masking the amplitude spectrum of the voice with noise by using the generated ratio film; the masked amplitude spectrum and the original phase of the noisy speech are used to synthesize the speech waveform again by inverse fourier transform. The method adopts a supervised learning method to perform voice noise reduction, and estimates an ideal ratio membrane by using a recurrent neural network with long-term and short-term memory; the recurrent neural network provided by the invention uses a large amount of voice with noise for training, wherein various real acoustic scenes and microphone impulse responses are included, and finally, the universal voice noise reduction independent of background noise, speakers and transmission channels is realized. The monaural noise reduction means processing signals acquired by a single microphone, and compared with a microphone array noise reduction method of beam forming, the monaural noise reduction has wider practicability and low cost. The invention adopts a supervised learning method to perform voice noise reduction, and estimates an ideal ratio membrane by using a recurrent neural network with long-term and short-term memory. The invention introduces the technology of eliminating the dependence on the future time frame, realizes the high-efficiency calculation of the recurrent neural network model in the noise reduction process, and constructs a very small recurrent neural network model by further simplifying the calculation on the premise of not influencing the noise reduction performance, thereby realizing the real-time voice noise reduction.
Further, bone vibration sensors were introduced. The bone vibration sensor can collect low-frequency voice and is not interfered by air noise. The bone vibration sensor signal and the air conduction microphone signal are fused by using a deep neural network, so that an ideal full-band noise reduction effect can be achieved under an extremely low signal-to-noise ratio. The bone vibration sensor in the present embodiment is a prior art.
Speech signals have a strong correlation in the time dimension and this correlation is very helpful for speech separation. In order to improve the separation performance by using the context information, the method based on the deep neural network splices the current frame and the front and back continuous frames into a vector with a larger dimension as an input feature. The method is executed by a computer program, extracts acoustic features from noisy speech, estimates an ideal time-frequency ratio film, and re-synthesizes a noise-reduced speech waveform. The method comprises one or more program modules, any system or hardware device with executable computer programming instructions for executing the one or more program modules.
Furthermore, the high-pass filtering module corrects the direct current offset of the audio signal of the bone vibration sensor and filters out low-frequency clutter signals.
Further, the high-pass filtering module may be implemented by digital filter filtering.
Further, the bone vibration sensor audio signal is subjected to a high-pass filtering process, and more preferably, is reconstructed by a high frequency. Namely, the frequency range is further widened by using a frequency band widening method, the audio signal of the bone vibration sensor is widened to more than two kilohertz, and then the audio signal is input into the deep neural network module.
Further, the high-frequency reconstruction module is used for further widening the bandwidth of the bone vibration signal and is an optional module.
Further, there are many methods for high frequency reconstruction, and a deep neural network is the most effective method at present, and the structure of a deep neural network is given as an example in this embodiment.
Carrying out high-pass filtering on the audio signal of the bone vibration sensor, correcting direct current offset of a bone conduction signal, and filtering low-frequency noise; the bone vibration signal is widened to be more than 2kHz through a frequency band widening (high-frequency reconstruction) method, the step is optional, and the original bone vibration signal in the step S1 can be directly used; sending the output of the step S2 and the signal of the microphone to a deep neural network module; and predicting the fused and denoised voice by the deep neural network module.
As shown in fig. 2, the high frequency reconstruction is used to further widen the frequency range of the bone vibration signal, and may be performed by using a deep neural network, wherein the deep neural network may have various implementations, and fig. 2 shows one (but is not limited to the network), and the high frequency reconstruction method of the deep recurrent neural network based on long-term and short-term memory is used.
The published patent application No. 201811199154.2 (entitled system for recognizing user's voice by human body vibration to control electronic equipment) includes a human body vibration sensor for sensing human body vibration of a user; the processing circuit is coupled with the human body vibration sensor and used for controlling the sound pickup equipment to start sound pickup when the output signal of the human body vibration sensor is determined to comprise a user voice signal; a communication module coupled with the processing circuit and the sound pickup equipment for communication between the processing circuit and the sound pickup equipment. Different from the patent that the bone vibration sensor signal is used as a mark for voice activity detection, the bone vibration sensor signal and the microphone signal are used as the input of a deep neural network together for deep fusion of a signal layer, so that an excellent noise reduction effect is achieved.
Furthermore, the deep neural network module also comprises a fusion module, and the fusion module based on the deep neural network is used for completing the fusion of the microphone audio signal and the bone vibration sensor audio signal and reducing the noise.
Further, one implementation method of the deep neural network module is implemented by a convolution cyclic neural network, and obtains a pure Speech Magnitude Spectrum (Speech Magnitude Spectrum) by prediction.
Furthermore, the network structure in the fusion module based on the deep neural network takes the convolution cyclic neural network as an example, and the network structure can also replace structures such as a long-term neural network, a deep full convolution network and the like.
As an example, the deep neural network module may be composed of three layers of convolutional networks, three layers of long-short term memory networks, and three layers of deconvolution networks.
Fig. 3 shows a structural block diagram of a deep neural network fusion module of the deep learning noise reduction method for fusing bone vibration sensors and microphone signals, which shows a convolution cycle neural network implementation of the deep neural network module, that is, a Training Target (Training Target) of the deep neural network module is a pure Speech amplitude Spectrum (Speech Magnitude Spectrum), and after a short-time fourier transform (STFT) is performed on pure Speech (Clean Speech), a pure Speech amplitude Spectrum (Speech amplitude Spectrum) is obtained as the Training Target (Training Target), that is, a Target amplitude Spectrum (Target Magnitude Spectrum).
Further, the input signal of the deep neural network module is formed by Stacking (Stacking) the amplitude spectrum of the bone vibration sensor audio signal and the amplitude spectrum of the microphone audio signal;
firstly, short-time Fourier transform (STFT) is respectively carried out on an audio signal of the bone vibration sensor and an audio signal of the microphone, two paths of amplitude spectrums (magnetic spectra) are respectively obtained and stacked (Stacking) is carried out.
Further, the stacked amplitude Spectrum is passed through a deep neural network module to obtain a predicted amplitude Spectrum (Estimated magnetic Spectrum), and the predicted amplitude Spectrum is output.
Further, the target amplitude Spectrum and the predicted amplitude Spectrum (Estimated Magnitude Spectrum) are subjected to mean-square error (MSE), which is a measure reflecting the degree of difference between the Estimated quantity and the Estimated quantity. Furthermore, the Training process (Training) adopts a back propagation-gradient descent mode to update the network parameters, and continuously sends network Training data and updates the network parameters until the network converges.
Further, the Inference process (Inference) recovers the predicted Clean Speech (Clean Speech) using a combination of the phase of the results after Short Time Fourier Transform (STFT) of the microphone data and the predicted Magnitude Spectrum (Estimated Magnitude Spectrum).
Compared with the traditional multi-microphone noise reduction technology, the single-microphone noise reduction method adopts a single microphone as input. Therefore, the method has the characteristics of strong robustness, controllable cost, low requirement on product structure design and the like. In this embodiment, robustness means that the noise reduction performance of the noise reduction system is interfered by microphone consistency and the like, and strong robustness means that no requirements are made on microphone consistency, microphone placement and the like, and the noise reduction system can adapt to various microphones.
As shown in fig. 7, a comparison graph of noise reduction effects of a deep learning noise reduction method fusing bone vibration sensors and microphone signals and a corresponding monaural deep learning noise reduction method without bone vibration sensors is shown. Specifically, the results of processing by using the method (Only-Mic) in the general monophonic real-time noise reduction method (application number: 201710594168.3) and the method (Sensor-Mic) according to the present technology in 8 noise scenes are compared, and the objective test results shown in fig. 7 are obtained. The eight types of noise are: bar noise, highway noise, intersection noise, train station noise, car noise traveling at 130km/h, cafe noise, noise on tables, and office noise. The test criteria was subjective speech quality assessment (PESQ), with values ranging from [ -0.5,4.5 ]. As can be seen from the table, the PESQ scores are greatly improved after the processing of the technology in each scene, and the average improvement of eight scenes is 0.26. This shows that the technology has higher voice restoration degree and stronger noise suppression capability. The method utilizes the characteristic that the bone vibration sensor is not interfered by air noise, and fuses the signals of the bone vibration sensor and the air conduction microphone by using the deep neural network, thereby achieving the ideal noise reduction effect under the condition of extremely low signal to noise ratio.
Furthermore, compared with the traditional single-microphone noise reduction technology, the method does not make any assumption on noise (the traditional single-microphone noise reduction technology generally presupposes that the noise is stable noise), utilizes the strong modeling capability of a deep neural network, has good voice reduction degree and strong noise suppression capability, can solve the voice extraction problem in a complex noise scene, and can be applied to a communication scene of earphones, mobile phones and the like attached to ears (or other body parts). Different from other noise reduction modes combining a bone vibration sensor and an air conduction microphone, the method only uses a bone vibration sensor signal as an activation detection mark, and uses the characteristic that the bone vibration sensor signal is not interfered by air conduction noise, uses a bone transmission signal as a low-frequency input signal, and sends the low-frequency input signal and the microphone signal into a deep neural network together for integral noise reduction and fusion after high-frequency reconstruction (selection). By means of the bone vibration sensor, a high-quality low-frequency signal can be obtained, and on the basis, the accuracy of deep neural network prediction is greatly improved, so that the noise reduction effect is better. The result of the bone vibration sensor signal after the frequency band is widened can be directly used as the output.
In this embodiment, the high frequency reconstruction module is an optional module for further broadening the bandwidth of the bone vibration signal. There are many methods for high frequency reconstruction, and a deep neural network is a recent method with the best effect, and the structure of the deep neural network is given as an example in the specific embodiment. In the embodiment, the network structure in the fusion module based on the deep neural network takes the convolution cyclic neural network as an example, and the network structure can also replace structures such as a long-term neural network, a deep full convolution network and the like.
The invention provides a deep learning voice extraction and noise reduction method fusing bone vibration sensors and microphone signals, which combines the respective advantages of the bone vibration sensors and the traditional microphone signals, realizes high voice reduction and strong noise suppression capability by utilizing the strong modeling capability of a deep neural network, can solve the voice extraction problem in a complex noise scene, realizes extraction of target voice, and reduces interference noise, and adopts a single microphone structure to reduce the realization complexity and cost.
Although the present invention has been described with reference to the above embodiments, the scope of the present invention is not limited thereto, and modifications, substitutions and the like of the above members are intended to fall within the scope of the claims of the present invention without departing from the spirit of the present invention.

Claims (11)

1. A deep learning noise reduction method for fusing bone vibration sensors and microphone signals is characterized by comprising the following steps:
s1, collecting audio signals by the bone vibration sensor and the microphone to respectively obtain an audio signal of the bone vibration sensor and an audio signal of the microphone;
s2, inputting the bone vibration sensor audio signal into a high-pass filtering module, and performing high-pass filtering;
s3, inputting the bone vibration sensor audio signal and the microphone audio signal which are subjected to high-pass filtering into a deep neural network module;
and S4, predicting by the deep neural network module to obtain the fused and noise-reduced voice.
2. The method of claim 1, wherein the high-pass filtering module corrects a DC offset of the audio signal of the bone vibration sensor and filters out low-frequency noise signals.
3. The method for deep learning noise reduction by fusing bone vibration sensor and microphone signals according to claim 2, wherein the bone vibration sensor audio signal is processed by high-pass filtering, and more preferably, the frequency range is further widened by high-frequency reconstruction, i.e. frequency band widening, so that the bone vibration sensor audio signal is widened to more than two kilohertz and then input into the deep neural network module.
4. The result of the high frequency reconstruction (band broadening) of the bone vibration sensor signal according to claim 3 can also be directly outputted as the present invention.
5. The method of claim 1, wherein the deep neural network module further comprises a fusion module, and the fusion module fuses and denoises the microphone audio signal and the bone vibration sensor audio signal.
6. The method for deep learning noise reduction by fusing bone vibration sensor and microphone signals according to claim 5, characterized in that, one implementation method of the deep neural network module is realized by convolution cyclic neural network, and pure speech magnitude spectrum is obtained by prediction.
7. The method as claimed in claim 1, wherein the deep neural network module comprises several layers of convolutional networks, several layers of long-short term memory networks and several layers of corresponding deconvolution networks.
8. The method as claimed in claim 6, wherein the training target of the deep neural network module is the pure speech amplitude spectrum, and the pure speech is first subjected to short-time fourier transform, and then the pure speech amplitude spectrum is obtained as the training target, i.e. the target amplitude spectrum.
9. The method for deep learning noise reduction by fusing bone vibration sensor and microphone signals according to claim 6, wherein the input signal of the deep neural network module is formed by stacking the amplitude spectrum of the bone vibration sensor audio signal and the amplitude spectrum of the microphone audio signal;
firstly, respectively carrying out short-time Fourier transform on the bone vibration sensor audio signal and the microphone audio signal, respectively obtaining two paths of amplitude spectrums, and stacking.
10. The method as claimed in claim 9, wherein the stacked magnitude spectrum is passed through the deep neural network module to obtain a predicted magnitude spectrum, and the predicted magnitude spectrum is output.
11. The method for deep learning noise reduction by fusing bone vibration sensor and microphone signals according to claim 8 or 10, characterized in that the target magnitude spectrum and the predicted magnitude spectrum are subjected to mean square error.
CN201910953534.9A 2019-10-09 2019-10-09 Deep learning voice extraction and noise reduction method fusing bone vibration sensor and microphone signals Pending CN110931031A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910953534.9A CN110931031A (en) 2019-10-09 2019-10-09 Deep learning voice extraction and noise reduction method fusing bone vibration sensor and microphone signals
TW109134873A TWI763073B (en) 2019-10-09 2020-10-08 Deep learning based noise reduction method using both bone-conduction sensor and microphone signals

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910953534.9A CN110931031A (en) 2019-10-09 2019-10-09 Deep learning voice extraction and noise reduction method fusing bone vibration sensor and microphone signals

Publications (1)

Publication Number Publication Date
CN110931031A true CN110931031A (en) 2020-03-27

Family

ID=69849105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910953534.9A Pending CN110931031A (en) 2019-10-09 2019-10-09 Deep learning voice extraction and noise reduction method fusing bone vibration sensor and microphone signals

Country Status (2)

Country Link
CN (1) CN110931031A (en)
TW (1) TWI763073B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111741419A (en) * 2020-08-21 2020-10-02 瑶芯微电子科技(上海)有限公司 Bone conduction sound processing system, bone conduction microphone and signal processing method thereof
CN111916101A (en) * 2020-08-06 2020-11-10 大象声科(深圳)科技有限公司 Deep learning noise reduction method and system fusing bone vibration sensor and double-microphone signals
CN111988702A (en) * 2020-08-25 2020-11-24 歌尔科技有限公司 Audio signal processing method, electronic device and storage medium
CN112017687A (en) * 2020-09-11 2020-12-01 歌尔科技有限公司 Voice processing method, device and medium of bone conduction equipment
CN112019967A (en) * 2020-09-09 2020-12-01 歌尔科技有限公司 Earphone noise reduction method and device, earphone equipment and storage medium
CN112055278A (en) * 2020-08-17 2020-12-08 大象声科(深圳)科技有限公司 Deep learning noise reduction method and device integrating in-ear microphone and out-of-ear microphone
CN112412538A (en) * 2020-11-11 2021-02-26 中煤科工开采研究院有限公司 Rock burst monitoring and early warning system
CN113113001A (en) * 2021-04-20 2021-07-13 深圳市友杰智新科技有限公司 Human voice activation detection method and device, computer equipment and storage medium
CN113411698A (en) * 2021-06-21 2021-09-17 歌尔科技有限公司 Audio signal processing method and intelligent sound box
CN113421580A (en) * 2021-08-23 2021-09-21 深圳市中科蓝讯科技股份有限公司 Noise reduction method, storage medium, chip and electronic device
CN113421583A (en) * 2021-08-23 2021-09-21 深圳市中科蓝讯科技股份有限公司 Noise reduction method, storage medium, chip and electronic device
WO2022027423A1 (en) * 2020-08-06 2022-02-10 大象声科(深圳)科技有限公司 Deep learning noise reduction method and system fusing signal of bone vibration sensor with signals of two microphones
CN114167315A (en) * 2021-11-18 2022-03-11 广东亿嘉和科技有限公司 Intelligent online monitoring system and method for transformer
TWI767696B (en) * 2020-09-08 2022-06-11 英屬開曼群島商意騰科技股份有限公司 Apparatus and method for own voice suppression
WO2022160593A1 (en) * 2021-01-28 2022-08-04 歌尔股份有限公司 Speech enhancement method, apparatus and system, and computer-readable storage medium
EP4141867A4 (en) * 2020-05-29 2023-06-14 Huawei Technologies Co., Ltd. Voice signal processing method and related device therefor

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101199006A (en) * 2005-06-20 2008-06-11 微软公司 Multi-sensory speech enhancement using a clean speech prior
CN101510905A (en) * 2004-02-24 2009-08-19 微软公司 Method and apparatus for multi-sensory speech enhancement on a mobile device
CN104780486A (en) * 2014-01-13 2015-07-15 Dsp集团有限公司 Use of microphones with vsensors for wearable devices
CN107300971A (en) * 2017-06-09 2017-10-27 深圳大学 The intelligent input method and system propagated based on osteoacusis vibration signal
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
CN107886967A (en) * 2017-11-18 2018-04-06 中国人民解放军陆军工程大学 A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network
CN108681709A (en) * 2018-05-16 2018-10-19 深圳大学 Intelligent input method and system based on osteoacusis vibration and machine learning
CN108986834A (en) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network
CN109151635A (en) * 2018-08-15 2019-01-04 恒玄科技(上海)有限公司 Realize the automatic switchover system and method that active noise reduction and the outer sound of ear are picked up
CN109195042A (en) * 2018-07-16 2019-01-11 恒玄科技(上海)有限公司 The high-efficient noise-reducing earphone and noise reduction system of low-power consumption
US20190045298A1 (en) * 2018-01-12 2019-02-07 Intel Corporation Apparatus and methods for bone conduction context detection
CN109346075A (en) * 2018-10-15 2019-02-15 华为技术有限公司 Identify user speech with the method and system of controlling electronic devices by human body vibration
CN109841226A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
US10313782B2 (en) * 2017-05-04 2019-06-04 Apple Inc. Automatic speech recognition triggering system
US10433075B2 (en) * 2017-09-12 2019-10-01 Whisper.Ai, Inc. Low latency audio enhancement

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7593535B2 (en) * 2006-08-01 2009-09-22 Dts, Inc. Neural network filtering techniques for compensating linear and non-linear distortion of an audio transducer
US11007081B2 (en) * 2018-03-05 2021-05-18 Intel Corporation Hearing protection and communication apparatus using vibration sensors
CN110010143B (en) * 2019-04-19 2020-06-09 出门问问信息科技有限公司 Voice signal enhancement system, method and storage medium

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510905A (en) * 2004-02-24 2009-08-19 微软公司 Method and apparatus for multi-sensory speech enhancement on a mobile device
CN101199006A (en) * 2005-06-20 2008-06-11 微软公司 Multi-sensory speech enhancement using a clean speech prior
CN104780486A (en) * 2014-01-13 2015-07-15 Dsp集团有限公司 Use of microphones with vsensors for wearable devices
US10313782B2 (en) * 2017-05-04 2019-06-04 Apple Inc. Automatic speech recognition triggering system
CN107300971A (en) * 2017-06-09 2017-10-27 深圳大学 The intelligent input method and system propagated based on osteoacusis vibration signal
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
US10433075B2 (en) * 2017-09-12 2019-10-01 Whisper.Ai, Inc. Low latency audio enhancement
CN107886967A (en) * 2017-11-18 2018-04-06 中国人民解放军陆军工程大学 A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network
US20190045298A1 (en) * 2018-01-12 2019-02-07 Intel Corporation Apparatus and methods for bone conduction context detection
CN108681709A (en) * 2018-05-16 2018-10-19 深圳大学 Intelligent input method and system based on osteoacusis vibration and machine learning
CN109195042A (en) * 2018-07-16 2019-01-11 恒玄科技(上海)有限公司 The high-efficient noise-reducing earphone and noise reduction system of low-power consumption
CN109151635A (en) * 2018-08-15 2019-01-04 恒玄科技(上海)有限公司 Realize the automatic switchover system and method that active noise reduction and the outer sound of ear are picked up
CN108986834A (en) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network
CN109841226A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
CN109346075A (en) * 2018-10-15 2019-02-15 华为技术有限公司 Identify user speech with the method and system of controlling electronic devices by human body vibration

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
朱颖莉: ""基于多传感器的语音增强技术研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *
李敏杰: ""骨导和气导结合的语音增强系统搭建"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4141867A4 (en) * 2020-05-29 2023-06-14 Huawei Technologies Co., Ltd. Voice signal processing method and related device therefor
CN111916101B (en) * 2020-08-06 2022-01-21 大象声科(深圳)科技有限公司 Deep learning noise reduction method and system fusing bone vibration sensor and double-microphone signals
CN111916101A (en) * 2020-08-06 2020-11-10 大象声科(深圳)科技有限公司 Deep learning noise reduction method and system fusing bone vibration sensor and double-microphone signals
WO2022027423A1 (en) * 2020-08-06 2022-02-10 大象声科(深圳)科技有限公司 Deep learning noise reduction method and system fusing signal of bone vibration sensor with signals of two microphones
CN112055278A (en) * 2020-08-17 2020-12-08 大象声科(深圳)科技有限公司 Deep learning noise reduction method and device integrating in-ear microphone and out-of-ear microphone
CN112055278B (en) * 2020-08-17 2022-03-08 大象声科(深圳)科技有限公司 Deep learning noise reduction device integrated with in-ear microphone and out-of-ear microphone
WO2022036761A1 (en) * 2020-08-17 2022-02-24 大象声科(深圳)科技有限公司 Deep learning noise reduction method that fuses in-ear microphone and on-ear microphone, and device
CN111741419B (en) * 2020-08-21 2020-12-04 瑶芯微电子科技(上海)有限公司 Bone conduction sound processing system, bone conduction microphone and signal processing method thereof
CN111741419A (en) * 2020-08-21 2020-10-02 瑶芯微电子科技(上海)有限公司 Bone conduction sound processing system, bone conduction microphone and signal processing method thereof
CN111988702B (en) * 2020-08-25 2022-02-25 歌尔科技有限公司 Audio signal processing method, electronic device and storage medium
CN111988702A (en) * 2020-08-25 2020-11-24 歌尔科技有限公司 Audio signal processing method, electronic device and storage medium
US11622208B2 (en) 2020-09-08 2023-04-04 British Cayman Islands Intelligo Technology Inc. Apparatus and method for own voice suppression
TWI767696B (en) * 2020-09-08 2022-06-11 英屬開曼群島商意騰科技股份有限公司 Apparatus and method for own voice suppression
CN112019967B (en) * 2020-09-09 2022-07-22 歌尔科技有限公司 Earphone noise reduction method and device, earphone equipment and storage medium
CN112019967A (en) * 2020-09-09 2020-12-01 歌尔科技有限公司 Earphone noise reduction method and device, earphone equipment and storage medium
CN112017687B (en) * 2020-09-11 2024-03-29 歌尔科技有限公司 Voice processing method, device and medium of bone conduction equipment
CN112017687A (en) * 2020-09-11 2020-12-01 歌尔科技有限公司 Voice processing method, device and medium of bone conduction equipment
CN112412538A (en) * 2020-11-11 2021-02-26 中煤科工开采研究院有限公司 Rock burst monitoring and early warning system
WO2022160593A1 (en) * 2021-01-28 2022-08-04 歌尔股份有限公司 Speech enhancement method, apparatus and system, and computer-readable storage medium
CN113113001A (en) * 2021-04-20 2021-07-13 深圳市友杰智新科技有限公司 Human voice activation detection method and device, computer equipment and storage medium
CN113411698A (en) * 2021-06-21 2021-09-17 歌尔科技有限公司 Audio signal processing method and intelligent sound box
CN113421583B (en) * 2021-08-23 2021-11-05 深圳市中科蓝讯科技股份有限公司 Noise reduction method, storage medium, chip and electronic device
CN113421580A (en) * 2021-08-23 2021-09-21 深圳市中科蓝讯科技股份有限公司 Noise reduction method, storage medium, chip and electronic device
US11664003B2 (en) 2021-08-23 2023-05-30 Shenzhen Bluetrum Technology Co., Ltd. Method for reducing noise, storage medium, chip and electronic equipment
US11670279B2 (en) 2021-08-23 2023-06-06 Shenzhen Bluetrum Technology Co., Ltd. Method for reducing noise, storage medium, chip and electronic equipment
CN113421583A (en) * 2021-08-23 2021-09-21 深圳市中科蓝讯科技股份有限公司 Noise reduction method, storage medium, chip and electronic device
CN113421580B (en) * 2021-08-23 2021-11-05 深圳市中科蓝讯科技股份有限公司 Noise reduction method, storage medium, chip and electronic device
CN114167315A (en) * 2021-11-18 2022-03-11 广东亿嘉和科技有限公司 Intelligent online monitoring system and method for transformer

Also Published As

Publication number Publication date
TWI763073B (en) 2022-05-01
TW202115718A (en) 2021-04-16

Similar Documents

Publication Publication Date Title
CN110931031A (en) Deep learning voice extraction and noise reduction method fusing bone vibration sensor and microphone signals
CN109065067B (en) Conference terminal voice noise reduction method based on neural network model
US11825279B2 (en) Robust estimation of sound source localization
WO2021068120A1 (en) Deep learning speech extraction and noise reduction method fusing signals of bone vibration sensor and microphone
CN107479030B (en) Frequency division and improved generalized cross-correlation based binaural time delay estimation method
JP6703525B2 (en) Method and device for enhancing sound source
CN111916101B (en) Deep learning noise reduction method and system fusing bone vibration sensor and double-microphone signals
US8781137B1 (en) Wind noise detection and suppression
KR20130108063A (en) Multi-microphone robust noise suppression
US9378754B1 (en) Adaptive spatial classifier for multi-microphone systems
CN110600050A (en) Microphone array voice enhancement method and system based on deep neural network
CN110931027A (en) Audio processing method and device, electronic equipment and computer readable storage medium
EP3726529A1 (en) Method and apparatus for determining a deep filter
Liu et al. DRC-NET: Densely connected recurrent convolutional neural network for speech dereverberation
CN106328160B (en) Noise reduction method based on double microphones
Stachurski et al. Sound source localization for video surveillance camera
WO2022027423A1 (en) Deep learning noise reduction method and system fusing signal of bone vibration sensor with signals of two microphones
JP2007251354A (en) Microphone and sound generation method
Wang et al. Distributed microphone speech enhancement based on deep learning
KR20200128684A (en) Audio noise reduction method and apparatus
Bagekar et al. Dual channel coherence based speech enhancement with wavelet denoising
KR101022457B1 (en) Method to combine CASA and soft mask for single-channel speech separation
RU2788939C1 (en) Method and apparatus for defining a deep filter
CN113936687B (en) Method for real-time voice separation voice transcription
Azarpour et al. Adaptive binaural noise reduction based on matched-filter equalization and post-filtering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 533, podium building 12, Shenzhen Bay science and technology ecological park, No.18, South Keji Road, high tech community, Yuehai street, Nanshan District, Shenzhen, Guangdong 518000

Applicant after: ELEVOC TECHNOLOGY Co.,Ltd.

Address before: 2206, phase I, International Students Pioneer Building, 29 Gaoxin South Ring Road, Yuehai street, Nanshan District, Shenzhen, Guangdong 518000

Applicant before: ELEVOC TECHNOLOGY Co.,Ltd.