CN110931031A

CN110931031A - Deep learning voice extraction and noise reduction method fusing bone vibration sensor and microphone signals

Info

Publication number: CN110931031A
Application number: CN201910953534.9A
Authority: CN
Inventors: 闫永杰
Original assignee: Elephant Acoustical (shenzhen) Technology Co Ltd
Current assignee: Elephant Acoustical (shenzhen) Technology Co Ltd
Priority date: 2019-10-09
Filing date: 2019-10-09
Publication date: 2020-03-27
Also published as: TWI763073B; TW202115718A

Abstract

The invention relates to a deep learning noise reduction method for fusing bone vibration sensor and microphone signals, which comprises the following steps: the bone vibration sensor and the microphone collect audio signals to respectively obtain an audio signal of the bone vibration sensor and an audio signal of the microphone; inputting the audio signal of the bone vibration sensor into a high-pass filtering module, and performing high-pass filtering; inputting the high-pass filtered bone vibration sensor audio signal or the signal subjected to frequency band widening and a microphone audio signal into a deep neural network module; and the deep neural network module obtains the voice after noise reduction through prediction. The invention combines the signals of the bone vibration sensor and the traditional microphone, realizes high voice reduction degree and strong noise suppression capacity by utilizing the strong modeling capacity of the deep neural network, can solve the voice extraction problem in a complex noise scene, realizes the extraction of target voice, reduces interference noise, and can adopt a single microphone structure to reduce cost. In addition, the signal of the audio signal of the bone vibration sensor after the frequency band is widened can be directly used as the output.

Description

Deep learning voice extraction and noise reduction method fusing bone vibration sensor and microphone signals

Technical Field

The invention relates to the technical field of voice noise reduction of electronic equipment, in particular to a deep learning noise reduction method fusing bone vibration sensors and microphone signals.

Background

The voice noise reduction technology is used for separating a voice signal from a voice signal with noise, has wide application, and generally has a single-microphone noise reduction technology and a multi-microphone noise reduction technology, however, the traditional noise reduction technology has some defects, and the traditional single-microphone noise reduction technology assumes noise as stable noise in advance, so that the adaptability is not high, and the limitation is large; the traditional multi-microphone noise reduction technology needs two or more microphones, so that the cost is increased, the multi-microphone structure has higher requirements on the structural design of a product, the structural design of the product is limited, in addition, the multi-microphone noise reduction technology depends on direction information to reduce noise, the noise from the direction of a target voice cannot be inhibited, and the defects are worth improving.

The traditional multi-microphone and single-microphone communication noise reduction technology has the following defects:

1. the number of the microphones and the cost are in a linear relation, and the more the number of the microphones is, the higher the cost is;

2. the multi-microphone has higher requirements on the structural design of the product, and the structural design of the product is limited;

3. the multi-microphone noise reduction technology is used for noise reduction depending on direction information, and noise from a direction close to a target human voice cannot be suppressed;

4. the single-microphone noise reduction technology relies on noise estimation, and noise is built into the noise in advance to be stable sound, so that the single-microphone noise reduction technology has limitation.

The invention combines the signals of the bone vibration sensor and the traditional microphone, adopts deep learning to carry out fusion so as to realize noise reduction, and realizes the extraction of the target voice and the reduction of interference noise under various noise environments. The technology can be applied to the communication scene of the earphone, the mobile phone and the like which are attached to the ear (or other body parts). In contrast to techniques that employ only one or more microphones to reduce noise, the combination with bone vibration sensors can be used in environments where the signal-to-noise ratio is very low, such as: and in the scenes such as subways, wind noises and the like, good conversation experience can still be kept. Compared with the traditional single-microphone noise reduction technology, the technology does not make any assumption on noise (the traditional single-microphone noise reduction technology assumes noise to be stable noise in advance), utilizes the strong modeling capability of the deep neural network, has good voice reduction degree and strong noise suppression capability, and can solve the voice extraction problem in a complex noise scene. Compared with a noise reduction scheme that the traditional multi-microphone noise reduction technology needs 2 or more microphones for beam forming, a single microphone is adopted.

With respect to air conduction microphones, bone vibration sensor signal sampling is predominantly in the low frequency range, but is not disturbed by air conduction noise. Different from other noise reduction modes combining a bone vibration sensor and an air conduction microphone, the technology only utilizes a bone vibration sensor signal as a human voice activation detection mark, takes a bone conduction signal as a low-frequency input signal, and sends the low-frequency input signal and a microphone signal into a deep neural network together for integral fusion after high-frequency reconstruction (optional) so as to realize noise reduction. By means of the bone vibration sensor, a high-quality low-frequency signal can be obtained, and on the basis, the accuracy of deep neural network prediction is greatly improved, so that the noise reduction effect is better.

Compared with the patent with the application number of 201710594168.3 (named as a universal single-channel real-time noise reduction method), the method introduces the bone vibration sensor signal, and utilizes the characteristic that the bone vibration sensor is not interfered by air noise to fuse the bone vibration sensor signal and the air conduction microphone signal by using the deep neural network, so that the high-quality noise reduction effect can be achieved under the condition of extremely low signal-to-noise ratio.

Compared with the signal of the bone vibration sensor as the voice activity detection sign in the patent with the application number of 201811199154.2 (named as a system for recognizing the voice of the user through human body vibration to control the electronic equipment), the signal of the bone vibration sensor and the signal of the microphone are used as the input of a deep neural network together to carry out the organic fusion of signal layers, so that the high-quality noise reduction effect is achieved.

Disclosure of Invention

The invention aims to solve the technical problems that in the prior art, multiple microphones limit product structures, the cost is too high, the traditional single-microphone noise reduction technology is limited and the like by adopting a deep learning noise reduction method fusing bone vibration sensors and microphone signals. Different from other technologies which combine a bone vibration sensor and an air conduction microphone and only use a bone vibration sensor signal as an activation detection mark, the technology uses the characteristic that the bone vibration sensor signal is not interfered by air conduction noise, uses a bone transmission signal as a direct input signal, and sends the bone transmission signal and the microphone signal into a deep neural network together for integral fusion and noise reduction after high-frequency reconstruction (selection). By means of the bone vibration sensor, a high-quality low-frequency signal can be obtained, and on the basis, the accuracy of deep neural network prediction is greatly improved, so that the noise reduction effect is better.

The technical scheme adopted by the invention for solving the technical problems is as follows: a deep learning noise reduction method fusing bone vibration sensors and microphone signals is constructed, the respective advantages of the bone vibration sensors and the signals of the traditional microphone are combined, a deep learning voice extraction and noise reduction technology is adopted, and under various noise environments, the extraction of target voice is achieved, and interference noise is reduced. The technology can be applied to the communication scene of the ear (or other body parts) fit such as earphones, mobile phones and the like, and is low in cost and easy to realize.

In the deep learning noise reduction method for fusing bone vibration sensor and microphone signal, the method comprises the following steps:

s1, collecting audio signals by the bone vibration sensor and the microphone to respectively obtain an audio signal of the bone vibration sensor and an audio signal of the microphone;

s2, inputting the audio signal of the bone vibration sensor into a high-pass filtering module for high-pass filtering;

s3, inputting the high-pass filtered bone vibration sensor audio signal and microphone audio signal into a deep neural network module;

and S4, the deep neural network module predicts and obtains the noise reduction voice after fusion.

In the deep learning noise reduction method fusing the bone vibration sensor and the microphone signal, the high-pass filtering module corrects the audio signal direct current offset of the bone vibration sensor and filters out low-frequency clutter signals.

In the deep learning noise reduction method for fusing the bone vibration sensor and the microphone signal, the audio signal of the bone vibration sensor is subjected to high-pass filtering, and more preferably, the frequency range is further widened through high-frequency reconstruction, namely, a frequency band widening method, so that the audio signal of the bone vibration sensor is widened to more than two kilohertz, and then the audio signal is input into a deep neural network module.

Further, only the bone vibration signal with the broadened frequency band may be used as the final output signal, thereby eliminating the need to rely on a microphone signal.

In the deep learning noise reduction method for fusing the bone vibration sensor and the microphone signal, the deep neural network module further comprises a fusion module, and the fusion module fuses and reduces noise of the microphone audio signal and the bone vibration sensor audio signal.

In the deep learning noise reduction method fusing the bone vibration sensor and the microphone signal, an implementation method of the deep neural network module is realized through a convolution cyclic neural network, and a pure voice magnitude spectrum is obtained through prediction.

In the deep learning noise reduction method fusing the bone vibration sensor and the microphone signal, a deep neural network module is composed of a plurality of layers of convolution networks, a plurality of layers of long and short term memory networks and a plurality of layers of corresponding deconvolution networks.

In the deep learning noise reduction method fusing the bone vibration sensor and the microphone signals, the training target of the deep neural network module is a pure voice magnitude spectrum. Firstly, pure voice is subjected to short-time Fourier transform, and then a pure voice amplitude spectrum is obtained to be used as a training target, namely a target amplitude spectrum.

In the deep learning noise reduction method fusing the bone vibration sensor and the microphone signal, an input signal of a deep neural network module is formed by stacking the amplitude spectrum of the audio signal of the bone vibration sensor (or the amplitude spectrum after the frequency band is widened) and the amplitude spectrum of the audio signal of the microphone;

firstly, respectively carrying out short-time Fourier transform on an audio signal of the bone vibration sensor and an audio signal of the microphone, respectively obtaining two paths of amplitude spectrums, and stacking.

In the deep learning noise reduction method fusing the bone vibration sensor and the microphone signals, the stacked amplitude spectrum passes through a deep neural network module to obtain a predicted amplitude spectrum, and the predicted amplitude spectrum is output.

In the deep learning noise reduction method fusing the bone vibration sensor and the microphone signals, the target amplitude spectrum and the predicted amplitude spectrum are subjected to mean square error.

The method for extracting the voice and reducing the noise through the deep learning, which is disclosed by the invention, has the beneficial effects that the method for extracting the voice and reducing the noise through the deep learning fusing the bone vibration sensor and the microphone signals is provided, the strong modeling capability of a deep neural network is utilized, the good voice reduction degree and the strong noise suppression capability are realized, and the problem of voice extraction in a complex noise scene can be solved. The invention utilizes the characteristic that the bone vibration sensor is not interfered by air conduction noise, and can be used in the environment with extremely low signal-to-noise ratio, such as: and the scenes such as subways, wind noises and the like still keep good conversation experience. And the use of a single microphone significantly simplifies implementation and reduces cost. Different from other noise reduction modes combining a bone vibration sensor and an air conduction microphone, the method only uses a bone vibration sensor signal as an activation detection mark, and uses the characteristic that the bone vibration sensor signal is not interfered by air conduction noise, uses a bone transmission signal as a low-frequency input signal, and sends the low-frequency input signal and a microphone signal into a deep neural network together for integral fusion after high-frequency reconstruction (selection) so as to obtain human voice. By means of the bone vibration sensor, a high-quality low-frequency signal can be obtained, and on the basis, the accuracy of predicting human voice by the deep neural network is greatly improved, so that the noise reduction effect is better.

Drawings

The invention will be further explained with reference to the drawings and the embodiments. In the drawings:

FIG. 1 is a block flow diagram of a method of deep learning noise reduction incorporating bone vibration sensor and microphone signals in accordance with the present invention;

FIG. 2 is a functional block diagram of a method of high frequency reconstruction;

FIG. 3 is a block diagram of a deep neural network fusion module structure of a deep learning noise reduction method for fusing bone vibration sensors and microphone signals according to the present invention;

FIG. 4 is a schematic diagram of a frequency spectrum of an audio signal collected by a bone vibration sensor according to the deep learning noise reduction method for integrating a bone vibration sensor and a microphone signal;

FIG. 5 is a schematic diagram of a frequency spectrum of an audio signal collected by a microphone of the deep learning noise reduction method for fusing bone vibration sensors and microphone signals according to the present invention;

FIG. 6 is a schematic diagram of a frequency spectrum of an audio signal processed by a deep learning noise reduction method for fusing bone vibration sensors and microphone signals according to the present invention;

FIG. 7 is a comparison graph of noise reduction effects of a method for noise reduction by fusing bone vibration sensors and microphone signals and a method for noise reduction by deep learning in real time corresponding to a single sound channel without a bone vibration sensor according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, the present invention is a deep learning speech extraction and noise reduction method fusing bone vibration sensor and microphone signals, comprising the following steps:

s2, inputting the audio signal of the bone vibration sensor into a high-pass filtering module, and carrying out high-pass filtering;

and S4, predicting by the deep neural network module to obtain the fused and noise-reduced voice. According to the invention, the bone vibration sensor is introduced, and the characteristics that the bone vibration sensor is not interfered by air noise are utilized, so that the signals of the bone vibration sensor and the air conduction microphone are fused by using a deep neural network, and an ideal noise reduction effect can be achieved under the condition of extremely low signal to noise ratio.

The most advanced practical speech noise reduction scheme before is the feedforward Deep Neural Network (DNN) trained using a large amount of data, and although this scheme can achieve the separation of specific voices from untrained noisy voices, the model does not have good noise reduction effect on non-specific voices. To improve the noise reduction effect of non-specific human voices, it is most effective to add multiple speakers' voices in the training set, however, this can confuse the DNN with the voice and background noise and tends to misinterpret the noise as voice.

The published patent application No. 201710594168.3 (entitled monophonic real-time noise reduction method) relates to a general monophonic real-time noise reduction method, comprising the steps of: receiving the voice with noise in an electronic format, wherein the voice with noise comprises voice and non-human voice interference noise; extracting a short-time Fourier magnitude spectrum from received sound frame by frame to serve as acoustic features; generating a ratio film frame by frame using a deep recurrent neural network with long and short term memory; masking the amplitude spectrum of the voice with noise by using the generated ratio film; the masked amplitude spectrum and the original phase of the noisy speech are used to synthesize the speech waveform again by inverse fourier transform. The method adopts a supervised learning method to perform voice noise reduction, and estimates an ideal ratio membrane by using a recurrent neural network with long-term and short-term memory; the recurrent neural network provided by the invention uses a large amount of voice with noise for training, wherein various real acoustic scenes and microphone impulse responses are included, and finally, the universal voice noise reduction independent of background noise, speakers and transmission channels is realized. The monaural noise reduction means processing signals acquired by a single microphone, and compared with a microphone array noise reduction method of beam forming, the monaural noise reduction has wider practicability and low cost. The invention adopts a supervised learning method to perform voice noise reduction, and estimates an ideal ratio membrane by using a recurrent neural network with long-term and short-term memory. The invention introduces the technology of eliminating the dependence on the future time frame, realizes the high-efficiency calculation of the recurrent neural network model in the noise reduction process, and constructs a very small recurrent neural network model by further simplifying the calculation on the premise of not influencing the noise reduction performance, thereby realizing the real-time voice noise reduction.

Further, bone vibration sensors were introduced. The bone vibration sensor can collect low-frequency voice and is not interfered by air noise. The bone vibration sensor signal and the air conduction microphone signal are fused by using a deep neural network, so that an ideal full-band noise reduction effect can be achieved under an extremely low signal-to-noise ratio. The bone vibration sensor in the present embodiment is a prior art.

Speech signals have a strong correlation in the time dimension and this correlation is very helpful for speech separation. In order to improve the separation performance by using the context information, the method based on the deep neural network splices the current frame and the front and back continuous frames into a vector with a larger dimension as an input feature. The method is executed by a computer program, extracts acoustic features from noisy speech, estimates an ideal time-frequency ratio film, and re-synthesizes a noise-reduced speech waveform. The method comprises one or more program modules, any system or hardware device with executable computer programming instructions for executing the one or more program modules.

Furthermore, the high-pass filtering module corrects the direct current offset of the audio signal of the bone vibration sensor and filters out low-frequency clutter signals.

Further, the high-pass filtering module may be implemented by digital filter filtering.

Further, the bone vibration sensor audio signal is subjected to a high-pass filtering process, and more preferably, is reconstructed by a high frequency. Namely, the frequency range is further widened by using a frequency band widening method, the audio signal of the bone vibration sensor is widened to more than two kilohertz, and then the audio signal is input into the deep neural network module.

Further, the high-frequency reconstruction module is used for further widening the bandwidth of the bone vibration signal and is an optional module.

Further, there are many methods for high frequency reconstruction, and a deep neural network is the most effective method at present, and the structure of a deep neural network is given as an example in this embodiment.

Carrying out high-pass filtering on the audio signal of the bone vibration sensor, correcting direct current offset of a bone conduction signal, and filtering low-frequency noise; the bone vibration signal is widened to be more than 2kHz through a frequency band widening (high-frequency reconstruction) method, the step is optional, and the original bone vibration signal in the step S1 can be directly used; sending the output of the step S2 and the signal of the microphone to a deep neural network module; and predicting the fused and denoised voice by the deep neural network module.

As shown in fig. 2, the high frequency reconstruction is used to further widen the frequency range of the bone vibration signal, and may be performed by using a deep neural network, wherein the deep neural network may have various implementations, and fig. 2 shows one (but is not limited to the network), and the high frequency reconstruction method of the deep recurrent neural network based on long-term and short-term memory is used.

The published patent application No. 201811199154.2 (entitled system for recognizing user's voice by human body vibration to control electronic equipment) includes a human body vibration sensor for sensing human body vibration of a user; the processing circuit is coupled with the human body vibration sensor and used for controlling the sound pickup equipment to start sound pickup when the output signal of the human body vibration sensor is determined to comprise a user voice signal; a communication module coupled with the processing circuit and the sound pickup equipment for communication between the processing circuit and the sound pickup equipment. Different from the patent that the bone vibration sensor signal is used as a mark for voice activity detection, the bone vibration sensor signal and the microphone signal are used as the input of a deep neural network together for deep fusion of a signal layer, so that an excellent noise reduction effect is achieved.

Furthermore, the deep neural network module also comprises a fusion module, and the fusion module based on the deep neural network is used for completing the fusion of the microphone audio signal and the bone vibration sensor audio signal and reducing the noise.

Further, one implementation method of the deep neural network module is implemented by a convolution cyclic neural network, and obtains a pure Speech Magnitude Spectrum (Speech Magnitude Spectrum) by prediction.

Furthermore, the network structure in the fusion module based on the deep neural network takes the convolution cyclic neural network as an example, and the network structure can also replace structures such as a long-term neural network, a deep full convolution network and the like.

As an example, the deep neural network module may be composed of three layers of convolutional networks, three layers of long-short term memory networks, and three layers of deconvolution networks.

Fig. 3 shows a structural block diagram of a deep neural network fusion module of the deep learning noise reduction method for fusing bone vibration sensors and microphone signals, which shows a convolution cycle neural network implementation of the deep neural network module, that is, a Training Target (Training Target) of the deep neural network module is a pure Speech amplitude Spectrum (Speech Magnitude Spectrum), and after a short-time fourier transform (STFT) is performed on pure Speech (Clean Speech), a pure Speech amplitude Spectrum (Speech amplitude Spectrum) is obtained as the Training Target (Training Target), that is, a Target amplitude Spectrum (Target Magnitude Spectrum).

Further, the input signal of the deep neural network module is formed by Stacking (Stacking) the amplitude spectrum of the bone vibration sensor audio signal and the amplitude spectrum of the microphone audio signal;

firstly, short-time Fourier transform (STFT) is respectively carried out on an audio signal of the bone vibration sensor and an audio signal of the microphone, two paths of amplitude spectrums (magnetic spectra) are respectively obtained and stacked (Stacking) is carried out.

Further, the stacked amplitude Spectrum is passed through a deep neural network module to obtain a predicted amplitude Spectrum (Estimated magnetic Spectrum), and the predicted amplitude Spectrum is output.

Further, the target amplitude Spectrum and the predicted amplitude Spectrum (Estimated Magnitude Spectrum) are subjected to mean-square error (MSE), which is a measure reflecting the degree of difference between the Estimated quantity and the Estimated quantity. Furthermore, the Training process (Training) adopts a back propagation-gradient descent mode to update the network parameters, and continuously sends network Training data and updates the network parameters until the network converges.

Further, the Inference process (Inference) recovers the predicted Clean Speech (Clean Speech) using a combination of the phase of the results after Short Time Fourier Transform (STFT) of the microphone data and the predicted Magnitude Spectrum (Estimated Magnitude Spectrum).

Compared with the traditional multi-microphone noise reduction technology, the single-microphone noise reduction method adopts a single microphone as input. Therefore, the method has the characteristics of strong robustness, controllable cost, low requirement on product structure design and the like. In this embodiment, robustness means that the noise reduction performance of the noise reduction system is interfered by microphone consistency and the like, and strong robustness means that no requirements are made on microphone consistency, microphone placement and the like, and the noise reduction system can adapt to various microphones.

As shown in fig. 7, a comparison graph of noise reduction effects of a deep learning noise reduction method fusing bone vibration sensors and microphone signals and a corresponding monaural deep learning noise reduction method without bone vibration sensors is shown. Specifically, the results of processing by using the method (Only-Mic) in the general monophonic real-time noise reduction method (application number: 201710594168.3) and the method (Sensor-Mic) according to the present technology in 8 noise scenes are compared, and the objective test results shown in fig. 7 are obtained. The eight types of noise are: bar noise, highway noise, intersection noise, train station noise, car noise traveling at 130km/h, cafe noise, noise on tables, and office noise. The test criteria was subjective speech quality assessment (PESQ), with values ranging from [ -0.5,4.5 ]. As can be seen from the table, the PESQ scores are greatly improved after the processing of the technology in each scene, and the average improvement of eight scenes is 0.26. This shows that the technology has higher voice restoration degree and stronger noise suppression capability. The method utilizes the characteristic that the bone vibration sensor is not interfered by air noise, and fuses the signals of the bone vibration sensor and the air conduction microphone by using the deep neural network, thereby achieving the ideal noise reduction effect under the condition of extremely low signal to noise ratio.

Furthermore, compared with the traditional single-microphone noise reduction technology, the method does not make any assumption on noise (the traditional single-microphone noise reduction technology generally presupposes that the noise is stable noise), utilizes the strong modeling capability of a deep neural network, has good voice reduction degree and strong noise suppression capability, can solve the voice extraction problem in a complex noise scene, and can be applied to a communication scene of earphones, mobile phones and the like attached to ears (or other body parts). Different from other noise reduction modes combining a bone vibration sensor and an air conduction microphone, the method only uses a bone vibration sensor signal as an activation detection mark, and uses the characteristic that the bone vibration sensor signal is not interfered by air conduction noise, uses a bone transmission signal as a low-frequency input signal, and sends the low-frequency input signal and the microphone signal into a deep neural network together for integral noise reduction and fusion after high-frequency reconstruction (selection). By means of the bone vibration sensor, a high-quality low-frequency signal can be obtained, and on the basis, the accuracy of deep neural network prediction is greatly improved, so that the noise reduction effect is better. The result of the bone vibration sensor signal after the frequency band is widened can be directly used as the output.

In this embodiment, the high frequency reconstruction module is an optional module for further broadening the bandwidth of the bone vibration signal. There are many methods for high frequency reconstruction, and a deep neural network is a recent method with the best effect, and the structure of the deep neural network is given as an example in the specific embodiment. In the embodiment, the network structure in the fusion module based on the deep neural network takes the convolution cyclic neural network as an example, and the network structure can also replace structures such as a long-term neural network, a deep full convolution network and the like.

The invention provides a deep learning voice extraction and noise reduction method fusing bone vibration sensors and microphone signals, which combines the respective advantages of the bone vibration sensors and the traditional microphone signals, realizes high voice reduction and strong noise suppression capability by utilizing the strong modeling capability of a deep neural network, can solve the voice extraction problem in a complex noise scene, realizes extraction of target voice, and reduces interference noise, and adopts a single microphone structure to reduce the realization complexity and cost.

Although the present invention has been described with reference to the above embodiments, the scope of the present invention is not limited thereto, and modifications, substitutions and the like of the above members are intended to fall within the scope of the claims of the present invention without departing from the spirit of the present invention.

Claims

1. A deep learning noise reduction method for fusing bone vibration sensors and microphone signals is characterized by comprising the following steps:

s2, inputting the bone vibration sensor audio signal into a high-pass filtering module, and performing high-pass filtering;

s3, inputting the bone vibration sensor audio signal and the microphone audio signal which are subjected to high-pass filtering into a deep neural network module;

and S4, predicting by the deep neural network module to obtain the fused and noise-reduced voice.

2. The method of claim 1, wherein the high-pass filtering module corrects a DC offset of the audio signal of the bone vibration sensor and filters out low-frequency noise signals.

3. The method for deep learning noise reduction by fusing bone vibration sensor and microphone signals according to claim 2, wherein the bone vibration sensor audio signal is processed by high-pass filtering, and more preferably, the frequency range is further widened by high-frequency reconstruction, i.e. frequency band widening, so that the bone vibration sensor audio signal is widened to more than two kilohertz and then input into the deep neural network module.

4. The result of the high frequency reconstruction (band broadening) of the bone vibration sensor signal according to claim 3 can also be directly outputted as the present invention.

5. The method of claim 1, wherein the deep neural network module further comprises a fusion module, and the fusion module fuses and denoises the microphone audio signal and the bone vibration sensor audio signal.

6. The method for deep learning noise reduction by fusing bone vibration sensor and microphone signals according to claim 5, characterized in that, one implementation method of the deep neural network module is realized by convolution cyclic neural network, and pure speech magnitude spectrum is obtained by prediction.

7. The method as claimed in claim 1, wherein the deep neural network module comprises several layers of convolutional networks, several layers of long-short term memory networks and several layers of corresponding deconvolution networks.

8. The method as claimed in claim 6, wherein the training target of the deep neural network module is the pure speech amplitude spectrum, and the pure speech is first subjected to short-time fourier transform, and then the pure speech amplitude spectrum is obtained as the training target, i.e. the target amplitude spectrum.

9. The method for deep learning noise reduction by fusing bone vibration sensor and microphone signals according to claim 6, wherein the input signal of the deep neural network module is formed by stacking the amplitude spectrum of the bone vibration sensor audio signal and the amplitude spectrum of the microphone audio signal;

firstly, respectively carrying out short-time Fourier transform on the bone vibration sensor audio signal and the microphone audio signal, respectively obtaining two paths of amplitude spectrums, and stacking.

10. The method as claimed in claim 9, wherein the stacked magnitude spectrum is passed through the deep neural network module to obtain a predicted magnitude spectrum, and the predicted magnitude spectrum is output.

11. The method for deep learning noise reduction by fusing bone vibration sensor and microphone signals according to claim 8 or 10, characterized in that the target magnitude spectrum and the predicted magnitude spectrum are subjected to mean square error.