CN114822573A

CN114822573A - Speech enhancement method, speech enhancement device, earphone device and computer-readable storage medium

Info

Publication number: CN114822573A
Application number: CN202210461573.9A
Authority: CN
Inventors: 陈国明
Original assignee: Goertek Inc
Current assignee: Goertek Inc
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-07-29
Anticipated expiration: 2042-04-28
Also published as: CN114822573B

Abstract

The invention discloses a voice enhancement method, a voice enhancement device, earphone equipment and a computer readable storage medium, which are applied to the earphone equipment, wherein the earphone equipment comprises a vibration sensor and comprises the following steps: when the sound signal is collected, judging the signal type of the sound signal according to a bone conduction signal in the sound signal, wherein the bone conduction signal is collected based on a vibration sensor; if the signal type is a voice signal, acquiring a posterior signal-to-noise ratio and determining the size between the posterior signal-to-noise ratio and a preset threshold value; and when the posterior signal-to-noise ratio is determined to be less than or equal to the preset threshold, performing first noise elimination processing on a high-frequency signal in the voice signal based on a deep learning mode, and performing second noise elimination processing on a low-frequency signal in the voice signal in a frequency domain to enhance the voice signal. The invention can realize the high-efficiency enhancement operation on the voice signals on the earphone equipment with low power consumption and low resources, thereby effectively eliminating the background noise of the voice signals received by the user through the earphone equipment.

Description

Speech enhancement method, speech enhancement device, headphone apparatus, and computer-readable storage medium

Technical Field

The present invention relates to the field of audio signal processing technologies, and in particular, to a speech enhancement method and apparatus, an earphone device, and a computer-readable storage medium.

Background

In addition, since the user receives the voice signal while wearing the earphone device, the voice signal can be transmitted to the inner ear of the user through two paths of air conduction and bone conduction. However, the voice signal conducted through the air is very susceptible to noise pollution due to the influence of environmental noise, and due to the limitation of the prior art, the frequency of the voice signal capable of being conducted through the bone is low, which makes the voice signal to be transmitted to the inner ear of the user to give the user a feeling of making the voice sound relatively stuffy.

Based on this, some speech enhancement algorithms based on deep learning of a neural network model are developed in the prior art, and although noise can be effectively eliminated to achieve the purpose of speech enhancement (extracting a pure speech signal from a noisy speech signal as much as possible at a receiving end, thereby reducing the hearing fatigue degree of listeners and improving the intelligibility of the speech signal), the deep learning requires extracting a high-dimensional feature vector of the speech signal, so that the whole algorithm implementation process is complex and has a large computation amount, and real-time operation processing is difficult to perform on low-power-consumption and low-resource embedded hardware devices such as earphone devices.

In summary, the existing speech enhancement algorithm is complex in process and large in computation amount, and is difficult to run on the headset device in real time to perform speech enhancement on a speech signal received by a user through the headset device.

Disclosure of Invention

The present invention is directed to a method and apparatus for speech enhancement, a headset device, and a computer-readable storage medium, and aims to implement a speech enhancement algorithm on an embedded hardware device, such as a headset device, to effectively eliminate background noise of a speech signal received by a user through the headset device.

To achieve the above object, the present invention provides a speech enhancement method applied to a headphone apparatus including a vibration sensor, the speech enhancement method including:

when a sound signal is collected, judging the signal type of the sound signal according to a bone conduction signal in the sound signal, wherein the bone conduction signal is collected based on the vibration sensor;

if the signal type is a voice signal, acquiring a posterior signal-to-noise ratio and determining the size between the posterior signal-to-noise ratio and a preset threshold value;

and when the posterior signal-to-noise ratio is determined to be smaller than or equal to the preset threshold, performing first denoising processing on a high-frequency signal in the voice signal based on a deep learning mode, and performing second denoising processing on a low-frequency signal in the voice signal in a frequency domain to enhance the voice signal.

Optionally, the earphone device further includes an inner ear microphone, the low-frequency signal is acquired based on the inner ear microphone, and the step of performing second noise cancellation processing on the low-frequency signal in the voice signal in a frequency domain includes:

detecting a signal bandwidth of the low frequency signal;

when the signal bandwidth meets the preset bandwidth design condition, carrying out second noise elimination processing of a frequency domain on the low-frequency signal;

or,

and when the signal bandwidth does not meet the bandwidth design condition, performing bandwidth expansion on the low-frequency signal, and performing second noise elimination on the low-frequency signal subjected to bandwidth expansion.

Optionally, the step of determining the signal type of the sound signal according to the bone conduction signal in the sound signal includes:

calculating time domain information according to a bone conduction signal in the sound signal, wherein the time domain information comprises a zero crossing rate and a pitch period;

performing time-frequency transformation processing on the bone conduction signal to obtain a frequency domain signal, and calculating frequency domain information according to the frequency domain signal, wherein the frequency domain information comprises spectrum energy and spectrum centroid;

and performing fusion judgment on the zero crossing rate, the pitch period, the spectrum energy and the spectrum centroid to judge and obtain the signal type of the sound signal.

Optionally, after the step of enhancing the speech signal, the method further comprises:

filtering the high-frequency signal subjected to the first denoising processing and the low-frequency signal subjected to the second denoising processing respectively to obtain corresponding signals to be output;

and fusing the signals to be output, and outputting the fused signals to be output after dynamic range control.

Optionally, after the step of determining the signal type of the sound signal according to the bone conduction signal in the sound signal, the method further includes:

if the signal type is background noise, updating respective noise power spectrum estimation of the microphone and the inner ear microphone;

and setting the output signal corresponding to the background noise to be zero, or converting the background noise into a preset comfortable noise signal.

Optionally, after the step of determining the magnitude between the a posteriori signal-to-noise ratio and a preset threshold, the method further comprises:

and when the posterior signal-to-noise ratio is determined to be larger than the preset threshold value, performing third noise elimination processing on the voice signal based on a preset signal processing mode so as to enhance the voice signal.

Optionally, the speech enhancement method further comprises:

and when the sound signal is acquired, after echo cancellation processing is carried out on the sound signal, the step of judging the signal type of the sound signal according to the bone conduction signal in the sound signal and subsequent steps are executed.

To achieve the above object, the present invention further provides a speech enhancement apparatus applied to a headphone device including a vibration sensor, the speech enhancement apparatus including:

the detection module is used for judging the signal type of the sound signal according to a bone conduction signal in the sound signal when the sound signal is collected, wherein the bone conduction signal is collected based on the vibration sensor;

the determining module is used for acquiring a posterior signal-to-noise ratio and determining the size between the posterior signal-to-noise ratio and a preset threshold value if the signal type is a voice signal;

and the enhancement module is used for carrying out first noise elimination processing on a high-frequency signal in the voice signal based on a deep learning mode and carrying out second noise elimination processing on a low-frequency signal in the voice signal in a frequency domain to enhance the voice signal when the posterior signal-to-noise ratio is determined to be less than or equal to the preset threshold.

The functional modules of the speech enhancement device of the present invention implement the steps of the speech enhancement method as described above when running.

To achieve the above object, the present invention also provides an earphone device, including: a memory, a processor and a speech enhancement program stored on the memory and executable on the processor, the speech enhancement program, when executed by the processor, implementing the steps of the speech enhancement method as described above.

Furthermore, to achieve the above object, the present invention also provides a computer readable storage medium, on which a speech enhancement program is stored, which when executed by a processor implements the steps of the speech enhancement method as described above.

In the embodiment of the invention, in the process of collecting the voice signal sent by a wearer for enhancement, when the voice signal is collected, the earphone equipment immediately judges the signal type of the currently collected voice signal according to the bone conduction signal collected by the vibration sensor configured in the voice signal; then, when the signal type is judged to be a voice signal, further obtaining a posterior signal-to-noise ratio and determining the size between the posterior signal-to-noise ratio and a preset threshold value, and finally, under the condition that the posterior signal-to-noise ratio is smaller than or equal to the preset threshold value, performing first denoising processing on a high-frequency signal in the voice signal based on a deep learning mode, and performing second denoising processing on a frequency domain on a low-frequency signal in the voice signal at the same time, thus completing the enhancement on the voice signal.

Compared with the existing voice enhancement mode, the method judges the signal type through the bone conduction signal acquired by the vibration sensor configured on the earphone device, so that when a wearer sends a voice signal, a mode of denoising and enhancing aiming at the voice is selected according to the posterior signal-to-noise ratio, namely when the posterior signal-to-noise ratio is smaller, a deep learning-based mode is respectively adopted to denoise the high-frequency signal of the voice signal and denoise the low-frequency signal in the voice signal, thus, the method realizes that on the earphone device with low power consumption and low resource, the enhancement operation is efficiently carried out on the voice signal based on the bone conduction signal judgment signal type and the determination of the posterior signal-to-noise ratio, and the background noise of the voice signal received by the user through the earphone device is effectively eliminated.

Drawings

FIG. 1 is a flowchart illustrating a speech enhancement method according to an embodiment of the present invention;

FIG. 2 is a diagram of an application scenario involved in a speech enhancement method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an application scenario for determining a signal type according to an embodiment of the speech enhancement method of the present invention;

FIG. 4 is a schematic diagram illustrating the division of frequency domain sub-bands according to an embodiment of the speech enhancement method of the present invention;

FIG. 5 is a schematic diagram of a noise-canceling and expanding structure based on deep learning according to an embodiment of the speech enhancement method of the present invention;

FIG. 6 is a schematic structural diagram of an LSTM (Long Short-Term Memory) according to an embodiment of the speech enhancement method of the present invention;

FIG. 7 is a schematic structural diagram of a GRU (gated Return Unit) according to an embodiment of the speech enhancement method of the present invention;

FIG. 8 is a block diagram of functional modules of a speech enhancement device according to a preferred embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment of the invention provides an embodiment of a voice enhancement method.

Referring to fig. 1, fig. 1 is a flowchart illustrating a speech enhancement method according to a first embodiment of the present invention. It should be noted that, although a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than that shown or described herein.

In this embodiment, the speech enhancement method provided by the present invention may be specifically applied to an earphone device configured with a vibration sensor, where the vibration sensor may be a bone voiceprint sensor or a supplementary vibration sensor, and the integrity and fidelity of a sound signal acquired by the vibration sensor in a low frequency band (0 to 4khz) signal are excellent, and the low frequency band signal is not easily affected by noise.

The speech enhancement method of the invention comprises the following steps:

step S10: when a sound signal is collected, judging the signal type of the sound signal according to a bone conduction signal in the sound signal, wherein the bone conduction signal is collected based on the vibration sensor;

in this embodiment, when the earphone device is used by a wearer to perform a voice call, the earphone device continuously collects a sound signal including a voice signal emitted by the wearer and/or a noise signal in an environment where the wearer is located through a self-configured sound collection device including a vibration sensor. In this way, when the earphone device collects a sound signal including a voice signal emitted by a wearer and/or a noise signal in the environment where the wearer is located, the earphone device immediately judges the signal type of the sound signal through a bone conduction signal collected by the vibration sensor in the sound signal, namely judges whether the sound signal is the voice signal or the noise signal.

It should be noted that, in this embodiment, the earphone device is configured with a normal microphone and an inner ear microphone in addition to the vibration sensor, the working mechanism of the inner ear microphone is similar to that of the vibration sensor, and the integrity and fidelity of the collected sound signal are also high in the low frequency band. In this way, the earphone device collects the sound signals through the vibration sensor, the normal microphone and the inner ear microphone at the same time.

Illustratively, as shown in fig. 2, the earphone device is obtained by simultaneously acquiring sound signals in real time through a vibration sensor, two ordinary microphones and an inner ear microphone: bone conduction signals collected by the vibration sensor, two high-frequency signals (shown as a microphone signal 1 and a microphone signal 2) collected by the two ordinary microphones, and low-frequency signals (shown as an inner ear microphone signal) collected by the inner ear microphone. Then, the earphone device immediately takes the sound signal including the bone conduction signal, the two high-frequency signals and the low-frequency signal as the input in the current noise cancellation process to realize the voice enhancement, and further performs an operation of determining Voice Activation Detection (VAD) through the bone conduction signal collected by the vibration sensor to determine whether the sound signal collected at present is a voice signal or a noise signal.

Further, in some possible embodiments, in the step S10, the step of determining the signal type of the sound signal according to the bone conduction signal in the sound signal may specifically include:

step S101: calculating time domain information according to a bone conduction signal in the sound signal, wherein the time domain information comprises a zero crossing rate and a pitch period;

in the present embodiment, when the earphone device determines the signal type of the sound signal from the bone conduction signal collected by the vibration sensor in the collected sound signal, time domain information, that is, a zero crossing rate and a pitch period are calculated based on the bone conduction signal.

It should be noted that, in this embodiment, the bone conduction signal collected by the earphone device through the vibration sensor, the high-frequency signal collected by the ordinary microphone, and the low-frequency signal collected by the inner ear microphone are both time-domain signals.

Illustratively, as shown in fig. 3, the earphone device uses a bone conduction signal collected by a vibration sensor as an input for calculating time domain information from a currently collected sound signal, so as to find a zero crossing rate and a pitch period.

It should be noted that, in this embodiment, the earphone device may specifically obtain the time domain information — the zero crossing rate Zn by using the following formula:

wherein sgn is a sign function, m1 and m2 represent time domain point sequence numbers in the time domain bone conduction signal, N is less than or equal to N, and N is the length of the bone conduction signal x (N). In addition to this, the present invention is,

ZCR＝Zn/(m2-m1+1)。

in addition, in this embodiment, the headphone device may specifically obtain the time domain information — Pitch period Pitch through the following formula:

Pitch＝max{Rm}。

wherein Rm is an autocorrelation function,

n and m are time variables.

Step S102: performing time-frequency transformation processing on the bone conduction signal to obtain a frequency domain signal, and calculating frequency domain information according to the frequency domain signal, wherein the frequency domain information comprises spectrum energy and spectrum centroid;

in this embodiment, when the earphone device determines the signal type of the sound signal according to the bone conduction signal collected by the vibration sensor in the collected sound signal, the earphone device calculates frequency domain information, that is, calculates the spectral energy and the spectral centroid, based on the frequency domain signal, in addition to directly calculating the zero crossing rate and the pitch period of the time domain information based on the bone conduction signal, and after performing time-frequency transform processing on the bone conduction signal to obtain the frequency domain signal.

Exemplarily, as shown in fig. 3, the earphone device uses Fast Fourier Transform (FFT) to perform time-frequency transformation on a bone conduction signal Y collected by a vibration sensor in a currently collected sound signal to obtain a frequency domain signal Y (k), and then uses the frequency domain signal Y (k) as an input for calculating frequency domain information, so as to obtain spectral energy and spectral centroid.

It should be noted that, in this embodiment, the frequency domain information spectrum energy is the spectrum logarithmic energy Eg of the specified bandwidth. In this embodiment, assuming that the ear speaker device divides the 8khz bandwidth into 128 subbands after FFT computation of the bone conduction signal Y, and obtains low sub-band energy of 24 subbands, a formula for obtaining frequency domain information — spectrum logarithmic energy Eg may specifically be:

in addition, in this embodiment, the headphone apparatus may specifically obtain frequency domain information — spectrum centroid birthness through the following formula:

n denotes frequency points, here, 128, f (k) denotes the frequency of the point mark, e (k) denotes the spectral energy, e (k) | y (k) denotes the number of spectral holes ² 。

Step S103: and performing fusion judgment on the zero crossing rate, the pitch period, the spectrum energy and the spectrum centroid to judge and obtain the signal type of the sound signal.

In this embodiment, after the earphone device calculates according to the bone conduction signal to obtain the zero-crossing rate and the pitch period of the time domain information, and calculates according to the frequency domain signal obtained by performing time-frequency transform on the bone conduction signal to obtain the frequency domain information spectrum energy and the spectrum centroid, the zero-crossing rate, the pitch period, the spectrum energy and the spectrum centroid are further used as features to perform fusion decision to obtain the result of performing voice activation detection: the signal type of the currently collected sound signal is a voice signal or a noise signal.

Illustratively, as shown in fig. 3, the headphone apparatus compares the obtained zero-crossing rate, Pitch period, spectral energy, and spectral centroid with the respective corresponding feature thresholds, that is, compares the spectral energy Eg with the corresponding feature threshold K1(K1 may be-9), compares the zero-crossing rate Zn with the corresponding feature threshold K2(K2 may be 0.6), compares the spectral centroid bright with the corresponding feature threshold K5 (K2 may be 3), and compares the Pitch period Pitch with the corresponding feature thresholds K3(K3 may be 143) and K4(K4 may be 8). Thus, the vad flag for identifying the signal type that the current sound signal is a speech signal is output only when the spectral energy Eg is greater than or equal to K1, the zero-crossing rate Zn is less than or equal to K2, the spectral centroid bright is less than or equal to K5, and the Pitch period Pitch is between K3 and K4, and the vad flag for identifying the signal type that the sound signal is a noise signal is output in any other cases.

Step S20: if the signal type is a voice signal, acquiring a posterior signal-to-noise ratio and determining the size between the posterior signal-to-noise ratio and a preset threshold value;

in this embodiment, when the earphone device performs voice activation detection on the bone conduction signal collected by the vibration sensor according to the currently collected sound signal to determine that the signal type of the sound signal is a voice signal, the size of the posterior signal-to-noise ratio is further calculated, and the size between the posterior signal-to-noise ratio and the preset threshold is determined.

It should be noted that, in this embodiment, the preset threshold is an a posteriori snr threshold pre-stored in the local by the headset device, it should be understood that, based on different design requirements of practical applications, in different feasible embodiments, the a posteriori snr threshold may be of different size, and the speech enhancement method of the present invention is not limited to the specific size of the a posteriori snr threshold.

In addition, when the headset device calculates and acquires the posterior signal-to-noise ratio of the voice signal generated by the current collection wearer and the noise signal in the environment where the wearer is located, the following noise cancellation method may be specifically adopted, that is:

wherein,

representing the enhanced frequency spectrum signal, k is a frequency spectrum serial number, the value range is 0-127, and H is _t (k) For the gain function, λ is an over-subtraction factor, which is a constant, usually λ is 0.9,

thus, the earphone device can specifically obtain the posterior signal-to-noise ratio post _ snr by calculating according to the following formula:

in this embodiment, since the speech enhancement algorithm based on deep learning is usually performed on sub-bands, and one case of frequency domain sub-band division can be as shown in fig. 4, the functional relationship between the frequency f and the frequency domain sub-band bark can be expressed as:

the relationship between the frequency f and the spectrum serial number k is as follows: k is f 128/8000.

Thus, assume thatIn the frequency domain, in mode one, and in the frequency domain, in mode two, in subband bark, with transfer functions G all at H _t (k) And g, (b) selecting, then:

in the way of one, if post _ snr (k) is greater than or equal to Coef _k Then G ═ H _t (k) If post _ snr (k) < Coef _k G ═ G (b) → G (k).

In the second approach, if 2/3 frequency domain sub-band bark satisfies the condition: post _ snr (k) ≧ Coef _k Then G ═ H _t (k)→H _t (b) And post _ snr (k) < Coef _k If G ═ G (b), the result is true.

Step S30: and when the posterior signal-to-noise ratio is determined to be smaller than or equal to the preset threshold, performing first denoising processing on a high-frequency signal in the voice signal based on a deep learning mode, and performing second denoising processing on a low-frequency signal in the voice signal in a frequency domain to enhance the voice signal.

In this embodiment, when determining the value between the current obtained posterior signal-to-noise ratio and the preset threshold value, and thus determining that the posterior signal-to-noise ratio is less than or equal to the preset threshold value, the earphone device performs a first denoising process on a high-frequency signal in the speech signal based on a deep learning manner, and performs a second denoising process on a frequency domain on a low-frequency signal acquired by an inner ear microphone in the speech signal, so as to complete enhancement on the speech signal.

For example, as shown in fig. 2, when the headset device determines that the signal type of the currently acquired sound signal is a voice signal, and when the posterior signal-to-noise ratio is less than or equal to a preset threshold, the first denoising process is performed on the air conduction signal, which is a high-frequency signal acquired by a common microphone, in the voice signal based on a deep learning manner, that is, DNN-based noise cancellation is performed on the air conduction signal, and at the same time, the second denoising process is performed on the low-frequency signal (inner ear microphone signal) acquired by an inner ear microphone in the voice signal, that is, in a frequency domain.

Further, in some possible embodiments, in the step S30, the step of performing the second denoising process in the frequency domain on the low-frequency signal in the speech signal may specifically include:

step S301: detecting a signal bandwidth of the low frequency signal;

step S302: when the signal bandwidth meets the preset bandwidth design condition, carrying out second noise elimination processing of a frequency domain on the low-frequency signal;

or,

step S303: and when the signal bandwidth does not meet the bandwidth design condition, performing bandwidth expansion on the low-frequency signal, and performing second noise elimination on the low-frequency signal subjected to bandwidth expansion.

In this embodiment, when performing second noise cancellation processing of a frequency domain on a low-frequency signal collected by an inner ear microphone in a speech signal, the earphone device first detects a signal bandwidth size of the low-frequency signal, so that when the signal bandwidth size is wide enough to meet a preset bandwidth design condition, the earphone device directly performs second noise cancellation processing of the frequency domain on the low-frequency signal; or, when the signal bandwidth is too narrow to meet the bandwidth design condition, performing bandwidth extension on the low-frequency signal, and then performing second denoising processing of a frequency domain on the low-frequency signal subjected to bandwidth extension.

It should be noted that, in this embodiment, the preset bandwidth design condition headphone apparatus stores in advance a bandwidth threshold locally determined based on the hardware performance of the inner ear microphone, and when the signal bandwidth size of the low-frequency signal is lower than the bandwidth threshold, it indicates that the signal bandwidth is too narrow to meet the bandwidth design condition, and when the signal bandwidth size is higher than or equal to the bandwidth threshold, it indicates that the signal bandwidth is sufficiently wide to meet the bandwidth design condition. It should be understood that the bandwidth threshold may of course be determined to be of different sizes in different possible embodiments based on different design requirements of practical applications, and the speech enhancement method of the present invention is not limited to the specific size of the bandwidth threshold.

In addition, in this embodiment, when performing bandwidth extension on a low-frequency signal, the earphone device may specifically perform bandwidth extension on the low-frequency signal based on DNN. That is, the headphone apparatus employs a deep neural network model corresponding to the structure shown in fig. 5 to perform bandwidth extension on the low-frequency signal, where N1, N2, N3, N4, and N5 represent the number of nodes or computing units.

In the above deep neural network model adopted by the headphone apparatus, the input signal is a feature vector (feature parameters include but are not limited to cepstrum parameters such as MFCC (mel frequency domain cepstrum coefficient), BFCC (bark frequency domain cepstrum coefficient), LPCC (LPC cepstrum coefficient)) composed of specific features, and the output signal is a gain g in the frequency domain.

Dense represents a fully connected network, i.e., all nodes in the network have connections to the data of all dimensions input.

The activation function may select Tanh and RELU, where:

RELU(x)＝max(0，x)。

the structure of LSTM (Long Short-Term Memory) is shown in FIG. 6, wherein the mathematical calculation process of LSTM is:

1. forgetting gate (forget gate) f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f )；

2. Input gate (input gate) i _t ＝σ(W _i ·[h _ti-1 ,x _t ]+b _i )；

3. Calculating a primary neuron variable

4. Updating neuronal variables

5. Computation output gate O _t ＝σ(W _i ·[h _ti-1 ,x _t ]+b _O )；

6. Final output h _t ＝O _t *tanh(C _t )。

In addition, the earphone device may also adopt a training framework of gru (gate recovery unit) instead of the LSTM described above. Because the design concept of the GRU is similar to that of the LSTM and there is one less gate inside the GRU than the LSTM, the use of the GRU requires fewer parameters than the LSTM, but the performance comparable to the LSTM can be approached or achieved by the use of the GRU as well. Thus, under the scene that the computing power of the hardware of the earphone device and the time cost of training the network are relatively low, the GRU is more practical than the LSTM.

In this embodiment, the structure of the GRU used by the earphone device may be as shown in fig. 7, and the mathematical calculation process of the GRU is as follows:

1. calculating reset gate r _t ＝σ(W _r x _t +U _r h _t-1 )；

2. Compute update gate z _t ＝σ(W _z x _t +U _z h _t-1 )；

3. Calculation hidden door

4. Final output

Based on this, in the above-mentioned earphone device, in the network training of the deep neural network model adopted for the first noise cancellation processing based on the deep learning mode on the high-frequency signal, namely the air conduction signal, collected by the common microphone in the currently collected voice signals, the calculation process of the sub-band gain g and the steps of training the LSTM (or GRU) network are as follows:

1. actually recording a noise signal n and an air-guide voice signal S, solving a mixed signal S _ mix of the noise signal n and the air-guide voice signal S, then respectively performing time-frequency transformation (such as FFT) on the n and ss _ mix to obtain frequency domain signals N (k), S (k) and S _ mix (k), and respectively calculating characteristic parameters of the signals through a characteristic extraction module;

2. dividing the signal into sub-bands (such as 18 sub-bands) in the frequency domain, wherein the sub-band division mode can adopt a mel frequency division mode or a bark sub-band division mode;

3. computing the speech signal energy and the mixed signal energy on the respective subbands, i.e. computing the speech signal energy E _s (b) And mixed signal energy E _{s_mix} (b) Wherein:

b is the subband number, b is 0, 1.

4. Calculating subband gains

5. Training a neural network, namely: and (3) taking the real characteristic signal obtained in the step (1) as an input signal, taking the real subband gain g obtained in the step (4) as an output signal, and continuously training and adjusting the bias b of the weight coefficient W, U in the deep neural network to enable g' output each time to be continuously close to the real value g. When the error between g' and g is smaller than a certain predetermined value, the network training is considered to be successful, and the network parameters at this time are parameters adopted in practical application.

It should be noted that, in this embodiment, the network training process described above may be specifically completed in advance using a framework such as tenserflow, keras, Mxnet, and the like.

In addition, in the deep learning model adopted by the earphone device for performing bandwidth extension on the low-frequency signal which has a narrow signal bandwidth and does not meet the bandwidth design condition, the calculation process and the training process of the subband gain g are as follows:

1. actually recording the noise signal ng of the inner ear microphone and the voice signal Sg of the inner ear microphone which are remained after noise reduction through the inner ear microphone, obtaining a mixed signal Sg _ mix of the noise signal ng and the voice signal Sg of the inner ear microphone, respectively performing time-frequency transformation (such as FFT) on the Sg, the ng and the Sg _ mix to obtain frequency domain signals Ng (k), Sg (k) and Sg _ mix (k), and respectively calculating characteristic parameters of the signals through a characteristic extraction module.

2. The signal is divided into sub-bands (for example, 5 sub-bands) in the frequency domain, and the sub-band division mode may adopt a mel frequency division mode or a bark sub-band division mode.

3. Computing the energy E of the speech signal on each subband _sg (b) And mixed signal energy E _{sg_mix} (b) Wherein:

b is the subband number, b is 0, 1.

4. Calculating subband gains

Further, in some possible embodiments, after the step of determining the magnitude between the a posteriori snr and the preset threshold, the speech enhancement method of the present invention may further include:

step A: and when the posterior signal-to-noise ratio is determined to be larger than the preset threshold value, performing third noise elimination processing on the voice signal based on a preset signal processing mode so as to enhance the voice signal.

It should be noted that, in this embodiment, the preset signal processing mode may specifically be a mode of performing noise cancellation processing on a speech signal based on spectral subtraction. In addition, when the posterior signal-to-noise ratio is high, the voice signal is specifically based on a high-frequency sub-band air conduction signal collected by a common microphone.

In this embodiment, as shown in fig. 2, when the earphone device determines that the value between the current obtained a posteriori signal-to-noise ratio by calculation and a preset threshold is larger than the preset threshold, the earphone device performs a third denoising process (shown as a conventional denoising process) based on spectral subtraction on the current acquired voice signal to perform denoising on the voice signal, so as to complete enhancement on the voice signal.

In this embodiment, when the posterior signal-to-noise ratio is high, the earphone device uses a speech enhancement method based on conventional signal processing of spectral subtraction for the high-frequency sub-band air-conduction signal, so that the complexity of enhancing the speech signal by the earphone device can be further reduced.

Further, in other possible embodiments, after the step of determining the signal type of the sound signal according to the bone conduction signal in the sound signal, the speech enhancement method of the present invention may further include:

and B: if the signal type is background noise, updating respective noise power spectrum estimation of the microphone and the inner ear microphone;

in this embodiment, as shown in fig. 2, when the earphone device performs voice activation detection on the bone conduction signal collected by the vibration sensor based on the currently collected sound signal to determine that the signal type of the sound signal is background noise, the noise signal power spectrum estimation of the normal microphone configured in the earphone device, namely, the microphone conduction noise signal power spectrum, and the noise signal power spectrum estimation of the inner ear microphone configured in the earphone device, namely, the inner ear microphone noise signal power spectrum, are immediately updated.

It should be noted that, in this embodiment, when the earphone device updates the noise power spectrum, the following formula may be specifically used:

P _n1 (k,t)＝β*P _n1 (k,t-1)+(1-β)*|Y ₁ (k，t)| ² ；

P _n2 (k,t)＝β*P _n2 (k,t-1)+(1-β)*|Y ₂ (k，t)| ² 。

wherein, P _n1 (k, t) and P _n2 And (k, t) respectively represents the noise signals received by the microphone and the bone conduction, the subscript t represents the t-th frame, and k is a frequency domain serial number and is an iteration factor, and is usually 0.9.

And C: and setting the output signal corresponding to the background noise to be zero, or converting the background noise into a preset comfortable noise signal.

It should be noted that, in this embodiment, the preset comfort noise signal may specifically be a sound signal playing a role in relaxing a person, and the comfort noise does not cause a person to be irritated or hurt the human body. For example, the preset comfort noise signal may specifically be: a lower energy white noise signal.

In this embodiment, as shown in fig. 2, when the earphone device performs voice activation detection on the bone conduction signal collected by the vibration sensor according to the currently collected sound signal to determine that the signal type of the sound signal is background noise, because the current sound signal collected by the earphone device is noise, the earphone device may directly set an output signal corresponding to the background noise to zero, or convert the background noise into a comfort noise signal through a sound signal processing technique.

Further, in some possible embodiments, in the above step S30: after enhancing the speech signal, the speech enhancement method of the present invention may further include:

step S40: filtering the high-frequency signal subjected to the first denoising processing and the low-frequency signal subjected to the second denoising processing respectively to obtain corresponding signals to be output;

in this embodiment, after performing first denoising processing on a high-frequency signal in a voice signal and performing second denoising processing on a low-frequency signal in the voice signal, the earphone device further filters the high-frequency signal subjected to the first denoising processing and the low-frequency signal subjected to the second denoising processing through a preset filter to obtain corresponding signals to be output.

Illustratively, as shown in fig. 2, the headphone apparatus filters the high-frequency signal subjected to the first noise cancellation process by a high-pass filter to obtain an output signal out1, and filters the low-frequency signal subjected to the second noise cancellation process by a low-pass filter to obtain an output signal out 2.

In this embodiment, the high-pass filter and the low-pass filter used in the earphone device may be specifically formed by connecting 5 biquad (biquad filters) in series, and various coefficients of the high-pass filter and the low-pass filter may be generated by matlab (commercial math software available from MathWorks corporation, usa).

In addition, in this embodiment, since the earphone device already performs time-frequency transform on the sound signal by using FFT when determining the signal type of the currently collected sound signal, when the earphone device performs filtering on the voice signal after performing noise cancellation processing, the earphone device needs to perform inverse IFFT fourier transform on the voice signal (including the high-frequency signal and the low-frequency signal) first.

Step S50: and fusing the signals to be output, and outputting the fused signals to be output after dynamic range control.

In this embodiment, after filtering the voice signals subjected to the noise cancellation processing to obtain output signals, the headphone device further fuses the output signals, and then performs dynamic range control on the fused output signals and outputs the output signals.

For example, as shown in fig. 2, the headphone apparatus fuses the filtered output signals out1 and out2 to obtain a fused signal out, which is k1 by out1+ k2 by out2, and then, the headphone apparatus further performs DRC dynamic range control on the fused signal out, that is, adjusts the output volume, frequency, and the like of the fused signal out to a standard range in which the human ear receives a voice signal, and then outputs the signal to the terminal apparatus currently connected to the headphone apparatus and performing a voice call.

Further, in any of the foregoing possible embodiments, the speech enhancement method of the present invention may further include:

In this embodiment, as shown in fig. 2, after the earphone device collects sound signals in real time through the vibration sensor, the two ordinary microphones and the inner ear microphone, the earphone device immediately performs echo cancellation processing on four signals, namely, a bone conduction signal collected by the vibration sensor, two high frequency signals (shown as a microphone signal 1 and a microphone signal 2) collected by the two ordinary microphones, and a low frequency signal (shown as an inner ear microphone signal) collected by the inner ear microphone, according to a far-end signal.

In addition, in order to further improve the effect of performing noise cancellation processing on the voice signal, the headset device may further perform beam forming on two high-frequency signals, namely two microphone signals, acquired by two common microphones in the sound signal after performing echo cancellation processing on the acquired sound signal, so as to suppress noise outside the directivity.

It should be noted that, in this embodiment, the earphone device may specifically perform echo cancellation and beam forming on the collected sound signal in any mature echo cancellation manner and beam forming manner, respectively. The speech enhancement method of the present invention is not limited to the specific implementation procedures of the echo cancellation and the beam forming.

In this embodiment, when the earphone device is used by a wearer to perform a voice call, the earphone device continuously collects a sound signal including a voice signal emitted by the wearer and/or a noise signal in an environment where the wearer is located through a self-configured sound collection device including a vibration sensor. In this way, when the earphone device collects a sound signal including a voice signal sent by a wearer and/or a noise signal in the environment where the wearer is located, the earphone device immediately judges the signal type of the sound signal through a bone conduction signal collected by a vibration sensor in the sound signal, namely judges whether the sound signal is the voice signal or the noise signal specifically; then, when the earphone device judges that the signal type of the sound signal is a voice signal, the size of the posterior signal-to-noise ratio is further calculated, and the size between the posterior signal-to-noise ratio and a preset threshold value is determined; finally, when the earphone device determines that the posterior signal-to-noise ratio is smaller than or equal to the preset threshold, the earphone device performs first denoising processing on a high-frequency signal in the voice signals based on a deep learning mode, and performs second denoising processing on a frequency domain on a low-frequency signal collected by an inner ear microphone in the voice signals, so that the enhancement on the voice signals is completed.

Thus, compared with the existing voice enhancement mode, the invention judges the signal type through the bone conduction signal acquired by the vibration sensor configured in the earphone device, so that when a wearer sends a voice signal, the invention selects a mode for denoising and enhancing the voice according to the posterior signal-to-noise ratio, namely when the posterior signal-to-noise ratio is smaller, a deep learning-based mode is respectively adopted for denoising the high-frequency signal of the voice signal and denoising the low-frequency signal in the voice signal, thus, the invention realizes that on the earphone device with low power consumption and low resource, the signal type is judged based on the bone conduction signal and the magnitude of the posterior signal-to-noise ratio is determined, the enhancement operation is efficiently carried out on the voice signal, and the background noise of the voice signal received by the user through the earphone device is effectively eliminated.

In addition, an embodiment of the present invention further provides a speech enhancement apparatus, where the speech enhancement apparatus of the present invention is applied to an earphone device, the earphone device includes a vibration sensor, and referring to fig. 8, the speech enhancement apparatus of the present invention includes:

the detection module 10 is configured to, when a sound signal is acquired, judge a signal type of the sound signal according to a bone conduction signal in the sound signal, where the bone conduction signal is acquired based on the vibration sensor;

a determining module 20, configured to obtain a posterior signal-to-noise ratio and determine a size between the posterior signal-to-noise ratio and a preset threshold if the signal type is a voice signal;

and the enhancing module 30 is configured to, when it is determined that the posterior signal-to-noise ratio is smaller than or equal to the preset threshold, perform first denoising processing on a high-frequency signal in the voice signal based on a deep learning manner, and perform second denoising processing in a frequency domain on a low-frequency signal in the voice signal, so as to enhance the voice signal.

Optionally, the earphone device further includes an inner ear microphone, the low frequency signal is acquired based on the inner ear microphone, and the enhancing module 30 includes:

a detection unit for detecting a signal bandwidth of the low frequency signal;

the first denoising unit is used for performing second denoising processing of a frequency domain on the low-frequency signal when the signal bandwidth meets a preset bandwidth design condition;

and the second noise elimination unit is used for performing bandwidth expansion on the low-frequency signal and performing second noise elimination on the low-frequency signal subjected to bandwidth expansion when the signal bandwidth does not meet the bandwidth design condition.

Optionally, the detection module 10 includes:

the first calculation unit is used for calculating time domain information according to a bone conduction signal in the sound signal, wherein the time domain information comprises a zero crossing rate and a pitch period;

the time-frequency transformation unit is used for performing time-frequency transformation processing on the bone conduction signal to obtain a frequency domain signal and calculating frequency domain information according to the frequency domain signal, wherein the frequency domain information comprises frequency spectrum energy and a frequency spectrum centroid;

and the signal type judging unit is used for performing fusion judgment on the zero crossing rate, the pitch period, the spectrum energy and the spectrum centroid to judge and obtain the signal type of the sound signal.

Optionally, the speech enhancement apparatus of the present invention further includes:

the filtering module is used for respectively filtering the high-frequency signal subjected to the first denoising processing and the low-frequency signal subjected to the second denoising processing to obtain corresponding signals to be output;

and the fusion output module is used for fusing the signals to be output and outputting the fused signals to be output after dynamic range control.

Optionally, the enhancing module 30 of the speech enhancing apparatus of the present invention is further configured to, when it is determined that the posterior signal-to-noise ratio is greater than the preset threshold, perform third denoising processing on the speech signal based on a preset signal processing manner to enhance the speech signal.

the frequency spectrum updating module is used for updating respective noise power spectrum estimation of the microphone and the inner ear microphone if the signal type is background noise;

and the noise processing module is used for setting the output signal corresponding to the background noise to be zero or converting the background noise into a preset comfortable noise signal.

Optionally, the detection module 10 of the speech enhancement device of the present invention is further configured to, when the sound signal is collected, perform echo cancellation processing on the sound signal, and then execute the step of determining the signal type of the sound signal according to the bone conduction signal in the sound signal and subsequent steps thereof.

The specific implementation of the speech enhancement apparatus of the present invention is basically the same as the foregoing embodiments of the speech enhancement method, and is not described herein again.

In addition, the embodiment of the present invention further provides an earphone device, which includes a structural shell, a communication module, a main control module (for example, a micro control unit MCU), a speaker, a microphone, an inner ear microphone, a vibration sensor, a memory, and the like. The main control module can comprise a microprocessor, an audio decoding unit, a power supply and power supply management unit, a sensor and other active or passive devices required by the system and the like (which can be replaced, deleted or added according to actual functions), so that the wireless audio receiving and playing functions are realized.

The memory of the headset device of the present invention may have stored therein a speech enhancement program that can be invoked by a microprocessor in the headset device and that performs the following operations:

Optionally, the low-frequency signal is acquired based on the inner ear microphone, and the speech enhancement program stored in the memory of the headset device of the present invention may be called by the microprocessor in the headset device to further perform the following operations:

detecting a signal bandwidth of the low frequency signal;

or,

Optionally, the speech enhancement program stored in the memory of the headset device of the present invention may be invoked by the microprocessor in the headset device to further perform the following operations:

Optionally, the speech enhancement program stored in the memory of the headset device of the present invention may be called by the microprocessor in the headset device to perform the following operations after performing enhancement on the speech signal:

Optionally, the speech enhancement program stored in the memory of the headset device of the present invention may be invoked by the microprocessor in the headset device to perform the following operations after performing the determination of the magnitude between the a posteriori signal-to-noise ratio and the preset threshold:

Alternatively, the speech enhancement program stored in the memory of the earphone device of the present invention may be called by the microprocessor in the earphone device, and after performing the determination of the signal type of the sound signal according to the bone conduction signal in the sound signal, further perform the following operations:

Furthermore, the present invention also provides a computer readable storage medium, on which a speech enhancement program is stored, which when executed by a processor implements the steps of the speech enhancement method of the present invention as described above.

The embodiments of the headset device and the computer-readable storage medium of the present invention can refer to the embodiments of the speech enhancement method of the present invention, and are not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A speech enhancement method applied to a headset device including a vibration sensor, the speech enhancement method comprising:

2. The speech enhancement method of claim 1, wherein the headset device further comprises an inner ear microphone, the low-frequency signal is acquired based on the inner ear microphone, and the step of performing the second noise cancellation processing on the low-frequency signal in the frequency domain in the speech signal comprises:

detecting a signal bandwidth of the low frequency signal;

or,

3. The speech enhancement method of claim 1 wherein the step of determining the signal type of the sound signal based on the bone conduction signal in the sound signal comprises:

4. The speech enhancement method of claim 1 wherein after the step of enhancing the speech signal, the method further comprises:

5. The speech enhancement method of claim 1, wherein after the step of determining the magnitude between the a posteriori signal-to-noise ratio and a preset threshold, the method further comprises:

6. The speech enhancement method of claim 1 wherein after the step of determining the signal type of the sound signal from the bone conduction signal in the sound signal, the method further comprises:

7. The speech enhancement method of any one of claims 1 to 6, further comprising:

8. A speech enhancement apparatus, applied to a headset device including a vibration sensor, the speech enhancement apparatus comprising:

the determining module is used for acquiring the posterior signal-to-noise ratio and determining the size between the posterior signal-to-noise ratio and a preset threshold value if the signal type is a voice signal;

9. An earphone device, characterized in that the earphone device comprises: memory, processor and a speech enhancement program stored on the memory and executable on the processor, the speech enhancement program when executed by the processor implementing the steps of the speech enhancement method according to any of claims 1 to 7.

10. A computer-readable storage medium, in which a speech enhancement program is stored, which when executed by a processor implements the steps of the speech enhancement method according to any one of claims 1 to 7.