CN108986834B

CN108986834B - Bone conduction voice blind enhancement method based on codec framework and recurrent neural network

Info

Publication number: CN108986834B
Application number: CN201810960512.0A
Authority: CN
Inventors: 张雄伟; 单东晶; 郑昌艳; 曹铁勇; 李莉; 杨吉斌
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2018-08-22
Filing date: 2018-08-22
Publication date: 2023-04-07
Anticipated expiration: 2038-08-22
Also published as: CN108986834A

Abstract

The invention discloses a bone conduction voice blind enhancement method based on a codec framework and a recurrent neural network, which comprises the steps of firstly extracting air conduction and bone conduction voice characteristics, carrying out alignment preprocessing on the extracted voice characteristic data, then taking the bone conduction voice characteristics as training input, taking air conduction voice dictionary combination coefficients as training targets, and taking the air conduction voice dictionary combination coefficients as initialization parameters of an encoder in the next step; constructing a decoder model based on a local attention mechanism, taking the output of an encoder as the input of a decoder, taking the air conduction speech characteristics as a training target, training the codec model in a combined manner, and storing model parameters; and finally, extracting the features of the bone conduction speech to be enhanced, realizing feature conversion by utilizing the encoding and decoding neural network trained in the steps, and then carrying out inverse normalization and inverse feature transformation on the output of the neural network to finally obtain the enhanced time domain speech. The invention solves the problems of recovery of high-frequency components, recovery of bone conduction silence, recovery under a stronger noise background and the like, and improves the enhancement quality of bone conduction voice.

Description

Bone conduction voice blind enhancement method based on codec framework and recurrent neural network

Technical Field

The invention belongs to the technical field of voice signal processing, and relates to a bone conduction voice blind enhancement method based on a codec framework and a depth long-time memory recurrent neural network.

Background

Bone conduction microphones are non-acoustic transducer devices, and vocal cord vibrations are transmitted to the larynx and skull when a person speaks, and the microphones acquire speech by collecting the vibration signals and converting the vibration signals into electrical signals. Different from the traditional air conduction microphone voice, the background noise hardly influences the non-acoustic sensors, so the bone conduction voice shields the noise from the sound source, has strong anti-noise performance and is applied to military and civil affairs. For example, many countries have bone conduction-based communication systems in military equipment, such as military helicopters and tanks, bone conduction headsets are important communication tools in the "future fighters" individual combat system in the united states, and in civilian applications, the american iASUS company has developed a plurality of throat microphones, bone conduction headsets and other devices for extreme sports such as racing cars and motorcycle racing cars, and japan panasonic, sony and sony companies have developed various bone conduction communication products and are applied to the fields of fire fighting, forestry, oil exploration and mining, mines, emergency rescue, special duty, engineering construction and the like.

Although the bone conduction voice can effectively resist the interference of environmental noise, due to the low-pass performance of human signal conduction and the inherent characteristics of vibration signals, the bone conduction voice has the phenomena of high-frequency part loss, medium-frequency part thickness, airflow sound, nasal cavity sound loss and the like, and the voice is hard to hear and not clear enough, so that the auditory perception of people is seriously influenced. In addition, bone conduction speech may also be mixed with non-acoustic physical noise, such as friction noise generated by the device and the skin in close contact, strong wind friction noise during extreme movements, noise introduced when a person chews or collides with teeth, and the like, which also degrades communication quality. Therefore, the research on the bone conduction voice enhancement algorithm is carried out, and the method has important theoretical significance and practical value for further promoting the practical process of bone conduction microphone products and improving the voice communication quality in strong noise environments.

Currently, there are three more typical methods for blind enhancement of bone conduction speech: unsupervised spectrum spreading, equalization, spectral envelope conversion.

An unsupervised spectrum expansion method (Bouseerhal R E, false T H, voix J. In-ear microphonic resonance quality enhancement visual adaptive filtering and identification bandwidth extension, J. Journal of the acoustic Society of America.2017) considers that the bone conduction voice and the air conduction voice have a consistent formant structure, or the low frequency and the high frequency of the voice have a consistent harmonic structure, and by utilizing the structural characteristics, the low frequency spectrum can be directly expanded to obtain an enhanced high frequency formant or harmonic structure, namely the blind enhancement of the bone conduction voice is realized.

The idea of the equalization method is to find the inverse transformation function g (t) of the transmission channel transformation function h (t) and recover the bone conduction speech signal from the bone conduction speech signal. The equalization method was first proposed by Shimamura (Shimamura T, tamiya t.a acquisition filter for bone-connected speech [ C ]. Circuits and Systems,2005. Midwest Symposium on,2005.2005, 1847-1850), by modeling g (T), and constructing an inverse filter to achieve bone-conduction speech enhancement. The equalization method can maintain the harmonic structure of the low frequency in the voice and effectively compress the excessive energy in the bone conduction voice, but is difficult to recover the high frequency component in the bone conduction voice.

Currently, most of the blind Enhancement of bone conduction Speech adopts a method based on spectral envelope transformation (Turan, M.A.T. and E.Erzin, source and Filter Estimation for thread-Microphone Speech Enhancement [ J ],2016.Mohammadi, S.H. and A.Kain, an overview of voice conversion systems [ J ], 2017). The basic idea of the spectral envelope conversion method is to decompose the speech into excitation source characteristics and spectral envelope characteristics according to a source-filter model of the speech. In the training stage, bone conduction voice and air conduction voice data are analyzed and synthesized by a model, excitation features and spectrum envelope features are extracted, and a conversion relation between the spectrum envelope features is established through a training conversion model; in the enhancement stage, the bone conduction speech to be enhanced is decomposed to obtain excitation features and spectrum envelope features, the trained model is used for estimating the gas conduction spectrum envelope features from the bone conduction speech spectrum envelope features, and then the estimated envelope and the original envelope features of the bone conduction speech are used for synthesizing the enhanced speech.

The decomposition synthesis method based on the source-filter model makes a certain progress on bone conduction speech enhancement, but the problems of difficult feature selection, non-ideal performance in a high-noise environment, inaccurate recovery of speech pitch-frequency components and the like generally exist, so that the enhanced speech has the defects of thick low frequency, insufficient sound clarity and intelligibility, process noise and the like. Some researches begin to adopt a decomposition synthesis method based on a signal model to divide a voice signal into a high-dimensional amplitude spectrum and a phase, and establish a relation between a bone conduction voice and a pure air conduction voice high-dimensional amplitude spectrum by utilizing a deep learning technology, so that a good effect is obtained when the bone conduction voice is recovered.

Disclosure of Invention

The invention aims to provide a bone conduction voice blind enhancement method based on a codec framework and a recurrent neural network, which takes data as drive, obtains model parameters through training, utilizes the trained model to enhance bone conduction voice, and solves the problems of recovery of high-frequency components, bone conduction silence recovery, recovery under the background of strong noise and the like, thereby improving the definition and intelligibility of the bone conduction voice and further improving the enhancement quality of the bone conduction voice.

The technical solution for realizing the purpose of the invention is as follows: a bone conduction voice blind enhancement method based on a codec architecture and a recurrent neural network comprises the following steps:

data preprocessing, namely extracting air conduction and bone conduction voice characteristics, performing alignment preprocessing on the extracted voice characteristic data, and calculating an air conduction voice dictionary on the air conduction voice characteristic data by using sparse nonnegative matrix factorization (SparsenNMF);

pre-training an encoder, namely training an encoder model by taking bone conduction speech characteristics as training input and taking air conduction speech dictionary combination coefficients as a training target and adopting a nonnegative sparse long-term memory recurrent neural network (NS-LSTM), and storing trained deep neural network parameters as initialization parameters of the encoder in the next step;

performing joint training on a coder and a decoder, constructing a decoder model based on a local attention mechanism, performing joint training on the coder and the decoder model by taking the output of a coder as the input of the decoder and the air conduction voice characteristic as a training target, and storing model parameters;

and voice enhancement, namely extracting bone conduction voice features to be enhanced, realizing feature conversion by utilizing the encoding and decoding neural network trained in the steps, and then carrying out inverse normalization and inverse feature transformation on the output of the neural network to finally obtain the enhanced time domain voice.

Compared with the prior art, the invention has the remarkable advantages that: applying a speech dictionary and a nonnegative sparse recurrent neural network to a bone conduction speech enhancement task, constructing a codec framework based on a local attention mechanism, taking data as a drive, obtaining network model parameters through training, and effectively improving the bone conduction speech enhancement quality by using a trained model, wherein the method specifically comprises the following steps:

(1) The structured information provided by the speech dictionary calculated by sparse nonnegative matrix factorization is effectively utilized, and the high-frequency components of the speech are better reconstructed;

(2) The output of the encoder is a linear combination coefficient of the voice dictionary, and the voice dictionary is extracted from real pure air conduction voice through SparsenNMF, so that the encoder has better anti-noise capability and can automatically remove noise in bone conduction voice in the encoding process;

(3) The sparse non-negative recurrent neural network is effectively utilized to model the complex nonlinear relation of the characteristic conversion from the bone conduction speech to the air conduction speech on the basis of the dictionary, compared with the traditional neural network, the non-negative sparse neural network can effectively learn the long-term dependence of the sequence through a specially designed network unit structure, and establish the mapping relation with the speech dictionary;

(4) The decoder network based on the local attention mechanism determines the input content of the decoder through training, so that the decoder network has the capability of recovering bone conduction voice silence (corresponding air conduction voice is not necessarily silent), has the anti-interference capability on strong noise, and can further improve the bone conduction voice recovery quality.

The present invention is described in further detail below with reference to the attached drawing figures.

Drawings

FIG. 1 is a flow chart of a blind enhancement method for bone conduction speech based on codec architecture and recurrent neural network according to the present invention.

FIG. 2 is a schematic diagram of a codec architecture used in the present invention.

FIG. 3 is a schematic diagram of a non-negative sparse NS-LSTM cell structure.

Fig. 4-1 and 4-2 are diagrams of an embodiment of the bone conduction voice blind enhancement of the present invention.

Detailed Description

With reference to fig. 1 and fig. 2, the implementation of the bone conduction speech blind enhancement method based on codec architecture and recurrent neural network of the present invention is divided into two stages: a training phase and an enhancement phase. The training phase comprises a first step, a second step and a third step, and the enhancement phase comprises a fourth step and a fifth step. The training phase and the enhancement phase speech data are not repeated, i.e. there are no sentences with the same utterance content.

The first phase is a training phase: the neural network model is trained by the training data.

The method comprises the following steps: extracting air guide (AC) and bone guide (BC) voice amplitude spectrum characteristics, and performing data preprocessing on the extracted voice characteristics to meet the input requirement of a neural network, wherein the two former processing stages are consistent with the data preprocessing step of a patent 'single-channel unsupervised speech and noise separation method based on low-rank and sparse matrix decomposition' (CN 102915742B), and in order to reduce the dynamic range of the extracted amplitude spectrum characteristics, the logarithmic amplitude spectrum characteristics are adopted, and the steps are as follows:

(1) The voice data is an AC and BC voice pair recorded by the same person wearing AC and BC microphone equipment, the AC voice can be represented as A, the BC voice can be represented as B, AC and BC voice time domain signals y (A) and y (B) are respectively transformed to a time-frequency domain by using short-time Fourier transform (STFT), and the method specifically comprises the following steps:

(1) respectively carrying out frame division and windowing on voice time domain signals y (A) and y (B), wherein a window function is a Hamming window, the frame length is N, N is an integer power of 2, and the inter-frame moving length is H;

(2) performing K-point discrete Fourier transform on the framed voice frame to obtain a time-frequency spectrum Y of the voice _A (k,t)、 Y _B (k, t), the calculation formula is as follows:

wherein, K =0,1, \8230, K-1 represents discrete frequency points, K represents frequency points in discrete Fourier transform, K = N, T =0,1, \8230, T-1 represents frame number, T is total frame number of the sub-frame, h (N) is Hamming window function;

(2) Taking the absolute value of the frequency spectrum Y (k, t), and calculating to obtain a magnitude spectrum M _A 、M _B The calculation formula is as follows:

M(k，t)＝|Y(k，t)|

(3) Taking the logarithm (ln) with e as the base for the amplitude spectrum M (k, t) to obtain the logarithmic amplitude spectrum L _A 、L _B The calculation formula is as follows:

L(k，t)＝lnM(k，t)

(4) And calculating the air conduction voice dictionary D on the log-amplitude spectrum characteristic Matrix of the pure air conduction voice by adopting Sparse Non-negative Matrix Factorization (Sparse Non-negative Matrix Factorization: sparse NMF).

Step two: pre-training of the encoder: the pre-training Encoder (Encoder) network consists of three layers (see fig. 2): a Linear layer (Linear), a long-and-short time memory recursive network Layer (LSTM) and a non-negative sparse long-and-short time memory recursive neural network layer (NS-LSTM). During training, the logarithmic magnitude spectrum feature after bone conduction speech Normalization (Normalization) is used as training input, the logarithmic magnitude spectrum feature of air conduction speech is used as a training target, a Time Back Propagation (BPTT) algorithm is adopted to train a neural network model, and trained neural network parameters are stored, wherein the NS-LSTM neural network unit structure and the pre-training process of an encoder network are as follows:

(1) The nonnegative sparse long-short term memory neural network model is a variant of a long-short term memory model (LSTM), can generate an output vector meeting constraint conditions by introducing nonnegative and sparse control variables, and the constituent units of the model are shown in FIG. 3 and can be represented by the following formula:

f _t ＝σ(W _fx X _t +W _fh h _t-1 +b _f )

i _t ＝σ(W _ix X _t +W _ih h _t-1 +b _i )

g _t ＝σ(W _gx X _t +W _gh h _t-1 +b _g )

o _t ＝σ(W _ox X _t +W _oh h _t-1 +b _o )

S _t ＝g _t ⊙i _t +S _t-1 ⊙f _t

h _t ＝sh _(D，u) (ψ(φ(S _t ))⊙o _t )

wherein the content of the first and second substances,

phi (x) = tanh (x), psi (x) = ReLU (x) = max (0, x) is a non-negative constraint, sh _(D，u) (x) D (tanh (x + u) + tanh (x-u)) is a sparse activation function;

(2) Discard regularization technique: in order to improve the robustness of the model, a drop regularization (drop regularization) technology is applied to neural network training, and the technology achieves the effect of improving generalization capability by reducing the number of neural units. Set the discard ratio to p (e.g., p is set to 0.2), the discard regularization formula is:

wherein the content of the first and second substances,

representing the probability of the presence of the jth neuron at level l, bernoulli (p) refers to a Bernoulli distribution with probability p, which occurs with a 1 at probability p, a 0 at probability 1-p, and/or a combination thereof>

Is the output value of the jth neuron of the ith layer, < >>

Is/>

Multiply by->

The latter value, i.e. the value equals +>

Or 0, </or >>

Is the network weight value>

Is a bias, f denotes the activation unit,

is the output of the neuron through the activation function.

(3) Training of the encoder neural network: the training loss objective function is the mean square error of the network output value and the corresponding AC voice log-amplitude spectrum:

wherein c is a dictionary coefficient, M = [ D, I, -I ] is a set of the air conduction speech dictionary and the compensation dictionary, I is a diagonal matrix with diagonal elements of 1 and other elements of 0, and the functions of compensating and improving representation precision are achieved in dictionary element linear combination. Training data of b% (for example, b is set to be between 10 and 20) in the training process is used as verification set data, the loss function is minimized in the training process, the network random initial weight [ -0.1,0.1], specifically, a modified Root Mean Square Propagation (RMSProp) algorithm of a random Gradient Descent (SGD) is used, the initial value of the learning rate is set to lr (for example, lr is set to be 0.01), when the verification set loss function value does not descend, the learning rate is multiplied by a factor ratio (for example, ratio is set to be 0.1), the momentum is momentum (for example, momentum is set to be 0.9), when the verification set loss function value continues to be i (for example, i is set to be 3) and the training is stopped, and the neural network model parameter with the minimum loss function value in the verification set is saved and is denoted as S'.

Step three: codec joint training. The Decoder (Decoder) structure is shown in FIG. 2 and comprises two layers of network structures, a recursive network Layer (LSTM) and a Linear layer (Linear), a _i Represents the decoder network input based on the local attention mechanism, and the formula is as follows:

e _j is the jth output of the encoder, and N (i) represents the neighborhood output of the ith output of the encoder, and can take values in 10-20 outputs. Omega _ij The weighted combination coefficients representing these neighboring inputs are calculated as:

score _ij is the jth output e of the encoder _j To the ith input a of the decoder _i The weighted score of (2) is normalized to obtain the weight of the linear combination. W _a Is a parameter matrix of a linear layer, the output of the decoder at time i-1 passes through the linear layer and e _j Inner product is made for calculating a weighted score (score) _ij )。

The decoder is used for obtaining a speech signal which is closer to a real speech signal through training on the basis of synthesized speech subjected to dictionary coding. The decoder is optimized by adopting a mode of joint training with the encoder, a mean square error loss function is constructed by using the logarithmic magnitude spectrum characteristics of the real voice signals according to a mode of optimizing time periods one by one, and the optimal network parameters are obtained through gradient descent and stored locally and recorded as S.

The second phase is the enhancement phase: and enhancing the bone conduction voice by using the trained codec network model.

Step four: extracting the bone conduction voice features to be enhanced, and obtaining the aligned log-amplitude spectrum LQ according to the step one _B The data statistics of (1), including the mean

And variance->

And (3) carrying out data normalization:

first, to-be-enhanced BC voice data B _E Transforming speech time domain waveform to time-frequency domain by Fourier transform

The process of extracting the feature of BC voice to be enhanced is shown in the enhanced portion of FIG. 1, which is compared with the feature extraction in the first step, which is to obtain the time-domain voice spectrum ^ greater or greater than the phase extraction step>

Then, not only the amplitude spectrum but also the phase needs to be calculated, according to the time frequency spectrum->

Calculates to obtain its amplitude spectrum>

And phase->

The calculation formula is as follows:

wherein atan2 (x) is a four-quadrant arctangent function, and imag (x) and real (x) represent the imaginary part and the real part of the time spectrum, respectively. According to the magnitude spectrum

Calculating to obtain a logarithmic magnitude spectrum>

Then, the mean value of the BC voice log-amplitude spectrum obtained according to the training stage is ≥ based on>

And variance->

Is paired and/or matched>

And (3) carrying out normalization: />

Step five: and during enhancement, converting the bone conduction voice features extracted in the step four by using the codec neural network trained in the first stage, and then performing inverse normalization and feature inverse transformation on the output of the neural network to finally obtain the enhanced time domain voice.

Firstly, the normalized

Inputting the input into the trained coding and decoding neural network modelIn S, the network output, i.e. the enhanced characteristic->

Secondly, the enhanced features

And performing inverse normalization and inverse transformation to finally obtain the enhanced time domain voice, wherein the steps are as follows:

(1) Mean value of logarithmic magnitude spectrum of AC voice according to training stage

And variance->

Based on the output from the bi-directional gate recurrent neural network>

Performing inverse normalization to obtain a log-amplitude spectrum>

The calculation formula is as follows:

(2) Will log amplitude spectrum

Performing index operation to obtain an amplitude spectrum>

The calculation formula is as follows:

(3) Using magnitude spectra

And phase information>

Calculating a time spectrum->

The calculation formula is as follows:

(4) Using inverse Fourier transform and de-overlap addition formula after speech framing to separate frequency spectrum

Converting the signal into time domain to finally obtain enhanced time domain speech signal y (B) _E )。

Examples

Fig. 4-1 and 4-2 are diagrams of embodiments of the present invention, in which example speech lengths are about 3.5s and 4s, a speech sampling frequency is 8kHz, a speech frame length is set to 32ms, a frame shift is set to 10ms, discrete fourier transform is performed on each frame, the number of frequency points K =256, and an obtained logarithmic magnitude spectrum dimension is 129 dimensions. In fig. 4-1 and 4-2, (a) is a spectrogram of bone conduction speech, (b) is a spectrogram of speech enhanced by using an LSTM deep neural network, and (c) is a spectrogram of speech enhanced by the present invention. It can be seen that the high-frequency signals of the enhanced voice and the signals such as the missing air sounds, fricatives and the like are effectively recovered, the performance is better improved compared with the LSTM algorithm, and in addition, the subjective test result also shows that the invention obtains good bone conduction voice enhancement effect.

Claims

1. A bone conduction voice blind enhancement method based on a codec architecture and a recurrent neural network is characterized by comprising the following steps:

data preprocessing: extracting air conduction AC and bone conduction BC voice amplitude spectrum characteristics, performing alignment pretreatment on the extracted voice characteristic data, and calculating an air conduction voice dictionary by using sparse nonnegative matrix decomposition on the air conduction voice characteristic data;

pre-training of the encoder: taking the bone conduction voice characteristics as training input, taking the air conduction voice dictionary combination coefficient as a training target, adopting a nonnegative and sparse long-term memory recurrent neural network training encoder model, and storing the trained deep neural network parameters as initialization parameters of an encoder in the next step;

joint training of codecs: constructing a decoder model based on a local attention mechanism, taking the output of a coder as the input of a decoder, taking the air conduction voice characteristic as a training target, jointly training a codec model, and storing model parameters;

and (3) speech enhancement: and extracting the features of the bone conduction speech to be enhanced, realizing feature conversion by utilizing the trained coding and decoding neural network in the steps, and then carrying out inverse normalization and inverse feature transformation on the output of the neural network to finally obtain the enhanced time domain speech.

2. The method of claim 1, characterized in that air conduction and bone conduction speech magnitude spectral features are extracted and the extracted speech features are data preprocessed to fit the input requirements of the neural network, wherein to reduce the dynamic range of the extracted magnitude spectral features, logarithmic magnitude spectral features are used:

(1) The voice data is an AC and BC voice pair recorded by the same person wearing AC and BC microphone equipment at the same time, the AC voice is represented as A, the BC voice is represented as B, and AC and BC voice time domain signals y (A) and y (B) are respectively transformed to a time frequency domain by short-time Fourier transform, and the method specifically comprises the following steps:

(2) performing K-point discrete Fourier transform on the framed voice frame to obtain a time-frequency spectrum Y of the voice _A (k,t)、Y _B (k, t), the calculation formula is as follows:

M(k,t)＝|Y(k,t)|

L(k,t)＝lnM(k,t)

(4) And calculating the air conduction voice dictionary D on the log-amplitude spectrum characteristic matrix of the pure air conduction voice by adopting sparse nonnegative matrix decomposition.

3. The method of claim 2, wherein in the pre-training of the encoder, the pre-trained encoder network consists of three layers: the method comprises the following steps that a Linear layer Linear, a long-time memory recursive network layer LSTM and a non-negative sparse long-time memory recursive neural network layer NS-LSTM are adopted, during training, logarithmic magnitude spectrum features after normalization of bone conduction speech are used as training input, logarithmic magnitude spectrum features of air conduction speech are used as a training target, a time back propagation algorithm is adopted to train a neural network model, and trained neural network parameters are stored, wherein the NS-LSTM neural network unit structure and the pre-training process of an encoder network are as follows:

(1) The nonnegative sparse long-term memory neural network model is a deformation of the long-term memory model LSTM, can generate an output vector meeting constraint conditions by introducing nonnegative and sparse control variables, and the constituent units of the model are represented by the following sub-formula:

f _t ＝σ(W _fx X _t +W _fh h _t-1 +b _f )

i _t ＝σ(W _ix X _t +W _ih h _t-1 +b _i )

g _t ＝σ(W _gx X _t +W _gh h _t-1 +b _g )

o _t ＝σ(W _ox X _t +W _oh h _t-1 +b _o )

wherein the content of the first and second substances,

phi (x) = tanh (x), psi (x) = ReLU (x) = max (0, x) is a non-negative constraint, sh _(D,u) (x) D (tanh (x + u) + tanh (x-u)) is a sparse activation function;

(2) Discard regularization technique: setting the discard ratio to p, the discard regularization formula is:

wherein the content of the first and second substances,

represents the probability of existence of the jth neuron of the ith layer, bernoulli (p) refers to Bernoulli distribution with probability p, the scoreCloth appears 1 with probability p, appears 0 with probability 1-p, and>

is the output value of the jth neuron of the ith layer, < >>

Is that

Multiply pick>

The latter value, i.e. the value equals->

Or 0,. Or>

Is a network weight value>

Is biased, f denotes an activation unit>

Is the neuron output via an activation function;

(3) Training of the encoder neural network: the training loss objective function is the mean square error of the network output value and the corresponding AC speech log-amplitude spectrum:

wherein c is a dictionary coefficient, M = [ D, I, -I ] is a set of an air conduction speech dictionary and a compensation dictionary, D is the air conduction speech dictionary, I is a diagonal matrix with 1 diagonal elements and 0 remaining elements, and the functions of compensation and representation precision improvement are realized in dictionary element linear combination; b% of training data in the training process is used as verification set data, a loss function is minimized in the training process, and a network random initial weight value is-0.1, specifically, a deformed root mean square propagation algorithm of a random gradient descent algorithm is adopted, the initial value of the learning rate is set to lr, when the loss function value of the verification set does not descend, the learning rate is multiplied by a factor ratio, momentum is momentum, when the loss function value of the verification set does not descend for i continuous training rounds, the training is stopped, and the neural network model parameter with the minimum loss function value of the verification set is stored and is marked as S'.

4. The method of claim 3, wherein in the codec joint training, the decoder comprises two layers of network structures, namely a recursive network layer LSTM and a Linear layer Linear, a _i Represents the decoder network input based on the local attention mechanism, and the formula is as follows:

e _j is the jth output of the encoder, N (i) represents the neighborhood output of the ith output of the encoder, ω _ij A weighted combination coefficient representing the neighboring inputs is calculated as:

score _ij is the jth output e of the encoder _j To the ith input a of the decoder _i Normalized to obtain the weight of the linear combination, W _a Is a parameter matrix of a linear layer, the output of the decoder at time i-1 passes through the linear layer and e _j Inner product is made for calculating a weighted score (score) _ij )；

The decoder is used for obtaining a speech signal which is closer to reality through training on the basis of the synthesized speech coded by the dictionary; the decoder is optimized in a mode of joint training with the encoder, a mean square error loss function is constructed according to the mode of optimizing time periods one by using the logarithmic magnitude spectrum characteristics of the real voice signals, and optimal network parameters are obtained through gradient descent and stored locally and recorded as S.

5. Method according to claim 4, characterized in that the bone conduction speech features to be enhanced are extracted and based on the obtained aligned log-amplitude spectrum LQ _B Including mean values

And variance->

And (3) carrying out data normalization:

The process of extracting the BC voice feature to be enhanced extracts the phase on the basis of feature extraction, namely, the time domain voice frequency spectrum is obtained

Then, not only the amplitude spectrum but also the phase is calculated, depending on the time spectrum->

Calculates to obtain its amplitude spectrum>

And phase->

The calculation formula is as follows:

wherein atan2 (x) is a four-quadrant arc tangent function, imag (x) and real (x) respectively represent imaginary part and real part of time frequency spectrum, and amplitude spectrum is used

The logarithmic magnitude spectrum is calculated to be>

And variance->

Is paired and/or matched>

And (4) normalization is carried out:

6. the method of claim 5, wherein during enhancement, the extracted bone conduction speech features are transformed by using the codec neural network trained in the first stage, and then the output of the neural network is subjected to inverse normalization and inverse feature transformation, so as to finally obtain the enhanced time domain speech:

firstly, the normalized log-amplitude spectrum is measured