CN108986834A

CN108986834A - The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network

Info

Publication number: CN108986834A
Application number: CN201810960512.0A
Authority: CN
Inventors: 张雄伟; 单东晶; 郑昌艳; 曹铁勇; 李莉; 杨吉斌
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2018-08-22
Filing date: 2018-08-22
Publication date: 2018-12-11
Anticipated expiration: 2038-08-22
Also published as: CN108986834B

Abstract

The invention discloses a kind of blind Enhancement Methods of the bone conduction voice based on codec framework and recurrent neural network, conductance and bone conduction phonetic feature are extracted first, alignment pretreatment is carried out to the voice feature data of extraction, then it is inputted using bone conduction phonetic feature as training, initiation parameter using conductance voice dictionary combination coefficient as training objective, as encoder in next step；The decoder model based on local attention mechanism is constructed, using encoder output as the input of decoder, using conductance phonetic feature as training objective, joint training codec models, and storage model parameter；Bone conduction phonetic feature to be reinforced is finally extracted, renormalization and feature inverse transformation is carried out using the trained encoding and decoding neural fusion Feature Conversion of above-mentioned steps, then to the output of neural network, finally obtains enhanced time domain speech.The present invention solves the recovery of radio-frequency component, bone conduction unvoiced segments are restored and the problems such as compared with recovery under strong noise background, improves the enhancing quality of bone conduction voice.

Description

The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network

Technical field

It is one kind based on codec framework and the long short-term memory of depth the invention belongs to speech signal processing technology The blind Enhancement Method of bone conduction voice of recurrent neural network.

Background technique

Bone-conduction microphone is a kind of non-sonic transducer equipment, and vocal cord vibration can be transmitted to larynx and skull when people speaks, this Kind microphone is exactly based on this vibration signal of acquisition and is converted to electric signal to obtain voice.With traditional air transmitted Mike Wind voice is different, and ambient noise is difficult to have an impact this kind of non-sonic transducer, so bone conduction voice just shields from sound source Noise has very strong noiseproof feature, military and civil upper be applied.For example, many national in military equipment, example If being provided with the communication system based on bone conduction in armed helicopter, tank, in " the following soldier " Single-soldier system in the U.S. Osophone is its important communication tool, and at civilian aspect, iASUS company of the U.S. transports for limit such as racing car, racing motorcycles It is dynamic, the equipment such as a plurality of larynx microphones, osophone are had developed, the companies such as Japanese Panasonic, Sony also develop a variety of bone conductions Communication product, and it is applied to the neck such as fire-fighting, forestry, petroleum exploration and exploitation, mine, emergency relief, special duty, engineering construction Domain.

Although bone conduction voice can be effective against the interference of ambient noise, due to human body signal conduction low pass with And thick and heavy high frequency section missing, intermediate-frequency section, air-flow sound, nasal cavity sound missing is presented in the inherent characteristics of vibration signal, bone conduction voice Phenomena such as, voice sounds that comparison is dull, not clear enough, has seriously affected the auditory perception of people.In addition, in bone conduction voice Can be mixed into some non-acoustic physical noises, for example, equipment generated with the skin being close to frictional noise, extreme sport when it is powerful The noise etc. that introduces when colliding of wind-force frictional noise, people's chewing or tooth, the communication quality that these noises also reduce.Cause This, carries out the research to bone conduction voice enhancement algorithm, to the practicalization for being pushed further into bone-conduction microphone product, improves strong Voice communication quality under noise circumstance has important theory significance and practical value.

Currently, the blind enhancing of bone conduction voice is there are mainly three types of comparing typical method: unsupervised spread spectrum method, equalization, Spectrum envelope transformation approach.

Unsupervised spread spectrum method (Bouserhal RE, Falk T H, Voix J.In-ear microphone speech quality enhancement via adaptive filtering and artificial bandwidth Extension. [J] .Journal of the Acoustical Society of America.2017) think bone conduction voice There is consistent harmonic structure between the consistent resonance peak structure of conductance voice or the low frequency and high frequency of voice, Using this architectural characteristic, directly low-frequency spectra can be extended, the high-frequency resonance peak or harmonic structure enhanced, i.e., Realize the blind enhancing of bone conduction voice.

The thought of equalization is to find the inverse transform function g (t) of transmission channel transforming function transformation function h (t), from bone conduction voice signal In recover conductance voice signal.Equalization proposes (Shimamura T, Tamiya T.A by Shimamura first reconstruction filter for bone-conducted speech[C].Circuits and Systems, 2005.Midwest Symposium on, 2005.2005:1847-1850), by modeling g (t), and construct inverse filter reality Existing bone conduction speech enhan-cement.Equalization is able to maintain the harmonic structure of low frequency in voice, and is effectively compressed in bone conduction voice excessive Energy, but the high-frequency components being relatively difficult to recover in bone conduction voice.

Currently, the most of blind enhancing of bone conduction voice is using method (Turan, the M.A.T.and converted based on spectrum envelope E.Erzin,Source and Filter Estimation for Throat-Microphone Speech Enhancement [J],2016.Mohammadi,S.H.and A.Kain,An overview of voice conversion systems[J], 2017).The basic ideas of spectrum envelope transformation approach are the source-filter models according to voice, are driving source feature by speech decomposition With spectrum envelope feature.In the training stage, bone conduction voice and conductance voice data synthetic model by analysis, extract incentive characteristic and Spectrum envelope feature establishes the transformational relation between spectrum envelope feature by training transformation model；In the enhancing stage, bone to be reinforced Lead cent solution obtains incentive characteristic and spectrum envelope feature, is estimated from bone conduction speech manual envelope characteristic using trained model Conductance spectrum envelope feature out recycles the envelope estimated and the original envelope characteristic of bone conduction voice to synthesize the voice of enhancing.

Above based on source-Filtering Model decomposing synthesizing process, certain progress is achieved on bone conduction speech enhan-cement, still Performance is undesirable under generally existing feature selecting difficulty, high-noise environment, restores the problems such as inaccurate to voice high-frequency ingredient, leads Cause that enhanced voice low frequency is thick and heavy, not clear enough the intelligibility of sound is insufficient, there are process noises etc..Some research has started to adopt With the decomposing synthesizing process based on signal model, voice signal is divided into higher-dimension amplitude spectrum and phase, and utilizes deep learning skill Art is established the relationship between bone conduction voice and pure gas lead pitch dimension amplitude spectrum, is achieved when restoring bone conduction voice good Effect, but since it is not used due to dictionary introduces structural information etc., however it remains low frequency is thick and heavy, high-frequency information restores not Completely, a series of problems, such as sound is not clear and legible enough.

Summary of the invention

The purpose of the present invention is to provide a kind of blind increasings of the bone conduction voice based on codec framework and recurrent neural network Strong method, is driving with data, obtains model parameter by training, recycles trained model enhancing bone conduction voice, solves The recovery of radio-frequency component, bone conduction unvoiced segments restore and the problems such as compared with recoveries under strong noise background, to promote bone conduction voice Clarity and intelligibility further improve the enhancing quality of bone conduction voice.

The technical solution for realizing the aim of the invention is as follows: a kind of bone based on codec framework and recurrent neural network Lead tone deaf's Enhancement Method, includes the following steps:

Data prediction extracts conductance and bone conduction phonetic feature, carries out alignment pretreatment to the voice feature data of extraction, And conductance voice dictionary is calculated using sparse Non-negative Matrix Factorization (Sparse NMF) on conductance voice feature data；

The pre-training of encoder, using bone conduction phonetic feature as training input, using conductance voice dictionary combination coefficient as Training objective, use is non-negative, sparse long short-term memory recurrent neural network (NS-LSTM) training encoder model, and stores instruction The deep neural network parameter perfected, the initiation parameter as encoder in next step；

The joint training of codec is constructed the decoder model based on local attention mechanism, is made with encoder output For the input of decoder, using conductance phonetic feature as training objective, joint training codec models, and storage model parameter；

Speech enhan-cement extracts bone conduction phonetic feature to be reinforced, utilizes the trained encoding and decoding neural network of above-mentioned steps It realizes Feature Conversion, then renormalization and feature inverse transformation is carried out to the output of neural network, finally obtain enhanced time domain Voice.

Compared with prior art, the present invention its remarkable advantage is: voice dictionary and non-negative sparse recurrent neural network are answered It uses in bone conduction speech enhan-cement task, constructs the codec framework based on local attention mechanism, be driving with data, lead to It crosses training and obtains network model parameter, bone conduction speech enhan-cement quality is effectively promoted using trained model, is specifically included:

(1) it is effectively utilized the structured message that the voice dictionary that sparse Non-negative Matrix Factorization calculates provides, it is preferably heavy Build voice high-frequency ingredient；

(2) output of encoder is the linear combination coefficient of voice dictionary, and voice dictionary passes through Sparse NMF by true Real pure conductance voice extracts, i.e., so that encoder has preferable anti-noise ability, can go automatically during coding Remove the noise in bone conduction voice；

(3) it is effectively utilized sparse non-negative recurrent neural network and models bone conduction voice on the basis of dictionary to conductance voice spy The complex nonlinear relationship for levying conversion, compared to traditional neural network, non-negative sparse neural network passes through specially designed net Network cellular construction, can effectively learn sequence it is long when dependence, and establish and the mapping relations of voice dictionary；

(4) decoder network based on local attention mechanism is determined the input content of decoder by training, makes it Have the recovery capability to bone conduction voice unvoiced segments (corresponding conductance voice is not necessarily noiseless), equally has to very noisy anti- Interference performance can further promote bone conduction voice Quality of recovery.

Present invention is further described in detail with reference to the accompanying drawing.

Detailed description of the invention

Fig. 1 is that the present invention is based on the blind Enhancement Method flow charts of the bone conduction voice of codec framework and recurrent neural network.

Fig. 2 is the codec configuration diagram that the present invention uses.

Fig. 3 is non-negative sparse NS-LSTM cellular construction schematic diagram.

Fig. 4 is the blind enhancing instance graph of bone conduction voice of the present invention.

Specific embodiment

The present invention is based on the blind Enhancement Methods of the bone conduction voice of codec framework and recurrent neural network combined with Figure 1 and Figure 2, Specific implementation be divided into two stages: training stage and enhancing stage.Training stage includes Step 1: Step 2: step 3, increases The strong stage includes Step 4: step 5.Training stage and enhancing stage voice data do not repeat, i.e., no speech content is identical Sentence.

First stage is the training stage: being trained by training data to neural network model.

Step 1: conductance (Air Conduction, AC) and bone conduction (Bone Conduction, BC) voice amplitudes are extracted Spectrum signature, and data prediction is carried out to meet the input demand of neural network to the phonetic feature of extraction, it is specifically divided into following Processing stage, wherein " the unsupervised language of single channel based on low-rank and sparse matrix decomposition is made an uproar point for the first two processing stage and patent From method " the data prediction step of (CN 102915742B) is consistent, in order to reduce extraction amplitude spectrum signature dynamic model It encloses, uses log-magnitude spectrum signature, the steps include:

(1) voice data is by the same person while to wear AC and BC voice pair that AC and BC microphone apparatus are recorded, AC Voice is represented by A, and BC voice is represented by B, is respectively believed AC and BC voice time domain using Short Time Fourier Transform (STFT) Number y (A), y (B) transform to time-frequency domain, specific steps respectively are as follows:

1. carrying out framing windowing process respectively to voice time domain signal y (A), y (B), window function is Hamming window, frame length N, N is taken as 2 integral number power, and interframe movable length is H；

2. carrying out leaf transformation in K point discrete Fourier to the speech frame after framing, the time-frequency spectrum Y of voice is obtained_A(k,t)、Y_B(k, T), calculation formula is as follows:

Wherein, k=0,1 ..., K-1 indicate discrete point in frequency, and K indicates frequency points when discrete Fourier transform, K= N, t=0,1 ..., T-1 indicate that frame number, T are the totalframes of framing, and h (n) is Hamming window function；

(2) it takes absolute value to frequency spectrum Y (k, t), amplitude spectrum M is calculated_A、M_B, calculation formula is as follows:

M (k, t)=| Y (k, t) |

(3) logarithm (ln) using e the bottom of as is taken to amplitude spectrum M (k, t), obtains log-magnitude spectrum L_A、L_B, calculation formula is as follows:

L (k, t)=lnM (k, t)

(4) using sparse Non-negative Matrix Factorization (Sparse Non-negative Matrix Factorization: Sparse NMF) on pure gas lead sound log-magnitude spectrum eigenmatrix calculate conductance voice dictionary D.

Step 2: the pre-training of encoder: pre-training encoder (Encoder) network is by up of three-layer (see Fig. 2): linear Layer (Linear), long short-term memory Recursive Networks layer (LSTM) and the long short-term memory recurrent neural net network layers (NS- of non-negative sparse LSTM).When training, the log-magnitude spectrum signature after normalizing (Normalization) using bone conduction voice is inputted as training, with The log-magnitude spectrum signature of conductance voice is as training objective, using time reversal propagation algorithm (Back Propagation Through Time, BPTT) training neural network model, and store trained neural network parameter, NS-LSTM neural network Cellular construction and encoder network pre-training process are as follows:

(1) the long Memory Neural Networks model in short-term of non-negative sparse is a kind of deformation of long memory models (LSTM) in short-term, is led to Cross and introduce non-negative, sparse control variable, the output vector for meeting constraint condition can be generated, component units as shown in figure 3, Following formula subrepresentation can be used:

f_t=σ (W_fxX_t+W_fhh_t-1+b_f)

i_t=σ (W_ixX_t+W_ihh_t-1+b_i)

g_t=σ (W_gxX_t+W_ghh_t-1+b_g)

o_t=σ (W_oxX_t+W_ohh_t-1+b_o)

S_t=g_t⊙i_t+S_t-1⊙f_t

h_t=sh_(D,u)(ψ(φ(S_t))⊙o_t)

Wherein,φ (x)=tanh (x), ψ (x)=ReLU (x)=max (0, x) are nonnegativity restrictions, sh_(D,u)(x)=D (tanh (x+u)+tanh (x-u)) is sparse activation primitive；

(2) it abandons Regularization Technique: for the robustness for improving model, regularization (dropout will be abandoned Regularization) technology is applied in neural metwork training, which is to be mentioned by cutting down neural unit number to reach The effect of high generalization ability.It is p (such as p is set as 0.2) that setting, which abandons ratio, abandons regularization formula are as follows:

Wherein,Indicate the existing probability of l layers of j-th of neuron, Bernoulli (p) refers to that probability is the Bai Nu of p Benefit distribution, the distribution are that occur 1 with Probability p, occur 0 with probability 1-p,It is the output valve of l layers of j-th of neuron,It isMultiplied byValue afterwards, the i.e. value are equal toOr 0,It is network weight,It is biasing, f indicates that activation is single Member,It is to be exported by the neuron of activation primitive.

(3) training of encoder neural network: training loss objective function is network output valve and corresponding A C voice logarithm The mean square deviation of amplitude spectrum:

Wherein, c is dictionary coefficient, and M=[D, I ,-I] is conductance voice dictionary and the set for compensating dictionary, and I is diagonal element Element is the diagonal matrix that 1 remaining element is 0, plays the role of compensation in dictionary element linear combination and promoted to indicate precision. The training data of b% (such as b is set as between 10-20) collects data as verifying in training process, minimizes damage in training process Lose function, the random initial weight of network [- 0.1,0.1], specifically using stochastic gradient descent algorithm (Stochastic Gradient Descent, SGD) a kind of deformation root mean square propagation algorithm (Root Mean Square Propagation, RMSProp), learning rate initial value is set as lr (such as lr is set as 0.01), when verifying collection loss function value does not decline, learning rate Multiplied by factor ratio (such as ratio is set as 0.1), momentum is momentum (such as momentum is set as 0.9), when verifying collects Deconditioning when a trained bout of the continuous i of loss function value (such as i is set as 3) does not decline saves the loss function value of verifying collection The smallest neural network model parameter, is denoted as S '.

Step 3: codec joint training.Decoder (Decoder) structure is as shown in Fig. 2, include two-tier network knot Structure is Recursive Networks layer (LSTM) and linear layer (Linear), a respectively_iIndicate the decoder net based on local attention mechanism Network input, formula are as follows:

e_jIt is j-th of encoder output, the neighborhood output of i-th of device of N (i) presentation code output can be exported at 10-20 Value.ω_ijThe weighted array coefficient for indicating these neighbouring inputs, its calculation formula is:

score_ijIt is j-th of output e of encoder_jTo i-th of input a of decoder_iWeighted score, by normalization after To the weight of linear combination.W_aIt is the parameter matrix of a linear layer, linear layer and e are passed through in the output at the (i-1)-th moment of decoder_j Inner product is done, for calculating weighted score (score_ij)。

The effect of decoder is obtained by training closer to true on the synthesis speech basic by dictionary encoding Voice signal.Decoder using with optimized by the way of encoder joint training, by the period one by one optimization in the way of, with The log-magnitude spectrum latent structure mean square error loss function of actual speech signal obtains optimal network by gradient decline and joins Number, and it is stored in local, it is denoted as S.

Second stage is the enhancing stage: utilizing trained codec network model, enhances bone conduction voice.

Step 4: bone conduction phonetic feature to be reinforced is extracted, and according to the log-magnitude after step 1 alignment obtained Compose LQ_BData statistical characteristics, including mean valueAnd varianceCarry out data normalization:

Firstly, to BC voice data B to be reinforced_E, using Fourier transformation by voice time domain waveform convertion to time-frequency domainThe process of BC phonetic feature to be reinforced is extracted as shown in Fig. 1 strengthening part, is mentioned compared to the feature in step 1 It takes, the more phase extraction steps of the step are obtaining time domain speech frequency spectrumAfterwards, it not only needs to calculate amplitude spectrum, also It needs to calculate phase, according to time-frequency spectrumIts amplitude spectrum is calculatedAnd phaseCalculation formula are as follows:

Wherein, atan2 (x) is four-quadrant arctan function, and imag (x) and real (x) respectively represent the imaginary part of time-frequency spectrum With real part.According to amplitude spectrumLog-magnitude spectrum is calculatedThen, the BC voice logarithm obtained according to the training stage The mean value of amplitude spectrumAnd varianceIt is rightIt is normalized:

Step 5: when enhancing, step 4 is extracted using first stage trained codec neural network bone conduction Phonetic feature is converted, then carries out renormalization and feature inverse transformation to the output of neural network, is finally obtained enhanced Time domain speech.

Firstly, by after normalizationIt is input in trained encoding and decoding neural network model S, network is calculated Output, i.e., enhanced feature

Secondly, by enhanced featureRenormalization and inverse transformation are carried out, enhanced time domain language is finally obtained Sound, steps are as follows:

(1) according to the mean value of training stage AC voice log-magnitude spectrumAnd varianceBy bidirectional gate recurrent neural net The output that network obtainsRenormalization is carried out, log-magnitude spectrum is obtainedCalculation formula is as follows:

(2) by log-magnitude spectrumExponent arithmetic is carried out, amplitude spectrum is obtainedCalculation formula is as follows:

(3) amplitude spectrum is utilizedAnd phase informationTime-frequency spectrum is calculatedCalculation formula is as follows:

(4) using duplicate removal Superposition Formula after inverse Fourier transform and voice framing, by frequency spectrumWhen being transformed into Domain finally obtains enhanced time domain speech signal y (B_E)。

Embodiment

Fig. 4 is figure of the embodiment of the present invention, and example voice length is about 3.5s and 4s, and speech sample frequency is 8kHz, setting Voice frame length 32ms, frame move 10ms, carry out discrete Fourier transform, frequency point number K=256, obtained log-magnitude spectrum to every frame Dimension is 129 dimensions.In Fig. 4-1 and Fig. 4-2, (a) is the sound spectrograph of bone conduction voice, (b) for using the increasing of LSTM deep neural network Voice sound spectrograph after strong is (c) the enhanced voice sound spectrograph of the present invention.It can be seen that the high-frequency signal of voice after enhancing Effective recovery is obtained with signals such as aspirant, the fricatives of missing, and compare LSTM algorithm to have preferable performance boost, In addition subjective test results also indicate that the present invention achieves good bone conduction speech enhan-cement effect.

Claims

1. a kind of blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network, it is characterised in that following step It is rapid:

Data prediction: conductance AC and bone conduction BC voice amplitudes spectrum signature are extracted, the voice feature data of extraction is aligned Pretreatment, and conductance voice dictionary is calculated using sparse Non-negative Matrix Factorization on conductance voice feature data；

The pre-training of encoder: inputting using bone conduction phonetic feature as training, using conductance voice dictionary combination coefficient as training Target, use is non-negative, sparse long short-term memory recurrent neural network training encoder model, and stores trained depth nerve Network parameter, the initiation parameter as encoder in next step；

The joint training of codec: decoder model of the building based on local attention mechanism, using encoder output as solution The input of code device, using conductance phonetic feature as training objective, joint training codec models, and storage model parameter；

Speech enhan-cement: extracting bone conduction phonetic feature to be reinforced, real using encoding and decoding neural network trained in above-mentioned steps Existing Feature Conversion, then renormalization and feature inverse transformation are carried out to the output of neural network, finally obtain enhanced time domain language Sound.

2. according to the method described in claim 1, it is characterized in that extract conductance and bone conduction voice amplitudes spectrum signature, and to extraction Phonetic feature carry out data prediction to meet the input demand of neural network, wherein in order to reduce the amplitude spectrum signature of extraction Dynamic range, using log-magnitude spectrum signature:

(1) voice data is by the same person while to wear AC and BC voice pair that AC and BC microphone apparatus are recorded, AC voice It is expressed as A, BC voice is expressed as B, using Short Time Fourier Transform respectively by AC and BC voice time domain signal y (A), y (B) difference Transform to time-frequency domain, specific steps are as follows:

1. carrying out framing windowing process respectively to voice time domain signal y (A), y (B), window function is Hamming window, frame length N, and N takes For 2 integral number power, interframe movable length is H；

2. carrying out leaf transformation in K point discrete Fourier to the speech frame after framing, the time-frequency spectrum Y of voice is obtained_A(k,t)、Y_B(k, t), meter It is as follows to calculate formula:

Wherein, k=0,1 ..., K-1 indicate discrete point in frequency, and K indicates frequency points when discrete Fourier transform, K=N, t= 0,1 ..., T-1 indicate that frame number, T are the totalframes of framing, and h (n) is Hamming window function；

M (k, t)=| Y (k, t) |

L (k, t)=lnM (k, t)

(4) conductance voice word is calculated on pure gas lead sound log-magnitude spectrum eigenmatrix using sparse Non-negative Matrix Factorization Allusion quotation D.

3. according to the method described in claim 1, it is characterized in that pre-training encoder network is by three in the pre-training of encoder Layer composition: linear layer Linear, long short-term memory Recursive Networks layer LSTM and the long short-term memory recurrent neural network of non-negative sparse Layer NS-LSTM when training, is inputted using the log-magnitude spectrum signature after the normalization of bone conduction voice as training, with conductance voice Log-magnitude spectrum signature using time reversal propagation algorithm training neural network model, and is stored and is trained as training objective Neural network parameter, NS-LSTM neural network cellular construction and encoder network pre-training process be as follows:

(1) the long Memory Neural Networks model in short-term of non-negative sparse is a kind of deformation of long memory models LSTM in short-term, passes through introducing Non-negative, sparse control variable can generate the output vector for meeting constraint condition, component units following formula subrepresentation:

f_t=σ (W_fxX_t+W_fhh_t-1+b_f)

i_t=σ (W_ixX_t+W_ihh_t-1+b_i)

g_t=σ (W_gxX_t+W_ghh_t-1+b_g)

o_t=σ (W_oxX_t+W_ohh_t-1+b_o)

Wherein,φ (x)=tanh (x), ψ (x)=ReLU (x)=max (0, x) are nonnegativity restrictions, sh_(D,u) (x)=D (tanh (x+u)+tanh (x-u)) is sparse activation primitive；

(2) abandon Regularization Technique: it is p that setting, which abandons ratio, abandons regularization formula are as follows:

Wherein,Indicate the existing probability of l layers of j-th of neuron, Bernoulli (p) refers to the Bernoulli Jacob point that probability is p Cloth, the distribution are that occur 1 with Probability p, occur 0 with probability 1-p,It is the output valve of l layers of j-th of neuron,It isMultiplied byValue afterwards, the i.e. value are equal toOr 0,It is network weight,It is biasing, f indicates activation unit, It is to be exported by the neuron of activation primitive；

(3) training of encoder neural network: training loss objective function is network output valve and corresponding A C voice log-magnitude The mean square deviation of spectrum:

Wherein, c is dictionary coefficient, and M=[D, I ,-I] is conductance voice dictionary and the set for compensating dictionary, and I is that diagonal element is 1 The diagonal matrix that remaining element is 0 plays the role of compensation and is promoted to indicate precision in dictionary element linear combination；It trained The training data of b% is as verifying collection data in journey, minimizes loss function in training process, the random initial weight of network [- 0.1,0.1], specifically using a kind of deformation root mean square propagation algorithm of stochastic gradient descent algorithm, learning rate initial value is set For lr, when verifying collection loss function value does not decline, learning rate is multiplied by factor ratio, momentum momentum, when verifying collection damage Deconditioning when functional value continuous i trained bout does not decline is lost, the smallest neural network of loss function value of verifying collection is saved Model parameter is denoted as S '.

4. according to the method described in claim 1, it is characterized in that decoder includes two-tier network in codec joint training Structure is Recursive Networks layer LSTM and linear layer Linear, a respectively_iIndicate the decoder network based on local attention mechanism Input, formula are as follows:

e_jIt is j-th of encoder output, the neighborhood output of i-th of device of N (i) presentation code output, ω_ijIndicate these neighbouring inputs Weighted array coefficient, its calculation formula is:

score_ijIt is j-th of output e of encoder_jTo i-th of input a of decoder_iWeighted score, by normalization after obtain line Property combination weight, W_aIt is the parameter matrix of a linear layer, linear layer and e are passed through in the output at the (i-1)-th moment of decoder_jIn doing Product, for calculating weighted score (score_ij)；

The effect of decoder is obtained by training closer to true voice on the synthesis speech basic by dictionary encoding Signal；Decoder using with optimized by the way of encoder joint training, by the period one by one optimization in the way of, with true The log-magnitude spectrum latent structure mean square error loss function of voice signal obtains optimal network parameter by gradient decline, And it is stored in local, it is denoted as S.

5. according to the method described in claim 1, it is characterized in that extract bone conduction phonetic feature to be reinforced, and according to acquisition Log-magnitude spectrum LQ after alignment_BData statistical characteristics, including mean valueAnd varianceCarry out data normalization:

Firstly, to BC voice data B to be reinforced_E, using Fourier transformation by voice time domain waveform convertion to time-frequency domain The process of BC phonetic feature to be reinforced is extracted on the basis of feature extraction to phase extraction, that is, is obtaining time domain speech frequency spectrumAfterwards, amplitude spectrum is not only calculated, phase is also calculated, according to time-frequency spectrumIts amplitude spectrum is calculatedAnd phase PositionCalculation formula are as follows:

Wherein, atan2 (x) is four-quadrant arctan function, and imag (x) and real (x) respectively represent the imaginary part and reality of time-frequency spectrum Portion, according to amplitude spectrumLog-magnitude spectrum is calculatedThen, the BC voice log-magnitude obtained according to the training stage The mean value of spectrumAnd varianceIt is rightIt is normalized:

6. according to the method described in claim 1, it is characterized in that utilizing first stage trained codec mind when enhancing It is converted through bone conduction phonetic feature of the network to extraction, then renormalization and feature inversion is carried out to the output of neural network It changes, finally obtains enhanced time domain speech:

Firstly, by after normalizationIt is input in trained encoding and decoding neural network model S, network output is calculated, I.e. enhanced feature

Secondly, by enhanced featureRenormalization and inverse transformation are carried out, enhanced time domain speech is finally obtained, is walked It is rapid as follows:

(1) according to the mean value of training stage AC voice log-magnitude spectrumAnd varianceBidirectional gate recurrent neural network is obtained The output arrivedRenormalization is carried out, log-magnitude spectrum is obtainedCalculation formula is as follows:

(4) using duplicate removal Superposition Formula after inverse Fourier transform and voice framing, by frequency spectrumIt is transformed into time domain, most Enhanced time domain speech signal y (B is obtained eventually_E)。