CN108986834B - Bone conduction voice blind enhancement method based on codec framework and recurrent neural network - Google Patents

Bone conduction voice blind enhancement method based on codec framework and recurrent neural network Download PDF

Info

Publication number
CN108986834B
CN108986834B CN201810960512.0A CN201810960512A CN108986834B CN 108986834 B CN108986834 B CN 108986834B CN 201810960512 A CN201810960512 A CN 201810960512A CN 108986834 B CN108986834 B CN 108986834B
Authority
CN
China
Prior art keywords
voice
training
neural network
speech
bone conduction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810960512.0A
Other languages
Chinese (zh)
Other versions
CN108986834A (en
Inventor
张雄伟
单东晶
郑昌艳
曹铁勇
李莉
杨吉斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Army Engineering University of PLA
Original Assignee
Army Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Army Engineering University of PLA filed Critical Army Engineering University of PLA
Priority to CN201810960512.0A priority Critical patent/CN108986834B/en
Publication of CN108986834A publication Critical patent/CN108986834A/en
Application granted granted Critical
Publication of CN108986834B publication Critical patent/CN108986834B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/005Correction of errors induced by the transmission channel, if related to the coding algorithm
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention discloses a bone conduction voice blind enhancement method based on a codec framework and a recurrent neural network, which comprises the steps of firstly extracting air conduction and bone conduction voice characteristics, carrying out alignment preprocessing on the extracted voice characteristic data, then taking the bone conduction voice characteristics as training input, taking air conduction voice dictionary combination coefficients as training targets, and taking the air conduction voice dictionary combination coefficients as initialization parameters of an encoder in the next step; constructing a decoder model based on a local attention mechanism, taking the output of an encoder as the input of a decoder, taking the air conduction speech characteristics as a training target, training the codec model in a combined manner, and storing model parameters; and finally, extracting the features of the bone conduction speech to be enhanced, realizing feature conversion by utilizing the encoding and decoding neural network trained in the steps, and then carrying out inverse normalization and inverse feature transformation on the output of the neural network to finally obtain the enhanced time domain speech. The invention solves the problems of recovery of high-frequency components, recovery of bone conduction silence, recovery under a stronger noise background and the like, and improves the enhancement quality of bone conduction voice.

Description

Bone conduction voice blind enhancement method based on codec framework and recurrent neural network
Technical Field
The invention belongs to the technical field of voice signal processing, and relates to a bone conduction voice blind enhancement method based on a codec framework and a depth long-time memory recurrent neural network.
Background
Bone conduction microphones are non-acoustic transducer devices, and vocal cord vibrations are transmitted to the larynx and skull when a person speaks, and the microphones acquire speech by collecting the vibration signals and converting the vibration signals into electrical signals. Different from the traditional air conduction microphone voice, the background noise hardly influences the non-acoustic sensors, so the bone conduction voice shields the noise from the sound source, has strong anti-noise performance and is applied to military and civil affairs. For example, many countries have bone conduction-based communication systems in military equipment, such as military helicopters and tanks, bone conduction headsets are important communication tools in the "future fighters" individual combat system in the united states, and in civilian applications, the american iASUS company has developed a plurality of throat microphones, bone conduction headsets and other devices for extreme sports such as racing cars and motorcycle racing cars, and japan panasonic, sony and sony companies have developed various bone conduction communication products and are applied to the fields of fire fighting, forestry, oil exploration and mining, mines, emergency rescue, special duty, engineering construction and the like.
Although the bone conduction voice can effectively resist the interference of environmental noise, due to the low-pass performance of human signal conduction and the inherent characteristics of vibration signals, the bone conduction voice has the phenomena of high-frequency part loss, medium-frequency part thickness, airflow sound, nasal cavity sound loss and the like, and the voice is hard to hear and not clear enough, so that the auditory perception of people is seriously influenced. In addition, bone conduction speech may also be mixed with non-acoustic physical noise, such as friction noise generated by the device and the skin in close contact, strong wind friction noise during extreme movements, noise introduced when a person chews or collides with teeth, and the like, which also degrades communication quality. Therefore, the research on the bone conduction voice enhancement algorithm is carried out, and the method has important theoretical significance and practical value for further promoting the practical process of bone conduction microphone products and improving the voice communication quality in strong noise environments.
Currently, there are three more typical methods for blind enhancement of bone conduction speech: unsupervised spectrum spreading, equalization, spectral envelope conversion.
An unsupervised spectrum expansion method (Bouseerhal R E, false T H, voix J. In-ear microphonic resonance quality enhancement visual adaptive filtering and identification bandwidth extension, J. Journal of the acoustic Society of America.2017) considers that the bone conduction voice and the air conduction voice have a consistent formant structure, or the low frequency and the high frequency of the voice have a consistent harmonic structure, and by utilizing the structural characteristics, the low frequency spectrum can be directly expanded to obtain an enhanced high frequency formant or harmonic structure, namely the blind enhancement of the bone conduction voice is realized.
The idea of the equalization method is to find the inverse transformation function g (t) of the transmission channel transformation function h (t) and recover the bone conduction speech signal from the bone conduction speech signal. The equalization method was first proposed by Shimamura (Shimamura T, tamiya t.a acquisition filter for bone-connected speech [ C ]. Circuits and Systems,2005. Midwest Symposium on,2005.2005, 1847-1850), by modeling g (T), and constructing an inverse filter to achieve bone-conduction speech enhancement. The equalization method can maintain the harmonic structure of the low frequency in the voice and effectively compress the excessive energy in the bone conduction voice, but is difficult to recover the high frequency component in the bone conduction voice.
Currently, most of the blind Enhancement of bone conduction Speech adopts a method based on spectral envelope transformation (Turan, M.A.T. and E.Erzin, source and Filter Estimation for thread-Microphone Speech Enhancement [ J ],2016.Mohammadi, S.H. and A.Kain, an overview of voice conversion systems [ J ], 2017). The basic idea of the spectral envelope conversion method is to decompose the speech into excitation source characteristics and spectral envelope characteristics according to a source-filter model of the speech. In the training stage, bone conduction voice and air conduction voice data are analyzed and synthesized by a model, excitation features and spectrum envelope features are extracted, and a conversion relation between the spectrum envelope features is established through a training conversion model; in the enhancement stage, the bone conduction speech to be enhanced is decomposed to obtain excitation features and spectrum envelope features, the trained model is used for estimating the gas conduction spectrum envelope features from the bone conduction speech spectrum envelope features, and then the estimated envelope and the original envelope features of the bone conduction speech are used for synthesizing the enhanced speech.
The decomposition synthesis method based on the source-filter model makes a certain progress on bone conduction speech enhancement, but the problems of difficult feature selection, non-ideal performance in a high-noise environment, inaccurate recovery of speech pitch-frequency components and the like generally exist, so that the enhanced speech has the defects of thick low frequency, insufficient sound clarity and intelligibility, process noise and the like. Some researches begin to adopt a decomposition synthesis method based on a signal model to divide a voice signal into a high-dimensional amplitude spectrum and a phase, and establish a relation between a bone conduction voice and a pure air conduction voice high-dimensional amplitude spectrum by utilizing a deep learning technology, so that a good effect is obtained when the bone conduction voice is recovered.
Disclosure of Invention
The invention aims to provide a bone conduction voice blind enhancement method based on a codec framework and a recurrent neural network, which takes data as drive, obtains model parameters through training, utilizes the trained model to enhance bone conduction voice, and solves the problems of recovery of high-frequency components, bone conduction silence recovery, recovery under the background of strong noise and the like, thereby improving the definition and intelligibility of the bone conduction voice and further improving the enhancement quality of the bone conduction voice.
The technical solution for realizing the purpose of the invention is as follows: a bone conduction voice blind enhancement method based on a codec architecture and a recurrent neural network comprises the following steps:
data preprocessing, namely extracting air conduction and bone conduction voice characteristics, performing alignment preprocessing on the extracted voice characteristic data, and calculating an air conduction voice dictionary on the air conduction voice characteristic data by using sparse nonnegative matrix factorization (SparsenNMF);
pre-training an encoder, namely training an encoder model by taking bone conduction speech characteristics as training input and taking air conduction speech dictionary combination coefficients as a training target and adopting a nonnegative sparse long-term memory recurrent neural network (NS-LSTM), and storing trained deep neural network parameters as initialization parameters of the encoder in the next step;
performing joint training on a coder and a decoder, constructing a decoder model based on a local attention mechanism, performing joint training on the coder and the decoder model by taking the output of a coder as the input of the decoder and the air conduction voice characteristic as a training target, and storing model parameters;
and voice enhancement, namely extracting bone conduction voice features to be enhanced, realizing feature conversion by utilizing the encoding and decoding neural network trained in the steps, and then carrying out inverse normalization and inverse feature transformation on the output of the neural network to finally obtain the enhanced time domain voice.
Compared with the prior art, the invention has the remarkable advantages that: applying a speech dictionary and a nonnegative sparse recurrent neural network to a bone conduction speech enhancement task, constructing a codec framework based on a local attention mechanism, taking data as a drive, obtaining network model parameters through training, and effectively improving the bone conduction speech enhancement quality by using a trained model, wherein the method specifically comprises the following steps:
(1) The structured information provided by the speech dictionary calculated by sparse nonnegative matrix factorization is effectively utilized, and the high-frequency components of the speech are better reconstructed;
(2) The output of the encoder is a linear combination coefficient of the voice dictionary, and the voice dictionary is extracted from real pure air conduction voice through SparsenNMF, so that the encoder has better anti-noise capability and can automatically remove noise in bone conduction voice in the encoding process;
(3) The sparse non-negative recurrent neural network is effectively utilized to model the complex nonlinear relation of the characteristic conversion from the bone conduction speech to the air conduction speech on the basis of the dictionary, compared with the traditional neural network, the non-negative sparse neural network can effectively learn the long-term dependence of the sequence through a specially designed network unit structure, and establish the mapping relation with the speech dictionary;
(4) The decoder network based on the local attention mechanism determines the input content of the decoder through training, so that the decoder network has the capability of recovering bone conduction voice silence (corresponding air conduction voice is not necessarily silent), has the anti-interference capability on strong noise, and can further improve the bone conduction voice recovery quality.
The present invention is described in further detail below with reference to the attached drawing figures.
Drawings
FIG. 1 is a flow chart of a blind enhancement method for bone conduction speech based on codec architecture and recurrent neural network according to the present invention.
FIG. 2 is a schematic diagram of a codec architecture used in the present invention.
FIG. 3 is a schematic diagram of a non-negative sparse NS-LSTM cell structure.
Fig. 4-1 and 4-2 are diagrams of an embodiment of the bone conduction voice blind enhancement of the present invention.
Detailed Description
With reference to fig. 1 and fig. 2, the implementation of the bone conduction speech blind enhancement method based on codec architecture and recurrent neural network of the present invention is divided into two stages: a training phase and an enhancement phase. The training phase comprises a first step, a second step and a third step, and the enhancement phase comprises a fourth step and a fifth step. The training phase and the enhancement phase speech data are not repeated, i.e. there are no sentences with the same utterance content.
The first phase is a training phase: the neural network model is trained by the training data.
The method comprises the following steps: extracting air guide (AC) and bone guide (BC) voice amplitude spectrum characteristics, and performing data preprocessing on the extracted voice characteristics to meet the input requirement of a neural network, wherein the two former processing stages are consistent with the data preprocessing step of a patent 'single-channel unsupervised speech and noise separation method based on low-rank and sparse matrix decomposition' (CN 102915742B), and in order to reduce the dynamic range of the extracted amplitude spectrum characteristics, the logarithmic amplitude spectrum characteristics are adopted, and the steps are as follows:
(1) The voice data is an AC and BC voice pair recorded by the same person wearing AC and BC microphone equipment, the AC voice can be represented as A, the BC voice can be represented as B, AC and BC voice time domain signals y (A) and y (B) are respectively transformed to a time-frequency domain by using short-time Fourier transform (STFT), and the method specifically comprises the following steps:
(1) respectively carrying out frame division and windowing on voice time domain signals y (A) and y (B), wherein a window function is a Hamming window, the frame length is N, N is an integer power of 2, and the inter-frame moving length is H;
(2) performing K-point discrete Fourier transform on the framed voice frame to obtain a time-frequency spectrum Y of the voice A (k,t)、 Y B (k, t), the calculation formula is as follows:
Figure GDA0004036728750000051
wherein, K =0,1, \8230, K-1 represents discrete frequency points, K represents frequency points in discrete Fourier transform, K = N, T =0,1, \8230, T-1 represents frame number, T is total frame number of the sub-frame, h (N) is Hamming window function;
(2) Taking the absolute value of the frequency spectrum Y (k, t), and calculating to obtain a magnitude spectrum M A 、M B The calculation formula is as follows:
M(k,t)=|Y(k,t)|
(3) Taking the logarithm (ln) with e as the base for the amplitude spectrum M (k, t) to obtain the logarithmic amplitude spectrum L A 、L B The calculation formula is as follows:
L(k,t)=lnM(k,t)
(4) And calculating the air conduction voice dictionary D on the log-amplitude spectrum characteristic Matrix of the pure air conduction voice by adopting Sparse Non-negative Matrix Factorization (Sparse Non-negative Matrix Factorization: sparse NMF).
Step two: pre-training of the encoder: the pre-training Encoder (Encoder) network consists of three layers (see fig. 2): a Linear layer (Linear), a long-and-short time memory recursive network Layer (LSTM) and a non-negative sparse long-and-short time memory recursive neural network layer (NS-LSTM). During training, the logarithmic magnitude spectrum feature after bone conduction speech Normalization (Normalization) is used as training input, the logarithmic magnitude spectrum feature of air conduction speech is used as a training target, a Time Back Propagation (BPTT) algorithm is adopted to train a neural network model, and trained neural network parameters are stored, wherein the NS-LSTM neural network unit structure and the pre-training process of an encoder network are as follows:
(1) The nonnegative sparse long-short term memory neural network model is a variant of a long-short term memory model (LSTM), can generate an output vector meeting constraint conditions by introducing nonnegative and sparse control variables, and the constituent units of the model are shown in FIG. 3 and can be represented by the following formula:
f t =σ(W fx X t +W fh h t-1 +b f )
i t =σ(W ix X t +W ih h t-1 +b i )
g t =σ(W gx X t +W gh h t-1 +b g )
o t =σ(W ox X t +W oh h t-1 +b o )
S t =g t ⊙i t +S t-1 ⊙f t
h t =sh (D,u) (ψ(φ(S t ))⊙o t )
wherein the content of the first and second substances,
Figure GDA0004036728750000052
phi (x) = tanh (x), psi (x) = ReLU (x) = max (0, x) is a non-negative constraint, sh (D,u) (x) D (tanh (x + u) + tanh (x-u)) is a sparse activation function;
(2) Discard regularization technique: in order to improve the robustness of the model, a drop regularization (drop regularization) technology is applied to neural network training, and the technology achieves the effect of improving generalization capability by reducing the number of neural units. Set the discard ratio to p (e.g., p is set to 0.2), the discard regularization formula is:
Figure GDA0004036728750000061
Figure GDA0004036728750000062
Figure GDA0004036728750000063
wherein the content of the first and second substances,
Figure GDA0004036728750000064
representing the probability of the presence of the jth neuron at level l, bernoulli (p) refers to a Bernoulli distribution with probability p, which occurs with a 1 at probability p, a 0 at probability 1-p, and/or a combination thereof>
Figure GDA0004036728750000065
Is the output value of the jth neuron of the ith layer, < >>
Figure GDA0004036728750000066
Is/>
Figure GDA0004036728750000067
Multiply by->
Figure GDA0004036728750000068
The latter value, i.e. the value equals +>
Figure GDA0004036728750000069
Or 0, </or >>
Figure GDA00040367287500000610
Is the network weight value>
Figure GDA00040367287500000611
Is a bias, f denotes the activation unit,
Figure GDA00040367287500000612
is the output of the neuron through the activation function.
(3) Training of the encoder neural network: the training loss objective function is the mean square error of the network output value and the corresponding AC voice log-amplitude spectrum:
Figure GDA00040367287500000613
Figure GDA00040367287500000614
wherein c is a dictionary coefficient, M = [ D, I, -I ] is a set of the air conduction speech dictionary and the compensation dictionary, I is a diagonal matrix with diagonal elements of 1 and other elements of 0, and the functions of compensating and improving representation precision are achieved in dictionary element linear combination. Training data of b% (for example, b is set to be between 10 and 20) in the training process is used as verification set data, the loss function is minimized in the training process, the network random initial weight [ -0.1,0.1], specifically, a modified Root Mean Square Propagation (RMSProp) algorithm of a random Gradient Descent (SGD) is used, the initial value of the learning rate is set to lr (for example, lr is set to be 0.01), when the verification set loss function value does not descend, the learning rate is multiplied by a factor ratio (for example, ratio is set to be 0.1), the momentum is momentum (for example, momentum is set to be 0.9), when the verification set loss function value continues to be i (for example, i is set to be 3) and the training is stopped, and the neural network model parameter with the minimum loss function value in the verification set is saved and is denoted as S'.
Step three: codec joint training. The Decoder (Decoder) structure is shown in FIG. 2 and comprises two layers of network structures, a recursive network Layer (LSTM) and a Linear layer (Linear), a i Represents the decoder network input based on the local attention mechanism, and the formula is as follows:
Figure GDA0004036728750000071
e j is the jth output of the encoder, and N (i) represents the neighborhood output of the ith output of the encoder, and can take values in 10-20 outputs. Omega ij The weighted combination coefficients representing these neighboring inputs are calculated as:
Figure GDA0004036728750000072
score ij is the jth output e of the encoder j To the ith input a of the decoder i The weighted score of (2) is normalized to obtain the weight of the linear combination. W a Is a parameter matrix of a linear layer, the output of the decoder at time i-1 passes through the linear layer and e j Inner product is made for calculating a weighted score (score) ij )。
The decoder is used for obtaining a speech signal which is closer to a real speech signal through training on the basis of synthesized speech subjected to dictionary coding. The decoder is optimized by adopting a mode of joint training with the encoder, a mean square error loss function is constructed by using the logarithmic magnitude spectrum characteristics of the real voice signals according to a mode of optimizing time periods one by one, and the optimal network parameters are obtained through gradient descent and stored locally and recorded as S.
The second phase is the enhancement phase: and enhancing the bone conduction voice by using the trained codec network model.
Step four: extracting the bone conduction voice features to be enhanced, and obtaining the aligned log-amplitude spectrum LQ according to the step one B The data statistics of (1), including the mean
Figure GDA0004036728750000073
And variance->
Figure GDA0004036728750000074
And (3) carrying out data normalization:
first, to-be-enhanced BC voice data B E Transforming speech time domain waveform to time-frequency domain by Fourier transform
Figure GDA0004036728750000075
The process of extracting the feature of BC voice to be enhanced is shown in the enhanced portion of FIG. 1, which is compared with the feature extraction in the first step, which is to obtain the time-domain voice spectrum ^ greater or greater than the phase extraction step>
Figure GDA0004036728750000076
Then, not only the amplitude spectrum but also the phase needs to be calculated, according to the time frequency spectrum->
Figure GDA0004036728750000077
Calculates to obtain its amplitude spectrum>
Figure GDA0004036728750000078
And phase->
Figure GDA0004036728750000079
The calculation formula is as follows:
Figure GDA00040367287500000710
Figure GDA00040367287500000711
wherein atan2 (x) is a four-quadrant arctangent function, and imag (x) and real (x) represent the imaginary part and the real part of the time spectrum, respectively. According to the magnitude spectrum
Figure GDA00040367287500000712
Calculating to obtain a logarithmic magnitude spectrum>
Figure GDA00040367287500000713
Then, the mean value of the BC voice log-amplitude spectrum obtained according to the training stage is ≥ based on>
Figure GDA0004036728750000081
And variance->
Figure GDA0004036728750000082
Is paired and/or matched>
Figure GDA0004036728750000083
And (3) carrying out normalization: />
Figure GDA0004036728750000084
Step five: and during enhancement, converting the bone conduction voice features extracted in the step four by using the codec neural network trained in the first stage, and then performing inverse normalization and feature inverse transformation on the output of the neural network to finally obtain the enhanced time domain voice.
Firstly, the normalized
Figure GDA0004036728750000085
Inputting the input into the trained coding and decoding neural network modelIn S, the network output, i.e. the enhanced characteristic->
Figure GDA0004036728750000086
Secondly, the enhanced features
Figure GDA0004036728750000087
And performing inverse normalization and inverse transformation to finally obtain the enhanced time domain voice, wherein the steps are as follows:
(1) Mean value of logarithmic magnitude spectrum of AC voice according to training stage
Figure GDA0004036728750000088
And variance->
Figure GDA0004036728750000089
Based on the output from the bi-directional gate recurrent neural network>
Figure GDA00040367287500000810
Performing inverse normalization to obtain a log-amplitude spectrum>
Figure GDA00040367287500000811
The calculation formula is as follows:
Figure GDA00040367287500000812
(2) Will log amplitude spectrum
Figure GDA00040367287500000813
Performing index operation to obtain an amplitude spectrum>
Figure GDA00040367287500000814
The calculation formula is as follows:
Figure GDA00040367287500000815
(3) Using magnitude spectra
Figure GDA00040367287500000816
And phase information>
Figure GDA00040367287500000817
Calculating a time spectrum->
Figure GDA00040367287500000818
The calculation formula is as follows:
Figure GDA00040367287500000819
(4) Using inverse Fourier transform and de-overlap addition formula after speech framing to separate frequency spectrum
Figure GDA00040367287500000820
Converting the signal into time domain to finally obtain enhanced time domain speech signal y (B) E )。
Examples
Fig. 4-1 and 4-2 are diagrams of embodiments of the present invention, in which example speech lengths are about 3.5s and 4s, a speech sampling frequency is 8kHz, a speech frame length is set to 32ms, a frame shift is set to 10ms, discrete fourier transform is performed on each frame, the number of frequency points K =256, and an obtained logarithmic magnitude spectrum dimension is 129 dimensions. In fig. 4-1 and 4-2, (a) is a spectrogram of bone conduction speech, (b) is a spectrogram of speech enhanced by using an LSTM deep neural network, and (c) is a spectrogram of speech enhanced by the present invention. It can be seen that the high-frequency signals of the enhanced voice and the signals such as the missing air sounds, fricatives and the like are effectively recovered, the performance is better improved compared with the LSTM algorithm, and in addition, the subjective test result also shows that the invention obtains good bone conduction voice enhancement effect.

Claims (6)

1. A bone conduction voice blind enhancement method based on a codec architecture and a recurrent neural network is characterized by comprising the following steps:
data preprocessing: extracting air conduction AC and bone conduction BC voice amplitude spectrum characteristics, performing alignment pretreatment on the extracted voice characteristic data, and calculating an air conduction voice dictionary by using sparse nonnegative matrix decomposition on the air conduction voice characteristic data;
pre-training of the encoder: taking the bone conduction voice characteristics as training input, taking the air conduction voice dictionary combination coefficient as a training target, adopting a nonnegative and sparse long-term memory recurrent neural network training encoder model, and storing the trained deep neural network parameters as initialization parameters of an encoder in the next step;
joint training of codecs: constructing a decoder model based on a local attention mechanism, taking the output of a coder as the input of a decoder, taking the air conduction voice characteristic as a training target, jointly training a codec model, and storing model parameters;
and (3) speech enhancement: and extracting the features of the bone conduction speech to be enhanced, realizing feature conversion by utilizing the trained coding and decoding neural network in the steps, and then carrying out inverse normalization and inverse feature transformation on the output of the neural network to finally obtain the enhanced time domain speech.
2. The method of claim 1, characterized in that air conduction and bone conduction speech magnitude spectral features are extracted and the extracted speech features are data preprocessed to fit the input requirements of the neural network, wherein to reduce the dynamic range of the extracted magnitude spectral features, logarithmic magnitude spectral features are used:
(1) The voice data is an AC and BC voice pair recorded by the same person wearing AC and BC microphone equipment at the same time, the AC voice is represented as A, the BC voice is represented as B, and AC and BC voice time domain signals y (A) and y (B) are respectively transformed to a time frequency domain by short-time Fourier transform, and the method specifically comprises the following steps:
(1) respectively carrying out frame division and windowing on voice time domain signals y (A) and y (B), wherein a window function is a Hamming window, the frame length is N, N is an integer power of 2, and the inter-frame moving length is H;
(2) performing K-point discrete Fourier transform on the framed voice frame to obtain a time-frequency spectrum Y of the voice A (k,t)、Y B (k, t), the calculation formula is as follows:
Figure FDA0004036728740000011
wherein, K =0,1, \8230, K-1 represents discrete frequency points, K represents frequency points in discrete Fourier transform, K = N, T =0,1, \8230, T-1 represents frame number, T is total frame number of the sub-frame, h (N) is Hamming window function;
(2) Taking the absolute value of the frequency spectrum Y (k, t), and calculating to obtain a magnitude spectrum M A 、M B The calculation formula is as follows:
M(k,t)=|Y(k,t)|
(3) Taking the logarithm (ln) with e as the base for the amplitude spectrum M (k, t) to obtain the logarithmic amplitude spectrum L A 、L B The calculation formula is as follows:
L(k,t)=lnM(k,t)
(4) And calculating the air conduction voice dictionary D on the log-amplitude spectrum characteristic matrix of the pure air conduction voice by adopting sparse nonnegative matrix decomposition.
3. The method of claim 2, wherein in the pre-training of the encoder, the pre-trained encoder network consists of three layers: the method comprises the following steps that a Linear layer Linear, a long-time memory recursive network layer LSTM and a non-negative sparse long-time memory recursive neural network layer NS-LSTM are adopted, during training, logarithmic magnitude spectrum features after normalization of bone conduction speech are used as training input, logarithmic magnitude spectrum features of air conduction speech are used as a training target, a time back propagation algorithm is adopted to train a neural network model, and trained neural network parameters are stored, wherein the NS-LSTM neural network unit structure and the pre-training process of an encoder network are as follows:
(1) The nonnegative sparse long-term memory neural network model is a deformation of the long-term memory model LSTM, can generate an output vector meeting constraint conditions by introducing nonnegative and sparse control variables, and the constituent units of the model are represented by the following sub-formula:
f t =σ(W fx X t +W fh h t-1 +b f )
i t =σ(W ix X t +W ih h t-1 +b i )
g t =σ(W gx X t +W gh h t-1 +b g )
o t =σ(W ox X t +W oh h t-1 +b o )
Figure FDA0004036728740000021
Figure FDA0004036728740000022
wherein the content of the first and second substances,
Figure FDA0004036728740000023
phi (x) = tanh (x), psi (x) = ReLU (x) = max (0, x) is a non-negative constraint, sh (D,u) (x) D (tanh (x + u) + tanh (x-u)) is a sparse activation function;
(2) Discard regularization technique: setting the discard ratio to p, the discard regularization formula is:
Figure FDA0004036728740000024
Figure FDA0004036728740000025
Figure FDA0004036728740000026
wherein the content of the first and second substances,
Figure FDA0004036728740000031
represents the probability of existence of the jth neuron of the ith layer, bernoulli (p) refers to Bernoulli distribution with probability p, the scoreCloth appears 1 with probability p, appears 0 with probability 1-p, and>
Figure FDA0004036728740000032
is the output value of the jth neuron of the ith layer, < >>
Figure FDA0004036728740000033
Is that
Figure FDA0004036728740000034
Multiply pick>
Figure FDA0004036728740000035
The latter value, i.e. the value equals->
Figure FDA0004036728740000036
Or 0,. Or>
Figure FDA0004036728740000037
Is a network weight value>
Figure FDA0004036728740000038
Is biased, f denotes an activation unit>
Figure FDA0004036728740000039
Is the neuron output via an activation function;
(3) Training of the encoder neural network: the training loss objective function is the mean square error of the network output value and the corresponding AC speech log-amplitude spectrum:
Figure FDA00040367287400000310
Figure FDA00040367287400000311
wherein c is a dictionary coefficient, M = [ D, I, -I ] is a set of an air conduction speech dictionary and a compensation dictionary, D is the air conduction speech dictionary, I is a diagonal matrix with 1 diagonal elements and 0 remaining elements, and the functions of compensation and representation precision improvement are realized in dictionary element linear combination; b% of training data in the training process is used as verification set data, a loss function is minimized in the training process, and a network random initial weight value is-0.1, specifically, a deformed root mean square propagation algorithm of a random gradient descent algorithm is adopted, the initial value of the learning rate is set to lr, when the loss function value of the verification set does not descend, the learning rate is multiplied by a factor ratio, momentum is momentum, when the loss function value of the verification set does not descend for i continuous training rounds, the training is stopped, and the neural network model parameter with the minimum loss function value of the verification set is stored and is marked as S'.
4. The method of claim 3, wherein in the codec joint training, the decoder comprises two layers of network structures, namely a recursive network layer LSTM and a Linear layer Linear, a i Represents the decoder network input based on the local attention mechanism, and the formula is as follows:
Figure FDA00040367287400000312
e j is the jth output of the encoder, N (i) represents the neighborhood output of the ith output of the encoder, ω ij A weighted combination coefficient representing the neighboring inputs is calculated as:
Figure FDA00040367287400000313
score ij is the jth output e of the encoder j To the ith input a of the decoder i Normalized to obtain the weight of the linear combination, W a Is a parameter matrix of a linear layer, the output of the decoder at time i-1 passes through the linear layer and e j Inner product is made for calculating a weighted score (score) ij );
The decoder is used for obtaining a speech signal which is closer to reality through training on the basis of the synthesized speech coded by the dictionary; the decoder is optimized in a mode of joint training with the encoder, a mean square error loss function is constructed according to the mode of optimizing time periods one by using the logarithmic magnitude spectrum characteristics of the real voice signals, and optimal network parameters are obtained through gradient descent and stored locally and recorded as S.
5. Method according to claim 4, characterized in that the bone conduction speech features to be enhanced are extracted and based on the obtained aligned log-amplitude spectrum LQ B Including mean values
Figure FDA0004036728740000041
And variance->
Figure FDA0004036728740000042
And (3) carrying out data normalization:
first, to-be-enhanced BC voice data B E Transforming speech time domain waveform to time-frequency domain by Fourier transform
Figure FDA0004036728740000043
The process of extracting the BC voice feature to be enhanced extracts the phase on the basis of feature extraction, namely, the time domain voice frequency spectrum is obtained
Figure FDA0004036728740000044
Then, not only the amplitude spectrum but also the phase is calculated, depending on the time spectrum->
Figure FDA0004036728740000045
Calculates to obtain its amplitude spectrum>
Figure FDA0004036728740000046
And phase->
Figure FDA0004036728740000047
The calculation formula is as follows:
Figure FDA0004036728740000048
Figure FDA0004036728740000049
wherein atan2 (x) is a four-quadrant arc tangent function, imag (x) and real (x) respectively represent imaginary part and real part of time frequency spectrum, and amplitude spectrum is used
Figure FDA00040367287400000410
The logarithmic magnitude spectrum is calculated to be>
Figure FDA00040367287400000411
Then, the mean value of the BC voice log-amplitude spectrum obtained according to the training stage is ≥ based on>
Figure FDA00040367287400000412
And variance->
Figure FDA00040367287400000413
Is paired and/or matched>
Figure FDA00040367287400000414
And (4) normalization is carried out:
Figure FDA00040367287400000415
6. the method of claim 5, wherein during enhancement, the extracted bone conduction speech features are transformed by using the codec neural network trained in the first stage, and then the output of the neural network is subjected to inverse normalization and inverse feature transformation, so as to finally obtain the enhanced time domain speech:
firstly, the normalized log-amplitude spectrum is measured
Figure FDA00040367287400000416
Inputting into trained encoding and decoding neural network model S, and calculating to obtain network output, i.e. enhanced characteristic->
Figure FDA00040367287400000417
Secondly, the enhanced features
Figure FDA00040367287400000418
And performing inverse normalization and inverse transformation to finally obtain the enhanced time domain voice, wherein the steps are as follows:
(1) Mean value of logarithmic magnitude spectrum of AC voice according to training stage
Figure FDA00040367287400000419
And variance->
Figure FDA00040367287400000420
Combining the output from a bidirectional gated recurrent neural network>
Figure FDA00040367287400000421
Performing inverse normalization to obtain a log-amplitude spectrum>
Figure FDA00040367287400000422
The calculation formula is as follows:
Figure FDA0004036728740000051
/>
(2) Will log amplitude spectrum
Figure FDA0004036728740000052
Performing an index operation to obtain an amplitude spectrum>
Figure FDA0004036728740000053
The calculation formula is as follows:
Figure FDA0004036728740000054
(3) Using magnitude spectra
Figure FDA0004036728740000055
And phase information>
Figure FDA0004036728740000056
Calculated time spectrum->
Figure FDA0004036728740000057
The calculation formula is as follows:
Figure FDA0004036728740000058
(4) Using inverse Fourier transform and de-overlap addition formula after speech framing to separate frequency spectrum
Figure FDA0004036728740000059
Converting the signal into time domain to finally obtain enhanced time domain speech signal y (B) E )。/>
CN201810960512.0A 2018-08-22 2018-08-22 Bone conduction voice blind enhancement method based on codec framework and recurrent neural network Active CN108986834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810960512.0A CN108986834B (en) 2018-08-22 2018-08-22 Bone conduction voice blind enhancement method based on codec framework and recurrent neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810960512.0A CN108986834B (en) 2018-08-22 2018-08-22 Bone conduction voice blind enhancement method based on codec framework and recurrent neural network

Publications (2)

Publication Number Publication Date
CN108986834A CN108986834A (en) 2018-12-11
CN108986834B true CN108986834B (en) 2023-04-07

Family

ID=64547287

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810960512.0A Active CN108986834B (en) 2018-08-22 2018-08-22 Bone conduction voice blind enhancement method based on codec framework and recurrent neural network

Country Status (1)

Country Link
CN (1) CN108986834B (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109793511A (en) * 2019-01-16 2019-05-24 成都蓝景信息技术有限公司 Electrocardiosignal noise detection algorithm based on depth learning technology
CN109975702B (en) * 2019-03-22 2021-08-10 华南理工大学 Direct-current gear reduction motor quality inspection method based on circulation network classification model
CN110111803B (en) * 2019-05-09 2021-02-19 南京工程学院 Transfer learning voice enhancement method based on self-attention multi-kernel maximum mean difference
CN110085249B (en) * 2019-05-09 2021-03-16 南京工程学院 Single-channel speech enhancement method of recurrent neural network based on attention gating
CN110136731B (en) * 2019-05-13 2021-12-24 天津大学 Cavity causal convolution generation confrontation network end-to-end bone conduction voice blind enhancement method
CN110164465B (en) * 2019-05-15 2021-06-29 上海大学 Deep-circulation neural network-based voice enhancement method and device
CN110648684B (en) * 2019-07-02 2022-02-18 中国人民解放军陆军工程大学 Bone conduction voice enhancement waveform generation method based on WaveNet
CN110390945B (en) * 2019-07-25 2021-09-21 华南理工大学 Dual-sensor voice enhancement method and implementation device
CN110675888A (en) * 2019-09-25 2020-01-10 电子科技大学 Speech enhancement method based on RefineNet and evaluation loss
KR102429152B1 (en) * 2019-10-09 2022-08-03 엘레복 테크놀로지 컴퍼니 리미티드 Deep learning voice extraction and noise reduction method by fusion of bone vibration sensor and microphone signal
CN110931031A (en) * 2019-10-09 2020-03-27 大象声科(深圳)科技有限公司 Deep learning voice extraction and noise reduction method fusing bone vibration sensor and microphone signals
CN110867192A (en) * 2019-10-23 2020-03-06 北京计算机技术及应用研究所 Speech enhancement method based on gated cyclic coding and decoding network
CN110808063A (en) * 2019-11-29 2020-02-18 北京搜狗科技发展有限公司 Voice processing method and device for processing voice
CN111242976B (en) * 2020-01-08 2021-05-11 北京天睿空间科技股份有限公司 Aircraft detection tracking method using attention mechanism
CN111312270B (en) * 2020-02-10 2022-11-22 腾讯科技(深圳)有限公司 Voice enhancement method and device, electronic equipment and computer readable storage medium
CN111833843B (en) 2020-07-21 2022-05-10 思必驰科技股份有限公司 Speech synthesis method and system
CN112185405B (en) * 2020-09-10 2024-02-09 中国科学技术大学 Bone conduction voice enhancement method based on differential operation and combined dictionary learning
CN111899757B (en) * 2020-09-29 2021-01-12 南京蕴智科技有限公司 Single-channel voice separation method and system for target speaker extraction
CN112562704B (en) * 2020-11-17 2023-08-18 中国人民解放军陆军工程大学 Frequency division topological anti-noise voice conversion method based on BLSTM
CN112786064B (en) * 2020-12-30 2023-09-08 西北工业大学 End-to-end bone qi conduction voice joint enhancement method
CN113642714B (en) * 2021-08-27 2024-02-09 国网湖南省电力有限公司 Insulator pollution discharge state identification method and system based on small sample learning
WO2023102930A1 (en) * 2021-12-10 2023-06-15 清华大学深圳国际研究生院 Speech enhancement method, electronic device, program product, and storage medium
CN114495909B (en) * 2022-02-20 2024-04-30 西北工业大学 End-to-end bone-qi guiding voice joint recognition method
CN115033734B (en) * 2022-08-11 2022-11-11 腾讯科技(深圳)有限公司 Audio data processing method and device, computer equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003264883A (en) * 2002-03-08 2003-09-19 Denso Corp Voice processing apparatus and voice processing method
CN102915742B (en) * 2012-10-30 2014-07-30 中国人民解放军理工大学 Single-channel monitor-free voice and noise separating method based on low-rank and sparse matrix decomposition
CN103559888B (en) * 2013-11-07 2016-10-05 航空电子系统综合技术重点实验室 Based on non-negative low-rank and the sound enhancement method of sparse matrix decomposition principle
US10013975B2 (en) * 2014-02-27 2018-07-03 Qualcomm Incorporated Systems and methods for speaker dictionary based speech modeling
US10249292B2 (en) * 2016-12-14 2019-04-02 International Business Machines Corporation Using long short-term memory recurrent neural network for speaker diarization segmentation
CN107886967B (en) * 2017-11-18 2018-11-13 中国人民解放军陆军工程大学 A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network

Also Published As

Publication number Publication date
CN108986834A (en) 2018-12-11

Similar Documents

Publication Publication Date Title
CN108986834B (en) Bone conduction voice blind enhancement method based on codec framework and recurrent neural network
US10777215B2 (en) Method and system for enhancing a speech signal of a human speaker in a video using visual information
Dave Feature extraction methods LPC, PLP and MFCC in speech recognition
CN107886967B (en) A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network
Akbari et al. Lip2audspec: Speech reconstruction from silent lip movements video
EP1536414B1 (en) Method and apparatus for multi-sensory speech enhancement
CN110085245B (en) Voice definition enhancing method based on acoustic feature conversion
CN111833896B (en) Voice enhancement method, system, device and storage medium for fusing feedback signals
CN109767756B (en) Sound characteristic extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficient
CN108108357B (en) Accent conversion method and device and electronic equipment
EP1250700A1 (en) Speech parameter compression
GB2560174A (en) A feature extraction system, an automatic speech recognition system, a feature extraction method, an automatic speech recognition method and a method of train
CN111192598A (en) Voice enhancement method for jump connection deep neural network
CN110867192A (en) Speech enhancement method based on gated cyclic coding and decoding network
CN111986679A (en) Speaker confirmation method, system and storage medium for responding to complex acoustic environment
Sharma et al. Study of robust feature extraction techniques for speech recognition system
Chetouani et al. Investigation on LP-residual representations for speaker identification
CN105679321A (en) Speech recognition method and device and terminal
CN112185405B (en) Bone conduction voice enhancement method based on differential operation and combined dictionary learning
CN109215635B (en) Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement
Ganapathy et al. Robust spectro-temporal features based on autoregressive models of hilbert envelopes
CN111968627B (en) Bone conduction voice enhancement method based on joint dictionary learning and sparse representation
Zheng et al. Throat microphone speech enhancement via progressive learning of spectral mapping based on lstm-rnn
Pickersgill et al. Investigation of DNN prediction of power spectral envelopes for speech coding & ASR
Zheng et al. Bandwidth extension WaveNet for bone-conducted speech enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant