CN108986834A - The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network - Google Patents
The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network Download PDFInfo
- Publication number
- CN108986834A CN108986834A CN201810960512.0A CN201810960512A CN108986834A CN 108986834 A CN108986834 A CN 108986834A CN 201810960512 A CN201810960512 A CN 201810960512A CN 108986834 A CN108986834 A CN 108986834A
- Authority
- CN
- China
- Prior art keywords
- voice
- training
- bone conduction
- spectrum
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 210000000988 bone and bone Anatomy 0.000 title claims abstract description 65
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 38
- 238000000034 method Methods 0.000 title claims abstract description 34
- 230000000306 recurrent effect Effects 0.000 title claims abstract description 17
- 238000012549 training Methods 0.000 claims abstract description 62
- 230000002708 enhancing effect Effects 0.000 claims abstract description 15
- 230000009466 transformation Effects 0.000 claims abstract description 14
- 238000000605 extraction Methods 0.000 claims abstract description 11
- 230000007246 mechanism Effects 0.000 claims abstract description 7
- 238000006243 chemical reaction Methods 0.000 claims abstract description 5
- 230000000977 initiatory effect Effects 0.000 claims abstract description 3
- 238000003860 storage Methods 0.000 claims abstract description 3
- 238000001228 spectrum Methods 0.000 claims description 72
- 230000006870 function Effects 0.000 claims description 20
- 238000004364 calculation method Methods 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 11
- 238000009432 framing Methods 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 8
- 239000004568 cement Substances 0.000 claims description 7
- 230000006403 short-term memory Effects 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 230000007423 decrease Effects 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 6
- 238000003062 neural network model Methods 0.000 claims description 6
- 210000002569 neuron Anatomy 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 claims description 5
- 230000001413 cellular effect Effects 0.000 claims description 4
- 230000015654 memory Effects 0.000 claims description 4
- 238000009826 distribution Methods 0.000 claims description 3
- 230000015572 biosynthetic process Effects 0.000 claims description 2
- 238000005457 optimization Methods 0.000 claims description 2
- 238000003786 synthesis reaction Methods 0.000 claims description 2
- 239000004744 fabric Substances 0.000 claims 1
- 239000000203 mixture Substances 0.000 claims 1
- 210000005036 nerve Anatomy 0.000 claims 1
- 238000011084 recovery Methods 0.000 abstract description 7
- 230000001537 neural effect Effects 0.000 abstract description 5
- 230000004927 fusion Effects 0.000 abstract 1
- 238000004891 communication Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000001965 increasing effect Effects 0.000 description 2
- 239000004615 ingredient Substances 0.000 description 2
- 210000000867 larynx Anatomy 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 206010011878 Deafness Diseases 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001055 chewing effect Effects 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 210000003928 nasal cavity Anatomy 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 239000003208 petroleum Substances 0.000 description 1
- 210000003625 skull Anatomy 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 230000032258 transport Effects 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/005—Correction of errors induced by the transmission channel, if related to the coding algorithm
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Quality & Reliability (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a kind of blind Enhancement Methods of the bone conduction voice based on codec framework and recurrent neural network, conductance and bone conduction phonetic feature are extracted first, alignment pretreatment is carried out to the voice feature data of extraction, then it is inputted using bone conduction phonetic feature as training, initiation parameter using conductance voice dictionary combination coefficient as training objective, as encoder in next step;The decoder model based on local attention mechanism is constructed, using encoder output as the input of decoder, using conductance phonetic feature as training objective, joint training codec models, and storage model parameter;Bone conduction phonetic feature to be reinforced is finally extracted, renormalization and feature inverse transformation is carried out using the trained encoding and decoding neural fusion Feature Conversion of above-mentioned steps, then to the output of neural network, finally obtains enhanced time domain speech.The present invention solves the recovery of radio-frequency component, bone conduction unvoiced segments are restored and the problems such as compared with recovery under strong noise background, improves the enhancing quality of bone conduction voice.
Description
Technical field
It is one kind based on codec framework and the long short-term memory of depth the invention belongs to speech signal processing technology
The blind Enhancement Method of bone conduction voice of recurrent neural network.
Background technique
Bone-conduction microphone is a kind of non-sonic transducer equipment, and vocal cord vibration can be transmitted to larynx and skull when people speaks, this
Kind microphone is exactly based on this vibration signal of acquisition and is converted to electric signal to obtain voice.With traditional air transmitted Mike
Wind voice is different, and ambient noise is difficult to have an impact this kind of non-sonic transducer, so bone conduction voice just shields from sound source
Noise has very strong noiseproof feature, military and civil upper be applied.For example, many national in military equipment, example
If being provided with the communication system based on bone conduction in armed helicopter, tank, in " the following soldier " Single-soldier system in the U.S.
Osophone is its important communication tool, and at civilian aspect, iASUS company of the U.S. transports for limit such as racing car, racing motorcycles
It is dynamic, the equipment such as a plurality of larynx microphones, osophone are had developed, the companies such as Japanese Panasonic, Sony also develop a variety of bone conductions
Communication product, and it is applied to the neck such as fire-fighting, forestry, petroleum exploration and exploitation, mine, emergency relief, special duty, engineering construction
Domain.
Although bone conduction voice can be effective against the interference of ambient noise, due to human body signal conduction low pass with
And thick and heavy high frequency section missing, intermediate-frequency section, air-flow sound, nasal cavity sound missing is presented in the inherent characteristics of vibration signal, bone conduction voice
Phenomena such as, voice sounds that comparison is dull, not clear enough, has seriously affected the auditory perception of people.In addition, in bone conduction voice
Can be mixed into some non-acoustic physical noises, for example, equipment generated with the skin being close to frictional noise, extreme sport when it is powerful
The noise etc. that introduces when colliding of wind-force frictional noise, people's chewing or tooth, the communication quality that these noises also reduce.Cause
This, carries out the research to bone conduction voice enhancement algorithm, to the practicalization for being pushed further into bone-conduction microphone product, improves strong
Voice communication quality under noise circumstance has important theory significance and practical value.
Currently, the blind enhancing of bone conduction voice is there are mainly three types of comparing typical method: unsupervised spread spectrum method, equalization,
Spectrum envelope transformation approach.
Unsupervised spread spectrum method (Bouserhal RE, Falk T H, Voix J.In-ear microphone
speech quality enhancement via adaptive filtering and artificial bandwidth
Extension. [J] .Journal of the Acoustical Society of America.2017) think bone conduction voice
There is consistent harmonic structure between the consistent resonance peak structure of conductance voice or the low frequency and high frequency of voice,
Using this architectural characteristic, directly low-frequency spectra can be extended, the high-frequency resonance peak or harmonic structure enhanced, i.e.,
Realize the blind enhancing of bone conduction voice.
The thought of equalization is to find the inverse transform function g (t) of transmission channel transforming function transformation function h (t), from bone conduction voice signal
In recover conductance voice signal.Equalization proposes (Shimamura T, Tamiya T.A by Shimamura first
reconstruction filter for bone-conducted speech[C].Circuits and Systems,
2005.Midwest Symposium on, 2005.2005:1847-1850), by modeling g (t), and construct inverse filter reality
Existing bone conduction speech enhan-cement.Equalization is able to maintain the harmonic structure of low frequency in voice, and is effectively compressed in bone conduction voice excessive
Energy, but the high-frequency components being relatively difficult to recover in bone conduction voice.
Currently, the most of blind enhancing of bone conduction voice is using method (Turan, the M.A.T.and converted based on spectrum envelope
E.Erzin,Source and Filter Estimation for Throat-Microphone Speech Enhancement
[J],2016.Mohammadi,S.H.and A.Kain,An overview of voice conversion systems[J],
2017).The basic ideas of spectrum envelope transformation approach are the source-filter models according to voice, are driving source feature by speech decomposition
With spectrum envelope feature.In the training stage, bone conduction voice and conductance voice data synthetic model by analysis, extract incentive characteristic and
Spectrum envelope feature establishes the transformational relation between spectrum envelope feature by training transformation model;In the enhancing stage, bone to be reinforced
Lead cent solution obtains incentive characteristic and spectrum envelope feature, is estimated from bone conduction speech manual envelope characteristic using trained model
Conductance spectrum envelope feature out recycles the envelope estimated and the original envelope characteristic of bone conduction voice to synthesize the voice of enhancing.
Above based on source-Filtering Model decomposing synthesizing process, certain progress is achieved on bone conduction speech enhan-cement, still
Performance is undesirable under generally existing feature selecting difficulty, high-noise environment, restores the problems such as inaccurate to voice high-frequency ingredient, leads
Cause that enhanced voice low frequency is thick and heavy, not clear enough the intelligibility of sound is insufficient, there are process noises etc..Some research has started to adopt
With the decomposing synthesizing process based on signal model, voice signal is divided into higher-dimension amplitude spectrum and phase, and utilizes deep learning skill
Art is established the relationship between bone conduction voice and pure gas lead pitch dimension amplitude spectrum, is achieved when restoring bone conduction voice good
Effect, but since it is not used due to dictionary introduces structural information etc., however it remains low frequency is thick and heavy, high-frequency information restores not
Completely, a series of problems, such as sound is not clear and legible enough.
Summary of the invention
The purpose of the present invention is to provide a kind of blind increasings of the bone conduction voice based on codec framework and recurrent neural network
Strong method, is driving with data, obtains model parameter by training, recycles trained model enhancing bone conduction voice, solves
The recovery of radio-frequency component, bone conduction unvoiced segments restore and the problems such as compared with recoveries under strong noise background, to promote bone conduction voice
Clarity and intelligibility further improve the enhancing quality of bone conduction voice.
The technical solution for realizing the aim of the invention is as follows: a kind of bone based on codec framework and recurrent neural network
Lead tone deaf's Enhancement Method, includes the following steps:
Data prediction extracts conductance and bone conduction phonetic feature, carries out alignment pretreatment to the voice feature data of extraction,
And conductance voice dictionary is calculated using sparse Non-negative Matrix Factorization (Sparse NMF) on conductance voice feature data;
The pre-training of encoder, using bone conduction phonetic feature as training input, using conductance voice dictionary combination coefficient as
Training objective, use is non-negative, sparse long short-term memory recurrent neural network (NS-LSTM) training encoder model, and stores instruction
The deep neural network parameter perfected, the initiation parameter as encoder in next step;
The joint training of codec is constructed the decoder model based on local attention mechanism, is made with encoder output
For the input of decoder, using conductance phonetic feature as training objective, joint training codec models, and storage model parameter;
Speech enhan-cement extracts bone conduction phonetic feature to be reinforced, utilizes the trained encoding and decoding neural network of above-mentioned steps
It realizes Feature Conversion, then renormalization and feature inverse transformation is carried out to the output of neural network, finally obtain enhanced time domain
Voice.
Compared with prior art, the present invention its remarkable advantage is: voice dictionary and non-negative sparse recurrent neural network are answered
It uses in bone conduction speech enhan-cement task, constructs the codec framework based on local attention mechanism, be driving with data, lead to
It crosses training and obtains network model parameter, bone conduction speech enhan-cement quality is effectively promoted using trained model, is specifically included:
(1) it is effectively utilized the structured message that the voice dictionary that sparse Non-negative Matrix Factorization calculates provides, it is preferably heavy
Build voice high-frequency ingredient;
(2) output of encoder is the linear combination coefficient of voice dictionary, and voice dictionary passes through Sparse NMF by true
Real pure conductance voice extracts, i.e., so that encoder has preferable anti-noise ability, can go automatically during coding
Remove the noise in bone conduction voice;
(3) it is effectively utilized sparse non-negative recurrent neural network and models bone conduction voice on the basis of dictionary to conductance voice spy
The complex nonlinear relationship for levying conversion, compared to traditional neural network, non-negative sparse neural network passes through specially designed net
Network cellular construction, can effectively learn sequence it is long when dependence, and establish and the mapping relations of voice dictionary;
(4) decoder network based on local attention mechanism is determined the input content of decoder by training, makes it
Have the recovery capability to bone conduction voice unvoiced segments (corresponding conductance voice is not necessarily noiseless), equally has to very noisy anti-
Interference performance can further promote bone conduction voice Quality of recovery.
Present invention is further described in detail with reference to the accompanying drawing.
Detailed description of the invention
Fig. 1 is that the present invention is based on the blind Enhancement Method flow charts of the bone conduction voice of codec framework and recurrent neural network.
Fig. 2 is the codec configuration diagram that the present invention uses.
Fig. 3 is non-negative sparse NS-LSTM cellular construction schematic diagram.
Fig. 4 is the blind enhancing instance graph of bone conduction voice of the present invention.
Specific embodiment
The present invention is based on the blind Enhancement Methods of the bone conduction voice of codec framework and recurrent neural network combined with Figure 1 and Figure 2,
Specific implementation be divided into two stages: training stage and enhancing stage.Training stage includes Step 1: Step 2: step 3, increases
The strong stage includes Step 4: step 5.Training stage and enhancing stage voice data do not repeat, i.e., no speech content is identical
Sentence.
First stage is the training stage: being trained by training data to neural network model.
Step 1: conductance (Air Conduction, AC) and bone conduction (Bone Conduction, BC) voice amplitudes are extracted
Spectrum signature, and data prediction is carried out to meet the input demand of neural network to the phonetic feature of extraction, it is specifically divided into following
Processing stage, wherein " the unsupervised language of single channel based on low-rank and sparse matrix decomposition is made an uproar point for the first two processing stage and patent
From method " the data prediction step of (CN 102915742B) is consistent, in order to reduce extraction amplitude spectrum signature dynamic model
It encloses, uses log-magnitude spectrum signature, the steps include:
(1) voice data is by the same person while to wear AC and BC voice pair that AC and BC microphone apparatus are recorded, AC
Voice is represented by A, and BC voice is represented by B, is respectively believed AC and BC voice time domain using Short Time Fourier Transform (STFT)
Number y (A), y (B) transform to time-frequency domain, specific steps respectively are as follows:
1. carrying out framing windowing process respectively to voice time domain signal y (A), y (B), window function is Hamming window, frame length N,
N is taken as 2 integral number power, and interframe movable length is H;
2. carrying out leaf transformation in K point discrete Fourier to the speech frame after framing, the time-frequency spectrum Y of voice is obtainedA(k,t)、YB(k,
T), calculation formula is as follows:
Wherein, k=0,1 ..., K-1 indicate discrete point in frequency, and K indicates frequency points when discrete Fourier transform, K=
N, t=0,1 ..., T-1 indicate that frame number, T are the totalframes of framing, and h (n) is Hamming window function;
(2) it takes absolute value to frequency spectrum Y (k, t), amplitude spectrum M is calculatedA、MB, calculation formula is as follows:
M (k, t)=| Y (k, t) |
(3) logarithm (ln) using e the bottom of as is taken to amplitude spectrum M (k, t), obtains log-magnitude spectrum LA、LB, calculation formula is as follows:
L (k, t)=lnM (k, t)
(4) using sparse Non-negative Matrix Factorization (Sparse Non-negative Matrix Factorization:
Sparse NMF) on pure gas lead sound log-magnitude spectrum eigenmatrix calculate conductance voice dictionary D.
Step 2: the pre-training of encoder: pre-training encoder (Encoder) network is by up of three-layer (see Fig. 2): linear
Layer (Linear), long short-term memory Recursive Networks layer (LSTM) and the long short-term memory recurrent neural net network layers (NS- of non-negative sparse
LSTM).When training, the log-magnitude spectrum signature after normalizing (Normalization) using bone conduction voice is inputted as training, with
The log-magnitude spectrum signature of conductance voice is as training objective, using time reversal propagation algorithm (Back Propagation
Through Time, BPTT) training neural network model, and store trained neural network parameter, NS-LSTM neural network
Cellular construction and encoder network pre-training process are as follows:
(1) the long Memory Neural Networks model in short-term of non-negative sparse is a kind of deformation of long memory models (LSTM) in short-term, is led to
Cross and introduce non-negative, sparse control variable, the output vector for meeting constraint condition can be generated, component units as shown in figure 3,
Following formula subrepresentation can be used:
ft=σ (WfxXt+Wfhht-1+bf)
it=σ (WixXt+Wihht-1+bi)
gt=σ (WgxXt+Wghht-1+bg)
ot=σ (WoxXt+Wohht-1+bo)
St=gt⊙it+St-1⊙ft
ht=sh(D,u)(ψ(φ(St))⊙ot)
Wherein,φ (x)=tanh (x), ψ (x)=ReLU (x)=max (0, x) are nonnegativity restrictions,
sh(D,u)(x)=D (tanh (x+u)+tanh (x-u)) is sparse activation primitive;
(2) it abandons Regularization Technique: for the robustness for improving model, regularization (dropout will be abandoned
Regularization) technology is applied in neural metwork training, which is to be mentioned by cutting down neural unit number to reach
The effect of high generalization ability.It is p (such as p is set as 0.2) that setting, which abandons ratio, abandons regularization formula are as follows:
Wherein,Indicate the existing probability of l layers of j-th of neuron, Bernoulli (p) refers to that probability is the Bai Nu of p
Benefit distribution, the distribution are that occur 1 with Probability p, occur 0 with probability 1-p,It is the output valve of l layers of j-th of neuron,It isMultiplied byValue afterwards, the i.e. value are equal toOr 0,It is network weight,It is biasing, f indicates that activation is single
Member,It is to be exported by the neuron of activation primitive.
(3) training of encoder neural network: training loss objective function is network output valve and corresponding A C voice logarithm
The mean square deviation of amplitude spectrum:
Wherein, c is dictionary coefficient, and M=[D, I ,-I] is conductance voice dictionary and the set for compensating dictionary, and I is diagonal element
Element is the diagonal matrix that 1 remaining element is 0, plays the role of compensation in dictionary element linear combination and promoted to indicate precision.
The training data of b% (such as b is set as between 10-20) collects data as verifying in training process, minimizes damage in training process
Lose function, the random initial weight of network [- 0.1,0.1], specifically using stochastic gradient descent algorithm (Stochastic
Gradient Descent, SGD) a kind of deformation root mean square propagation algorithm (Root Mean Square Propagation,
RMSProp), learning rate initial value is set as lr (such as lr is set as 0.01), when verifying collection loss function value does not decline, learning rate
Multiplied by factor ratio (such as ratio is set as 0.1), momentum is momentum (such as momentum is set as 0.9), when verifying collects
Deconditioning when a trained bout of the continuous i of loss function value (such as i is set as 3) does not decline saves the loss function value of verifying collection
The smallest neural network model parameter, is denoted as S '.
Step 3: codec joint training.Decoder (Decoder) structure is as shown in Fig. 2, include two-tier network knot
Structure is Recursive Networks layer (LSTM) and linear layer (Linear), a respectivelyiIndicate the decoder net based on local attention mechanism
Network input, formula are as follows:
ejIt is j-th of encoder output, the neighborhood output of i-th of device of N (i) presentation code output can be exported at 10-20
Value.ωijThe weighted array coefficient for indicating these neighbouring inputs, its calculation formula is:
scoreijIt is j-th of output e of encoderjTo i-th of input a of decoderiWeighted score, by normalization after
To the weight of linear combination.WaIt is the parameter matrix of a linear layer, linear layer and e are passed through in the output at the (i-1)-th moment of decoderj
Inner product is done, for calculating weighted score (scoreij)。
The effect of decoder is obtained by training closer to true on the synthesis speech basic by dictionary encoding
Voice signal.Decoder using with optimized by the way of encoder joint training, by the period one by one optimization in the way of, with
The log-magnitude spectrum latent structure mean square error loss function of actual speech signal obtains optimal network by gradient decline and joins
Number, and it is stored in local, it is denoted as S.
Second stage is the enhancing stage: utilizing trained codec network model, enhances bone conduction voice.
Step 4: bone conduction phonetic feature to be reinforced is extracted, and according to the log-magnitude after step 1 alignment obtained
Compose LQBData statistical characteristics, including mean valueAnd varianceCarry out data normalization:
Firstly, to BC voice data B to be reinforcedE, using Fourier transformation by voice time domain waveform convertion to time-frequency domainThe process of BC phonetic feature to be reinforced is extracted as shown in Fig. 1 strengthening part, is mentioned compared to the feature in step 1
It takes, the more phase extraction steps of the step are obtaining time domain speech frequency spectrumAfterwards, it not only needs to calculate amplitude spectrum, also
It needs to calculate phase, according to time-frequency spectrumIts amplitude spectrum is calculatedAnd phaseCalculation formula are as follows:
Wherein, atan2 (x) is four-quadrant arctan function, and imag (x) and real (x) respectively represent the imaginary part of time-frequency spectrum
With real part.According to amplitude spectrumLog-magnitude spectrum is calculatedThen, the BC voice logarithm obtained according to the training stage
The mean value of amplitude spectrumAnd varianceIt is rightIt is normalized:
Step 5: when enhancing, step 4 is extracted using first stage trained codec neural network bone conduction
Phonetic feature is converted, then carries out renormalization and feature inverse transformation to the output of neural network, is finally obtained enhanced
Time domain speech.
Firstly, by after normalizationIt is input in trained encoding and decoding neural network model S, network is calculated
Output, i.e., enhanced feature
Secondly, by enhanced featureRenormalization and inverse transformation are carried out, enhanced time domain language is finally obtained
Sound, steps are as follows:
(1) according to the mean value of training stage AC voice log-magnitude spectrumAnd varianceBy bidirectional gate recurrent neural net
The output that network obtainsRenormalization is carried out, log-magnitude spectrum is obtainedCalculation formula is as follows:
(2) by log-magnitude spectrumExponent arithmetic is carried out, amplitude spectrum is obtainedCalculation formula is as follows:
(3) amplitude spectrum is utilizedAnd phase informationTime-frequency spectrum is calculatedCalculation formula is as follows:
(4) using duplicate removal Superposition Formula after inverse Fourier transform and voice framing, by frequency spectrumWhen being transformed into
Domain finally obtains enhanced time domain speech signal y (BE)。
Embodiment
Fig. 4 is figure of the embodiment of the present invention, and example voice length is about 3.5s and 4s, and speech sample frequency is 8kHz, setting
Voice frame length 32ms, frame move 10ms, carry out discrete Fourier transform, frequency point number K=256, obtained log-magnitude spectrum to every frame
Dimension is 129 dimensions.In Fig. 4-1 and Fig. 4-2, (a) is the sound spectrograph of bone conduction voice, (b) for using the increasing of LSTM deep neural network
Voice sound spectrograph after strong is (c) the enhanced voice sound spectrograph of the present invention.It can be seen that the high-frequency signal of voice after enhancing
Effective recovery is obtained with signals such as aspirant, the fricatives of missing, and compare LSTM algorithm to have preferable performance boost,
In addition subjective test results also indicate that the present invention achieves good bone conduction speech enhan-cement effect.
Claims (6)
1. a kind of blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network, it is characterised in that following step
It is rapid:
Data prediction: conductance AC and bone conduction BC voice amplitudes spectrum signature are extracted, the voice feature data of extraction is aligned
Pretreatment, and conductance voice dictionary is calculated using sparse Non-negative Matrix Factorization on conductance voice feature data;
The pre-training of encoder: inputting using bone conduction phonetic feature as training, using conductance voice dictionary combination coefficient as training
Target, use is non-negative, sparse long short-term memory recurrent neural network training encoder model, and stores trained depth nerve
Network parameter, the initiation parameter as encoder in next step;
The joint training of codec: decoder model of the building based on local attention mechanism, using encoder output as solution
The input of code device, using conductance phonetic feature as training objective, joint training codec models, and storage model parameter;
Speech enhan-cement: extracting bone conduction phonetic feature to be reinforced, real using encoding and decoding neural network trained in above-mentioned steps
Existing Feature Conversion, then renormalization and feature inverse transformation are carried out to the output of neural network, finally obtain enhanced time domain language
Sound.
2. according to the method described in claim 1, it is characterized in that extract conductance and bone conduction voice amplitudes spectrum signature, and to extraction
Phonetic feature carry out data prediction to meet the input demand of neural network, wherein in order to reduce the amplitude spectrum signature of extraction
Dynamic range, using log-magnitude spectrum signature:
(1) voice data is by the same person while to wear AC and BC voice pair that AC and BC microphone apparatus are recorded, AC voice
It is expressed as A, BC voice is expressed as B, using Short Time Fourier Transform respectively by AC and BC voice time domain signal y (A), y (B) difference
Transform to time-frequency domain, specific steps are as follows:
1. carrying out framing windowing process respectively to voice time domain signal y (A), y (B), window function is Hamming window, frame length N, and N takes
For 2 integral number power, interframe movable length is H;
2. carrying out leaf transformation in K point discrete Fourier to the speech frame after framing, the time-frequency spectrum Y of voice is obtainedA(k,t)、YB(k, t), meter
It is as follows to calculate formula:
Wherein, k=0,1 ..., K-1 indicate discrete point in frequency, and K indicates frequency points when discrete Fourier transform, K=N, t=
0,1 ..., T-1 indicate that frame number, T are the totalframes of framing, and h (n) is Hamming window function;
(2) it takes absolute value to frequency spectrum Y (k, t), amplitude spectrum M is calculatedA、MB, calculation formula is as follows:
M (k, t)=| Y (k, t) |
(3) logarithm (ln) using e the bottom of as is taken to amplitude spectrum M (k, t), obtains log-magnitude spectrum LA、LB, calculation formula is as follows:
L (k, t)=lnM (k, t)
(4) conductance voice word is calculated on pure gas lead sound log-magnitude spectrum eigenmatrix using sparse Non-negative Matrix Factorization
Allusion quotation D.
3. according to the method described in claim 1, it is characterized in that pre-training encoder network is by three in the pre-training of encoder
Layer composition: linear layer Linear, long short-term memory Recursive Networks layer LSTM and the long short-term memory recurrent neural network of non-negative sparse
Layer NS-LSTM when training, is inputted using the log-magnitude spectrum signature after the normalization of bone conduction voice as training, with conductance voice
Log-magnitude spectrum signature using time reversal propagation algorithm training neural network model, and is stored and is trained as training objective
Neural network parameter, NS-LSTM neural network cellular construction and encoder network pre-training process be as follows:
(1) the long Memory Neural Networks model in short-term of non-negative sparse is a kind of deformation of long memory models LSTM in short-term, passes through introducing
Non-negative, sparse control variable can generate the output vector for meeting constraint condition, component units following formula subrepresentation:
ft=σ (WfxXt+Wfhht-1+bf)
it=σ (WixXt+Wihht-1+bi)
gt=σ (WgxXt+Wghht-1+bg)
ot=σ (WoxXt+Wohht-1+bo)
Wherein,φ (x)=tanh (x), ψ (x)=ReLU (x)=max (0, x) are nonnegativity restrictions, sh(D,u)
(x)=D (tanh (x+u)+tanh (x-u)) is sparse activation primitive;
(2) abandon Regularization Technique: it is p that setting, which abandons ratio, abandons regularization formula are as follows:
Wherein,Indicate the existing probability of l layers of j-th of neuron, Bernoulli (p) refers to the Bernoulli Jacob point that probability is p
Cloth, the distribution are that occur 1 with Probability p, occur 0 with probability 1-p,It is the output valve of l layers of j-th of neuron,It isMultiplied byValue afterwards, the i.e. value are equal toOr 0,It is network weight,It is biasing, f indicates activation unit,
It is to be exported by the neuron of activation primitive;
(3) training of encoder neural network: training loss objective function is network output valve and corresponding A C voice log-magnitude
The mean square deviation of spectrum:
Wherein, c is dictionary coefficient, and M=[D, I ,-I] is conductance voice dictionary and the set for compensating dictionary, and I is that diagonal element is 1
The diagonal matrix that remaining element is 0 plays the role of compensation and is promoted to indicate precision in dictionary element linear combination;It trained
The training data of b% is as verifying collection data in journey, minimizes loss function in training process, the random initial weight of network [-
0.1,0.1], specifically using a kind of deformation root mean square propagation algorithm of stochastic gradient descent algorithm, learning rate initial value is set
For lr, when verifying collection loss function value does not decline, learning rate is multiplied by factor ratio, momentum momentum, when verifying collection damage
Deconditioning when functional value continuous i trained bout does not decline is lost, the smallest neural network of loss function value of verifying collection is saved
Model parameter is denoted as S '.
4. according to the method described in claim 1, it is characterized in that decoder includes two-tier network in codec joint training
Structure is Recursive Networks layer LSTM and linear layer Linear, a respectivelyiIndicate the decoder network based on local attention mechanism
Input, formula are as follows:
ejIt is j-th of encoder output, the neighborhood output of i-th of device of N (i) presentation code output, ωijIndicate these neighbouring inputs
Weighted array coefficient, its calculation formula is:
scoreijIt is j-th of output e of encoderjTo i-th of input a of decoderiWeighted score, by normalization after obtain line
Property combination weight, WaIt is the parameter matrix of a linear layer, linear layer and e are passed through in the output at the (i-1)-th moment of decoderjIn doing
Product, for calculating weighted score (scoreij);
The effect of decoder is obtained by training closer to true voice on the synthesis speech basic by dictionary encoding
Signal;Decoder using with optimized by the way of encoder joint training, by the period one by one optimization in the way of, with true
The log-magnitude spectrum latent structure mean square error loss function of voice signal obtains optimal network parameter by gradient decline,
And it is stored in local, it is denoted as S.
5. according to the method described in claim 1, it is characterized in that extract bone conduction phonetic feature to be reinforced, and according to acquisition
Log-magnitude spectrum LQ after alignmentBData statistical characteristics, including mean valueAnd varianceCarry out data normalization:
Firstly, to BC voice data B to be reinforcedE, using Fourier transformation by voice time domain waveform convertion to time-frequency domain
The process of BC phonetic feature to be reinforced is extracted on the basis of feature extraction to phase extraction, that is, is obtaining time domain speech frequency spectrumAfterwards, amplitude spectrum is not only calculated, phase is also calculated, according to time-frequency spectrumIts amplitude spectrum is calculatedAnd phase
PositionCalculation formula are as follows:
Wherein, atan2 (x) is four-quadrant arctan function, and imag (x) and real (x) respectively represent the imaginary part and reality of time-frequency spectrum
Portion, according to amplitude spectrumLog-magnitude spectrum is calculatedThen, the BC voice log-magnitude obtained according to the training stage
The mean value of spectrumAnd varianceIt is rightIt is normalized:
6. according to the method described in claim 1, it is characterized in that utilizing first stage trained codec mind when enhancing
It is converted through bone conduction phonetic feature of the network to extraction, then renormalization and feature inversion is carried out to the output of neural network
It changes, finally obtains enhanced time domain speech:
Firstly, by after normalizationIt is input in trained encoding and decoding neural network model S, network output is calculated,
I.e. enhanced feature
Secondly, by enhanced featureRenormalization and inverse transformation are carried out, enhanced time domain speech is finally obtained, is walked
It is rapid as follows:
(1) according to the mean value of training stage AC voice log-magnitude spectrumAnd varianceBidirectional gate recurrent neural network is obtained
The output arrivedRenormalization is carried out, log-magnitude spectrum is obtainedCalculation formula is as follows:
(2) by log-magnitude spectrumExponent arithmetic is carried out, amplitude spectrum is obtainedCalculation formula is as follows:
(3) amplitude spectrum is utilizedAnd phase informationTime-frequency spectrum is calculatedCalculation formula is as follows:
(4) using duplicate removal Superposition Formula after inverse Fourier transform and voice framing, by frequency spectrumIt is transformed into time domain, most
Enhanced time domain speech signal y (B is obtained eventuallyE)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810960512.0A CN108986834B (en) | 2018-08-22 | 2018-08-22 | Bone conduction voice blind enhancement method based on codec framework and recurrent neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810960512.0A CN108986834B (en) | 2018-08-22 | 2018-08-22 | Bone conduction voice blind enhancement method based on codec framework and recurrent neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108986834A true CN108986834A (en) | 2018-12-11 |
CN108986834B CN108986834B (en) | 2023-04-07 |
Family
ID=64547287
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810960512.0A Active CN108986834B (en) | 2018-08-22 | 2018-08-22 | Bone conduction voice blind enhancement method based on codec framework and recurrent neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108986834B (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109793511A (en) * | 2019-01-16 | 2019-05-24 | 成都蓝景信息技术有限公司 | Electrocardiosignal noise detection algorithm based on depth learning technology |
CN109975702A (en) * | 2019-03-22 | 2019-07-05 | 华南理工大学 | A kind of DC gear decelerating motor product examine method based on recirculating network disaggregated model |
CN110085249A (en) * | 2019-05-09 | 2019-08-02 | 南京工程学院 | The single-channel voice Enhancement Method of Recognition with Recurrent Neural Network based on attention gate |
CN110111803A (en) * | 2019-05-09 | 2019-08-09 | 南京工程学院 | Based on the transfer learning sound enhancement method from attention multicore Largest Mean difference |
CN110136731A (en) * | 2019-05-13 | 2019-08-16 | 天津大学 | Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice |
CN110164465A (en) * | 2019-05-15 | 2019-08-23 | 上海大学 | A kind of sound enhancement method and device based on deep layer Recognition with Recurrent Neural Network |
CN110648684A (en) * | 2019-07-02 | 2020-01-03 | 中国人民解放军陆军工程大学 | Bone conduction voice enhancement waveform generation method based on WaveNet |
CN110675888A (en) * | 2019-09-25 | 2020-01-10 | 电子科技大学 | Speech enhancement method based on RefineNet and evaluation loss |
CN110808063A (en) * | 2019-11-29 | 2020-02-18 | 北京搜狗科技发展有限公司 | Voice processing method and device for processing voice |
CN110867192A (en) * | 2019-10-23 | 2020-03-06 | 北京计算机技术及应用研究所 | Speech enhancement method based on gated cyclic coding and decoding network |
CN110931031A (en) * | 2019-10-09 | 2020-03-27 | 大象声科(深圳)科技有限公司 | Deep learning voice extraction and noise reduction method fusing bone vibration sensor and microphone signals |
CN111242976A (en) * | 2020-01-08 | 2020-06-05 | 北京天睿空间科技股份有限公司 | Aircraft detection tracking method using attention mechanism |
CN111833843A (en) * | 2020-07-21 | 2020-10-27 | 苏州思必驰信息科技有限公司 | Speech synthesis method and system |
CN111899757A (en) * | 2020-09-29 | 2020-11-06 | 南京蕴智科技有限公司 | Single-channel voice separation method and system for target speaker extraction |
CN112185405A (en) * | 2020-09-10 | 2021-01-05 | 中国科学技术大学 | Bone conduction speech enhancement method based on differential operation and joint dictionary learning |
WO2021012403A1 (en) * | 2019-07-25 | 2021-01-28 | 华南理工大学 | Dual sensor speech enhancement method and implementation device |
CN112562704A (en) * | 2020-11-17 | 2021-03-26 | 中国人民解放军陆军工程大学 | BLSTM-based frequency division spectrum expansion anti-noise voice conversion method |
WO2021068120A1 (en) * | 2019-10-09 | 2021-04-15 | 大象声科(深圳)科技有限公司 | Deep learning speech extraction and noise reduction method fusing signals of bone vibration sensor and microphone |
CN112786064A (en) * | 2020-12-30 | 2021-05-11 | 西北工业大学 | End-to-end bone-qi-conduction speech joint enhancement method |
WO2021159772A1 (en) * | 2020-02-10 | 2021-08-19 | 腾讯科技(深圳)有限公司 | Speech enhancement method and apparatus, electronic device, and computer readable storage medium |
CN113642714A (en) * | 2021-08-27 | 2021-11-12 | 国网湖南省电力有限公司 | Insulator pollution discharge state identification method and system based on small sample learning |
CN114495909A (en) * | 2022-02-20 | 2022-05-13 | 西北工业大学 | End-to-end bone-qi-guide voice joint identification method |
CN115033734A (en) * | 2022-08-11 | 2022-09-09 | 腾讯科技(深圳)有限公司 | Audio data processing method and device, computer equipment and storage medium |
WO2023102930A1 (en) * | 2021-12-10 | 2023-06-15 | 清华大学深圳国际研究生院 | Speech enhancement method, electronic device, program product, and storage medium |
US12009004B2 (en) | 2020-02-10 | 2024-06-11 | Tencent Technology (Shenzhen) Company Limited | Speech enhancement method and apparatus, electronic device, and computer-readable storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003264883A (en) * | 2002-03-08 | 2003-09-19 | Denso Corp | Voice processing apparatus and voice processing method |
CN102915742A (en) * | 2012-10-30 | 2013-02-06 | 中国人民解放军理工大学 | Single-channel monitor-free voice and noise separating method based on low-rank and sparse matrix decomposition |
CN103559888A (en) * | 2013-11-07 | 2014-02-05 | 航空电子系统综合技术重点实验室 | Speech enhancement method based on non-negative low-rank and sparse matrix decomposition principle |
CN106030705A (en) * | 2014-02-27 | 2016-10-12 | 高通股份有限公司 | Systems and methods for speaker dictionary based speech modeling |
CN107886967A (en) * | 2017-11-18 | 2018-04-06 | 中国人民解放军陆军工程大学 | A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network |
US20180166066A1 (en) * | 2016-12-14 | 2018-06-14 | International Business Machines Corporation | Using long short-term memory recurrent neural network for speaker diarization segmentation |
-
2018
- 2018-08-22 CN CN201810960512.0A patent/CN108986834B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003264883A (en) * | 2002-03-08 | 2003-09-19 | Denso Corp | Voice processing apparatus and voice processing method |
CN102915742A (en) * | 2012-10-30 | 2013-02-06 | 中国人民解放军理工大学 | Single-channel monitor-free voice and noise separating method based on low-rank and sparse matrix decomposition |
CN103559888A (en) * | 2013-11-07 | 2014-02-05 | 航空电子系统综合技术重点实验室 | Speech enhancement method based on non-negative low-rank and sparse matrix decomposition principle |
CN106030705A (en) * | 2014-02-27 | 2016-10-12 | 高通股份有限公司 | Systems and methods for speaker dictionary based speech modeling |
US20180166066A1 (en) * | 2016-12-14 | 2018-06-14 | International Business Machines Corporation | Using long short-term memory recurrent neural network for speaker diarization segmentation |
CN107886967A (en) * | 2017-11-18 | 2018-04-06 | 中国人民解放军陆军工程大学 | A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network |
Non-Patent Citations (1)
Title |
---|
刘斌 等: "联合长短时记忆递归神经网络和非负矩阵分解的语音混响消除方法", 《信号处理》 * |
Cited By (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109793511A (en) * | 2019-01-16 | 2019-05-24 | 成都蓝景信息技术有限公司 | Electrocardiosignal noise detection algorithm based on depth learning technology |
CN109975702A (en) * | 2019-03-22 | 2019-07-05 | 华南理工大学 | A kind of DC gear decelerating motor product examine method based on recirculating network disaggregated model |
CN109975702B (en) * | 2019-03-22 | 2021-08-10 | 华南理工大学 | Direct-current gear reduction motor quality inspection method based on circulation network classification model |
CN110085249A (en) * | 2019-05-09 | 2019-08-02 | 南京工程学院 | The single-channel voice Enhancement Method of Recognition with Recurrent Neural Network based on attention gate |
CN110111803A (en) * | 2019-05-09 | 2019-08-09 | 南京工程学院 | Based on the transfer learning sound enhancement method from attention multicore Largest Mean difference |
CN110136731A (en) * | 2019-05-13 | 2019-08-16 | 天津大学 | Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice |
CN110164465A (en) * | 2019-05-15 | 2019-08-23 | 上海大学 | A kind of sound enhancement method and device based on deep layer Recognition with Recurrent Neural Network |
CN110164465B (en) * | 2019-05-15 | 2021-06-29 | 上海大学 | Deep-circulation neural network-based voice enhancement method and device |
CN110648684A (en) * | 2019-07-02 | 2020-01-03 | 中国人民解放军陆军工程大学 | Bone conduction voice enhancement waveform generation method based on WaveNet |
WO2021012403A1 (en) * | 2019-07-25 | 2021-01-28 | 华南理工大学 | Dual sensor speech enhancement method and implementation device |
CN110675888A (en) * | 2019-09-25 | 2020-01-10 | 电子科技大学 | Speech enhancement method based on RefineNet and evaluation loss |
KR102429152B1 (en) * | 2019-10-09 | 2022-08-03 | 엘레복 테크놀로지 컴퍼니 리미티드 | Deep learning voice extraction and noise reduction method by fusion of bone vibration sensor and microphone signal |
CN110931031A (en) * | 2019-10-09 | 2020-03-27 | 大象声科(深圳)科技有限公司 | Deep learning voice extraction and noise reduction method fusing bone vibration sensor and microphone signals |
JP2022505997A (en) * | 2019-10-09 | 2022-01-17 | 大象声科(深セン)科技有限公司 | Deep learning voice extraction and noise reduction method that fuses bone vibration sensor and microphone signal |
KR20210043485A (en) * | 2019-10-09 | 2021-04-21 | 엘레복 테크놀로지 컴퍼니 리미티드 | Deep learning speech extraction and noise reduction method that combines bone vibration sensor and microphone signal |
WO2021068120A1 (en) * | 2019-10-09 | 2021-04-15 | 大象声科(深圳)科技有限公司 | Deep learning speech extraction and noise reduction method fusing signals of bone vibration sensor and microphone |
CN110867192A (en) * | 2019-10-23 | 2020-03-06 | 北京计算机技术及应用研究所 | Speech enhancement method based on gated cyclic coding and decoding network |
CN110808063A (en) * | 2019-11-29 | 2020-02-18 | 北京搜狗科技发展有限公司 | Voice processing method and device for processing voice |
CN111242976A (en) * | 2020-01-08 | 2020-06-05 | 北京天睿空间科技股份有限公司 | Aircraft detection tracking method using attention mechanism |
US12009004B2 (en) | 2020-02-10 | 2024-06-11 | Tencent Technology (Shenzhen) Company Limited | Speech enhancement method and apparatus, electronic device, and computer-readable storage medium |
WO2021159772A1 (en) * | 2020-02-10 | 2021-08-19 | 腾讯科技(深圳)有限公司 | Speech enhancement method and apparatus, electronic device, and computer readable storage medium |
US11842722B2 (en) | 2020-07-21 | 2023-12-12 | Ai Speech Co., Ltd. | Speech synthesis method and system |
CN111833843A (en) * | 2020-07-21 | 2020-10-27 | 苏州思必驰信息科技有限公司 | Speech synthesis method and system |
CN112185405A (en) * | 2020-09-10 | 2021-01-05 | 中国科学技术大学 | Bone conduction speech enhancement method based on differential operation and joint dictionary learning |
CN112185405B (en) * | 2020-09-10 | 2024-02-09 | 中国科学技术大学 | Bone conduction voice enhancement method based on differential operation and combined dictionary learning |
CN111899757B (en) * | 2020-09-29 | 2021-01-12 | 南京蕴智科技有限公司 | Single-channel voice separation method and system for target speaker extraction |
CN111899757A (en) * | 2020-09-29 | 2020-11-06 | 南京蕴智科技有限公司 | Single-channel voice separation method and system for target speaker extraction |
CN112562704B (en) * | 2020-11-17 | 2023-08-18 | 中国人民解放军陆军工程大学 | Frequency division topological anti-noise voice conversion method based on BLSTM |
CN112562704A (en) * | 2020-11-17 | 2021-03-26 | 中国人民解放军陆军工程大学 | BLSTM-based frequency division spectrum expansion anti-noise voice conversion method |
CN112786064B (en) * | 2020-12-30 | 2023-09-08 | 西北工业大学 | End-to-end bone qi conduction voice joint enhancement method |
CN112786064A (en) * | 2020-12-30 | 2021-05-11 | 西北工业大学 | End-to-end bone-qi-conduction speech joint enhancement method |
CN113642714A (en) * | 2021-08-27 | 2021-11-12 | 国网湖南省电力有限公司 | Insulator pollution discharge state identification method and system based on small sample learning |
CN113642714B (en) * | 2021-08-27 | 2024-02-09 | 国网湖南省电力有限公司 | Insulator pollution discharge state identification method and system based on small sample learning |
WO2023102930A1 (en) * | 2021-12-10 | 2023-06-15 | 清华大学深圳国际研究生院 | Speech enhancement method, electronic device, program product, and storage medium |
CN114495909A (en) * | 2022-02-20 | 2022-05-13 | 西北工业大学 | End-to-end bone-qi-guide voice joint identification method |
CN114495909B (en) * | 2022-02-20 | 2024-04-30 | 西北工业大学 | End-to-end bone-qi guiding voice joint recognition method |
CN115033734A (en) * | 2022-08-11 | 2022-09-09 | 腾讯科技(深圳)有限公司 | Audio data processing method and device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108986834B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108986834A (en) | The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network | |
CN107886967B (en) | A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network | |
Dave | Feature extraction methods LPC, PLP and MFCC in speech recognition | |
CN103531205B (en) | The asymmetrical voice conversion method mapped based on deep neural network feature | |
CN108447495B (en) | Deep learning voice enhancement method based on comprehensive feature set | |
CN105070293B (en) | Audio bandwidth expansion coding-decoding method based on deep neural network and device | |
CN103065629A (en) | Speech recognition system of humanoid robot | |
CN110085245B (en) | Voice definition enhancing method based on acoustic feature conversion | |
CN110379412A (en) | Method, apparatus, electronic equipment and the computer readable storage medium of speech processes | |
CN102800316A (en) | Optimal codebook design method for voiceprint recognition system based on nerve network | |
CN105023580A (en) | Unsupervised noise estimation and speech enhancement method based on separable deep automatic encoding technology | |
CN110648684B (en) | Bone conduction voice enhancement waveform generation method based on WaveNet | |
CN110867192A (en) | Speech enhancement method based on gated cyclic coding and decoding network | |
Lavrynenko et al. | Method of voice control functions of the UAV | |
CN113658583B (en) | Ear voice conversion method, system and device based on generation countermeasure network | |
CN106024010A (en) | Speech signal dynamic characteristic extraction method based on formant curves | |
CN102237083A (en) | Portable interpretation system based on WinCE platform and language recognition method thereof | |
CN105448302A (en) | Environment adaptive type voice reverberation elimination method and system | |
Chetouani et al. | Investigation on LP-residual representations for speaker identification | |
Shah et al. | Novel MMSE DiscoGAN for cross-domain whisper-to-speech conversion | |
CN111326170A (en) | Method and device for converting ear voice into normal voice by combining time-frequency domain expansion convolution | |
CN106875944A (en) | A kind of system of Voice command home intelligent terminal | |
Brucal et al. | Female voice recognition using artificial neural networks and MATLAB voicebox toolbox | |
CN109215635B (en) | Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement | |
CN103854655A (en) | Low-bit-rate voice coder and decoder |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |