CN110010149A - Dual sensor sound enhancement method based on statistical model - Google Patents
Dual sensor sound enhancement method based on statistical model Download PDFInfo
- Publication number
- CN110010149A CN110010149A CN201910296425.4A CN201910296425A CN110010149A CN 110010149 A CN110010149 A CN 110010149A CN 201910296425 A CN201910296425 A CN 201910296425A CN 110010149 A CN110010149 A CN 110010149A
- Authority
- CN
- China
- Prior art keywords
- voice
- conductance
- statistical model
- detection
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/038—Speech enhancement, e.g. noise reduction or echo cancellation using band spreading techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T90/00—Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation
Abstract
The present invention discloses a kind of dual sensor sound enhancement method based on statistical model, it include: synchronous acquisition conductance detection voice and non-conductance detection voice, the endpoint of conductance detection voice is detected, then establishes conductance noise statistics model using the pure noise segment of conductance detection voice;Combine statistical model using conductance noise statistics Modifying model, and classifies to conductance detection speech frame;Optimum gas, which is calculated, according to the gentle bone conduction noise statistical model of the corresponding conductance voice linear spectral statistical model of classification results leads voice filter;Leading voice filter using optimum gas and be filtered after enhancing to obtain filtering to conductance detection voice enhances voice, wherein, the joint gentle lead sound linear spectral statistical model of statistical model is pre-established by the clean conductance training voice and non-conductance training voice of synchronous acquisition.
Description
Technical field
The present invention is entitled " a kind of dual sensor language based on statistical model that applicant proposed on 01 14th, 2016
Sound Enhancement Method and device ", application No. is the divisional applications of 201610025390.7 patent application, and the present invention relates to numbers to believe
Number process field, in particular to a kind of dual sensor sound enhancement method based on statistical model.
Background technique
Communication is the modern important means exchanged between men, and voice is as shape most common in communication system
Formula, quality directly affect the accuracy that people obtain information.Voice is during propagation, inevitably by various ring
The interference of border noise, sound quality, intelligibility will be all decreased obviously, therefore often utilize speech enhancement technique in practical applications
Voice under noise circumstance is handled.
Speech enhancement technique can extract useful voice signal from noise background, be the base for inhibiting, reducing noise jamming
This means.Traditional speech enhan-cement object is the voice signal based on air conduction sensor (such as microphone) acquisition, according to
The difference of processing mode, common speech enhancement technique can be divided into the method based on model and be not based on the method two of model
Class.The Enhancement Method for being not based on model has spectrum-subtraction, filter method, Wavelet Transform etc., they often assume that noise is relatively flat
Steady, when noise variation is too fast, reinforcing effect is not satisfactory.Sound enhancement method based on model is then right first
Voice signal and noise signal establish statistical model, then by the Minimum Mean Squared Error estimation of model acquisition clean speech or most
Big posterior probability estimation.Such methods can be avoided the generation of music noise, and can handle nonstationary noise.But due to above-mentioned
It is based on the air transmitteds speech transducer such as microphone based on model and the method for being not based on model, signal is easy by environment
Acoustic noise influence, especially under strong noise environment, system performance can sharp fall.
To solve influence of the very noisy to speech processing system, it is different from traditional air conduction sensor, non-air passes
The speech transducer led drives reed or carbon film in sensor to occur using the vibration at the positions such as the human vocal band that speaks, jawbone
Variation, changes its resistance value, the voltage at its both ends is made to change, to convert electric signal for vibration signal, i.e. voice is believed
Number.Deformation occurs for the reed or carbon film that can not make non-air conduction sensor due to the sound wave conducted in air, non-empty
Gas conduction sensor is not influenced by air transmitted sound, the ability with the interference of very strong environment resistant acoustic noise.But because non-
Air conduction sensor acquisition is voice by the Vibration propagation of jawbone, muscle, skin etc., show as it is stuffy, ambiguous not
Clearly, high frequency section is lost serious, and the intelligibility of speech is poor, constrains the practical application of non-air conduction technique.
Haves the defects that certain, appearance in recent years in view of air transmitted and being used alone all for non-air conduction sensor
The sound enhancement method of both some combinations advantage.These methods utilize air conduction sensor voice and non-air conduction sensing
The complementarity of device voice realizes the purpose of speech enhan-cement using multi-sensor fusion technology, can usually obtain and compare single-sensor
The better effect of speech-enhancement system.But speech enhan-cement of the existing air conduction sensor in conjunction with non-air conduction sensor
There is also following deficiencies for method: (1) air conduction sensor voice is mostly independently carried out with non-air conduction sensor voice
Recovery processing, the voice after then again restoring the two merge, and fail to pass in air conduction sensor voice and non-air
Complementarity between the two is made full use of in the recovery process of derivative sensor voice;(2) under changeable strong noisy environment, air
The signal-to-noise ratio that the statistical property of the pure voice segments of conduction sensor voice by severe jamming, can enhance voice can also reduce, and cause to melt
Speech enhan-cement effect is unobvious after conjunction.
Summary of the invention
The present invention provides a kind of dual sensor sound enhancement method based on statistical model, comprising: the inspection of synchronous acquisition conductance
It surveys voice and non-conductance detects voice, then the endpoint of detection conductance detection voice utilizes the pure noise segment of conductance detection voice
Establish conductance noise statistics model;Combine statistical model using conductance noise statistics Modifying model, and speech frame is detected to conductance
Classify;It is calculated according to the gentle bone conduction noise statistical model of the corresponding conductance voice linear spectral statistical model of classification results best
Conductance voice filter;Leading voice filter using optimum gas and be filtered after enhancing to obtain filtering to conductance detection voice enhances
Voice, wherein the joint gentle lead sound linear spectral statistical model of statistical model trains language by the clean conductance of synchronous acquisition
Sound and non-conductance training voice pre-establish.
The present invention has the following advantages and effects with respect to the prior art:
1, the present invention carrys out structure in conjunction with non-air conduction transducer voice and air conduction transducer voice during conductance speech enhan-cement
It builds the phonetic statistical model for being currently used in classification and carries out end-point detection, and construction optimum gas leads voice filter accordingly, mentions
The high reinforcing effect of conductance voice, significantly increases the robustness of whole system;
2, present invention employs the structural approach of two-stage speech enhan-cement, and in conductance voice, filter effect is bad due to very noisy
When, second level speech enhan-cement will filter voice and non-conductance voice the progress of mapping voice it is adaptive weighted merge, remain to obtain
Good speech enhan-cement effect;
3, without distance limitation, user between the air conduction sensor used and non-air conduction sensor of the invention
Just.
Detailed description of the invention
Fig. 1 is the process step of the dual sensor sound enhancement method disclosed by the embodiments of the present invention based on statistical model
Figure;
Fig. 2 is the process step figure of training phonetic statistical model in the embodiment of the present invention;
Fig. 3 is the process step figure that non-conductance voice is established in the embodiment of the present invention to conductance voice mapping model;
Fig. 4 is the process step figure that conductance noise statistics model is established in the embodiment of the present invention;
Fig. 5 is the process step figure that joint statistical model is corrected in the embodiment of the present invention;
Fig. 6 is the process step figure that estimation optimum gas leads voice filter in the embodiment of the present invention;
Fig. 7 is the process step figure of mapping voice and filtering enhancing voice Weighted Fusion in the embodiment of the present invention;
Fig. 8 is the structural block diagram of the dual sensor speech sound enhancement device disclosed by the embodiments of the present invention based on statistical model.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer and more explicit, right as follows in conjunction with drawings and embodiments
The present invention is further described.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and do not have to
It is of the invention in limiting.
Embodiment one
Present embodiment discloses a kind of the dual sensor sound enhancement method based on statistical model, detailed process step reference
Shown in attached drawing 1, it is known that the dual sensor sound enhancement method includes following scheme step:
Step S1: the clean conductance training voice of synchronous acquisition and non-conductance training voice establish the joint for classification
Statistical model, and the conductance voice linear spectral statistical model for corresponding to each classification is calculated, it can specifically be divided into again following several
Step, process are as shown in Figure 2:
Step S1.1: the clean conductance training voice of synchronous acquisition and non-conductance training voice simultaneously carry out framing, extract every
The characteristic parameter of frame voice;
In above-described embodiment, clean, synchronous conductance training voice and the training of non-conductance are acquired using speech reception module
Voice.Discrete Fourier transform is done after carrying out framing and pretreatment to the clean trained voice of input, is then filtered using Meier
Device extracts the mel-frequency cepstrum coefficient MFCC of two kinds of trained voices, the training data as joint statistical model.
In further embodiments, extraction be conductance training voice and non-conductance training voice LPCC or LSF system
Number.
Step S1.2: the characteristic parameter of conductance training voice in step S1.1 and non-conductance training voice is spliced and is done
Net joint speech characteristic parameter;
In above-described embodiment, the cepstral domain feature vector sequence of conductance training voice is denoted as: SN={ sN1,sN2,...,
sNn, n is voice data frame number, sNlFor the column vector of l frame feature;The cepstral domain feature vector sequence of non-conductance training voice
It is denoted as: ST={ sT1,sT2,…,sTn, frame number n, sTlFor the column vector of l frame feature.By l frame conductance training voice and
The cepstral domain feature parameter of the non-conductance training voice of l frame is spliced, and obtaining l frame cepstrum domain union feature vector is
Step S1.3: the joint speech characteristic parameter obtained using step S1.2, cepstrum domain joint of the training for classification
Statistical model;
In above-described embodiment, the probability distribution of joint training voice is fitted using multiple data stream mixed Gauss model,
The probability density function of spectral domain joint statistical model are as follows:
Wherein s is the serial number of audio data stream, and M is the mixed components number in GMM, θsIt is weight shared by audio data stream, πm
It is model mixed components priori weight,θs>=0,πm≥0;WithRespectively indicate cepstrum domain joint
The mean value vector and variance matrix of audio data stream s, z in statistical model m classificationsFor the characteristic vector of s-th of data flow,It is single Gaussian Profile probability density function.λ is enabled to indicate the parameter set of multiple data stream gauss hybrid models, Z=
{z1,z2,...,znIndicating the cepstrum domain union feature set of vectors trained, then cepstrum domain combines statistical model likelihood function
Are as follows:
Can be found out using EM algorithm (Expectation Maximization Algorithm) so that P (Z |
λ) maximum model parameter collection λ.
Step S1.4: classify to for trained all cepstrum domains joint speech frame, calculating belongs to each classification
The linear spectral domain statistical parameter of conductance voice in all joint speech frames establishes conductance voice lines corresponding with each classification
Property frequency spectrum statistical model.
In above-described embodiment, each Gaussian component in multiple data stream mixed Gauss model represents a classification, for
Combine speech frame in trained all cepstrum domains, calculates each frame cepstrum domain union feature vector zlBelong to cepstrum domain joint statistics
The probability of model m classification, formula are as follows:
WhereinIndicate the cepstral domain feature vector of s-th of audio data stream in l frame voice.Write down maximum probability max
{p(m|zl) corresponding to model mixed components (classify).
After the classification for completing all cepstrum domain joint speech frames, calculating is gathered in same classificatory all joint speech frames
The spectrum mean of middle conductance voiceIt is counted as conductance voice linear spectral corresponding with cepstrum domain joint statistical model
Model parameter.
In other embodiments, joint statistical model is used as using multiple data stream Hidden Markov Model, and with more
Each Gaussian component in data flow Hidden Markov Model indicates a classification.
Step S2: using the conductance of step S1 synchronous acquisition and non-conductance training voice, non-conductance voice is established to conductance
The mapping model of voice, specific to be divided into following steps again, process is as shown in Fig. 3:
Step S2.1: the clean non-conductance training voice and conductance training voice of synchronous acquisition in step S1 are divided
Frame, using non-conductance training speech frame as input, conductance training speech frame in the same time is as ideal output, after being sent into initialization
Feedforward neural network;
In above-described embodiment, first to conductance training voice and the training voice framing of non-conductance, conductance training is extracted respectively
Line spectral frequencies (LSF) parameter of speech frame and non-conductance training speech frame, gives the input and output mode of feedforward neural network
(LT,LN), LTThe LSF vector for indicating non-conductance training voice, as the input of feedforward neural network, LNIndicate conductance training language
The LSF vector of sound as the ideal output of feedforward neural network, and initializes feedforward neural network weight.
Step S2.2: according to minimum mean square error criterion, using scale conjugate gradient algorithms training feedforward neural network
Weight coefficient obtains the mapping of non-conductance voice to conductance voice so that the error between reality output and ideal output is minimum
Model;
In above-described embodiment, the connection weight of the neuron that l layer of feedforward neural network to l+1 layers of j-th of neuron
It is worth vector are as follows:
WhereinFor the connection weight of l layers of i-th of neuron to l+1 layers of j-th of neuron, NlFor
L layers of neuron number,For the threshold value of l+1 layers of j-th of neuron, by owningThe feed forward neural of composition
Network weight vector is as follows:
Wherein M is the neural network number of plies, and N is output layer neuron number.Remember that P is training number of speech frames, neural network is real
Border output vector L*With the error between ideal output L are as follows:
Feedforward neural network weight is iterated using scale conjugate gradient algorithms ,+1 iteration result of kth are as follows:
wk+1=wk+αkPk(14)
Wherein direction of search PkWith step-length αkIt is given by the following formula:
Wherein E'(wk) and E " (wk) be respectively E (w) first derivative and second dervative, work as E'(wk)=0 is error E
(w) when reaching minimum value, optimal weight coefficient W is obtainedbest。
Step S3: synchronous acquisition conductance detection voice and non-conductance detection voice and the endpoint for detecting conductance detection voice,
Then spectrum domain conductance noise statistics model is established using the pure noise segment of conductance detection voice, specifically uses following steps,
Process is as shown in Figure 4:
Step S3.1: synchronous acquisition conductance detects voice and non-conductance detection voice and framing;
Step S3.2: the short-time autocorrelation function R of speech frame is detected according to non-conductancew(k) and short-time energy Ew, calculate every
The short-time average of the non-conductance detection speech frame of frame crosses threshold rate Cw(n):
Cw(n)=| sgn [Rw(k)-αT]-sgn[Rw(k-1)-αT]|+
|sgn[Rw(k)+αT]-sgn[Rw(k-1)+αT]|}w(n-k)(17)
Wherein sgn [] is to take symbolic operation,It is regulatory factor, w (n) is window function, at the beginning of T is thresholding
Value.Work as Cw(n) when being greater than preset threshold value, judge that the frame is otherwise noise for voice signal.According to the court verdict of every frame
Obtain the endpoint location of non-conductance detection voice signal;
Step S3.3: it is examined at the time of the non-conductance detection speech sound signal terminal point that step S3.2 is detected is corresponded to as conductance
The endpoint of voice is surveyed, the pure noise segment in conductance detection voice is extracted;
Step S3.4: calculating the linear spectral mean value of pure noise segment signal in conductance detection voice, save the Mean Parameters,
Establish the statistical model of spectrum domain conductance noise.
Step S4: language is detected using the joint statistical model in conductance noise statistics Modifying model step S1, and to conductance
Sound frame is classified, then according to the corresponding gentle bone conduction noise statistical model of conductance voice linear spectral statistical model of classification results
It calculates optimum gas and leads voice filter, and enhancing is filtered to conductance detection voice.
In above-described embodiment, audio data stream is detected to the conductance in joint statistical model using model compensation technology first
Parameter is modified, and specifically includes following steps, and process is as shown in Figure 5:
Step S4.1a: statistical model Parameter Switch is combined into linear spectral domain in mel cepstrum domain.In above-described embodiment,
Inverse discrete cosine transform C is used first-1By the mean value of mel cepstrum domain joint statistical model m classificationAnd variance
It is transformed into log-domain: WhereinWithRespectively the mean value of log-domain and
Variance.Linear spectral domain is transformed into from log-domain again:
WhereinFor linear spectral domain mean value vectorI-th of component,For linear spectral domain variance square
Battle arrayThe element of i-th row jth column.
Step S4.2a: being additive relation in linear spectral domain by the gentle bone conduction noise of conductance clean speech counts mould to joint
Conductance audio data stream parameter in type is modified.In above-described embodiment, the parameter of conductance audio data stream is carried out as follows
Amendment:
Wherein g is the signal-to-noise ratio of conductance detection voice,Be respectively conductance noise linearity spectrum domain mean value and
Variance,WithMean value and variance of the conductance audio data stream in linear spectral domain after respectively correcting.
Step S4.3a: the revised linear spectral domain of step S4.2a is combined using the inverse transformation of formula (13) and formula (14)
Modeling statistics Parameter Switch returns original property field (cepstrum domain), obtains revised joint cepstrum domain statistical model.
After correcting joint statistical model, available each frame union feature detects vector zlBelong to joint statistical model
The probability of m classification:
Optimum gas leads the calculating of voice filter in above-mentioned steps S4, specifically includes following steps, process such as Fig. 6 institute
Show:
Step S4.1b: extracting the union feature parameter of conductance detection voice and non-conductance detection voice, calculates each frame connection
Close detection voice correspond to each classification amendment after combine statistical model output probability p (m | zl);
Step S4.2b: it is gentle that non-conductance detection audio data stream in joint statistical model is calculated according to above-mentioned output probability
The weight for leading detection audio data stream, can use following steps:
Step S4.2.1: the initial weight that setting conductance detects voice is w0, the initial weight of non-conductance detection voice is
1-w0, the number of iterations t=0, calculating Difft:
Wherein M indicate model mixed components number, L be voice frame number, p (j | zl) and p (k | zl) it is respectively l frame joint
Detect voice zlBelong to the probability of jth classification and kth classification in joint statistical model,For joint statistics
Model kth is classified at a distance from jth statistic of classification parameter,Classifying for joint statistical model kth, it is equal to classify with jth
Value.
Step S4.2.2: it calculates conductance and detects voice weightNon- conductance detects voice weight
θ2(Difft)=1- θ1(Difft), using updated weight recalculate p (j | zl) and p (k | zl), then according to formula (23)
Calculate Difft+1;
Step S4.2.3: if | Difft+1-Difft| < ξ, ξ are preset threshold value, then stop updating weight, execute step
S4.2.4, otherwise t=t+1, goes to step S4.2.2;
Step S4.2.4: Diff is utilizedTCalculate optimal weight θ1(DiffT) and θ2(DiffT), wherein T is t when stopping updating
Value.
Step S4.3b: the joint statistical model obtained using step S4.2b classifies to conductance detection speech frame, so
Optimum gas lead is calculated according to the gentle bone conduction noise statistical model of the corresponding conductance voice linear spectral statistical model of classification results afterwards
Sound filter specifically uses following steps:
Step S4.3.1: optimal weight θ is utilized1(DiffT) and θ2(DiffT) calculate joint-detection speech frame zlBelong to current
It is modified joint statistical model m classification Probability p (m | zl);
Step S4.3.2: the frequency domain gain function that optimum gas leads voice filter is calculated using following formula:
Wherein, K is the mean value vector length of joint statistical model m classification,For joint statistical model m classification
The corresponding linear spectrum mean vector of conductance voiceI-th of value,For conductance noise statistics model m classification pair
The noise linearity spectrum mean vector answeredI-th of value.
After acquisition optimum gas leads the frequency domain gain function of voice filter, conductance detection voice is transformed into frequency domain and is retained
Its amplitude spectrum is scaled G (z by phase informationl) times, time domain is then converted back, filtering enhancing voice is obtained.
In further embodiments, in order to improve operation efficiency, optimum gas lead sound filter gain function uses following formula
It calculates:
Step S5: the non-conductance voice according to obtained in step S2 detects non-conductance to the mapping model of conductance voice
Voice is converted to the mapping voice of conductance;
Step S6: linear weighted function is carried out to the filtering enhancing voice that mapping voice and step 4 obtain obtained in step S5
Fusion, obtains merging enhanced voice, specifically uses following steps, and process is as shown in Figure 7:
Step S6.1: the filtering for calculating m frame enhances voice xmWeightWith the mapping voice y of m framemWeight
In above-described embodiment, according to the voice signal start time that step S3 end-point detection obtains, interception filtering enhancing language
Sound xmAll data frames before middle start point signal, ask its mean power as the power of noise frameThe filtering for calculating m frame increases
Strong voice xmWeightWith the mapping voice y of m framemWeight
WhereinRespectively m frame filtering enhancing voice xmWith mapping voice ymAmplitude variance, α, β are default
Constant, SNRmEnhancing voice x is filtered for m framemSignal-to-noise ratio:
WhereinIt is xmPower.
Step S6.2: filtering is enhanced into voice xmWith mapping voice ymWeighted superposition obtains fusion enhancing voice:
Embodiment two
The present embodiment two discloses a kind of dual sensor speech sound enhancement device based on model, by speech reception module, language
Sound statistical model training module, conductance noise statistics model estimation block, conductance detection voice filter enhancing module, voice mapping
Module, voice fusion enhancing module collectively constitute, and structure is as shown in Figure 2.
Wherein, speech reception module, the conductance training voice clean for synchronous acquisition and non-conductance training voice;
Wherein, phonetic statistical model training module, for establishing the gentle lead sound linear spectral of joint statistical model
Statistical model;
Wherein, conductance noise statistics model estimation block, the endpoint of detection conductance detection voice, is then detected using conductance
The pure noise segment of voice establishes conductance noise statistics model;;
Wherein, conductance detection voice filter enhances module, for joining using described in the conductance noise statistics Modifying model
The statistical parameter of statistical model is closed, and is classified to conductance detection speech frame, then in conjunction with conductance corresponding to classification results
The gentle bone conduction noise statistical model of voice linear spectral statistical model calculates optimum gas and leads voice filter, and detects voice to conductance
It is filtered enhancing, obtains filtering enhancing voice;
Wherein, voice mapping block, the mapping model for establishing non-conductance voice to conductance voice, and according to described non-
Non- conductance detection voice is converted to the mapping voice of conductance feature to the mapping model of conductance voice by conductance voice;
Wherein, voice fusion enhancing module, for the mapping voice of the conductance feature and the filtering to be enhanced voice
It is weighted fusion, obtains merging enhanced voice.
As shown in Fig. 8, wherein speech reception module and phonetic statistical model training module, conductance noise statistics model
Estimation block, conductance detection voice filter enhancing module, the connection of voice mapping block, phonetic statistical model training module and conductance
Voice filter enhancing module connection is detected, conductance noise statistics model estimation block and conductance detection voice filter enhancing module connect
It connects, conductance detection voice filter enhancing module merges enhancing module connection with voice, and voice mapping block merges enhancing with voice
Module connection.
Above-mentioned speech reception module includes two submodules of conductance speech transducer and non-conductance speech transducer, Qian Zheyong
In obtaining conductance voice data, the latter is for obtaining non-conductance voice data;Phonetic statistical model training module includes joint system
The gentle lead sound linear spectral statistical model submodule of model submodule is counted, joint statistical model is gentle to lead voice lines for establishing
Property frequency spectrum statistical model;Conductance noise statistics model estimation block is used to estimate the ambient noise of current system, counts to joint
Model is modified, and simultaneously participates in the calculating of filter coefficient;Conductance detects voice filter enhancing module by joint statistical model
It corrects submodule, joint-detection Classification of Speech identification submodule, optimum gas waveguide filter coefficient and generates submodule and conductance detection
Voice filter submodule collectively forms, wherein joint statistical model amendment submodule is used to correct the statistics ginseng of joint statistical model
Number, joint-detection Classification of Speech identify that submodule classifies to detection voice, classification results are acted on best conductance and are filtered
Device coefficient generates submodule, and optimum gas waveguide filter coefficient generates submodule and calculates filter parameter, examines finally by conductance
It surveys voice filter submodule and obtains the conductance voice of filtering enhancing;Voice mapping block is used to for non-conductance detection voice being mapped as
Conductance voice;Voice fusion enhancing module includes that adaptive weighting generates submodule and linear fusion submodule, the former is based on
Calculate mapping voice and filtering enhancing voice weight, the latter using adaptive weighting generate submodule result will map voice and
Filtering enhancing voice carries out linear weighted function fusion, obtains fusion enhancing voice.
In above-mentioned each submodule, conductance speech transducer and conductance noise statistics model estimation block combine statistics mould
Type submodule, joint-detection Classification of Speech identification submodule are connected with conductance detection voice filter submodule, and non-conductance voice passes
Sensor divides with statistical model submodule, conductance noise statistics model estimation block, voice mapping block, joint-detection voice is combined
Class identifies submodule connection;Joint statistical model submodule and conductance voice linear spectral statistical model submodule combine statistics
The connection of Modifying model submodule, conductance voice linear spectral statistical model training module and optimum gas waveguide filter coefficient generate son
Module connection, participates in the calculating of filter coefficient;
Conductance noise model estimation block corrects submodule, optimum gas waveguide filter coefficient generation with statistical model is combined
Module connection;Joint statistical model corrects submodule and optimum gas waveguide filter coefficient generates submodule, conductance detection voice filter
The connection of marble module, joint-detection Classification of Speech identify that submodule generates submodule with optimum gas waveguide filter coefficient and connect, most
Good conductance filter coefficient generates submodule and connect with conductance detection voice filter submodule;Conductance detects voice filter submodule
Submodule is generated with adaptive weighting, linear fusion submodule is connect;Voice mapping block and adaptive weighting generate submodule,
The connection of linear fusion submodule;Adaptive weighting generation module is connected with linear fusion module.
It is worth noting that, included modules are only drawn according to function logic in above-mentioned apparatus embodiment
Point, but be not limited to the above division, as long as corresponding functions can be realized;In addition, the specific name of each module
Also it is only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.
The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment
Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention,
It should be equivalent substitute mode, be included within the scope of the present invention.
Claims (10)
1. a kind of dual sensor sound enhancement method based on statistical model characterized by comprising
Synchronous acquisition conductance detects voice and non-conductance detects voice, then the endpoint of detection conductance detection voice utilizes conductance
The pure noise segment of detection voice establishes conductance noise statistics model;
Combine statistical model using the conductance noise statistics Modifying model, and classifies to conductance detection speech frame;
Best conductance is calculated according to the gentle bone conduction noise statistical model of the corresponding conductance voice linear spectral statistical model of classification results
Voice filter;
Leading voice filter using the optimum gas and be filtered after enhancing to obtain filtering to conductance detection voice enhances voice,
Wherein, statistical model and the conductance voice linear spectral statistical model of combining is by the clean conductance of synchronous acquisition
Training voice and non-conductance training voice pre-establish.
2. the method according to claim 1, wherein synchronous acquisition conductance detection voice and the detection of non-conductance
Voice, the endpoint of detection conductance detection voice, then establishes conductance noise statistics mould using the pure noise segment of conductance detection voice
The step of type includes:
Synchronous acquisition conductance detects voice and non-conductance detection voice and framing;
Short-time autocorrelation function and the short-time energy that speech frame is detected according to non-conductance calculate the non-conductance detection speech frame of every frame
Short-time average crosses threshold rate, when the short-time average crosses threshold rate greater than preset threshold value, judges the non-conductance detection
Speech frame is voice signal, is otherwise noise;
The endpoint location of non-conductance detection voice signal is obtained according to the court verdict of each non-conductance detection speech frame;
Endpoint at the time of the non-conductance detection speech sound signal terminal point that will test corresponds to as conductance detection voice, is extracted
Conductance detects the pure noise segment in voice;
The linear spectral mean value for calculating pure noise segment signal in conductance detection voice, saves the statistics mould that the mean value is conductance noise
Shape parameter.
3. according to the method described in claim 2, it is characterized in that, the short-time average, which crosses threshold rate, passes through following formula meter
It calculates:
Cw(n)=| sgn [Rw(k)-αT]-sgn[Rw(k-1)-αT]|+
|sgn[Rw(k)+αT]-sgn[Rw(k-1)+αT]|}w(n-k);
Wherein sgn [] is to take symbolic operation,For regulatory factor, w (n) is window function, and T is thresholding initial value, Rw
It (k) is the short-time autocorrelation function, EwFor the short-time energy, Cw(n) threshold rate is crossed for the short-time average.
4. the method according to claim 1, wherein the joint statistical model is repaired by following steps
Just:
The Parameter Switch of statistical model will be combined to linear spectral domain;
It in linear spectral domain is additive relation to the conductance voice in joint statistical model by the gentle bone conduction noise of conductance clean speech
Traffic parameter is modified;
Revised linear spectral domain joint statistical model Parameter Switch is returned into original property field, obtains revised joint system
Count model;
Wherein, the conductance audio data stream parameter in the joint statistical model is mixed Gauss model or Hidden Markov
The mean value and covariance of Gaussian component in model.
5. the method according to claim 1, wherein described according to the corresponding linear frequency of conductance voice of classification results
The gentle bone conduction noise statistical model of spectrum statistical model calculates the step of optimum gas leads voice filter
The union feature parameter for extracting conductance detection voice and non-conductance detection voice, it is corresponding to calculate each frame joint-detection voice
Combine the output probability of statistical model after the amendment of each classification;
Non- conductance detection audio data stream and conductance in joint statistical model, which are calculated, according to above-mentioned output probability detects voice data
The weight parameter of stream;
According to above-mentioned weight parameter, classified using updated joint statistical model to conductance detection speech frame, then root
The filter of optimum gas lead sound is calculated according to the gentle bone conduction noise statistical model of the corresponding conductance voice linear spectral statistical model of classification results
Wave device.
6. according to the method described in claim 5, it is characterized in that, the non-conductance detection audio data stream and conductance detect language
The weight parameter of sound data flow is calculated by following steps:
The initial weight that conductance detection voice is arranged is w0, the initial weight of non-conductance detection voice is 1-w0, the number of iterations t=
0, calculate Difft
Wherein M indicate model mixed components number, L be voice frame number, p (j | zl) and p (k | zl) it is respectively l frame joint-detection
Voice zlBelong to the probability of jth classification and kth classification in joint statistical model,To combine statistical model
Kth is classified at a distance from jth statistic of classification parameter,Join for the classification of joint statistical model kth and the statistics of jth classification
Number;
It calculates conductance and detects voice weightNon- conductance detects voice weight θ2(Difft)=1- θ1
(Difft), using updated weight recalculate p (j | zl) and p (k | zl), then recalculate Difft+1;
If | Difft+1-Difft| < ξ, ξ are preset threshold value, then stop updating weight, perform the next step suddenly, otherwise t=t+1, turn
Return previous step;
Utilize DiffTCalculate optimal weight θ1(DiffT) and θ2(DiffT), wherein T is the value of t when stopping updating.
7. according to the method described in claim 6, it is characterized in that, the optimum gas, which leads voice filter, passes through following steps meter
It calculates:
Utilize optimal weight θ1(DiffT) and θ2(DiffT) calculate joint-detection speech frame zlBelong to current modified joint statistics
Model m classification Probability p (m | zl);
The frequency domain gain function that optimum gas leads voice filter is calculated using one in following formula:
Wherein K is the mean value vector dimension of joint statistical model m classification,It is corresponding for joint statistical model m classification
The linear spectrum mean vector of conductance voiceI-th of component,Classify for conductance noise statistics model m corresponding
Noise linearity spectrum mean vectorI-th of component.
8. the method according to claim 1, wherein the method further includes:
According to the mapping model of the non-conductance voice to conductance voice, non-conductance detection voice is converted to the mapping language of conductance
Sound;
The mapping voice of the conductance is weighted with filtering enhancing voice and is merged, obtains merging enhanced voice;
Wherein, the mapping model is built in advance by the clean conductance training voice and non-conductance training voice of the synchronous acquisition
It is vertical.
9. according to the method described in claim 8, it is characterized in that, the mapping voice by the conductance and the filtering increase
The step of strong voice is weighted fusion include:
It is calculated by the following formula the filtering enhancing voice x of m framemWeightWith the mapping voice y of m framemWeight
Wherein,WithRespectively m frame filtering enhancing voice xmWith mapping voice ymAmplitude variance, SNRmFor the filter of m frame
Wave enhances voice xmSignal-to-noise ratio, α, β be preset constant;
Filtering is enhanced into voice x by following formulamWith mapping voice ymWeighted superposition obtains fusion enhancing voice:
10. according to the method described in claim 9, it is characterized in that, the mapping voice by the conductance and the filtering
Before the step of enhancing voice is weighted fusion, further comprise:
The voice signal start time that end-point detection obtains is carried out according to conductance detection voice, is believed in interception filtering enhancing voice
All data frames before number starting point are averaging power of the power as noise frame
The Signal to Noise Ratio (SNR)mIt is calculated by the following formula:
WhereinIt is m frame filtering enhancing voice xmPower.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910296425.4A CN110010149B (en) | 2016-01-14 | 2016-01-14 | Dual-sensor voice enhancement method based on statistical model |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610025390.7A CN105632512B (en) | 2016-01-14 | 2016-01-14 | A kind of dual sensor sound enhancement method and device based on statistical model |
CN201910296425.4A CN110010149B (en) | 2016-01-14 | 2016-01-14 | Dual-sensor voice enhancement method based on statistical model |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610025390.7A Division CN105632512B (en) | 2016-01-14 | 2016-01-14 | A kind of dual sensor sound enhancement method and device based on statistical model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110010149A true CN110010149A (en) | 2019-07-12 |
CN110010149B CN110010149B (en) | 2023-07-28 |
Family
ID=56047353
Family Applications (5)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910296437.7A Active CN110070883B (en) | 2016-01-14 | 2016-01-14 | Speech enhancement method |
CN201610025390.7A Active CN105632512B (en) | 2016-01-14 | 2016-01-14 | A kind of dual sensor sound enhancement method and device based on statistical model |
CN201910296425.4A Active CN110010149B (en) | 2016-01-14 | 2016-01-14 | Dual-sensor voice enhancement method based on statistical model |
CN201910296436.2A Active CN110085250B (en) | 2016-01-14 | 2016-01-14 | Method for establishing air conduction noise statistical model and application method |
CN201910296427.3A Active CN110070880B (en) | 2016-01-14 | 2016-01-14 | Establishment method and application method of combined statistical model for classification |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910296437.7A Active CN110070883B (en) | 2016-01-14 | 2016-01-14 | Speech enhancement method |
CN201610025390.7A Active CN105632512B (en) | 2016-01-14 | 2016-01-14 | A kind of dual sensor sound enhancement method and device based on statistical model |
Family Applications After (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910296436.2A Active CN110085250B (en) | 2016-01-14 | 2016-01-14 | Method for establishing air conduction noise statistical model and application method |
CN201910296427.3A Active CN110070880B (en) | 2016-01-14 | 2016-01-14 | Establishment method and application method of combined statistical model for classification |
Country Status (1)
Country | Link |
---|---|
CN (5) | CN110070883B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107808662B (en) * | 2016-09-07 | 2021-06-22 | 斑马智行网络(香港)有限公司 | Method and device for updating grammar rule base for speech recognition |
CN107886967B (en) * | 2017-11-18 | 2018-11-13 | 中国人民解放军陆军工程大学 | A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network |
CN107993670B (en) * | 2017-11-23 | 2021-01-19 | 华南理工大学 | Microphone array speech enhancement method based on statistical model |
CN109584894A (en) * | 2018-12-20 | 2019-04-05 | 西京学院 | A kind of sound enhancement method blended based on radar voice and microphone voice |
CN109767783B (en) * | 2019-02-15 | 2021-02-02 | 深圳市汇顶科技股份有限公司 | Voice enhancement method, device, equipment and storage medium |
CN109767781A (en) * | 2019-03-06 | 2019-05-17 | 哈尔滨工业大学(深圳) | Speech separating method, system and storage medium based on super-Gaussian priori speech model and deep learning |
CN110265056B (en) * | 2019-06-11 | 2021-09-17 | 安克创新科技股份有限公司 | Sound source control method, loudspeaker device and apparatus |
CN110390945B (en) * | 2019-07-25 | 2021-09-21 | 华南理工大学 | Dual-sensor voice enhancement method and implementation device |
CN110797039B (en) * | 2019-08-15 | 2023-10-24 | 腾讯科技(深圳)有限公司 | Voice processing method, device, terminal and medium |
CN111724796B (en) * | 2020-06-22 | 2023-01-13 | 之江实验室 | Musical instrument sound identification method and system based on deep pulse neural network |
CN113178191A (en) * | 2021-04-25 | 2021-07-27 | 平安科技(深圳)有限公司 | Federal learning-based speech characterization model training method, device, equipment and medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1992015155A1 (en) * | 1991-02-19 | 1992-09-03 | Motorola, Inc. | Interference reduction system |
JP2001236089A (en) * | 1999-12-17 | 2001-08-31 | Atr Interpreting Telecommunications Res Lab | Statistical language model generating device, speech recognition device, information retrieval processor and kana/kanji converter |
CN1750123A (en) * | 2004-09-17 | 2006-03-22 | 微软公司 | Method and apparatus for multi-sensory speech enhancement |
CN101080765A (en) * | 2005-05-09 | 2007-11-28 | 株式会社东芝 | Voice activity detection apparatus and method |
JP2008176155A (en) * | 2007-01-19 | 2008-07-31 | Kddi Corp | Voice recognition device and its utterance determination method, and utterance determination program and its storage medium |
CN101320566A (en) * | 2008-06-30 | 2008-12-10 | 中国人民解放军第四军医大学 | Non-air conduction speech reinforcement method based on multi-band spectrum subtraction |
CN102027536A (en) * | 2008-05-14 | 2011-04-20 | 索尼爱立信移动通讯有限公司 | Adaptively filtering a microphone signal responsive to vibration sensed in a user's face while speaking |
CN103208291A (en) * | 2013-03-08 | 2013-07-17 | 华南理工大学 | Speech enhancement method and device applicable to strong noise environments |
CN103229238A (en) * | 2010-11-24 | 2013-07-31 | 皇家飞利浦电子股份有限公司 | System and method for producing an audio signal |
US9058820B1 (en) * | 2013-05-21 | 2015-06-16 | The Intellisis Corporation | Identifying speech portions of a sound model using various statistics thereof |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7283850B2 (en) * | 2004-10-12 | 2007-10-16 | Microsoft Corporation | Method and apparatus for multi-sensory speech enhancement on a mobile device |
CN105224844B (en) * | 2014-07-01 | 2020-01-24 | 腾讯科技(深圳)有限公司 | Verification method, system and device |
-
2016
- 2016-01-14 CN CN201910296437.7A patent/CN110070883B/en active Active
- 2016-01-14 CN CN201610025390.7A patent/CN105632512B/en active Active
- 2016-01-14 CN CN201910296425.4A patent/CN110010149B/en active Active
- 2016-01-14 CN CN201910296436.2A patent/CN110085250B/en active Active
- 2016-01-14 CN CN201910296427.3A patent/CN110070880B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1992015155A1 (en) * | 1991-02-19 | 1992-09-03 | Motorola, Inc. | Interference reduction system |
JP2001236089A (en) * | 1999-12-17 | 2001-08-31 | Atr Interpreting Telecommunications Res Lab | Statistical language model generating device, speech recognition device, information retrieval processor and kana/kanji converter |
CN1750123A (en) * | 2004-09-17 | 2006-03-22 | 微软公司 | Method and apparatus for multi-sensory speech enhancement |
CN101080765A (en) * | 2005-05-09 | 2007-11-28 | 株式会社东芝 | Voice activity detection apparatus and method |
JP2008176155A (en) * | 2007-01-19 | 2008-07-31 | Kddi Corp | Voice recognition device and its utterance determination method, and utterance determination program and its storage medium |
CN102027536A (en) * | 2008-05-14 | 2011-04-20 | 索尼爱立信移动通讯有限公司 | Adaptively filtering a microphone signal responsive to vibration sensed in a user's face while speaking |
CN101320566A (en) * | 2008-06-30 | 2008-12-10 | 中国人民解放军第四军医大学 | Non-air conduction speech reinforcement method based on multi-band spectrum subtraction |
CN103229238A (en) * | 2010-11-24 | 2013-07-31 | 皇家飞利浦电子股份有限公司 | System and method for producing an audio signal |
CN103208291A (en) * | 2013-03-08 | 2013-07-17 | 华南理工大学 | Speech enhancement method and device applicable to strong noise environments |
US9058820B1 (en) * | 2013-05-21 | 2015-06-16 | The Intellisis Corporation | Identifying speech portions of a sound model using various statistics thereof |
Non-Patent Citations (5)
Title |
---|
GRACIARENA M.: ""Combining Standard and Throat Microphones for Robust Speech Recognition"", 《IEEE SIGNAL PROCESSING LETTERS》 * |
RAHMAN M. S: ""Intelligibility Enhancement of Bone Conducted Speech by an Analysis-Synthesis Method"", 《IEEE INTERNATIONAL MIDWEST SYMPOSIUM ON》 * |
ZHANG ZHENGYOU, LIU ZICHENG: ""Multi-Sensory Microphones for Robust Speech Detection, Enhancement and Recognition"", 《ICASSP》 * |
徐舫: ""基于模型的多数据流语音增强技术"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
牛颖莉: ""基于多传感器的语音增强技术"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Also Published As
Publication number | Publication date |
---|---|
CN110070883B (en) | 2023-07-28 |
CN105632512B (en) | 2019-04-09 |
CN110070880B (en) | 2023-07-28 |
CN110085250B (en) | 2023-07-28 |
CN105632512A (en) | 2016-06-01 |
CN110010149B (en) | 2023-07-28 |
CN110070880A (en) | 2019-07-30 |
CN110085250A (en) | 2019-08-02 |
CN110070883A (en) | 2019-07-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105632512B (en) | A kind of dual sensor sound enhancement method and device based on statistical model | |
CN105513605B (en) | The speech-enhancement system and sound enhancement method of mobile microphone | |
CN106971740B (en) | Sound enhancement method based on voice existing probability and phase estimation | |
Gao et al. | Joint training of front-end and back-end deep neural networks for robust speech recognition | |
CN103854662B (en) | Adaptive voice detection method based on multiple domain Combined estimator | |
CN108831499A (en) | Utilize the sound enhancement method of voice existing probability | |
CN106373589B (en) | A kind of ears mixing voice separation method based on iteration structure | |
CN108172238A (en) | A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system | |
CN106971741A (en) | The method and system for the voice de-noising that voice is separated in real time | |
CN103646649A (en) | High-efficiency voice detecting method | |
CN108680245A (en) | Whale globefish class Click classes are called and traditional Sonar Signal sorting technique and device | |
CN108766459A (en) | Target speaker method of estimation and system in a kind of mixing of multi-person speech | |
CN109949823A (en) | A kind of interior abnormal sound recognition methods based on DWPT-MFCC and GMM | |
CN103730126B (en) | Noise suppressing method and noise silencer | |
CN110197665A (en) | A kind of speech Separation and tracking for police criminal detection monitoring | |
CN103208291A (en) | Speech enhancement method and device applicable to strong noise environments | |
Lv et al. | A permutation algorithm based on dynamic time warping in speech frequency-domain blind source separation | |
CN106023986A (en) | Voice identification method based on sound effect mode detection | |
Liu et al. | A novel pitch extraction based on jointly trained deep BLSTM recurrent neural networks with bottleneck features | |
CN110390945A (en) | A kind of dual sensor sound enhancement method and realization device | |
CN203165457U (en) | Voice acquisition device used for noisy environment | |
Hu et al. | Robust binaural sound localisation with temporal attention | |
CN111968671B (en) | Low-altitude sound target comprehensive identification method and device based on multidimensional feature space | |
CN111785262B (en) | Speaker age and gender classification method based on residual error network and fusion characteristics | |
CN107025911A (en) | Fundamental frequency detection method based on particle group optimizing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |