CN110010149A

CN110010149A - Dual sensor sound enhancement method based on statistical model

Info

Publication number: CN110010149A
Application number: CN201910296425.4A
Authority: CN
Inventors: 张军; 陈鑫源; 潘伟锵; 宁更新; 冯义志; 余华; 季飞; 陈芳炯
Original assignee: Shenzhen Voxtech Co Ltd
Current assignee: Shenzhen Voxtech Co Ltd
Priority date: 2016-01-14
Filing date: 2016-01-14
Publication date: 2019-07-12
Anticipated expiration: 2036-01-14
Also published as: CN110070883B; CN105632512B; CN110070880B; CN110085250B; CN105632512A; CN110010149B; CN110070880A; CN110085250A; CN110070883A

Abstract

The present invention discloses a kind of dual sensor sound enhancement method based on statistical model, it include: synchronous acquisition conductance detection voice and non-conductance detection voice, the endpoint of conductance detection voice is detected, then establishes conductance noise statistics model using the pure noise segment of conductance detection voice；Combine statistical model using conductance noise statistics Modifying model, and classifies to conductance detection speech frame；Optimum gas, which is calculated, according to the gentle bone conduction noise statistical model of the corresponding conductance voice linear spectral statistical model of classification results leads voice filter；Leading voice filter using optimum gas and be filtered after enhancing to obtain filtering to conductance detection voice enhances voice, wherein, the joint gentle lead sound linear spectral statistical model of statistical model is pre-established by the clean conductance training voice and non-conductance training voice of synchronous acquisition.

Description

Dual sensor sound enhancement method based on statistical model

Technical field

The present invention is entitled " a kind of dual sensor language based on statistical model that applicant proposed on 01 14th, 2016 Sound Enhancement Method and device ", application No. is the divisional applications of 201610025390.7 patent application, and the present invention relates to numbers to believe Number process field, in particular to a kind of dual sensor sound enhancement method based on statistical model.

Background technique

Communication is the modern important means exchanged between men, and voice is as shape most common in communication system Formula, quality directly affect the accuracy that people obtain information.Voice is during propagation, inevitably by various ring The interference of border noise, sound quality, intelligibility will be all decreased obviously, therefore often utilize speech enhancement technique in practical applications Voice under noise circumstance is handled.

Speech enhancement technique can extract useful voice signal from noise background, be the base for inhibiting, reducing noise jamming This means.Traditional speech enhan-cement object is the voice signal based on air conduction sensor (such as microphone) acquisition, according to The difference of processing mode, common speech enhancement technique can be divided into the method based on model and be not based on the method two of model Class.The Enhancement Method for being not based on model has spectrum-subtraction, filter method, Wavelet Transform etc., they often assume that noise is relatively flat Steady, when noise variation is too fast, reinforcing effect is not satisfactory.Sound enhancement method based on model is then right first Voice signal and noise signal establish statistical model, then by the Minimum Mean Squared Error estimation of model acquisition clean speech or most Big posterior probability estimation.Such methods can be avoided the generation of music noise, and can handle nonstationary noise.But due to above-mentioned It is based on the air transmitteds speech transducer such as microphone based on model and the method for being not based on model, signal is easy by environment Acoustic noise influence, especially under strong noise environment, system performance can sharp fall.

To solve influence of the very noisy to speech processing system, it is different from traditional air conduction sensor, non-air passes The speech transducer led drives reed or carbon film in sensor to occur using the vibration at the positions such as the human vocal band that speaks, jawbone Variation, changes its resistance value, the voltage at its both ends is made to change, to convert electric signal for vibration signal, i.e. voice is believed Number.Deformation occurs for the reed or carbon film that can not make non-air conduction sensor due to the sound wave conducted in air, non-empty Gas conduction sensor is not influenced by air transmitted sound, the ability with the interference of very strong environment resistant acoustic noise.But because non- Air conduction sensor acquisition is voice by the Vibration propagation of jawbone, muscle, skin etc., show as it is stuffy, ambiguous not Clearly, high frequency section is lost serious, and the intelligibility of speech is poor, constrains the practical application of non-air conduction technique.

Haves the defects that certain, appearance in recent years in view of air transmitted and being used alone all for non-air conduction sensor The sound enhancement method of both some combinations advantage.These methods utilize air conduction sensor voice and non-air conduction sensing The complementarity of device voice realizes the purpose of speech enhan-cement using multi-sensor fusion technology, can usually obtain and compare single-sensor The better effect of speech-enhancement system.But speech enhan-cement of the existing air conduction sensor in conjunction with non-air conduction sensor There is also following deficiencies for method: (1) air conduction sensor voice is mostly independently carried out with non-air conduction sensor voice Recovery processing, the voice after then again restoring the two merge, and fail to pass in air conduction sensor voice and non-air Complementarity between the two is made full use of in the recovery process of derivative sensor voice；(2) under changeable strong noisy environment, air The signal-to-noise ratio that the statistical property of the pure voice segments of conduction sensor voice by severe jamming, can enhance voice can also reduce, and cause to melt Speech enhan-cement effect is unobvious after conjunction.

Summary of the invention

The present invention provides a kind of dual sensor sound enhancement method based on statistical model, comprising: the inspection of synchronous acquisition conductance It surveys voice and non-conductance detects voice, then the endpoint of detection conductance detection voice utilizes the pure noise segment of conductance detection voice Establish conductance noise statistics model；Combine statistical model using conductance noise statistics Modifying model, and speech frame is detected to conductance Classify；It is calculated according to the gentle bone conduction noise statistical model of the corresponding conductance voice linear spectral statistical model of classification results best Conductance voice filter；Leading voice filter using optimum gas and be filtered after enhancing to obtain filtering to conductance detection voice enhances Voice, wherein the joint gentle lead sound linear spectral statistical model of statistical model trains language by the clean conductance of synchronous acquisition Sound and non-conductance training voice pre-establish.

The present invention has the following advantages and effects with respect to the prior art:

1, the present invention carrys out structure in conjunction with non-air conduction transducer voice and air conduction transducer voice during conductance speech enhan-cement It builds the phonetic statistical model for being currently used in classification and carries out end-point detection, and construction optimum gas leads voice filter accordingly, mentions The high reinforcing effect of conductance voice, significantly increases the robustness of whole system；

2, present invention employs the structural approach of two-stage speech enhan-cement, and in conductance voice, filter effect is bad due to very noisy When, second level speech enhan-cement will filter voice and non-conductance voice the progress of mapping voice it is adaptive weighted merge, remain to obtain Good speech enhan-cement effect；

3, without distance limitation, user between the air conduction sensor used and non-air conduction sensor of the invention Just.

Detailed description of the invention

Fig. 1 is the process step of the dual sensor sound enhancement method disclosed by the embodiments of the present invention based on statistical model Figure；

Fig. 2 is the process step figure of training phonetic statistical model in the embodiment of the present invention；

Fig. 3 is the process step figure that non-conductance voice is established in the embodiment of the present invention to conductance voice mapping model；

Fig. 4 is the process step figure that conductance noise statistics model is established in the embodiment of the present invention；

Fig. 5 is the process step figure that joint statistical model is corrected in the embodiment of the present invention；

Fig. 6 is the process step figure that estimation optimum gas leads voice filter in the embodiment of the present invention；

Fig. 7 is the process step figure of mapping voice and filtering enhancing voice Weighted Fusion in the embodiment of the present invention；

Fig. 8 is the structural block diagram of the dual sensor speech sound enhancement device disclosed by the embodiments of the present invention based on statistical model.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer and more explicit, right as follows in conjunction with drawings and embodiments The present invention is further described.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and do not have to It is of the invention in limiting.

Embodiment one

Present embodiment discloses a kind of the dual sensor sound enhancement method based on statistical model, detailed process step reference Shown in attached drawing 1, it is known that the dual sensor sound enhancement method includes following scheme step:

Step S1: the clean conductance training voice of synchronous acquisition and non-conductance training voice establish the joint for classification Statistical model, and the conductance voice linear spectral statistical model for corresponding to each classification is calculated, it can specifically be divided into again following several Step, process are as shown in Figure 2:

Step S1.1: the clean conductance training voice of synchronous acquisition and non-conductance training voice simultaneously carry out framing, extract every The characteristic parameter of frame voice；

In above-described embodiment, clean, synchronous conductance training voice and the training of non-conductance are acquired using speech reception module Voice.Discrete Fourier transform is done after carrying out framing and pretreatment to the clean trained voice of input, is then filtered using Meier Device extracts the mel-frequency cepstrum coefficient MFCC of two kinds of trained voices, the training data as joint statistical model.

In further embodiments, extraction be conductance training voice and non-conductance training voice LPCC or LSF system Number.

Step S1.2: the characteristic parameter of conductance training voice in step S1.1 and non-conductance training voice is spliced and is done Net joint speech characteristic parameter；

In above-described embodiment, the cepstral domain feature vector sequence of conductance training voice is denoted as: S_N={ s_N1,s_N2,..., s_Nn, n is voice data frame number, s_NlFor the column vector of l frame feature；The cepstral domain feature vector sequence of non-conductance training voice It is denoted as: S_T={ s_T1,s_T2,…,s_Tn, frame number n, s_TlFor the column vector of l frame feature.By l frame conductance training voice and The cepstral domain feature parameter of the non-conductance training voice of l frame is spliced, and obtaining l frame cepstrum domain union feature vector is

Step S1.3: the joint speech characteristic parameter obtained using step S1.2, cepstrum domain joint of the training for classification Statistical model；

In above-described embodiment, the probability distribution of joint training voice is fitted using multiple data stream mixed Gauss model, The probability density function of spectral domain joint statistical model are as follows:

Wherein s is the serial number of audio data stream, and M is the mixed components number in GMM, θ_sIt is weight shared by audio data stream, π_m It is model mixed components priori weight,θ_s>=0,π_m≥0；WithRespectively indicate cepstrum domain joint The mean value vector and variance matrix of audio data stream s, z in statistical model m classification^sFor the characteristic vector of s-th of data flow,It is single Gaussian Profile probability density function.λ is enabled to indicate the parameter set of multiple data stream gauss hybrid models, Z= {z₁,z₂,...,z_nIndicating the cepstrum domain union feature set of vectors trained, then cepstrum domain combines statistical model likelihood function Are as follows:

Can be found out using EM algorithm (Expectation Maximization Algorithm) so that P (Z | λ) maximum model parameter collection λ.

Step S1.4: classify to for trained all cepstrum domains joint speech frame, calculating belongs to each classification The linear spectral domain statistical parameter of conductance voice in all joint speech frames establishes conductance voice lines corresponding with each classification Property frequency spectrum statistical model.

In above-described embodiment, each Gaussian component in multiple data stream mixed Gauss model represents a classification, for Combine speech frame in trained all cepstrum domains, calculates each frame cepstrum domain union feature vector z_lBelong to cepstrum domain joint statistics The probability of model m classification, formula are as follows:

WhereinIndicate the cepstral domain feature vector of s-th of audio data stream in l frame voice.Write down maximum probability max {p(m|z_l) corresponding to model mixed components (classify).

After the classification for completing all cepstrum domain joint speech frames, calculating is gathered in same classificatory all joint speech frames The spectrum mean of middle conductance voiceIt is counted as conductance voice linear spectral corresponding with cepstrum domain joint statistical model Model parameter.

In other embodiments, joint statistical model is used as using multiple data stream Hidden Markov Model, and with more Each Gaussian component in data flow Hidden Markov Model indicates a classification.

Step S2: using the conductance of step S1 synchronous acquisition and non-conductance training voice, non-conductance voice is established to conductance The mapping model of voice, specific to be divided into following steps again, process is as shown in Fig. 3:

Step S2.1: the clean non-conductance training voice and conductance training voice of synchronous acquisition in step S1 are divided Frame, using non-conductance training speech frame as input, conductance training speech frame in the same time is as ideal output, after being sent into initialization Feedforward neural network；

In above-described embodiment, first to conductance training voice and the training voice framing of non-conductance, conductance training is extracted respectively Line spectral frequencies (LSF) parameter of speech frame and non-conductance training speech frame, gives the input and output mode of feedforward neural network (L_T,L_N), L_TThe LSF vector for indicating non-conductance training voice, as the input of feedforward neural network, L_NIndicate conductance training language The LSF vector of sound as the ideal output of feedforward neural network, and initializes feedforward neural network weight.

Step S2.2: according to minimum mean square error criterion, using scale conjugate gradient algorithms training feedforward neural network Weight coefficient obtains the mapping of non-conductance voice to conductance voice so that the error between reality output and ideal output is minimum Model；

In above-described embodiment, the connection weight of the neuron that l layer of feedforward neural network to l+1 layers of j-th of neuron It is worth vector are as follows:

WhereinFor the connection weight of l layers of i-th of neuron to l+1 layers of j-th of neuron, N_lFor L layers of neuron number,For the threshold value of l+1 layers of j-th of neuron, by owningThe feed forward neural of composition Network weight vector is as follows:

Wherein M is the neural network number of plies, and N is output layer neuron number.Remember that P is training number of speech frames, neural network is real Border output vector L^*With the error between ideal output L are as follows:

Feedforward neural network weight is iterated using scale conjugate gradient algorithms ,+1 iteration result of kth are as follows:

w_k+1=w_k+α_kP_k(14)

Wherein direction of search P_kWith step-length α_kIt is given by the following formula:

Wherein E'(w_k) and E " (w_k) be respectively E (w) first derivative and second dervative, work as E'(w_k)=0 is error E (w) when reaching minimum value, optimal weight coefficient W is obtained_best。

Step S3: synchronous acquisition conductance detection voice and non-conductance detection voice and the endpoint for detecting conductance detection voice, Then spectrum domain conductance noise statistics model is established using the pure noise segment of conductance detection voice, specifically uses following steps, Process is as shown in Figure 4:

Step S3.1: synchronous acquisition conductance detects voice and non-conductance detection voice and framing；

Step S3.2: the short-time autocorrelation function R of speech frame is detected according to non-conductance_w(k) and short-time energy E_w, calculate every The short-time average of the non-conductance detection speech frame of frame crosses threshold rate C_w(n):

C_w(n)=| sgn [R_w(k)-αT]-sgn[R_w(k-1)-αT]|+

|sgn[R_w(k)+αT]-sgn[R_w(k-1)+αT]|}w(n-k)(17)

Wherein sgn [] is to take symbolic operation,It is regulatory factor, w (n) is window function, at the beginning of T is thresholding Value.Work as C_w(n) when being greater than preset threshold value, judge that the frame is otherwise noise for voice signal.According to the court verdict of every frame Obtain the endpoint location of non-conductance detection voice signal；

Step S3.3: it is examined at the time of the non-conductance detection speech sound signal terminal point that step S3.2 is detected is corresponded to as conductance The endpoint of voice is surveyed, the pure noise segment in conductance detection voice is extracted；

Step S3.4: calculating the linear spectral mean value of pure noise segment signal in conductance detection voice, save the Mean Parameters, Establish the statistical model of spectrum domain conductance noise.

Step S4: language is detected using the joint statistical model in conductance noise statistics Modifying model step S1, and to conductance Sound frame is classified, then according to the corresponding gentle bone conduction noise statistical model of conductance voice linear spectral statistical model of classification results It calculates optimum gas and leads voice filter, and enhancing is filtered to conductance detection voice.

In above-described embodiment, audio data stream is detected to the conductance in joint statistical model using model compensation technology first Parameter is modified, and specifically includes following steps, and process is as shown in Figure 5:

Step S4.1a: statistical model Parameter Switch is combined into linear spectral domain in mel cepstrum domain.In above-described embodiment, Inverse discrete cosine transform C is used first^-1By the mean value of mel cepstrum domain joint statistical model m classificationAnd variance It is transformed into log-domain: WhereinWithRespectively the mean value of log-domain and Variance.Linear spectral domain is transformed into from log-domain again:

WhereinFor linear spectral domain mean value vectorI-th of component,For linear spectral domain variance square Battle arrayThe element of i-th row jth column.

Step S4.2a: being additive relation in linear spectral domain by the gentle bone conduction noise of conductance clean speech counts mould to joint Conductance audio data stream parameter in type is modified.In above-described embodiment, the parameter of conductance audio data stream is carried out as follows Amendment:

Wherein g is the signal-to-noise ratio of conductance detection voice,Be respectively conductance noise linearity spectrum domain mean value and Variance,WithMean value and variance of the conductance audio data stream in linear spectral domain after respectively correcting.

Step S4.3a: the revised linear spectral domain of step S4.2a is combined using the inverse transformation of formula (13) and formula (14) Modeling statistics Parameter Switch returns original property field (cepstrum domain), obtains revised joint cepstrum domain statistical model.

After correcting joint statistical model, available each frame union feature detects vector z_lBelong to joint statistical model The probability of m classification:

Optimum gas leads the calculating of voice filter in above-mentioned steps S4, specifically includes following steps, process such as Fig. 6 institute Show:

Step S4.1b: extracting the union feature parameter of conductance detection voice and non-conductance detection voice, calculates each frame connection Close detection voice correspond to each classification amendment after combine statistical model output probability p (m | z_l)；

Step S4.2b: it is gentle that non-conductance detection audio data stream in joint statistical model is calculated according to above-mentioned output probability The weight for leading detection audio data stream, can use following steps:

Step S4.2.1: the initial weight that setting conductance detects voice is w₀, the initial weight of non-conductance detection voice is 1-w₀, the number of iterations t=0, calculating Diff_t:

Wherein M indicate model mixed components number, L be voice frame number, p (j | z_l) and p (k | z_l) it is respectively l frame joint Detect voice z_lBelong to the probability of jth classification and kth classification in joint statistical model,For joint statistics Model kth is classified at a distance from jth statistic of classification parameter,Classifying for joint statistical model kth, it is equal to classify with jth Value.

Step S4.2.2: it calculates conductance and detects voice weightNon- conductance detects voice weight θ₂(Diff_t)=1- θ₁(Diff_t), using updated weight recalculate p (j | z_l) and p (k | z_l), then according to formula (23) Calculate Diff_t+1；

Step S4.2.3: if | Diff_t+1-Diff_t| < ξ, ξ are preset threshold value, then stop updating weight, execute step S4.2.4, otherwise t=t+1, goes to step S4.2.2；

Step S4.2.4: Diff is utilized_TCalculate optimal weight θ₁(Diff_T) and θ₂(Diff_T), wherein T is t when stopping updating Value.

Step S4.3b: the joint statistical model obtained using step S4.2b classifies to conductance detection speech frame, so Optimum gas lead is calculated according to the gentle bone conduction noise statistical model of the corresponding conductance voice linear spectral statistical model of classification results afterwards Sound filter specifically uses following steps:

Step S4.3.1: optimal weight θ is utilized₁(Diff_T) and θ₂(Diff_T) calculate joint-detection speech frame z_lBelong to current It is modified joint statistical model m classification Probability p (m | z_l)；

Step S4.3.2: the frequency domain gain function that optimum gas leads voice filter is calculated using following formula:

Wherein, K is the mean value vector length of joint statistical model m classification,For joint statistical model m classification The corresponding linear spectrum mean vector of conductance voiceI-th of value,For conductance noise statistics model m classification pair The noise linearity spectrum mean vector answeredI-th of value.

After acquisition optimum gas leads the frequency domain gain function of voice filter, conductance detection voice is transformed into frequency domain and is retained Its amplitude spectrum is scaled G (z by phase information_l) times, time domain is then converted back, filtering enhancing voice is obtained.

In further embodiments, in order to improve operation efficiency, optimum gas lead sound filter gain function uses following formula It calculates:

Step S5: the non-conductance voice according to obtained in step S2 detects non-conductance to the mapping model of conductance voice Voice is converted to the mapping voice of conductance；

Step S6: linear weighted function is carried out to the filtering enhancing voice that mapping voice and step 4 obtain obtained in step S5 Fusion, obtains merging enhanced voice, specifically uses following steps, and process is as shown in Figure 7:

Step S6.1: the filtering for calculating m frame enhances voice x_mWeightWith the mapping voice y of m frame_mWeight

In above-described embodiment, according to the voice signal start time that step S3 end-point detection obtains, interception filtering enhancing language Sound x_mAll data frames before middle start point signal, ask its mean power as the power of noise frameThe filtering for calculating m frame increases Strong voice x_mWeightWith the mapping voice y of m frame_mWeight

WhereinRespectively m frame filtering enhancing voice x_mWith mapping voice y_mAmplitude variance, α, β are default Constant, SNR_mEnhancing voice x is filtered for m frame_mSignal-to-noise ratio:

WhereinIt is x_mPower.

Step S6.2: filtering is enhanced into voice x_mWith mapping voice y_mWeighted superposition obtains fusion enhancing voice:

Embodiment two

The present embodiment two discloses a kind of dual sensor speech sound enhancement device based on model, by speech reception module, language Sound statistical model training module, conductance noise statistics model estimation block, conductance detection voice filter enhancing module, voice mapping Module, voice fusion enhancing module collectively constitute, and structure is as shown in Figure 2.

Wherein, speech reception module, the conductance training voice clean for synchronous acquisition and non-conductance training voice；

Wherein, phonetic statistical model training module, for establishing the gentle lead sound linear spectral of joint statistical model Statistical model；

Wherein, conductance noise statistics model estimation block, the endpoint of detection conductance detection voice, is then detected using conductance The pure noise segment of voice establishes conductance noise statistics model；；

Wherein, conductance detection voice filter enhances module, for joining using described in the conductance noise statistics Modifying model The statistical parameter of statistical model is closed, and is classified to conductance detection speech frame, then in conjunction with conductance corresponding to classification results The gentle bone conduction noise statistical model of voice linear spectral statistical model calculates optimum gas and leads voice filter, and detects voice to conductance It is filtered enhancing, obtains filtering enhancing voice；

Wherein, voice mapping block, the mapping model for establishing non-conductance voice to conductance voice, and according to described non- Non- conductance detection voice is converted to the mapping voice of conductance feature to the mapping model of conductance voice by conductance voice；

Wherein, voice fusion enhancing module, for the mapping voice of the conductance feature and the filtering to be enhanced voice It is weighted fusion, obtains merging enhanced voice.

As shown in Fig. 8, wherein speech reception module and phonetic statistical model training module, conductance noise statistics model Estimation block, conductance detection voice filter enhancing module, the connection of voice mapping block, phonetic statistical model training module and conductance Voice filter enhancing module connection is detected, conductance noise statistics model estimation block and conductance detection voice filter enhancing module connect It connects, conductance detection voice filter enhancing module merges enhancing module connection with voice, and voice mapping block merges enhancing with voice Module connection.

Above-mentioned speech reception module includes two submodules of conductance speech transducer and non-conductance speech transducer, Qian Zheyong In obtaining conductance voice data, the latter is for obtaining non-conductance voice data；Phonetic statistical model training module includes joint system The gentle lead sound linear spectral statistical model submodule of model submodule is counted, joint statistical model is gentle to lead voice lines for establishing Property frequency spectrum statistical model；Conductance noise statistics model estimation block is used to estimate the ambient noise of current system, counts to joint Model is modified, and simultaneously participates in the calculating of filter coefficient；Conductance detects voice filter enhancing module by joint statistical model It corrects submodule, joint-detection Classification of Speech identification submodule, optimum gas waveguide filter coefficient and generates submodule and conductance detection Voice filter submodule collectively forms, wherein joint statistical model amendment submodule is used to correct the statistics ginseng of joint statistical model Number, joint-detection Classification of Speech identify that submodule classifies to detection voice, classification results are acted on best conductance and are filtered Device coefficient generates submodule, and optimum gas waveguide filter coefficient generates submodule and calculates filter parameter, examines finally by conductance It surveys voice filter submodule and obtains the conductance voice of filtering enhancing；Voice mapping block is used to for non-conductance detection voice being mapped as Conductance voice；Voice fusion enhancing module includes that adaptive weighting generates submodule and linear fusion submodule, the former is based on Calculate mapping voice and filtering enhancing voice weight, the latter using adaptive weighting generate submodule result will map voice and Filtering enhancing voice carries out linear weighted function fusion, obtains fusion enhancing voice.

In above-mentioned each submodule, conductance speech transducer and conductance noise statistics model estimation block combine statistics mould Type submodule, joint-detection Classification of Speech identification submodule are connected with conductance detection voice filter submodule, and non-conductance voice passes Sensor divides with statistical model submodule, conductance noise statistics model estimation block, voice mapping block, joint-detection voice is combined Class identifies submodule connection；Joint statistical model submodule and conductance voice linear spectral statistical model submodule combine statistics The connection of Modifying model submodule, conductance voice linear spectral statistical model training module and optimum gas waveguide filter coefficient generate son Module connection, participates in the calculating of filter coefficient；

Conductance noise model estimation block corrects submodule, optimum gas waveguide filter coefficient generation with statistical model is combined Module connection；Joint statistical model corrects submodule and optimum gas waveguide filter coefficient generates submodule, conductance detection voice filter The connection of marble module, joint-detection Classification of Speech identify that submodule generates submodule with optimum gas waveguide filter coefficient and connect, most Good conductance filter coefficient generates submodule and connect with conductance detection voice filter submodule；Conductance detects voice filter submodule Submodule is generated with adaptive weighting, linear fusion submodule is connect；Voice mapping block and adaptive weighting generate submodule, The connection of linear fusion submodule；Adaptive weighting generation module is connected with linear fusion module.

It is worth noting that, included modules are only drawn according to function logic in above-mentioned apparatus embodiment Point, but be not limited to the above division, as long as corresponding functions can be realized；In addition, the specific name of each module Also it is only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.

The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention, It should be equivalent substitute mode, be included within the scope of the present invention.

Claims

1. a kind of dual sensor sound enhancement method based on statistical model characterized by comprising

Synchronous acquisition conductance detects voice and non-conductance detects voice, then the endpoint of detection conductance detection voice utilizes conductance The pure noise segment of detection voice establishes conductance noise statistics model；

Combine statistical model using the conductance noise statistics Modifying model, and classifies to conductance detection speech frame；

Best conductance is calculated according to the gentle bone conduction noise statistical model of the corresponding conductance voice linear spectral statistical model of classification results Voice filter；

Leading voice filter using the optimum gas and be filtered after enhancing to obtain filtering to conductance detection voice enhances voice,

Wherein, statistical model and the conductance voice linear spectral statistical model of combining is by the clean conductance of synchronous acquisition Training voice and non-conductance training voice pre-establish.

2. the method according to claim 1, wherein synchronous acquisition conductance detection voice and the detection of non-conductance Voice, the endpoint of detection conductance detection voice, then establishes conductance noise statistics mould using the pure noise segment of conductance detection voice The step of type includes:

Synchronous acquisition conductance detects voice and non-conductance detection voice and framing；

Short-time autocorrelation function and the short-time energy that speech frame is detected according to non-conductance calculate the non-conductance detection speech frame of every frame Short-time average crosses threshold rate, when the short-time average crosses threshold rate greater than preset threshold value, judges the non-conductance detection Speech frame is voice signal, is otherwise noise；

The endpoint location of non-conductance detection voice signal is obtained according to the court verdict of each non-conductance detection speech frame；

Endpoint at the time of the non-conductance detection speech sound signal terminal point that will test corresponds to as conductance detection voice, is extracted Conductance detects the pure noise segment in voice；

The linear spectral mean value for calculating pure noise segment signal in conductance detection voice, saves the statistics mould that the mean value is conductance noise Shape parameter.

3. according to the method described in claim 2, it is characterized in that, the short-time average, which crosses threshold rate, passes through following formula meter It calculates:

C_w(n)=| sgn [R_w(k)-αT]-sgn[R_w(k-1)-αT]|+

|sgn[R_w(k)+αT]-sgn[R_w(k-1)+αT]|}w(n-k)；

Wherein sgn [] is to take symbolic operation,For regulatory factor, w (n) is window function, and T is thresholding initial value, R_w It (k) is the short-time autocorrelation function, E_wFor the short-time energy, C_w(n) threshold rate is crossed for the short-time average.

4. the method according to claim 1, wherein the joint statistical model is repaired by following steps Just:

The Parameter Switch of statistical model will be combined to linear spectral domain；

It in linear spectral domain is additive relation to the conductance voice in joint statistical model by the gentle bone conduction noise of conductance clean speech Traffic parameter is modified；

Revised linear spectral domain joint statistical model Parameter Switch is returned into original property field, obtains revised joint system Count model；

Wherein, the conductance audio data stream parameter in the joint statistical model is mixed Gauss model or Hidden Markov The mean value and covariance of Gaussian component in model.

5. the method according to claim 1, wherein described according to the corresponding linear frequency of conductance voice of classification results The gentle bone conduction noise statistical model of spectrum statistical model calculates the step of optimum gas leads voice filter

The union feature parameter for extracting conductance detection voice and non-conductance detection voice, it is corresponding to calculate each frame joint-detection voice Combine the output probability of statistical model after the amendment of each classification；

Non- conductance detection audio data stream and conductance in joint statistical model, which are calculated, according to above-mentioned output probability detects voice data The weight parameter of stream；

According to above-mentioned weight parameter, classified using updated joint statistical model to conductance detection speech frame, then root The filter of optimum gas lead sound is calculated according to the gentle bone conduction noise statistical model of the corresponding conductance voice linear spectral statistical model of classification results Wave device.

6. according to the method described in claim 5, it is characterized in that, the non-conductance detection audio data stream and conductance detect language The weight parameter of sound data flow is calculated by following steps:

The initial weight that conductance detection voice is arranged is w₀, the initial weight of non-conductance detection voice is 1-w₀, the number of iterations t= 0, calculate Diff_t

Wherein M indicate model mixed components number, L be voice frame number, p (j | z_l) and p (k | z_l) it is respectively l frame joint-detection Voice z_lBelong to the probability of jth classification and kth classification in joint statistical model,To combine statistical model Kth is classified at a distance from jth statistic of classification parameter,Join for the classification of joint statistical model kth and the statistics of jth classification Number；

It calculates conductance and detects voice weightNon- conductance detects voice weight θ₂(Diff_t)=1- θ₁ (Diff_t), using updated weight recalculate p (j | z_l) and p (k | z_l), then recalculate Diff_t+1；

If | Diff_t+1-Diff_t| < ξ, ξ are preset threshold value, then stop updating weight, perform the next step suddenly, otherwise t=t+1, turn Return previous step；

Utilize Diff_TCalculate optimal weight θ₁(Diff_T) and θ₂(Diff_T), wherein T is the value of t when stopping updating.

7. according to the method described in claim 6, it is characterized in that, the optimum gas, which leads voice filter, passes through following steps meter It calculates:

Utilize optimal weight θ₁(Diff_T) and θ₂(Diff_T) calculate joint-detection speech frame z_lBelong to current modified joint statistics Model m classification Probability p (m | z_l)；

The frequency domain gain function that optimum gas leads voice filter is calculated using one in following formula:

Wherein K is the mean value vector dimension of joint statistical model m classification,It is corresponding for joint statistical model m classification The linear spectrum mean vector of conductance voiceI-th of component,Classify for conductance noise statistics model m corresponding Noise linearity spectrum mean vectorI-th of component.

8. the method according to claim 1, wherein the method further includes:

According to the mapping model of the non-conductance voice to conductance voice, non-conductance detection voice is converted to the mapping language of conductance Sound；

The mapping voice of the conductance is weighted with filtering enhancing voice and is merged, obtains merging enhanced voice；

Wherein, the mapping model is built in advance by the clean conductance training voice and non-conductance training voice of the synchronous acquisition It is vertical.

9. according to the method described in claim 8, it is characterized in that, the mapping voice by the conductance and the filtering increase The step of strong voice is weighted fusion include:

It is calculated by the following formula the filtering enhancing voice x of m frame_mWeightWith the mapping voice y of m frame_mWeight

Wherein,WithRespectively m frame filtering enhancing voice x_mWith mapping voice y_mAmplitude variance, SNR_mFor the filter of m frame Wave enhances voice x_mSignal-to-noise ratio, α, β be preset constant；

Filtering is enhanced into voice x by following formula_mWith mapping voice y_mWeighted superposition obtains fusion enhancing voice:

10. according to the method described in claim 9, it is characterized in that, the mapping voice by the conductance and the filtering Before the step of enhancing voice is weighted fusion, further comprise:

The voice signal start time that end-point detection obtains is carried out according to conductance detection voice, is believed in interception filtering enhancing voice All data frames before number starting point are averaging power of the power as noise frame

The Signal to Noise Ratio (SNR)_mIt is calculated by the following formula:

WhereinIt is m frame filtering enhancing voice x_mPower.