CN110010149A - Dual sensor sound enhancement method based on statistical model - Google Patents

Dual sensor sound enhancement method based on statistical model Download PDF

Info

Publication number
CN110010149A
CN110010149A CN201910296425.4A CN201910296425A CN110010149A CN 110010149 A CN110010149 A CN 110010149A CN 201910296425 A CN201910296425 A CN 201910296425A CN 110010149 A CN110010149 A CN 110010149A
Authority
CN
China
Prior art keywords
voice
conductance
statistical model
detection
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910296425.4A
Other languages
Chinese (zh)
Other versions
CN110010149B (en
Inventor
张军
陈鑫源
潘伟锵
宁更新
冯义志
余华
季飞
陈芳炯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Voxtech Co Ltd
Original Assignee
Shenzhen Voxtech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Voxtech Co Ltd filed Critical Shenzhen Voxtech Co Ltd
Priority to CN201910296425.4A priority Critical patent/CN110010149B/en
Publication of CN110010149A publication Critical patent/CN110010149A/en
Application granted granted Critical
Publication of CN110010149B publication Critical patent/CN110010149B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/038Speech enhancement, e.g. noise reduction or echo cancellation using band spreading techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T90/00Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation

Abstract

The present invention discloses a kind of dual sensor sound enhancement method based on statistical model, it include: synchronous acquisition conductance detection voice and non-conductance detection voice, the endpoint of conductance detection voice is detected, then establishes conductance noise statistics model using the pure noise segment of conductance detection voice;Combine statistical model using conductance noise statistics Modifying model, and classifies to conductance detection speech frame;Optimum gas, which is calculated, according to the gentle bone conduction noise statistical model of the corresponding conductance voice linear spectral statistical model of classification results leads voice filter;Leading voice filter using optimum gas and be filtered after enhancing to obtain filtering to conductance detection voice enhances voice, wherein, the joint gentle lead sound linear spectral statistical model of statistical model is pre-established by the clean conductance training voice and non-conductance training voice of synchronous acquisition.

Description

Dual sensor sound enhancement method based on statistical model
Technical field
The present invention is entitled " a kind of dual sensor language based on statistical model that applicant proposed on 01 14th, 2016 Sound Enhancement Method and device ", application No. is the divisional applications of 201610025390.7 patent application, and the present invention relates to numbers to believe Number process field, in particular to a kind of dual sensor sound enhancement method based on statistical model.
Background technique
Communication is the modern important means exchanged between men, and voice is as shape most common in communication system Formula, quality directly affect the accuracy that people obtain information.Voice is during propagation, inevitably by various ring The interference of border noise, sound quality, intelligibility will be all decreased obviously, therefore often utilize speech enhancement technique in practical applications Voice under noise circumstance is handled.
Speech enhancement technique can extract useful voice signal from noise background, be the base for inhibiting, reducing noise jamming This means.Traditional speech enhan-cement object is the voice signal based on air conduction sensor (such as microphone) acquisition, according to The difference of processing mode, common speech enhancement technique can be divided into the method based on model and be not based on the method two of model Class.The Enhancement Method for being not based on model has spectrum-subtraction, filter method, Wavelet Transform etc., they often assume that noise is relatively flat Steady, when noise variation is too fast, reinforcing effect is not satisfactory.Sound enhancement method based on model is then right first Voice signal and noise signal establish statistical model, then by the Minimum Mean Squared Error estimation of model acquisition clean speech or most Big posterior probability estimation.Such methods can be avoided the generation of music noise, and can handle nonstationary noise.But due to above-mentioned It is based on the air transmitteds speech transducer such as microphone based on model and the method for being not based on model, signal is easy by environment Acoustic noise influence, especially under strong noise environment, system performance can sharp fall.
To solve influence of the very noisy to speech processing system, it is different from traditional air conduction sensor, non-air passes The speech transducer led drives reed or carbon film in sensor to occur using the vibration at the positions such as the human vocal band that speaks, jawbone Variation, changes its resistance value, the voltage at its both ends is made to change, to convert electric signal for vibration signal, i.e. voice is believed Number.Deformation occurs for the reed or carbon film that can not make non-air conduction sensor due to the sound wave conducted in air, non-empty Gas conduction sensor is not influenced by air transmitted sound, the ability with the interference of very strong environment resistant acoustic noise.But because non- Air conduction sensor acquisition is voice by the Vibration propagation of jawbone, muscle, skin etc., show as it is stuffy, ambiguous not Clearly, high frequency section is lost serious, and the intelligibility of speech is poor, constrains the practical application of non-air conduction technique.
Haves the defects that certain, appearance in recent years in view of air transmitted and being used alone all for non-air conduction sensor The sound enhancement method of both some combinations advantage.These methods utilize air conduction sensor voice and non-air conduction sensing The complementarity of device voice realizes the purpose of speech enhan-cement using multi-sensor fusion technology, can usually obtain and compare single-sensor The better effect of speech-enhancement system.But speech enhan-cement of the existing air conduction sensor in conjunction with non-air conduction sensor There is also following deficiencies for method: (1) air conduction sensor voice is mostly independently carried out with non-air conduction sensor voice Recovery processing, the voice after then again restoring the two merge, and fail to pass in air conduction sensor voice and non-air Complementarity between the two is made full use of in the recovery process of derivative sensor voice;(2) under changeable strong noisy environment, air The signal-to-noise ratio that the statistical property of the pure voice segments of conduction sensor voice by severe jamming, can enhance voice can also reduce, and cause to melt Speech enhan-cement effect is unobvious after conjunction.
Summary of the invention
The present invention provides a kind of dual sensor sound enhancement method based on statistical model, comprising: the inspection of synchronous acquisition conductance It surveys voice and non-conductance detects voice, then the endpoint of detection conductance detection voice utilizes the pure noise segment of conductance detection voice Establish conductance noise statistics model;Combine statistical model using conductance noise statistics Modifying model, and speech frame is detected to conductance Classify;It is calculated according to the gentle bone conduction noise statistical model of the corresponding conductance voice linear spectral statistical model of classification results best Conductance voice filter;Leading voice filter using optimum gas and be filtered after enhancing to obtain filtering to conductance detection voice enhances Voice, wherein the joint gentle lead sound linear spectral statistical model of statistical model trains language by the clean conductance of synchronous acquisition Sound and non-conductance training voice pre-establish.
The present invention has the following advantages and effects with respect to the prior art:
1, the present invention carrys out structure in conjunction with non-air conduction transducer voice and air conduction transducer voice during conductance speech enhan-cement It builds the phonetic statistical model for being currently used in classification and carries out end-point detection, and construction optimum gas leads voice filter accordingly, mentions The high reinforcing effect of conductance voice, significantly increases the robustness of whole system;
2, present invention employs the structural approach of two-stage speech enhan-cement, and in conductance voice, filter effect is bad due to very noisy When, second level speech enhan-cement will filter voice and non-conductance voice the progress of mapping voice it is adaptive weighted merge, remain to obtain Good speech enhan-cement effect;
3, without distance limitation, user between the air conduction sensor used and non-air conduction sensor of the invention Just.
Detailed description of the invention
Fig. 1 is the process step of the dual sensor sound enhancement method disclosed by the embodiments of the present invention based on statistical model Figure;
Fig. 2 is the process step figure of training phonetic statistical model in the embodiment of the present invention;
Fig. 3 is the process step figure that non-conductance voice is established in the embodiment of the present invention to conductance voice mapping model;
Fig. 4 is the process step figure that conductance noise statistics model is established in the embodiment of the present invention;
Fig. 5 is the process step figure that joint statistical model is corrected in the embodiment of the present invention;
Fig. 6 is the process step figure that estimation optimum gas leads voice filter in the embodiment of the present invention;
Fig. 7 is the process step figure of mapping voice and filtering enhancing voice Weighted Fusion in the embodiment of the present invention;
Fig. 8 is the structural block diagram of the dual sensor speech sound enhancement device disclosed by the embodiments of the present invention based on statistical model.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer and more explicit, right as follows in conjunction with drawings and embodiments The present invention is further described.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and do not have to It is of the invention in limiting.
Embodiment one
Present embodiment discloses a kind of the dual sensor sound enhancement method based on statistical model, detailed process step reference Shown in attached drawing 1, it is known that the dual sensor sound enhancement method includes following scheme step:
Step S1: the clean conductance training voice of synchronous acquisition and non-conductance training voice establish the joint for classification Statistical model, and the conductance voice linear spectral statistical model for corresponding to each classification is calculated, it can specifically be divided into again following several Step, process are as shown in Figure 2:
Step S1.1: the clean conductance training voice of synchronous acquisition and non-conductance training voice simultaneously carry out framing, extract every The characteristic parameter of frame voice;
In above-described embodiment, clean, synchronous conductance training voice and the training of non-conductance are acquired using speech reception module Voice.Discrete Fourier transform is done after carrying out framing and pretreatment to the clean trained voice of input, is then filtered using Meier Device extracts the mel-frequency cepstrum coefficient MFCC of two kinds of trained voices, the training data as joint statistical model.
In further embodiments, extraction be conductance training voice and non-conductance training voice LPCC or LSF system Number.
Step S1.2: the characteristic parameter of conductance training voice in step S1.1 and non-conductance training voice is spliced and is done Net joint speech characteristic parameter;
In above-described embodiment, the cepstral domain feature vector sequence of conductance training voice is denoted as: SN={ sN1,sN2,..., sNn, n is voice data frame number, sNlFor the column vector of l frame feature;The cepstral domain feature vector sequence of non-conductance training voice It is denoted as: ST={ sT1,sT2,…,sTn, frame number n, sTlFor the column vector of l frame feature.By l frame conductance training voice and The cepstral domain feature parameter of the non-conductance training voice of l frame is spliced, and obtaining l frame cepstrum domain union feature vector is
Step S1.3: the joint speech characteristic parameter obtained using step S1.2, cepstrum domain joint of the training for classification Statistical model;
In above-described embodiment, the probability distribution of joint training voice is fitted using multiple data stream mixed Gauss model, The probability density function of spectral domain joint statistical model are as follows:
Wherein s is the serial number of audio data stream, and M is the mixed components number in GMM, θsIt is weight shared by audio data stream, πm It is model mixed components priori weight,θs>=0,πm≥0;WithRespectively indicate cepstrum domain joint The mean value vector and variance matrix of audio data stream s, z in statistical model m classificationsFor the characteristic vector of s-th of data flow,It is single Gaussian Profile probability density function.λ is enabled to indicate the parameter set of multiple data stream gauss hybrid models, Z= {z1,z2,...,znIndicating the cepstrum domain union feature set of vectors trained, then cepstrum domain combines statistical model likelihood function Are as follows:
Can be found out using EM algorithm (Expectation Maximization Algorithm) so that P (Z | λ) maximum model parameter collection λ.
Step S1.4: classify to for trained all cepstrum domains joint speech frame, calculating belongs to each classification The linear spectral domain statistical parameter of conductance voice in all joint speech frames establishes conductance voice lines corresponding with each classification Property frequency spectrum statistical model.
In above-described embodiment, each Gaussian component in multiple data stream mixed Gauss model represents a classification, for Combine speech frame in trained all cepstrum domains, calculates each frame cepstrum domain union feature vector zlBelong to cepstrum domain joint statistics The probability of model m classification, formula are as follows:
WhereinIndicate the cepstral domain feature vector of s-th of audio data stream in l frame voice.Write down maximum probability max {p(m|zl) corresponding to model mixed components (classify).
After the classification for completing all cepstrum domain joint speech frames, calculating is gathered in same classificatory all joint speech frames The spectrum mean of middle conductance voiceIt is counted as conductance voice linear spectral corresponding with cepstrum domain joint statistical model Model parameter.
In other embodiments, joint statistical model is used as using multiple data stream Hidden Markov Model, and with more Each Gaussian component in data flow Hidden Markov Model indicates a classification.
Step S2: using the conductance of step S1 synchronous acquisition and non-conductance training voice, non-conductance voice is established to conductance The mapping model of voice, specific to be divided into following steps again, process is as shown in Fig. 3:
Step S2.1: the clean non-conductance training voice and conductance training voice of synchronous acquisition in step S1 are divided Frame, using non-conductance training speech frame as input, conductance training speech frame in the same time is as ideal output, after being sent into initialization Feedforward neural network;
In above-described embodiment, first to conductance training voice and the training voice framing of non-conductance, conductance training is extracted respectively Line spectral frequencies (LSF) parameter of speech frame and non-conductance training speech frame, gives the input and output mode of feedforward neural network (LT,LN), LTThe LSF vector for indicating non-conductance training voice, as the input of feedforward neural network, LNIndicate conductance training language The LSF vector of sound as the ideal output of feedforward neural network, and initializes feedforward neural network weight.
Step S2.2: according to minimum mean square error criterion, using scale conjugate gradient algorithms training feedforward neural network Weight coefficient obtains the mapping of non-conductance voice to conductance voice so that the error between reality output and ideal output is minimum Model;
In above-described embodiment, the connection weight of the neuron that l layer of feedforward neural network to l+1 layers of j-th of neuron It is worth vector are as follows:
WhereinFor the connection weight of l layers of i-th of neuron to l+1 layers of j-th of neuron, NlFor L layers of neuron number,For the threshold value of l+1 layers of j-th of neuron, by owningThe feed forward neural of composition Network weight vector is as follows:
Wherein M is the neural network number of plies, and N is output layer neuron number.Remember that P is training number of speech frames, neural network is real Border output vector L*With the error between ideal output L are as follows:
Feedforward neural network weight is iterated using scale conjugate gradient algorithms ,+1 iteration result of kth are as follows:
wk+1=wkkPk(14)
Wherein direction of search PkWith step-length αkIt is given by the following formula:
Wherein E'(wk) and E " (wk) be respectively E (w) first derivative and second dervative, work as E'(wk)=0 is error E (w) when reaching minimum value, optimal weight coefficient W is obtainedbest
Step S3: synchronous acquisition conductance detection voice and non-conductance detection voice and the endpoint for detecting conductance detection voice, Then spectrum domain conductance noise statistics model is established using the pure noise segment of conductance detection voice, specifically uses following steps, Process is as shown in Figure 4:
Step S3.1: synchronous acquisition conductance detects voice and non-conductance detection voice and framing;
Step S3.2: the short-time autocorrelation function R of speech frame is detected according to non-conductancew(k) and short-time energy Ew, calculate every The short-time average of the non-conductance detection speech frame of frame crosses threshold rate Cw(n):
Cw(n)=| sgn [Rw(k)-αT]-sgn[Rw(k-1)-αT]|+
|sgn[Rw(k)+αT]-sgn[Rw(k-1)+αT]|}w(n-k)(17)
Wherein sgn [] is to take symbolic operation,It is regulatory factor, w (n) is window function, at the beginning of T is thresholding Value.Work as Cw(n) when being greater than preset threshold value, judge that the frame is otherwise noise for voice signal.According to the court verdict of every frame Obtain the endpoint location of non-conductance detection voice signal;
Step S3.3: it is examined at the time of the non-conductance detection speech sound signal terminal point that step S3.2 is detected is corresponded to as conductance The endpoint of voice is surveyed, the pure noise segment in conductance detection voice is extracted;
Step S3.4: calculating the linear spectral mean value of pure noise segment signal in conductance detection voice, save the Mean Parameters, Establish the statistical model of spectrum domain conductance noise.
Step S4: language is detected using the joint statistical model in conductance noise statistics Modifying model step S1, and to conductance Sound frame is classified, then according to the corresponding gentle bone conduction noise statistical model of conductance voice linear spectral statistical model of classification results It calculates optimum gas and leads voice filter, and enhancing is filtered to conductance detection voice.
In above-described embodiment, audio data stream is detected to the conductance in joint statistical model using model compensation technology first Parameter is modified, and specifically includes following steps, and process is as shown in Figure 5:
Step S4.1a: statistical model Parameter Switch is combined into linear spectral domain in mel cepstrum domain.In above-described embodiment, Inverse discrete cosine transform C is used first-1By the mean value of mel cepstrum domain joint statistical model m classificationAnd variance It is transformed into log-domain: WhereinWithRespectively the mean value of log-domain and Variance.Linear spectral domain is transformed into from log-domain again:
WhereinFor linear spectral domain mean value vectorI-th of component,For linear spectral domain variance square Battle arrayThe element of i-th row jth column.
Step S4.2a: being additive relation in linear spectral domain by the gentle bone conduction noise of conductance clean speech counts mould to joint Conductance audio data stream parameter in type is modified.In above-described embodiment, the parameter of conductance audio data stream is carried out as follows Amendment:
Wherein g is the signal-to-noise ratio of conductance detection voice,Be respectively conductance noise linearity spectrum domain mean value and Variance,WithMean value and variance of the conductance audio data stream in linear spectral domain after respectively correcting.
Step S4.3a: the revised linear spectral domain of step S4.2a is combined using the inverse transformation of formula (13) and formula (14) Modeling statistics Parameter Switch returns original property field (cepstrum domain), obtains revised joint cepstrum domain statistical model.
After correcting joint statistical model, available each frame union feature detects vector zlBelong to joint statistical model The probability of m classification:
Optimum gas leads the calculating of voice filter in above-mentioned steps S4, specifically includes following steps, process such as Fig. 6 institute Show:
Step S4.1b: extracting the union feature parameter of conductance detection voice and non-conductance detection voice, calculates each frame connection Close detection voice correspond to each classification amendment after combine statistical model output probability p (m | zl);
Step S4.2b: it is gentle that non-conductance detection audio data stream in joint statistical model is calculated according to above-mentioned output probability The weight for leading detection audio data stream, can use following steps:
Step S4.2.1: the initial weight that setting conductance detects voice is w0, the initial weight of non-conductance detection voice is 1-w0, the number of iterations t=0, calculating Difft:
Wherein M indicate model mixed components number, L be voice frame number, p (j | zl) and p (k | zl) it is respectively l frame joint Detect voice zlBelong to the probability of jth classification and kth classification in joint statistical model,For joint statistics Model kth is classified at a distance from jth statistic of classification parameter,Classifying for joint statistical model kth, it is equal to classify with jth Value.
Step S4.2.2: it calculates conductance and detects voice weightNon- conductance detects voice weight θ2(Difft)=1- θ1(Difft), using updated weight recalculate p (j | zl) and p (k | zl), then according to formula (23) Calculate Difft+1
Step S4.2.3: if | Difft+1-Difft| < ξ, ξ are preset threshold value, then stop updating weight, execute step S4.2.4, otherwise t=t+1, goes to step S4.2.2;
Step S4.2.4: Diff is utilizedTCalculate optimal weight θ1(DiffT) and θ2(DiffT), wherein T is t when stopping updating Value.
Step S4.3b: the joint statistical model obtained using step S4.2b classifies to conductance detection speech frame, so Optimum gas lead is calculated according to the gentle bone conduction noise statistical model of the corresponding conductance voice linear spectral statistical model of classification results afterwards Sound filter specifically uses following steps:
Step S4.3.1: optimal weight θ is utilized1(DiffT) and θ2(DiffT) calculate joint-detection speech frame zlBelong to current It is modified joint statistical model m classification Probability p (m | zl);
Step S4.3.2: the frequency domain gain function that optimum gas leads voice filter is calculated using following formula:
Wherein, K is the mean value vector length of joint statistical model m classification,For joint statistical model m classification The corresponding linear spectrum mean vector of conductance voiceI-th of value,For conductance noise statistics model m classification pair The noise linearity spectrum mean vector answeredI-th of value.
After acquisition optimum gas leads the frequency domain gain function of voice filter, conductance detection voice is transformed into frequency domain and is retained Its amplitude spectrum is scaled G (z by phase informationl) times, time domain is then converted back, filtering enhancing voice is obtained.
In further embodiments, in order to improve operation efficiency, optimum gas lead sound filter gain function uses following formula It calculates:
Step S5: the non-conductance voice according to obtained in step S2 detects non-conductance to the mapping model of conductance voice Voice is converted to the mapping voice of conductance;
Step S6: linear weighted function is carried out to the filtering enhancing voice that mapping voice and step 4 obtain obtained in step S5 Fusion, obtains merging enhanced voice, specifically uses following steps, and process is as shown in Figure 7:
Step S6.1: the filtering for calculating m frame enhances voice xmWeightWith the mapping voice y of m framemWeight
In above-described embodiment, according to the voice signal start time that step S3 end-point detection obtains, interception filtering enhancing language Sound xmAll data frames before middle start point signal, ask its mean power as the power of noise frameThe filtering for calculating m frame increases Strong voice xmWeightWith the mapping voice y of m framemWeight
WhereinRespectively m frame filtering enhancing voice xmWith mapping voice ymAmplitude variance, α, β are default Constant, SNRmEnhancing voice x is filtered for m framemSignal-to-noise ratio:
WhereinIt is xmPower.
Step S6.2: filtering is enhanced into voice xmWith mapping voice ymWeighted superposition obtains fusion enhancing voice:
Embodiment two
The present embodiment two discloses a kind of dual sensor speech sound enhancement device based on model, by speech reception module, language Sound statistical model training module, conductance noise statistics model estimation block, conductance detection voice filter enhancing module, voice mapping Module, voice fusion enhancing module collectively constitute, and structure is as shown in Figure 2.
Wherein, speech reception module, the conductance training voice clean for synchronous acquisition and non-conductance training voice;
Wherein, phonetic statistical model training module, for establishing the gentle lead sound linear spectral of joint statistical model Statistical model;
Wherein, conductance noise statistics model estimation block, the endpoint of detection conductance detection voice, is then detected using conductance The pure noise segment of voice establishes conductance noise statistics model;;
Wherein, conductance detection voice filter enhances module, for joining using described in the conductance noise statistics Modifying model The statistical parameter of statistical model is closed, and is classified to conductance detection speech frame, then in conjunction with conductance corresponding to classification results The gentle bone conduction noise statistical model of voice linear spectral statistical model calculates optimum gas and leads voice filter, and detects voice to conductance It is filtered enhancing, obtains filtering enhancing voice;
Wherein, voice mapping block, the mapping model for establishing non-conductance voice to conductance voice, and according to described non- Non- conductance detection voice is converted to the mapping voice of conductance feature to the mapping model of conductance voice by conductance voice;
Wherein, voice fusion enhancing module, for the mapping voice of the conductance feature and the filtering to be enhanced voice It is weighted fusion, obtains merging enhanced voice.
As shown in Fig. 8, wherein speech reception module and phonetic statistical model training module, conductance noise statistics model Estimation block, conductance detection voice filter enhancing module, the connection of voice mapping block, phonetic statistical model training module and conductance Voice filter enhancing module connection is detected, conductance noise statistics model estimation block and conductance detection voice filter enhancing module connect It connects, conductance detection voice filter enhancing module merges enhancing module connection with voice, and voice mapping block merges enhancing with voice Module connection.
Above-mentioned speech reception module includes two submodules of conductance speech transducer and non-conductance speech transducer, Qian Zheyong In obtaining conductance voice data, the latter is for obtaining non-conductance voice data;Phonetic statistical model training module includes joint system The gentle lead sound linear spectral statistical model submodule of model submodule is counted, joint statistical model is gentle to lead voice lines for establishing Property frequency spectrum statistical model;Conductance noise statistics model estimation block is used to estimate the ambient noise of current system, counts to joint Model is modified, and simultaneously participates in the calculating of filter coefficient;Conductance detects voice filter enhancing module by joint statistical model It corrects submodule, joint-detection Classification of Speech identification submodule, optimum gas waveguide filter coefficient and generates submodule and conductance detection Voice filter submodule collectively forms, wherein joint statistical model amendment submodule is used to correct the statistics ginseng of joint statistical model Number, joint-detection Classification of Speech identify that submodule classifies to detection voice, classification results are acted on best conductance and are filtered Device coefficient generates submodule, and optimum gas waveguide filter coefficient generates submodule and calculates filter parameter, examines finally by conductance It surveys voice filter submodule and obtains the conductance voice of filtering enhancing;Voice mapping block is used to for non-conductance detection voice being mapped as Conductance voice;Voice fusion enhancing module includes that adaptive weighting generates submodule and linear fusion submodule, the former is based on Calculate mapping voice and filtering enhancing voice weight, the latter using adaptive weighting generate submodule result will map voice and Filtering enhancing voice carries out linear weighted function fusion, obtains fusion enhancing voice.
In above-mentioned each submodule, conductance speech transducer and conductance noise statistics model estimation block combine statistics mould Type submodule, joint-detection Classification of Speech identification submodule are connected with conductance detection voice filter submodule, and non-conductance voice passes Sensor divides with statistical model submodule, conductance noise statistics model estimation block, voice mapping block, joint-detection voice is combined Class identifies submodule connection;Joint statistical model submodule and conductance voice linear spectral statistical model submodule combine statistics The connection of Modifying model submodule, conductance voice linear spectral statistical model training module and optimum gas waveguide filter coefficient generate son Module connection, participates in the calculating of filter coefficient;
Conductance noise model estimation block corrects submodule, optimum gas waveguide filter coefficient generation with statistical model is combined Module connection;Joint statistical model corrects submodule and optimum gas waveguide filter coefficient generates submodule, conductance detection voice filter The connection of marble module, joint-detection Classification of Speech identify that submodule generates submodule with optimum gas waveguide filter coefficient and connect, most Good conductance filter coefficient generates submodule and connect with conductance detection voice filter submodule;Conductance detects voice filter submodule Submodule is generated with adaptive weighting, linear fusion submodule is connect;Voice mapping block and adaptive weighting generate submodule, The connection of linear fusion submodule;Adaptive weighting generation module is connected with linear fusion module.
It is worth noting that, included modules are only drawn according to function logic in above-mentioned apparatus embodiment Point, but be not limited to the above division, as long as corresponding functions can be realized;In addition, the specific name of each module Also it is only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.
The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention, It should be equivalent substitute mode, be included within the scope of the present invention.

Claims (10)

1. a kind of dual sensor sound enhancement method based on statistical model characterized by comprising
Synchronous acquisition conductance detects voice and non-conductance detects voice, then the endpoint of detection conductance detection voice utilizes conductance The pure noise segment of detection voice establishes conductance noise statistics model;
Combine statistical model using the conductance noise statistics Modifying model, and classifies to conductance detection speech frame;
Best conductance is calculated according to the gentle bone conduction noise statistical model of the corresponding conductance voice linear spectral statistical model of classification results Voice filter;
Leading voice filter using the optimum gas and be filtered after enhancing to obtain filtering to conductance detection voice enhances voice,
Wherein, statistical model and the conductance voice linear spectral statistical model of combining is by the clean conductance of synchronous acquisition Training voice and non-conductance training voice pre-establish.
2. the method according to claim 1, wherein synchronous acquisition conductance detection voice and the detection of non-conductance Voice, the endpoint of detection conductance detection voice, then establishes conductance noise statistics mould using the pure noise segment of conductance detection voice The step of type includes:
Synchronous acquisition conductance detects voice and non-conductance detection voice and framing;
Short-time autocorrelation function and the short-time energy that speech frame is detected according to non-conductance calculate the non-conductance detection speech frame of every frame Short-time average crosses threshold rate, when the short-time average crosses threshold rate greater than preset threshold value, judges the non-conductance detection Speech frame is voice signal, is otherwise noise;
The endpoint location of non-conductance detection voice signal is obtained according to the court verdict of each non-conductance detection speech frame;
Endpoint at the time of the non-conductance detection speech sound signal terminal point that will test corresponds to as conductance detection voice, is extracted Conductance detects the pure noise segment in voice;
The linear spectral mean value for calculating pure noise segment signal in conductance detection voice, saves the statistics mould that the mean value is conductance noise Shape parameter.
3. according to the method described in claim 2, it is characterized in that, the short-time average, which crosses threshold rate, passes through following formula meter It calculates:
Cw(n)=| sgn [Rw(k)-αT]-sgn[Rw(k-1)-αT]|+
|sgn[Rw(k)+αT]-sgn[Rw(k-1)+αT]|}w(n-k);
Wherein sgn [] is to take symbolic operation,For regulatory factor, w (n) is window function, and T is thresholding initial value, Rw It (k) is the short-time autocorrelation function, EwFor the short-time energy, Cw(n) threshold rate is crossed for the short-time average.
4. the method according to claim 1, wherein the joint statistical model is repaired by following steps Just:
The Parameter Switch of statistical model will be combined to linear spectral domain;
It in linear spectral domain is additive relation to the conductance voice in joint statistical model by the gentle bone conduction noise of conductance clean speech Traffic parameter is modified;
Revised linear spectral domain joint statistical model Parameter Switch is returned into original property field, obtains revised joint system Count model;
Wherein, the conductance audio data stream parameter in the joint statistical model is mixed Gauss model or Hidden Markov The mean value and covariance of Gaussian component in model.
5. the method according to claim 1, wherein described according to the corresponding linear frequency of conductance voice of classification results The gentle bone conduction noise statistical model of spectrum statistical model calculates the step of optimum gas leads voice filter
The union feature parameter for extracting conductance detection voice and non-conductance detection voice, it is corresponding to calculate each frame joint-detection voice Combine the output probability of statistical model after the amendment of each classification;
Non- conductance detection audio data stream and conductance in joint statistical model, which are calculated, according to above-mentioned output probability detects voice data The weight parameter of stream;
According to above-mentioned weight parameter, classified using updated joint statistical model to conductance detection speech frame, then root The filter of optimum gas lead sound is calculated according to the gentle bone conduction noise statistical model of the corresponding conductance voice linear spectral statistical model of classification results Wave device.
6. according to the method described in claim 5, it is characterized in that, the non-conductance detection audio data stream and conductance detect language The weight parameter of sound data flow is calculated by following steps:
The initial weight that conductance detection voice is arranged is w0, the initial weight of non-conductance detection voice is 1-w0, the number of iterations t= 0, calculate Difft
Wherein M indicate model mixed components number, L be voice frame number, p (j | zl) and p (k | zl) it is respectively l frame joint-detection Voice zlBelong to the probability of jth classification and kth classification in joint statistical model,To combine statistical model Kth is classified at a distance from jth statistic of classification parameter,Join for the classification of joint statistical model kth and the statistics of jth classification Number;
It calculates conductance and detects voice weightNon- conductance detects voice weight θ2(Difft)=1- θ1 (Difft), using updated weight recalculate p (j | zl) and p (k | zl), then recalculate Difft+1
If | Difft+1-Difft| < ξ, ξ are preset threshold value, then stop updating weight, perform the next step suddenly, otherwise t=t+1, turn Return previous step;
Utilize DiffTCalculate optimal weight θ1(DiffT) and θ2(DiffT), wherein T is the value of t when stopping updating.
7. according to the method described in claim 6, it is characterized in that, the optimum gas, which leads voice filter, passes through following steps meter It calculates:
Utilize optimal weight θ1(DiffT) and θ2(DiffT) calculate joint-detection speech frame zlBelong to current modified joint statistics Model m classification Probability p (m | zl);
The frequency domain gain function that optimum gas leads voice filter is calculated using one in following formula:
Wherein K is the mean value vector dimension of joint statistical model m classification,It is corresponding for joint statistical model m classification The linear spectrum mean vector of conductance voiceI-th of component,Classify for conductance noise statistics model m corresponding Noise linearity spectrum mean vectorI-th of component.
8. the method according to claim 1, wherein the method further includes:
According to the mapping model of the non-conductance voice to conductance voice, non-conductance detection voice is converted to the mapping language of conductance Sound;
The mapping voice of the conductance is weighted with filtering enhancing voice and is merged, obtains merging enhanced voice;
Wherein, the mapping model is built in advance by the clean conductance training voice and non-conductance training voice of the synchronous acquisition It is vertical.
9. according to the method described in claim 8, it is characterized in that, the mapping voice by the conductance and the filtering increase The step of strong voice is weighted fusion include:
It is calculated by the following formula the filtering enhancing voice x of m framemWeightWith the mapping voice y of m framemWeight
Wherein,WithRespectively m frame filtering enhancing voice xmWith mapping voice ymAmplitude variance, SNRmFor the filter of m frame Wave enhances voice xmSignal-to-noise ratio, α, β be preset constant;
Filtering is enhanced into voice x by following formulamWith mapping voice ymWeighted superposition obtains fusion enhancing voice:
10. according to the method described in claim 9, it is characterized in that, the mapping voice by the conductance and the filtering Before the step of enhancing voice is weighted fusion, further comprise:
The voice signal start time that end-point detection obtains is carried out according to conductance detection voice, is believed in interception filtering enhancing voice All data frames before number starting point are averaging power of the power as noise frame
The Signal to Noise Ratio (SNR)mIt is calculated by the following formula:
WhereinIt is m frame filtering enhancing voice xmPower.
CN201910296425.4A 2016-01-14 2016-01-14 Dual-sensor voice enhancement method based on statistical model Active CN110010149B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910296425.4A CN110010149B (en) 2016-01-14 2016-01-14 Dual-sensor voice enhancement method based on statistical model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610025390.7A CN105632512B (en) 2016-01-14 2016-01-14 A kind of dual sensor sound enhancement method and device based on statistical model
CN201910296425.4A CN110010149B (en) 2016-01-14 2016-01-14 Dual-sensor voice enhancement method based on statistical model

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201610025390.7A Division CN105632512B (en) 2016-01-14 2016-01-14 A kind of dual sensor sound enhancement method and device based on statistical model

Publications (2)

Publication Number Publication Date
CN110010149A true CN110010149A (en) 2019-07-12
CN110010149B CN110010149B (en) 2023-07-28

Family

ID=56047353

Family Applications (5)

Application Number Title Priority Date Filing Date
CN201910296437.7A Active CN110070883B (en) 2016-01-14 2016-01-14 Speech enhancement method
CN201610025390.7A Active CN105632512B (en) 2016-01-14 2016-01-14 A kind of dual sensor sound enhancement method and device based on statistical model
CN201910296425.4A Active CN110010149B (en) 2016-01-14 2016-01-14 Dual-sensor voice enhancement method based on statistical model
CN201910296436.2A Active CN110085250B (en) 2016-01-14 2016-01-14 Method for establishing air conduction noise statistical model and application method
CN201910296427.3A Active CN110070880B (en) 2016-01-14 2016-01-14 Establishment method and application method of combined statistical model for classification

Family Applications Before (2)

Application Number Title Priority Date Filing Date
CN201910296437.7A Active CN110070883B (en) 2016-01-14 2016-01-14 Speech enhancement method
CN201610025390.7A Active CN105632512B (en) 2016-01-14 2016-01-14 A kind of dual sensor sound enhancement method and device based on statistical model

Family Applications After (2)

Application Number Title Priority Date Filing Date
CN201910296436.2A Active CN110085250B (en) 2016-01-14 2016-01-14 Method for establishing air conduction noise statistical model and application method
CN201910296427.3A Active CN110070880B (en) 2016-01-14 2016-01-14 Establishment method and application method of combined statistical model for classification

Country Status (1)

Country Link
CN (5) CN110070883B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107808662B (en) * 2016-09-07 2021-06-22 斑马智行网络(香港)有限公司 Method and device for updating grammar rule base for speech recognition
CN107886967B (en) * 2017-11-18 2018-11-13 中国人民解放军陆军工程大学 A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network
CN107993670B (en) * 2017-11-23 2021-01-19 华南理工大学 Microphone array speech enhancement method based on statistical model
CN109584894A (en) * 2018-12-20 2019-04-05 西京学院 A kind of sound enhancement method blended based on radar voice and microphone voice
CN109767783B (en) * 2019-02-15 2021-02-02 深圳市汇顶科技股份有限公司 Voice enhancement method, device, equipment and storage medium
CN109767781A (en) * 2019-03-06 2019-05-17 哈尔滨工业大学(深圳) Speech separating method, system and storage medium based on super-Gaussian priori speech model and deep learning
CN110265056B (en) * 2019-06-11 2021-09-17 安克创新科技股份有限公司 Sound source control method, loudspeaker device and apparatus
CN110390945B (en) * 2019-07-25 2021-09-21 华南理工大学 Dual-sensor voice enhancement method and implementation device
CN110797039B (en) * 2019-08-15 2023-10-24 腾讯科技(深圳)有限公司 Voice processing method, device, terminal and medium
CN111724796B (en) * 2020-06-22 2023-01-13 之江实验室 Musical instrument sound identification method and system based on deep pulse neural network
CN113178191A (en) * 2021-04-25 2021-07-27 平安科技(深圳)有限公司 Federal learning-based speech characterization model training method, device, equipment and medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1992015155A1 (en) * 1991-02-19 1992-09-03 Motorola, Inc. Interference reduction system
JP2001236089A (en) * 1999-12-17 2001-08-31 Atr Interpreting Telecommunications Res Lab Statistical language model generating device, speech recognition device, information retrieval processor and kana/kanji converter
CN1750123A (en) * 2004-09-17 2006-03-22 微软公司 Method and apparatus for multi-sensory speech enhancement
CN101080765A (en) * 2005-05-09 2007-11-28 株式会社东芝 Voice activity detection apparatus and method
JP2008176155A (en) * 2007-01-19 2008-07-31 Kddi Corp Voice recognition device and its utterance determination method, and utterance determination program and its storage medium
CN101320566A (en) * 2008-06-30 2008-12-10 中国人民解放军第四军医大学 Non-air conduction speech reinforcement method based on multi-band spectrum subtraction
CN102027536A (en) * 2008-05-14 2011-04-20 索尼爱立信移动通讯有限公司 Adaptively filtering a microphone signal responsive to vibration sensed in a user's face while speaking
CN103208291A (en) * 2013-03-08 2013-07-17 华南理工大学 Speech enhancement method and device applicable to strong noise environments
CN103229238A (en) * 2010-11-24 2013-07-31 皇家飞利浦电子股份有限公司 System and method for producing an audio signal
US9058820B1 (en) * 2013-05-21 2015-06-16 The Intellisis Corporation Identifying speech portions of a sound model using various statistics thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7283850B2 (en) * 2004-10-12 2007-10-16 Microsoft Corporation Method and apparatus for multi-sensory speech enhancement on a mobile device
CN105224844B (en) * 2014-07-01 2020-01-24 腾讯科技(深圳)有限公司 Verification method, system and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1992015155A1 (en) * 1991-02-19 1992-09-03 Motorola, Inc. Interference reduction system
JP2001236089A (en) * 1999-12-17 2001-08-31 Atr Interpreting Telecommunications Res Lab Statistical language model generating device, speech recognition device, information retrieval processor and kana/kanji converter
CN1750123A (en) * 2004-09-17 2006-03-22 微软公司 Method and apparatus for multi-sensory speech enhancement
CN101080765A (en) * 2005-05-09 2007-11-28 株式会社东芝 Voice activity detection apparatus and method
JP2008176155A (en) * 2007-01-19 2008-07-31 Kddi Corp Voice recognition device and its utterance determination method, and utterance determination program and its storage medium
CN102027536A (en) * 2008-05-14 2011-04-20 索尼爱立信移动通讯有限公司 Adaptively filtering a microphone signal responsive to vibration sensed in a user's face while speaking
CN101320566A (en) * 2008-06-30 2008-12-10 中国人民解放军第四军医大学 Non-air conduction speech reinforcement method based on multi-band spectrum subtraction
CN103229238A (en) * 2010-11-24 2013-07-31 皇家飞利浦电子股份有限公司 System and method for producing an audio signal
CN103208291A (en) * 2013-03-08 2013-07-17 华南理工大学 Speech enhancement method and device applicable to strong noise environments
US9058820B1 (en) * 2013-05-21 2015-06-16 The Intellisis Corporation Identifying speech portions of a sound model using various statistics thereof

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
GRACIARENA M.: ""Combining Standard and Throat Microphones for Robust Speech Recognition"", 《IEEE SIGNAL PROCESSING LETTERS》 *
RAHMAN M. S: ""Intelligibility Enhancement of Bone Conducted Speech by an Analysis-Synthesis Method"", 《IEEE INTERNATIONAL MIDWEST SYMPOSIUM ON》 *
ZHANG ZHENGYOU, LIU ZICHENG: ""Multi-Sensory Microphones for Robust Speech Detection, Enhancement and Recognition"", 《ICASSP》 *
徐舫: ""基于模型的多数据流语音增强技术"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
牛颖莉: ""基于多传感器的语音增强技术"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Also Published As

Publication number Publication date
CN110070883B (en) 2023-07-28
CN105632512B (en) 2019-04-09
CN110070880B (en) 2023-07-28
CN110085250B (en) 2023-07-28
CN105632512A (en) 2016-06-01
CN110010149B (en) 2023-07-28
CN110070880A (en) 2019-07-30
CN110085250A (en) 2019-08-02
CN110070883A (en) 2019-07-30

Similar Documents

Publication Publication Date Title
CN105632512B (en) A kind of dual sensor sound enhancement method and device based on statistical model
CN105513605B (en) The speech-enhancement system and sound enhancement method of mobile microphone
CN106971740B (en) Sound enhancement method based on voice existing probability and phase estimation
Gao et al. Joint training of front-end and back-end deep neural networks for robust speech recognition
CN103854662B (en) Adaptive voice detection method based on multiple domain Combined estimator
CN108831499A (en) Utilize the sound enhancement method of voice existing probability
CN106373589B (en) A kind of ears mixing voice separation method based on iteration structure
CN108172238A (en) A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system
CN106971741A (en) The method and system for the voice de-noising that voice is separated in real time
CN103646649A (en) High-efficiency voice detecting method
CN108680245A (en) Whale globefish class Click classes are called and traditional Sonar Signal sorting technique and device
CN108766459A (en) Target speaker method of estimation and system in a kind of mixing of multi-person speech
CN109949823A (en) A kind of interior abnormal sound recognition methods based on DWPT-MFCC and GMM
CN103730126B (en) Noise suppressing method and noise silencer
CN110197665A (en) A kind of speech Separation and tracking for police criminal detection monitoring
CN103208291A (en) Speech enhancement method and device applicable to strong noise environments
Lv et al. A permutation algorithm based on dynamic time warping in speech frequency-domain blind source separation
CN106023986A (en) Voice identification method based on sound effect mode detection
Liu et al. A novel pitch extraction based on jointly trained deep BLSTM recurrent neural networks with bottleneck features
CN110390945A (en) A kind of dual sensor sound enhancement method and realization device
CN203165457U (en) Voice acquisition device used for noisy environment
Hu et al. Robust binaural sound localisation with temporal attention
CN111968671B (en) Low-altitude sound target comprehensive identification method and device based on multidimensional feature space
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN107025911A (en) Fundamental frequency detection method based on particle group optimizing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant