CN106601229A

CN106601229A - Voice awakening method based on soc chip

Info

Publication number: CN106601229A
Application number: CN201611003861.0A
Authority: CN
Inventors: 陈晓鹏; 殷瑞祥; 徐向民; 张伟彬; 邢晓芬
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2016-11-15
Filing date: 2016-11-15
Publication date: 2017-04-26

Abstract

The invention discloses a voice awakening method based on an soc chip. The voice awakening method comprises the steps that: S1, acquiring voice data and sampling the voice data by means of the chip, and converting an analog signal into a digital signal; S2, carrying out MFCC feature extraction on voice data of the digital signal; S3, conducting voice activity detection on MFCC feature values, judging whether a new frame of MFCC data of the current MFCC feature value is a voice frame, if not, jumping to the step S2 and releasing the data, and if so, subjecting the MFCC feature values to processing at the next step; S4, recognizing the MFCC feature values by adopting a voice recognition algorithm based on an HMM model, and awakening control equipment if a recognition result is an effective instruction, otherwise jumping to the step S2. According to the voice awakening method provided by the invention, a real-time system implemented by adopting the algorithm with high robustness has high recognition rate, and achieves the requirements for low power consumption and high performance.

Description

A kind of voice awakening method based on soc chips

Technical field

The present invention relates to technical field of voice recognition, more particularly to a kind of voice awakening method based on soc chips.

Background technology

With the development in epoch, increasing electronic equipment is entered in daily life, and people are enjoying electronics Equipment brings easily at the same time, it is desirable to electronic equipment more intelligently can realize the interactive mode without touch-control.

Voice wakes up, i.e., user says the phonetic order of setting, and the equipment under allowing in a dormant state is entered directly into Treat command status.By the technology, anyone directly says default wake-up word in any environment, any time to equipment, just Energy activation equipment, so as to realize low-power consumption and the interaction without touch-control.

But the voice awakening technology major part for occurring at present is realized based on computer and mobile phone terminal, is needed powerful Processor be supported, be not suitable for commercial Application.And the voice awakening technology based on mcu realizations is although with low cost, But as the restriction of processor performance is unable to reach preferable effect.

The content of the invention

The technical problem to be solved in the present invention is, there is provided a kind of voice awakening method based on soc chips, by adopting The real-time system that the high algorithm of robustness is realized has higher discrimination, reaches low-power consumption and high performance requirement.

To solve above-mentioned technical problem, the present invention provides following technical scheme：A kind of voice wake-up side based on soc chips Method, comprises the following steps：

S1, chip collection speech data, and which is sampled, convert analog signals into digital signal；

S2, the speech data of digital signal is carried out into MFCC feature extractions；

S3, voice activity detection is carried out to MFCC eigenvalues, judge that the new frame MFCC data of current MFCC eigenvalues are It is no for speech frame, if otherwise return to step S2 discharging data, if then by MFCC eigenvalues into next step process；

S4, MFCC eigenvalues are identified by the speech recognition algorithm based on HMM model, if recognition result is effective Instruction, then wake up control device；Otherwise then return to step S2.

Further, MFCC feature extractions in step S2, which is specially：

1), the pretreatment of digital signal, including preemphasis, framing and adding window；

2) FFT is carried out to each frame signal, frequency spectrum is sought, and then is tried to achieve amplitude spectrum | X_n(k)|；

3), to amplitude spectrum | X_n(k) | plus Mel wave filter groups W_lK (), formula are as follows：

Wherein k refers to k-th point of FFT；O (l), c (l), h (l) be respectively l-th triangular filter lower frequency limit, in Frequency of heart and upper limiting frequency；

4) logarithm operation is done to the output of all of wave filter, discrete cosine transform is further done and is obtained MFCC features Value, formula are as follows：

Wherein N, L are 26, refer to number of filter；I refers to MFCC coefficient exponent numbers, and i takes 12, has as obtained 12 cepstrum spies Levy；Additionally, along with the logarithmic energy of a frame is used as the 13rd characteristic parameter, being defined as follows：

Wherein, X_nK () is amplitude, 13 characteristic parameters are thus obtained, including 12 cepstrum features add 1 logarithm energy Amount；

5), the cepstrum parameter MFCC of 13 required standards only reflects the static characteristic of speech parameter, the dynamic of voice Characteristic is described according to the Difference Spectrum of the static nature；Calculate first-order difference dtm (i) and second differnce of 13 MFCC features dtmm(i)：

13 standard MFCC features and its 13 first-order differences, the MFCC features ginseng of 13 39 dimensions of second differnce composition Number, so far MFCC feature extractions are finished.

Further, voice activity detection is carried out in step S3 to eigenvalue, is lived using the voice based on GMM model Dynamic detection method, which assumes that voice and background noise meet Gaussian Mixture distribution in specific feature space, in feature space It is middle to build silence model, non-mute model respectively；Then the new frame MFCC data of MFCC features are calculated, is calculated respectively The likelihood value P1 of silence model, the likelihood value P2 of non-mute model；Compare likelihood value P1, the size of likelihood value P2, if P1 is more than Then current MFCC Frames are speech frame, otherwise mute frame to P2.

Further, if after the current MFCC Frames are judged as speech frame, when judging next frame MFCC Frames, Likelihood value P1 and likelihood value P2 are multiplied by corresponding transition probability respectively, compare two result of product, if the product knot of likelihood value P1 Result of product of the fruit more than likelihood value P2, then current MFCC Frames are speech frame, are otherwise mute frame；

If after the current MFCC Frames are judged as mute frame, when judging next frame MFCC Frames, likelihood value P1 Corresponding transition probability is multiplied by respectively with likelihood value P2, compares two result of product, if the result of product of likelihood value P1 is more than seemingly The result of product of right value P2, then current MFCC Frames are speech frame, are otherwise mute frame；

The corresponding transition probability is the model data for pre-setting.

Further, the likelihood value P1 of the silence model, the computational methods of the likelihood value P2 of non-mute model, specifically For：

Wherein silence model, non-mute model are constituted by 13 39 dimension Gauss models；The probability of one M rank Gauss model Density function is obtained by M Gaussian probability-density function weighted sum, such as following formula 3.1：

In formula, M is multidimensional Gauss model number, and M takes 13；X is that a D ties up random vector, as 39 dimension MFCC eigenvalues； b_i(X) it is sub- distribution, ω_iFor hybrid weight；Per height, distribution is the joint gaussian probability distribution of D dimensions, such as following formula 3.2：

Wherein μ_iIt is the average of i-th dimension；σ_i ²For variance；x_iFor the MFCC eigenvalues of the i-th dimension of input；D represents total dimension, D takes 39；

As formula 3.2 calculates excessively complicated, derivation simplification is carried out to which：

Take the logarithm and can obtain in formula both sides：

Understand that the plus sige left side is all known parameter in the model for training, can train in advance, therefore set gconst works For a parameter of model：

So formula 3.2 is transformed to seek following formula：

And then formula 3.1 is reduced to：

MFCC Frames and model parameter are brought in above formula, you can obtain the frame data silence model likelihood value and The likelihood value of non-mute model.

Further, the just MFCC Frames and model parameter are brought in above formula, you can obtain the quiet of the frame data The likelihood value of the likelihood value and non-mute model of sound model, concretely comprises the following steps：

1) matching primitives are carried out with silence model and non-mute model respectively to the MFCC eigenvalues of each frame voice, first Carry out (x_i-μ_i)²/σ²Calculate, result of calculation added up, obtain the Multi-dimensional Gaussian distribution of two models exponential part fa0 and fa1：

Wherein mean μ_iAnd varianceThe direct access from model data；

2), the result of calculation of previous step is calculated as below, the likelihood value b of Multi-dimensional Gaussian distribution is obtained：

Wherein gconst is the data trained in advance, the direct access from model data, so far the multidimensional in perfect 3.3 Gauss distribution likelihood value ln b_i(X) calculate；

3), by above it will be appreciated that silence model and non-mute model respectively comprising 13 Multi-dimensional Gaussian distributions, thus step 1,2 The likelihood value ln b of 13 Multi-dimensional Gaussian distributions can be obtained after circulating 13 times_i(X), by these likelihood values and corresponding weights omega_iBring into Following formula, obtains likelihood value P of the present frame to silence model₁With the likelihood value P to non-mute model₂：

Further, speech recognition algorithm of step S4 based on HMM model, which is specially：

S41, loading HMM model, construct the identification network of HMM chains；

S42, by the identification net mate of MFCC eigenvalues and HMM model, calculate initial likelihood value；

S43, according to initial likelihood value, Token Passing algorithms find the optimal path in HMM chain networks, complete to translate The work of code；

S45, judge whether phonetic order is matched with HMM chains, if being then efficient voice, if being otherwise invalid voice.

After above-mentioned technical proposal, the present invention at least has the advantages that：

(1) present invention is transformed into log domains by will partly calculate in former algorithm, and a large amount of multiplyings are converted into addition fortune Calculate, successfully reduce time delay when realizing on the microprocessor；The complicated calculations of algorithm are accelerated by specialized hardware, dropped Low time delay, has been finally reached the purpose of Real time identification；

(2) present invention has higher discrimination by the real-time system realized using the high algorithm of robustness；

(3) algorithm that the present invention has easy upgradability, the present invention is divided into the extraction of independent three modular character, speech activity Detection and speech recognition, subsequently have performance more preferably algorithm carry out to system by way of individually replacing submodule excellent Change.

Description of the drawings

Fig. 1 is a kind of overall flow figure of the voice awakening method based on soc chips of the present invention；

Fig. 2 is a kind of triangular filter schematic diagram of the voice awakening method based on soc chips of the present invention；

Fig. 3 is a kind of triangular filter group schematic diagram of the voice awakening method based on soc chips of the present invention；

Fig. 4 is a kind of voice activity detection flow chart of the voice awakening method based on soc chips of the present invention；

Fig. 5 is that a kind of parameter of 39 dimension Gauss models of voice awakening method based on soc chips of the present invention constitutes signal Figure；

Fig. 6 is a kind of voice activity detection flow chart of steps of the voice awakening method based on soc chips of the present invention；

Fig. 7 is that a kind of training in advance in voice activity detection of the voice awakening method based on soc chips of the present invention is good Model data schematic diagram；

Fig. 8 is a kind of speech recognition algorithm entirety flow chart of steps of voice awakening method based on soc chips of the present invention；

Fig. 9 is a kind of HMM chains of example in speech recognition algorithm of the voice awakening method based on soc chips of the present invention Schematic diagram.

Specific embodiment

It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combine, the application is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

Fig. 1 is total algorithm flow chart of the present invention, wherein each module calculation process is as follows：

1st, speech front-end is processed：

Speech front-end process is exactly, by sampling, analogue signal to be converted to digital signal by the signal of speech data.This In scheme, sample rate is 16K.Voice digital signal is PCM (Pulse Code Modulation)

Form, i.e. pulse code modulation, it, by the speech data after being quantified after speech simulation signal sampling, is most base A kind of phonetic matrix of this most original.In the present invention, ADC is integrated in soc chips, is done at a speech detection per 10ms Reason, sample frequency are 16K data of collection per second, and data bit width is 16bits.

2nd, MFCC characteristics are extracted：

1) pretreatment of signal, including preemphasis (Preemphasis), framing (Frame Blocking), adding window (Windowing)；Sample frequency fs=16KHz of voice signal, as voice signal is considered stable in 10-30ms, therefore Arrange per frame 10ms, so frame length is 160 points；Frame moves 1/2, i.e., 80 for frame length；

2) FFT of 256 points is carried out to each frame, frequency spectrum is sought, and then is tried to achieve amplitude spectrum | X_n(k)|；

3) to amplitude spectrum | X_n(k) | plus Mel wave filter groups W_lK (), formula are as follows：

Wherein k refers to k-th point of FFT；O (l), c (l), h (l) are the lower frequency limit of l-th triangular filter, center frequency Rate and upper limiting frequency, as shown in Figure 2；

In the present invention, Mel wave filter groups are made up of 26 triangular filters, and parameter is calculated in advance.Triangular filter group As shown in figure 3, the point in abscissa correspondence FFT, vertical coordinate is W_l(k), due to being symmetrical so only taking half before FFT Point calculates frequency spectrum, is then added in triangular filter；

4) logarithm operation (Logarlithm) is done to the output of all of wave filter, further does discrete cosine transform MFCC can be obtained, formula is as follows.

Wherein N, L are 26, refer to number of filter；I refers to MFCC coefficient exponent numbers, and the present invention takes 12, that is, obtained 12 cepstrums Feature；In addition along with the logarithmic energy of a frame is used as the 13rd characteristic parameter, it is defined as follows：

Thus 13 characteristic parameters are obtained (12 cepstrum features add 1 logarithmic energy)；

5), the cepstrum parameter MFCC of this 13 standards only reflects the static characteristic of speech parameter, the dynamic characteristic of voice Can be described with the Difference Spectrum of these static natures；Calculate first-order difference dtm (i) and second differnce of 13 MFCC features dtmm(i)：

3rd, voice activity detection (VAD)：

Using the voice activity detection method based on GMM model in the present invention, the method assumes that voice and background noise exist Meet Gaussian Mixture distribution in specific feature space, set up their GMM model in feature space respectively, then use model The method of matching detects effective voice segments in measured signal；Algorithm flow is as shown in Figure 4：

Model trains out by HTK workboxes in advance, 1 39 dimension Gauss model by 1 weight (MIXTURE), 39 Individual average (MEAN), 39 variances (VARIANCE) and 1 gconst are constituted, as shown in Figure 5：

Silence model and non-mute model are made up of 13 multidimensional Gauss models as shown in Figure 5 respectively；When a new frame Speech data is collected into system, and a new frame 39 is tieed up MFCC eigenvalues carries out likelihood value with quiet and non-mute model respectively Calculate, compare two likelihood value sizes, the larger model of likelihood value is the Matching Model of present frame, so as to judge that present frame is It is no for speech frame, VAD detailed process is as shown in Figure 6：

Wherein transfer ratio a₁₁、a₁₂、a₂₁、a₂₂For the good model data of training in advance, as shown in fig. 7, a₁₁For former frame It is mute frame, present frame is also the transition probability of mute frame；a₁₂It is mute frame for former frame, present frame is but the transfer of speech frame Probability；a₂₁It is speech frame for former frame, present frame is but the transition probability of mute frame；a₂₂It is speech frame for former frame, present frame And the transition probability of speech frame；

The most complicated calculating for being calculated as likelihood value in whole processing procedure, the below calculating to likelihood value are introduced：

The probability density function of the multidimensional gauss hybrid models of 13 ranks is weighted by 13 multidimensional Gaussian probability-density functions What summation was obtained, such as following formula 3.1：

In formula, M is multidimensional Gauss model number, is 13 in the present invention；X is that a D dimension random vector is (i.e. previously mentioned 39 dimension MFCC eigenvalues), b_i(X) it is sub- distribution, ω_iFor hybrid weight.Per height, distribution is the joint gaussian probability distribution of D dimensions, Such as following formula：

For 1 dimension, μ is to expect, σ²It is variance；For multidimensional, D represents the dimension of X, represents the association side of D*D Difference matrix, is defined as ∑=E [(x- μ) (x- μ)^T], value of | the ∑ | for the determinant of the covariance；

So the concrete calculation procedure of vad algorithm is：

1) matching primitives are carried out with quiet and non-mute model respectively to 39 dimension MFCC eigenvalues of each frame voice, it is advanced Row (X- μ_i)²/σ²Calculate, and 39 results are added up, obtain exponential part fa0 of the Multi-dimensional Gaussian distribution of two models With fa1 (calculating is completed by hardware-accelerated IP)：

Wherein mean μ_iAnd varianceThe direct access from model data；

2) previous step result is calculated as below, the likelihood value of Multi-dimensional Gaussian distribution is obtained：

B=exp (fa0)

Wherein gconst is the data trained in advance, the direct access from model data.So far the multidimensional in perfect 3.2 Gauss distribution likelihood value is calculated；

3) by above it will be appreciated that silence model and non-mute model respectively comprising 13 Multi-dimensional Gaussian distributions, thus step 1,2 The likelihood value of 13 Multi-dimensional Gaussian distributions can be obtained after circulating 13 times, these likelihood values are multiplied by into Model Weight and i.e. formula 3.1 is added, The likelihood value of silence model and non-mute model can be obtained；So step 1,2 circulation 13 times after can obtain 13 Multi-dimensional Gaussian distributions Likelihood value ln b_i(X), by these likelihood values and corresponding weights omega_iBring following formula into, likelihood of the present frame to silence model can be obtained Value P₁With the likelihood value P to non-mute model₂：

4) finally it is multiplied by transition probability a：

If previous frame data are speech frames, present frame is the probability=a of speech frame₂₂*P₂；

Present frame is the probability=a of mute frame₂₁*P₁；

If previous frame data are mute frames, present frame is the probability=a of speech frame₁₂*P₂；

Present frame is the probability=a of mute frame₁₁*P₁；

Comparison be the probability of speech frame and be mute frame probability size, the probability of speech frame then thinks that greatly present frame is language Sound frame, on the contrary it is then mute frame, and so far vad algorithm is completed.

4th, speech recognition algorithm：

This block process is as shown in figure 8, wherein the loading of model and structure HMM chains are complete when program most starts initialization Into need not subsequently repeat is carried out；When higher level's VAD module detects efficient voice, just calculated into this module.This module Each state for the HMM model for calling is made up of 24 GMM, and flow process is described below：

(1) HMM model is loaded into, the identification network of HMM chains is constructed；

(2), by the identification net mate of MFCC eigenvalues and HMM model, calculate initial likelihood value；

(3), according to initial likelihood value, Token Passing algorithms find the optimal path in HMM chain networks, complete to translate The work of code；

(4), judge whether phonetic order is matched with HMM chains, if being then efficient voice, if being otherwise invalid voice.

Whole flow process is described below：By taking " shutdown " as an example, below for " shutdown " corresponding HMM chains (actual HMM chains are longer, Each syllable is made up of multiple states, here for convenience of explaining, is simplified)." shutdown " may be split into syllable " g " " uan " 4 syllables are described as 4 states with HMM model by " j " " i ", and are connected and can be obtained HMM chains, as shown in Figure 9；

A, this network starting point (i.e. " g ") initialization token value P_g=0；

B, when the first frame MFCC data arrive, start token-passing, the first frame only has P_gThis token value, order Board value P_gCan transmit to state " g " and " uan ", be embodied in：

P_g=P_g+a₁₁+log(GMM_g)

P_uan=P_g+a₁₂+log(GMM_uan)

log(GMM_g) likelihood value for MFCC data to state " g ", log (GMM_uan) it is MFCC data to state " uan " Likelihood value, the calculation of likelihood value is consistent with vad, sees formula 3.3 and 3.4.

C, when the second frame data arrive, now state " g " and " uan " have token value, so token is to the two shapes The state transmission connected by state.

The token value of state " g " is updated：

P_g=P_g+a₁₁+log(GMM_g)

The token value of state " uan " is updated：

P_g→uan=P_g+a₁₂

P_uan→uan=P_uan+a₂₂

After renewal：P_uan=max (P_g→uan, P_uan→uan)+log(GMM_uan)

Due to being connected with " g " on the left of state " uan ", while oneself is connected with oneself, so two token values can be obtained, this When to compare the two token values, choose that big and remain.

The token of state " j " is updated

P_j=P_uan+a₂₃+log(GMM_j)

D, the token value renewal when the 3rd frame arrives, to state " g "：

P_g=P_g+a₁₁+log(GMM_g)

The token value of state " uan " is updated：

P_g→uan=P_g+a₁₂

P_uan→uan=P_uan+a₂₂

After renewal：P_uan=max (P_g→uan, P_uan→uan)+log(GMM_uan)

The token of state " j " is updated

P_uan→j=P_uan+a₂₃

P_j→j=P_j+a₃₃

After renewal：P_j=max (P_uan→j, P_j→j)+log(GMM_j)

The token of state " i " is updated：

P_i=P_j+a₃₄+log(GMM_i)

E, the token value renewal when the 4th frame arrives, to state " g "：

P_g=P_g+a₁₁+log(GMM_g)

The token value of state " uan " is updated：

P_g→uan=P_g+a₁₂

P_uan→uan=P_uan+a₂₂

After renewal：P_uan=max (P_g→uan, P_uan→uan)+log(GMM_uan)

The token of state " j " is updated

P_uan→j=P_uan+a₂₃

P_j→j=P_j+a₃₃

After renewal：P_j=max (P_uan→j, P_j→j)+log(GMM_j)

The token of state " i " is updated：

P_j→i=P_j+a₃₄

P_i→i=P_i+a₄₄

After renewal：P_i=max (P_j→i, P_i→i)+log(GMM_i)

So far all phonetic order frames are all input into and finish, and start token and compare, the token value of four states is carried out size Sequence, if the token value of last state of HMM chains (i.e. " i ") is maximum, illustrates phonetic order and " shutdown " being input into This HMM chain is matched, and decoding result is " shutdown ".Otherwise be considered as input is invalid voice.

Whole decoding process can be seen that and increase that token is diffused into right-hand member always from left end, during this with frame number Each state has a token, and token can be transmitted and be calculated to adjacent state, and (frame number is by pre- for the frame number specified when arrival If phonetic order length determine that shorter such as " shutdown " if, " open sesame " is longer due to voice, and frame number also can be more), just general The token of all states is ranked up, the voice of the current input of explanation if the token value maximum in the end state of HMM chains Match with this HMM chain.The quantity of capable of speech instruction can be increased in actual applications, a plurality of HMM chains are now just had, Such words last frame, all states of all HMM chains can all be ranked up, and determine specifically which bar is instructed with this.

Although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with It is understood by, can these embodiments be carried out with various equivalent changes without departing from the principles and spirit of the present invention Change, change, replace and modification, the scope of the present invention is limited by claims and its equivalency range.

Claims

1. a kind of voice awakening method based on soc chips, it is characterised in that comprise the following steps：

S3, voice activity detection is carried out to MFCC eigenvalues, judge that whether the new frame MFCC data of current MFCC eigenvalues are Speech frame, if otherwise return to step S2 discharging data, if then by MFCC eigenvalues into next step process；

S4, MFCC eigenvalues are identified by the speech recognition algorithm based on HMM model, if recognition result is effectively finger Order, then wake up control device；Otherwise then return to step S2.

2. a kind of voice awakening method based on soc chips as claimed in claim 1, it is characterised in that in step S2 MFCC feature extractions, which is specially：

m (l) = Σ_{k = o (l)}^{h (l)} W_{l} (k) | X_{n} (k) |, l = 1, 2, ..., 26

W_{l} (k) = \{\begin{matrix} \frac{k - o (l)}{c (l) - o (l)}, o (l) \leq k \leq C (l) \\ \frac{h (l) - k}{h (l) - c (l)}, c (l) \leq k \leq h (l) \end{matrix}

Wherein k refers to k-th point of FFT；O (l), c (l), h (l) are respectively the lower frequency limit of l-th triangular filter, center frequency Rate and upper limiting frequency；

4) logarithm operation is done to the output of all of wave filter, discrete cosine transform is further done and is obtained MFCC eigenvalues, it is public Formula is as follows：

c (i) = \sqrt{\frac{2}{N}} Σ_{i = 1}^{L} \log m (l) c o s {(l - \frac{1}{2}) \frac{i π}{L}}

Wherein N, L are 26, refer to number of filter；I refers to MFCC coefficient exponent numbers, and i takes 12, has as obtained 12 cepstrum features；This Outward, along with the logarithmic energy of a frame is used as the 13rd characteristic parameter, it is defined as follows：

c (13) = 10 \lg Σ_{k = 1}^{256} {(X_{n} (k))}^{2}

Wherein, X_nK () is amplitude, 13 characteristic parameters are thus obtained, including 12 cepstrum features add 1 logarithmic energy；

5), the cepstrum parameter MFCC of 13 required standards only reflects the static characteristic of speech parameter, the dynamic characteristic of voice Described according to the Difference Spectrum of the static nature；Calculate first-order difference dtm (i) and second differnce dtmm of 13 MFCC features (i)：

d t m (i) = \frac{- 2 c (i - 2) - c (i - 1) + c (i + 1) + 2 c (i + 2)}{3}

d t m m (i) = \frac{- 2 d t m (i - 2) - d t m (i - 1) + d t m (i + 1) + 2 d t m (i + 2)}{3}

13 standard MFCC features and its 13 first-order differences, the MFCC characteristic parameters of 13 39 dimensions of second differnce composition, extremely This MFCC feature extraction is finished.

3. a kind of voice awakening method based on soc chips as claimed in claim 1, it is characterised in that in step S3 Voice activity detection is carried out to eigenvalue, using the voice activity detection method based on GMM model, which assumes that voice and background are made an uproar Sound meets Gaussian Mixture distribution in specific feature space, builds silence model, non-mute model in feature space respectively； Then the new frame MFCC data of MFCC features are calculated, calculates likelihood value P1, the non-mute model of silence model respectively Likelihood value P2；Compare likelihood value P1, the size of likelihood value P2, current MFCC Frames are speech frame if P1 is more than P2, no Then mute frame.

4. a kind of voice awakening method based on soc chips as claimed in claim 3, it is characterised in that if described current After MFCC Frames are judged as speech frame, when judging next frame MFCC Frames, likelihood value P1 and likelihood value P2 are multiplied by respectively Corresponding transition probability, compares two result of product, if the result of product of likelihood value P1 is more than the result of product of likelihood value P2, Current MFCC Frames are speech frame, are otherwise mute frame；

If after the current MFCC Frames are judged as mute frame, when judging next frame MFCC Frames, likelihood value P1 and seemingly So value P2 is multiplied by corresponding transition probability respectively, compares two result of product, if the result of product of likelihood value P1 is more than likelihood value The result of product of P2, then current MFCC Frames are speech frame, are otherwise mute frame；

The corresponding transition probability is the model data for pre-setting.

5. a kind of voice awakening method based on soc chips as claimed in claim 3, it is characterised in that the silence model Likelihood value P1, the computational methods of the likelihood value P2 of non-mute model, specially：

Wherein silence model, non-mute model are constituted by 13 39 dimension Gauss models；The probability density of one M rank Gauss model Function is obtained by M Gaussian probability-density function weighted sum, such as following formula 3.1：

P (X | λ) = Σ_{i = 1}^{M} ω_{i} b_{i} (X) - - - 3.1

In formula, M is multidimensional Gauss model number, and M takes 13；X is that a D ties up random vector, as 39 dimension MFCC eigenvalues；b_i(X) For sub- distribution, ω_iFor hybrid weight；Per height, distribution is the joint gaussian probability distribution of D dimensions, such as following formula 3.2：

b_{i} (X) = \frac{1}{{(2 π)}^{\frac{D}{2}} | Σ_{i} | \frac{1}{2}} \exp {- \frac{1}{2} {(X - μ_{i})}^{T} {Σ_{i}}^{- 1} (X - μ_{i})} - - - 3.2

Wherein μ_iIt is the average of i-th dimension；σ_i ²For variance；x_iFor the MFCC eigenvalues of the i-th dimension of input；D represents total dimension, and D takes 39；

b_{i} (X) = \frac{1}{{(2 π)}^{\frac{D}{2}} Π_{i = 1}^{39} σ_{i}} e^{- Σ_{i = 1}^{39} \frac{{(x_{i} - μ_{i})}^{2}}{2 {σ_{i}}^{2}}}

Take the logarithm and can obtain in formula both sides：

\ln b_{i} (X) = - \frac{1}{2} {2 \ln [{(2 π)}^{\frac{D}{2}} | Π_{i = 1}^{39} {σ_{i}}^{2} |^{\frac{1}{2}}] + Σ_{i = 1}^{39} \frac{{(x_{i} - μ_{i})}^{2}}{{σ_{i}}^{2}}}

Understand that the plus sige left side is all known parameter in the model for training, can train in advance, therefore gconst be set as mould One parameter of type：

g c o n s t = 2 \ln [{(2 π)}^{\frac{D}{2}} | Π_{i = 1}^{39} {σ_{i}}^{2} |^{\frac{1}{2}}]

So formula 3.2 is transformed to seek following formula：

\ln b_{i} (X) = - \frac{1}{2} [g c o n s t + Σ_{i = 1}^{39} \frac{{(x_{i} - μ_{i})}^{2}}{{σ_{i}}^{2}}] - - - 3.3

And then formula 3.1 is reduced to：

MFCC Frames and model parameter are brought in above formula, you can obtain the likelihood value of the silence model of the frame data and non-quiet The likelihood value of sound model.

6. a kind of voice awakening method based on soc chips as claimed in claim 5, it is characterised in that the just MFCC Frame and model parameter are brought in above formula, you can obtain the likelihood value and non-mute model of the silence model of the frame data seemingly So it is worth, concretely comprises the following steps：

1) matching primitives are carried out with silence model and non-mute model respectively to the MFCC eigenvalues of each frame voice, is first carried out (x_i-μ_i)²/σ²Calculate, result of calculation is added up, and obtains exponential part fa0 and fa1 of the Multi-dimensional Gaussian distribution of two models：

f a 0 = Σ_{i = 1}^{39} \frac{{(x_{i} - μ_{i})}^{2}}{σ_{i}^{2}}

Wherein mean μ_iAnd varianceThe direct access from model data；

b = - \frac{1}{2} (g c o n s t + f a 0)

Wherein gconst is the data trained in advance, the direct access from model data, so far the multidimensional Gauss in perfect 3.3 Distribution likelihood value ln b_i(X) calculate；

3), by above it will be appreciated that silence model and non-mute model respectively comprising 13 Multi-dimensional Gaussian distributions, so step 1,2 circulations The likelihood value ln b of 13 Multi-dimensional Gaussian distributions can be obtained after 13 times_i(X), by these likelihood values and corresponding weights omega_iBring into down Formula, obtains likelihood value P of the present frame to silence model₁With the likelihood value P to non-mute model₂：

7. a kind of voice awakening method based on soc chips as claimed in claim 1, it is characterised in that the step S4 base In the speech recognition algorithm of HMM model, which is specially：

S41, loading HMM model, construct the identification network of HMM chains；

S43, according to initial likelihood value, Token Passing algorithms find the optimal path in HMM chain networks, complete decode Work；