CN106601229A - Voice awakening method based on soc chip - Google Patents

Voice awakening method based on soc chip Download PDF

Info

Publication number
CN106601229A
CN106601229A CN201611003861.0A CN201611003861A CN106601229A CN 106601229 A CN106601229 A CN 106601229A CN 201611003861 A CN201611003861 A CN 201611003861A CN 106601229 A CN106601229 A CN 106601229A
Authority
CN
China
Prior art keywords
mfcc
model
frame
likelihood value
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611003861.0A
Other languages
Chinese (zh)
Inventor
陈晓鹏
殷瑞祥
徐向民
张伟彬
邢晓芬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201611003861.0A priority Critical patent/CN106601229A/en
Publication of CN106601229A publication Critical patent/CN106601229A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/148Duration modelling in HMMs, e.g. semi HMM, segmental models or transition probabilities
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a voice awakening method based on an soc chip. The voice awakening method comprises the steps that: S1, acquiring voice data and sampling the voice data by means of the chip, and converting an analog signal into a digital signal; S2, carrying out MFCC feature extraction on voice data of the digital signal; S3, conducting voice activity detection on MFCC feature values, judging whether a new frame of MFCC data of the current MFCC feature value is a voice frame, if not, jumping to the step S2 and releasing the data, and if so, subjecting the MFCC feature values to processing at the next step; S4, recognizing the MFCC feature values by adopting a voice recognition algorithm based on an HMM model, and awakening control equipment if a recognition result is an effective instruction, otherwise jumping to the step S2. According to the voice awakening method provided by the invention, a real-time system implemented by adopting the algorithm with high robustness has high recognition rate, and achieves the requirements for low power consumption and high performance.

Description

A kind of voice awakening method based on soc chips
Technical field
The present invention relates to technical field of voice recognition, more particularly to a kind of voice awakening method based on soc chips.
Background technology
With the development in epoch, increasing electronic equipment is entered in daily life, and people are enjoying electronics Equipment brings easily at the same time, it is desirable to electronic equipment more intelligently can realize the interactive mode without touch-control.
Voice wakes up, i.e., user says the phonetic order of setting, and the equipment under allowing in a dormant state is entered directly into Treat command status.By the technology, anyone directly says default wake-up word in any environment, any time to equipment, just Energy activation equipment, so as to realize low-power consumption and the interaction without touch-control.
But the voice awakening technology major part for occurring at present is realized based on computer and mobile phone terminal, is needed powerful Processor be supported, be not suitable for commercial Application.And the voice awakening technology based on mcu realizations is although with low cost, But as the restriction of processor performance is unable to reach preferable effect.
The content of the invention
The technical problem to be solved in the present invention is, there is provided a kind of voice awakening method based on soc chips, by adopting The real-time system that the high algorithm of robustness is realized has higher discrimination, reaches low-power consumption and high performance requirement.
To solve above-mentioned technical problem, the present invention provides following technical scheme:A kind of voice wake-up side based on soc chips Method, comprises the following steps:
S1, chip collection speech data, and which is sampled, convert analog signals into digital signal;
S2, the speech data of digital signal is carried out into MFCC feature extractions;
S3, voice activity detection is carried out to MFCC eigenvalues, judge that the new frame MFCC data of current MFCC eigenvalues are It is no for speech frame, if otherwise return to step S2 discharging data, if then by MFCC eigenvalues into next step process;
S4, MFCC eigenvalues are identified by the speech recognition algorithm based on HMM model, if recognition result is effective Instruction, then wake up control device;Otherwise then return to step S2.
Further, MFCC feature extractions in step S2, which is specially:
1), the pretreatment of digital signal, including preemphasis, framing and adding window;
2) FFT is carried out to each frame signal, frequency spectrum is sought, and then is tried to achieve amplitude spectrum | Xn(k)|;
3), to amplitude spectrum | Xn(k) | plus Mel wave filter groups WlK (), formula are as follows:
Wherein k refers to k-th point of FFT;O (l), c (l), h (l) be respectively l-th triangular filter lower frequency limit, in Frequency of heart and upper limiting frequency;
4) logarithm operation is done to the output of all of wave filter, discrete cosine transform is further done and is obtained MFCC features Value, formula are as follows:
Wherein N, L are 26, refer to number of filter;I refers to MFCC coefficient exponent numbers, and i takes 12, has as obtained 12 cepstrum spies Levy;Additionally, along with the logarithmic energy of a frame is used as the 13rd characteristic parameter, being defined as follows:
Wherein, XnK () is amplitude, 13 characteristic parameters are thus obtained, including 12 cepstrum features add 1 logarithm energy Amount;
5), the cepstrum parameter MFCC of 13 required standards only reflects the static characteristic of speech parameter, the dynamic of voice Characteristic is described according to the Difference Spectrum of the static nature;Calculate first-order difference dtm (i) and second differnce of 13 MFCC features dtmm(i):
13 standard MFCC features and its 13 first-order differences, the MFCC features ginseng of 13 39 dimensions of second differnce composition Number, so far MFCC feature extractions are finished.
Further, voice activity detection is carried out in step S3 to eigenvalue, is lived using the voice based on GMM model Dynamic detection method, which assumes that voice and background noise meet Gaussian Mixture distribution in specific feature space, in feature space It is middle to build silence model, non-mute model respectively;Then the new frame MFCC data of MFCC features are calculated, is calculated respectively The likelihood value P1 of silence model, the likelihood value P2 of non-mute model;Compare likelihood value P1, the size of likelihood value P2, if P1 is more than Then current MFCC Frames are speech frame, otherwise mute frame to P2.
Further, if after the current MFCC Frames are judged as speech frame, when judging next frame MFCC Frames, Likelihood value P1 and likelihood value P2 are multiplied by corresponding transition probability respectively, compare two result of product, if the product knot of likelihood value P1 Result of product of the fruit more than likelihood value P2, then current MFCC Frames are speech frame, are otherwise mute frame;
If after the current MFCC Frames are judged as mute frame, when judging next frame MFCC Frames, likelihood value P1 Corresponding transition probability is multiplied by respectively with likelihood value P2, compares two result of product, if the result of product of likelihood value P1 is more than seemingly The result of product of right value P2, then current MFCC Frames are speech frame, are otherwise mute frame;
The corresponding transition probability is the model data for pre-setting.
Further, the likelihood value P1 of the silence model, the computational methods of the likelihood value P2 of non-mute model, specifically For:
Wherein silence model, non-mute model are constituted by 13 39 dimension Gauss models;The probability of one M rank Gauss model Density function is obtained by M Gaussian probability-density function weighted sum, such as following formula 3.1:
In formula, M is multidimensional Gauss model number, and M takes 13;X is that a D ties up random vector, as 39 dimension MFCC eigenvalues; bi(X) it is sub- distribution, ωiFor hybrid weight;Per height, distribution is the joint gaussian probability distribution of D dimensions, such as following formula 3.2:
Wherein μiIt is the average of i-th dimension;σi 2For variance;xiFor the MFCC eigenvalues of the i-th dimension of input;D represents total dimension, D takes 39;
As formula 3.2 calculates excessively complicated, derivation simplification is carried out to which:
Take the logarithm and can obtain in formula both sides:
Understand that the plus sige left side is all known parameter in the model for training, can train in advance, therefore set gconst works For a parameter of model:
So formula 3.2 is transformed to seek following formula:
And then formula 3.1 is reduced to:
MFCC Frames and model parameter are brought in above formula, you can obtain the frame data silence model likelihood value and The likelihood value of non-mute model.
Further, the just MFCC Frames and model parameter are brought in above formula, you can obtain the quiet of the frame data The likelihood value of the likelihood value and non-mute model of sound model, concretely comprises the following steps:
1) matching primitives are carried out with silence model and non-mute model respectively to the MFCC eigenvalues of each frame voice, first Carry out (xii)22Calculate, result of calculation added up, obtain the Multi-dimensional Gaussian distribution of two models exponential part fa0 and fa1:
Wherein mean μiAnd varianceThe direct access from model data;
2), the result of calculation of previous step is calculated as below, the likelihood value b of Multi-dimensional Gaussian distribution is obtained:
Wherein gconst is the data trained in advance, the direct access from model data, so far the multidimensional in perfect 3.3 Gauss distribution likelihood value ln bi(X) calculate;
3), by above it will be appreciated that silence model and non-mute model respectively comprising 13 Multi-dimensional Gaussian distributions, thus step 1,2 The likelihood value ln b of 13 Multi-dimensional Gaussian distributions can be obtained after circulating 13 timesi(X), by these likelihood values and corresponding weights omegaiBring into Following formula, obtains likelihood value P of the present frame to silence model1With the likelihood value P to non-mute model2
Further, speech recognition algorithm of step S4 based on HMM model, which is specially:
S41, loading HMM model, construct the identification network of HMM chains;
S42, by the identification net mate of MFCC eigenvalues and HMM model, calculate initial likelihood value;
S43, according to initial likelihood value, Token Passing algorithms find the optimal path in HMM chain networks, complete to translate The work of code;
S45, judge whether phonetic order is matched with HMM chains, if being then efficient voice, if being otherwise invalid voice.
After above-mentioned technical proposal, the present invention at least has the advantages that:
(1) present invention is transformed into log domains by will partly calculate in former algorithm, and a large amount of multiplyings are converted into addition fortune Calculate, successfully reduce time delay when realizing on the microprocessor;The complicated calculations of algorithm are accelerated by specialized hardware, dropped Low time delay, has been finally reached the purpose of Real time identification;
(2) present invention has higher discrimination by the real-time system realized using the high algorithm of robustness;
(3) algorithm that the present invention has easy upgradability, the present invention is divided into the extraction of independent three modular character, speech activity Detection and speech recognition, subsequently have performance more preferably algorithm carry out to system by way of individually replacing submodule excellent Change.
Description of the drawings
Fig. 1 is a kind of overall flow figure of the voice awakening method based on soc chips of the present invention;
Fig. 2 is a kind of triangular filter schematic diagram of the voice awakening method based on soc chips of the present invention;
Fig. 3 is a kind of triangular filter group schematic diagram of the voice awakening method based on soc chips of the present invention;
Fig. 4 is a kind of voice activity detection flow chart of the voice awakening method based on soc chips of the present invention;
Fig. 5 is that a kind of parameter of 39 dimension Gauss models of voice awakening method based on soc chips of the present invention constitutes signal Figure;
Fig. 6 is a kind of voice activity detection flow chart of steps of the voice awakening method based on soc chips of the present invention;
Fig. 7 is that a kind of training in advance in voice activity detection of the voice awakening method based on soc chips of the present invention is good Model data schematic diagram;
Fig. 8 is a kind of speech recognition algorithm entirety flow chart of steps of voice awakening method based on soc chips of the present invention;
Fig. 9 is a kind of HMM chains of example in speech recognition algorithm of the voice awakening method based on soc chips of the present invention Schematic diagram.
Specific embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combine, the application is described in further detail with specific embodiment below in conjunction with the accompanying drawings.
Fig. 1 is total algorithm flow chart of the present invention, wherein each module calculation process is as follows:
1st, speech front-end is processed:
Speech front-end process is exactly, by sampling, analogue signal to be converted to digital signal by the signal of speech data.This In scheme, sample rate is 16K.Voice digital signal is PCM (Pulse Code Modulation)
Form, i.e. pulse code modulation, it, by the speech data after being quantified after speech simulation signal sampling, is most base A kind of phonetic matrix of this most original.In the present invention, ADC is integrated in soc chips, is done at a speech detection per 10ms Reason, sample frequency are 16K data of collection per second, and data bit width is 16bits.
2nd, MFCC characteristics are extracted:
1) pretreatment of signal, including preemphasis (Preemphasis), framing (Frame Blocking), adding window (Windowing);Sample frequency fs=16KHz of voice signal, as voice signal is considered stable in 10-30ms, therefore Arrange per frame 10ms, so frame length is 160 points;Frame moves 1/2, i.e., 80 for frame length;
2) FFT of 256 points is carried out to each frame, frequency spectrum is sought, and then is tried to achieve amplitude spectrum | Xn(k)|;
3) to amplitude spectrum | Xn(k) | plus Mel wave filter groups WlK (), formula are as follows:
Wherein k refers to k-th point of FFT;O (l), c (l), h (l) are the lower frequency limit of l-th triangular filter, center frequency Rate and upper limiting frequency, as shown in Figure 2;
In the present invention, Mel wave filter groups are made up of 26 triangular filters, and parameter is calculated in advance.Triangular filter group As shown in figure 3, the point in abscissa correspondence FFT, vertical coordinate is Wl(k), due to being symmetrical so only taking half before FFT Point calculates frequency spectrum, is then added in triangular filter;
4) logarithm operation (Logarlithm) is done to the output of all of wave filter, further does discrete cosine transform MFCC can be obtained, formula is as follows.
Wherein N, L are 26, refer to number of filter;I refers to MFCC coefficient exponent numbers, and the present invention takes 12, that is, obtained 12 cepstrums Feature;In addition along with the logarithmic energy of a frame is used as the 13rd characteristic parameter, it is defined as follows:
Thus 13 characteristic parameters are obtained (12 cepstrum features add 1 logarithmic energy);
5), the cepstrum parameter MFCC of this 13 standards only reflects the static characteristic of speech parameter, the dynamic characteristic of voice Can be described with the Difference Spectrum of these static natures;Calculate first-order difference dtm (i) and second differnce of 13 MFCC features dtmm(i):
13 standard MFCC features and its 13 first-order differences, the MFCC features ginseng of 13 39 dimensions of second differnce composition Number, so far MFCC feature extractions are finished.
3rd, voice activity detection (VAD):
Using the voice activity detection method based on GMM model in the present invention, the method assumes that voice and background noise exist Meet Gaussian Mixture distribution in specific feature space, set up their GMM model in feature space respectively, then use model The method of matching detects effective voice segments in measured signal;Algorithm flow is as shown in Figure 4:
Model trains out by HTK workboxes in advance, 1 39 dimension Gauss model by 1 weight (MIXTURE), 39 Individual average (MEAN), 39 variances (VARIANCE) and 1 gconst are constituted, as shown in Figure 5:
Silence model and non-mute model are made up of 13 multidimensional Gauss models as shown in Figure 5 respectively;When a new frame Speech data is collected into system, and a new frame 39 is tieed up MFCC eigenvalues carries out likelihood value with quiet and non-mute model respectively Calculate, compare two likelihood value sizes, the larger model of likelihood value is the Matching Model of present frame, so as to judge that present frame is It is no for speech frame, VAD detailed process is as shown in Figure 6:
Wherein transfer ratio a11、a12、a21、a22For the good model data of training in advance, as shown in fig. 7, a11For former frame It is mute frame, present frame is also the transition probability of mute frame;a12It is mute frame for former frame, present frame is but the transfer of speech frame Probability;a21It is speech frame for former frame, present frame is but the transition probability of mute frame;a22It is speech frame for former frame, present frame And the transition probability of speech frame;
The most complicated calculating for being calculated as likelihood value in whole processing procedure, the below calculating to likelihood value are introduced:
The probability density function of the multidimensional gauss hybrid models of 13 ranks is weighted by 13 multidimensional Gaussian probability-density functions What summation was obtained, such as following formula 3.1:
In formula, M is multidimensional Gauss model number, is 13 in the present invention;X is that a D dimension random vector is (i.e. previously mentioned 39 dimension MFCC eigenvalues), bi(X) it is sub- distribution, ωiFor hybrid weight.Per height, distribution is the joint gaussian probability distribution of D dimensions, Such as following formula:
For 1 dimension, μ is to expect, σ2It is variance;For multidimensional, D represents the dimension of X, represents the association side of D*D Difference matrix, is defined as ∑=E [(x- μ) (x- μ)T], value of | the ∑ | for the determinant of the covariance;
So the concrete calculation procedure of vad algorithm is:
1) matching primitives are carried out with quiet and non-mute model respectively to 39 dimension MFCC eigenvalues of each frame voice, it is advanced Row (X- μi)22Calculate, and 39 results are added up, obtain exponential part fa0 of the Multi-dimensional Gaussian distribution of two models With fa1 (calculating is completed by hardware-accelerated IP):
Wherein mean μiAnd varianceThe direct access from model data;
2) previous step result is calculated as below, the likelihood value of Multi-dimensional Gaussian distribution is obtained:
B=exp (fa0)
Wherein gconst is the data trained in advance, the direct access from model data.So far the multidimensional in perfect 3.2 Gauss distribution likelihood value is calculated;
3) by above it will be appreciated that silence model and non-mute model respectively comprising 13 Multi-dimensional Gaussian distributions, thus step 1,2 The likelihood value of 13 Multi-dimensional Gaussian distributions can be obtained after circulating 13 times, these likelihood values are multiplied by into Model Weight and i.e. formula 3.1 is added, The likelihood value of silence model and non-mute model can be obtained;So step 1,2 circulation 13 times after can obtain 13 Multi-dimensional Gaussian distributions Likelihood value ln bi(X), by these likelihood values and corresponding weights omegaiBring following formula into, likelihood of the present frame to silence model can be obtained Value P1With the likelihood value P to non-mute model2
4) finally it is multiplied by transition probability a:
If previous frame data are speech frames, present frame is the probability=a of speech frame22*P2
Present frame is the probability=a of mute frame21*P1
If previous frame data are mute frames, present frame is the probability=a of speech frame12*P2
Present frame is the probability=a of mute frame11*P1
Comparison be the probability of speech frame and be mute frame probability size, the probability of speech frame then thinks that greatly present frame is language Sound frame, on the contrary it is then mute frame, and so far vad algorithm is completed.
4th, speech recognition algorithm:
This block process is as shown in figure 8, wherein the loading of model and structure HMM chains are complete when program most starts initialization Into need not subsequently repeat is carried out;When higher level's VAD module detects efficient voice, just calculated into this module.This module Each state for the HMM model for calling is made up of 24 GMM, and flow process is described below:
(1) HMM model is loaded into, the identification network of HMM chains is constructed;
(2), by the identification net mate of MFCC eigenvalues and HMM model, calculate initial likelihood value;
(3), according to initial likelihood value, Token Passing algorithms find the optimal path in HMM chain networks, complete to translate The work of code;
(4), judge whether phonetic order is matched with HMM chains, if being then efficient voice, if being otherwise invalid voice.
Whole flow process is described below:By taking " shutdown " as an example, below for " shutdown " corresponding HMM chains (actual HMM chains are longer, Each syllable is made up of multiple states, here for convenience of explaining, is simplified)." shutdown " may be split into syllable " g " " uan " 4 syllables are described as 4 states with HMM model by " j " " i ", and are connected and can be obtained HMM chains, as shown in Figure 9;
A, this network starting point (i.e. " g ") initialization token value Pg=0;
B, when the first frame MFCC data arrive, start token-passing, the first frame only has PgThis token value, order Board value PgCan transmit to state " g " and " uan ", be embodied in:
Pg=Pg+a11+log(GMMg)
Puan=Pg+a12+log(GMMuan)
log(GMMg) likelihood value for MFCC data to state " g ", log (GMMuan) it is MFCC data to state " uan " Likelihood value, the calculation of likelihood value is consistent with vad, sees formula 3.3 and 3.4.
C, when the second frame data arrive, now state " g " and " uan " have token value, so token is to the two shapes The state transmission connected by state.
The token value of state " g " is updated:
Pg=Pg+a11+log(GMMg)
The token value of state " uan " is updated:
Pg→uan=Pg+a12
Puan→uan=Puan+a22
After renewal:Puan=max (Pg→uan, Puan→uan)+log(GMMuan)
Due to being connected with " g " on the left of state " uan ", while oneself is connected with oneself, so two token values can be obtained, this When to compare the two token values, choose that big and remain.
The token of state " j " is updated
Pj=Puan+a23+log(GMMj)
D, the token value renewal when the 3rd frame arrives, to state " g ":
Pg=Pg+a11+log(GMMg)
The token value of state " uan " is updated:
Pg→uan=Pg+a12
Puan→uan=Puan+a22
After renewal:Puan=max (Pg→uan, Puan→uan)+log(GMMuan)
The token of state " j " is updated
Puan→j=Puan+a23
Pj→j=Pj+a33
After renewal:Pj=max (Puan→j, Pj→j)+log(GMMj)
The token of state " i " is updated:
Pi=Pj+a34+log(GMMi)
E, the token value renewal when the 4th frame arrives, to state " g ":
Pg=Pg+a11+log(GMMg)
The token value of state " uan " is updated:
Pg→uan=Pg+a12
Puan→uan=Puan+a22
After renewal:Puan=max (Pg→uan, Puan→uan)+log(GMMuan)
The token of state " j " is updated
Puan→j=Puan+a23
Pj→j=Pj+a33
After renewal:Pj=max (Puan→j, Pj→j)+log(GMMj)
The token of state " i " is updated:
Pj→i=Pj+a34
Pi→i=Pi+a44
After renewal:Pi=max (Pj→i, Pi→i)+log(GMMi)
So far all phonetic order frames are all input into and finish, and start token and compare, the token value of four states is carried out size Sequence, if the token value of last state of HMM chains (i.e. " i ") is maximum, illustrates phonetic order and " shutdown " being input into This HMM chain is matched, and decoding result is " shutdown ".Otherwise be considered as input is invalid voice.
Whole decoding process can be seen that and increase that token is diffused into right-hand member always from left end, during this with frame number Each state has a token, and token can be transmitted and be calculated to adjacent state, and (frame number is by pre- for the frame number specified when arrival If phonetic order length determine that shorter such as " shutdown " if, " open sesame " is longer due to voice, and frame number also can be more), just general The token of all states is ranked up, the voice of the current input of explanation if the token value maximum in the end state of HMM chains Match with this HMM chain.The quantity of capable of speech instruction can be increased in actual applications, a plurality of HMM chains are now just had, Such words last frame, all states of all HMM chains can all be ranked up, and determine specifically which bar is instructed with this.
Although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with It is understood by, can these embodiments be carried out with various equivalent changes without departing from the principles and spirit of the present invention Change, change, replace and modification, the scope of the present invention is limited by claims and its equivalency range.

Claims (7)

1. a kind of voice awakening method based on soc chips, it is characterised in that comprise the following steps:
S1, chip collection speech data, and which is sampled, convert analog signals into digital signal;
S2, the speech data of digital signal is carried out into MFCC feature extractions;
S3, voice activity detection is carried out to MFCC eigenvalues, judge that whether the new frame MFCC data of current MFCC eigenvalues are Speech frame, if otherwise return to step S2 discharging data, if then by MFCC eigenvalues into next step process;
S4, MFCC eigenvalues are identified by the speech recognition algorithm based on HMM model, if recognition result is effectively finger Order, then wake up control device;Otherwise then return to step S2.
2. a kind of voice awakening method based on soc chips as claimed in claim 1, it is characterised in that in step S2 MFCC feature extractions, which is specially:
1), the pretreatment of digital signal, including preemphasis, framing and adding window;
2) FFT is carried out to each frame signal, frequency spectrum is sought, and then is tried to achieve amplitude spectrum | Xn(k)|;
3), to amplitude spectrum | Xn(k) | plus Mel wave filter groups WlK (), formula are as follows:
m ( l ) = Σ k = o ( l ) h ( l ) W l ( k ) | X n ( k ) | , l = 1 , 2 , ... , 26
W l ( k ) = k - o ( l ) c ( l ) - o ( l ) , o ( l ) ≤ k ≤ C ( l ) h ( l ) - k h ( l ) - c ( l ) , c ( l ) ≤ k ≤ h ( l )
Wherein k refers to k-th point of FFT;O (l), c (l), h (l) are respectively the lower frequency limit of l-th triangular filter, center frequency Rate and upper limiting frequency;
4) logarithm operation is done to the output of all of wave filter, discrete cosine transform is further done and is obtained MFCC eigenvalues, it is public Formula is as follows:
c ( i ) = 2 N Σ i = 1 L log m ( l ) c o s { ( l - 1 2 ) i π L }
Wherein N, L are 26, refer to number of filter;I refers to MFCC coefficient exponent numbers, and i takes 12, has as obtained 12 cepstrum features;This Outward, along with the logarithmic energy of a frame is used as the 13rd characteristic parameter, it is defined as follows:
c ( 13 ) = 10 lg Σ k = 1 256 ( X n ( k ) ) 2
Wherein, XnK () is amplitude, 13 characteristic parameters are thus obtained, including 12 cepstrum features add 1 logarithmic energy;
5), the cepstrum parameter MFCC of 13 required standards only reflects the static characteristic of speech parameter, the dynamic characteristic of voice Described according to the Difference Spectrum of the static nature;Calculate first-order difference dtm (i) and second differnce dtmm of 13 MFCC features (i):
d t m ( i ) = - 2 c ( i - 2 ) - c ( i - 1 ) + c ( i + 1 ) + 2 c ( i + 2 ) 3
d t m m ( i ) = - 2 d t m ( i - 2 ) - d t m ( i - 1 ) + d t m ( i + 1 ) + 2 d t m ( i + 2 ) 3
13 standard MFCC features and its 13 first-order differences, the MFCC characteristic parameters of 13 39 dimensions of second differnce composition, extremely This MFCC feature extraction is finished.
3. a kind of voice awakening method based on soc chips as claimed in claim 1, it is characterised in that in step S3 Voice activity detection is carried out to eigenvalue, using the voice activity detection method based on GMM model, which assumes that voice and background are made an uproar Sound meets Gaussian Mixture distribution in specific feature space, builds silence model, non-mute model in feature space respectively; Then the new frame MFCC data of MFCC features are calculated, calculates likelihood value P1, the non-mute model of silence model respectively Likelihood value P2;Compare likelihood value P1, the size of likelihood value P2, current MFCC Frames are speech frame if P1 is more than P2, no Then mute frame.
4. a kind of voice awakening method based on soc chips as claimed in claim 3, it is characterised in that if described current After MFCC Frames are judged as speech frame, when judging next frame MFCC Frames, likelihood value P1 and likelihood value P2 are multiplied by respectively Corresponding transition probability, compares two result of product, if the result of product of likelihood value P1 is more than the result of product of likelihood value P2, Current MFCC Frames are speech frame, are otherwise mute frame;
If after the current MFCC Frames are judged as mute frame, when judging next frame MFCC Frames, likelihood value P1 and seemingly So value P2 is multiplied by corresponding transition probability respectively, compares two result of product, if the result of product of likelihood value P1 is more than likelihood value The result of product of P2, then current MFCC Frames are speech frame, are otherwise mute frame;
The corresponding transition probability is the model data for pre-setting.
5. a kind of voice awakening method based on soc chips as claimed in claim 3, it is characterised in that the silence model Likelihood value P1, the computational methods of the likelihood value P2 of non-mute model, specially:
Wherein silence model, non-mute model are constituted by 13 39 dimension Gauss models;The probability density of one M rank Gauss model Function is obtained by M Gaussian probability-density function weighted sum, such as following formula 3.1:
P ( X | λ ) = Σ i = 1 M ω i b i ( X ) - - - 3.1
In formula, M is multidimensional Gauss model number, and M takes 13;X is that a D ties up random vector, as 39 dimension MFCC eigenvalues;bi(X) For sub- distribution, ωiFor hybrid weight;Per height, distribution is the joint gaussian probability distribution of D dimensions, such as following formula 3.2:
b i ( X ) = 1 ( 2 π ) D 2 | Σ i | 1 2 exp { - 1 2 ( X - μ i ) T Σ i - 1 ( X - μ i ) } - - - 3.2
Wherein μiIt is the average of i-th dimension;σi 2For variance;xiFor the MFCC eigenvalues of the i-th dimension of input;D represents total dimension, and D takes 39;
As formula 3.2 calculates excessively complicated, derivation simplification is carried out to which:
b i ( X ) = 1 ( 2 π ) D 2 Π i = 1 39 σ i e - Σ i = 1 39 ( x i - μ i ) 2 2 σ i 2
Take the logarithm and can obtain in formula both sides:
ln b i ( X ) = - 1 2 { 2 ln [ ( 2 π ) D 2 | Π i = 1 39 σ i 2 | 1 2 ] + Σ i = 1 39 ( x i - μ i ) 2 σ i 2 }
Understand that the plus sige left side is all known parameter in the model for training, can train in advance, therefore gconst be set as mould One parameter of type:
g c o n s t = 2 ln [ ( 2 π ) D 2 | Π i = 1 39 σ i 2 | 1 2 ]
So formula 3.2 is transformed to seek following formula:
ln b i ( X ) = - 1 2 [ g c o n s t + Σ i = 1 39 ( x i - μ i ) 2 σ i 2 ] - - - 3.3
And then formula 3.1 is reduced to:
MFCC Frames and model parameter are brought in above formula, you can obtain the likelihood value of the silence model of the frame data and non-quiet The likelihood value of sound model.
6. a kind of voice awakening method based on soc chips as claimed in claim 5, it is characterised in that the just MFCC Frame and model parameter are brought in above formula, you can obtain the likelihood value and non-mute model of the silence model of the frame data seemingly So it is worth, concretely comprises the following steps:
1) matching primitives are carried out with silence model and non-mute model respectively to the MFCC eigenvalues of each frame voice, is first carried out (xii)22Calculate, result of calculation is added up, and obtains exponential part fa0 and fa1 of the Multi-dimensional Gaussian distribution of two models:
f a 0 = Σ i = 1 39 ( x i - μ i ) 2 σ i 2
Wherein mean μiAnd varianceThe direct access from model data;
2), the result of calculation of previous step is calculated as below, the likelihood value b of Multi-dimensional Gaussian distribution is obtained:
b = - 1 2 ( g c o n s t + f a 0 )
Wherein gconst is the data trained in advance, the direct access from model data, so far the multidimensional Gauss in perfect 3.3 Distribution likelihood value ln bi(X) calculate;
3), by above it will be appreciated that silence model and non-mute model respectively comprising 13 Multi-dimensional Gaussian distributions, so step 1,2 circulations The likelihood value ln b of 13 Multi-dimensional Gaussian distributions can be obtained after 13 timesi(X), by these likelihood values and corresponding weights omegaiBring into down Formula, obtains likelihood value P of the present frame to silence model1With the likelihood value P to non-mute model2
7. a kind of voice awakening method based on soc chips as claimed in claim 1, it is characterised in that the step S4 base In the speech recognition algorithm of HMM model, which is specially:
S41, loading HMM model, construct the identification network of HMM chains;
S42, by the identification net mate of MFCC eigenvalues and HMM model, calculate initial likelihood value;
S43, according to initial likelihood value, Token Passing algorithms find the optimal path in HMM chain networks, complete decode Work;
S45, judge whether phonetic order is matched with HMM chains, if being then efficient voice, if being otherwise invalid voice.
CN201611003861.0A 2016-11-15 2016-11-15 Voice awakening method based on soc chip Pending CN106601229A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611003861.0A CN106601229A (en) 2016-11-15 2016-11-15 Voice awakening method based on soc chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611003861.0A CN106601229A (en) 2016-11-15 2016-11-15 Voice awakening method based on soc chip

Publications (1)

Publication Number Publication Date
CN106601229A true CN106601229A (en) 2017-04-26

Family

ID=58590197

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611003861.0A Pending CN106601229A (en) 2016-11-15 2016-11-15 Voice awakening method based on soc chip

Country Status (1)

Country Link
CN (1) CN106601229A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107886957A (en) * 2017-11-17 2018-04-06 广州势必可赢网络科技有限公司 The voice awakening method and device of a kind of combination Application on Voiceprint Recognition
CN108615535A (en) * 2018-05-07 2018-10-02 腾讯科技(深圳)有限公司 Sound enhancement method, device, intelligent sound equipment and computer equipment
CN108986822A (en) * 2018-08-31 2018-12-11 出门问问信息科技有限公司 Audio recognition method, device, electronic equipment and non-transient computer storage medium
CN109088611A (en) * 2018-09-28 2018-12-25 咪付(广西)网络技术有限公司 A kind of auto gain control method and device of acoustic communication system
CN110580919A (en) * 2019-08-19 2019-12-17 东南大学 voice feature extraction method and reconfigurable voice feature extraction device under multi-noise scene
CN111028831A (en) * 2019-11-11 2020-04-17 云知声智能科技股份有限公司 Voice awakening method and device
CN111124511A (en) * 2019-12-09 2020-05-08 浙江省北大信息技术高等研究院 Wake-up chip and wake-up system
CN111868825A (en) * 2018-03-12 2020-10-30 赛普拉斯半导体公司 Dual pipeline architecture for wake phrase detection with voice onset detection
CN112102848A (en) * 2019-06-17 2020-12-18 华为技术有限公司 Method, chip and terminal for identifying music
CN115132231A (en) * 2022-08-31 2022-09-30 安徽讯飞寰语科技有限公司 Voice activity detection method, device, equipment and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1455387A (en) * 2002-11-15 2003-11-12 中国科学院声学研究所 Rapid decoding method for voice identifying system
CN101051462A (en) * 2006-04-07 2007-10-10 株式会社东芝 Feature-vector compensating apparatus and feature-vector compensating method
CN203253172U (en) * 2013-03-18 2013-10-30 北京承芯卓越科技有限公司 Intelligent voice communication toy
CN105096939A (en) * 2015-07-08 2015-11-25 百度在线网络技术(北京)有限公司 Voice wake-up method and device
CN105206271A (en) * 2015-08-25 2015-12-30 北京宇音天下科技有限公司 Intelligent equipment voice wake-up method and system for realizing method
CN105869628A (en) * 2016-03-30 2016-08-17 乐视控股(北京)有限公司 Voice endpoint detection method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1455387A (en) * 2002-11-15 2003-11-12 中国科学院声学研究所 Rapid decoding method for voice identifying system
CN101051462A (en) * 2006-04-07 2007-10-10 株式会社东芝 Feature-vector compensating apparatus and feature-vector compensating method
CN203253172U (en) * 2013-03-18 2013-10-30 北京承芯卓越科技有限公司 Intelligent voice communication toy
CN105096939A (en) * 2015-07-08 2015-11-25 百度在线网络技术(北京)有限公司 Voice wake-up method and device
CN105206271A (en) * 2015-08-25 2015-12-30 北京宇音天下科技有限公司 Intelligent equipment voice wake-up method and system for realizing method
CN105869628A (en) * 2016-03-30 2016-08-17 乐视控股(北京)有限公司 Voice endpoint detection method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姜楠: ""手机语音识别系统中语音活动检测算法研究与实现"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107886957A (en) * 2017-11-17 2018-04-06 广州势必可赢网络科技有限公司 The voice awakening method and device of a kind of combination Application on Voiceprint Recognition
CN111868825A (en) * 2018-03-12 2020-10-30 赛普拉斯半导体公司 Dual pipeline architecture for wake phrase detection with voice onset detection
CN108615535A (en) * 2018-05-07 2018-10-02 腾讯科技(深圳)有限公司 Sound enhancement method, device, intelligent sound equipment and computer equipment
CN108986822A (en) * 2018-08-31 2018-12-11 出门问问信息科技有限公司 Audio recognition method, device, electronic equipment and non-transient computer storage medium
CN109088611A (en) * 2018-09-28 2018-12-25 咪付(广西)网络技术有限公司 A kind of auto gain control method and device of acoustic communication system
CN112102848B (en) * 2019-06-17 2024-04-26 华为技术有限公司 Method, chip and terminal for identifying music
CN112102848A (en) * 2019-06-17 2020-12-18 华为技术有限公司 Method, chip and terminal for identifying music
CN110580919B (en) * 2019-08-19 2021-09-28 东南大学 Voice feature extraction method and reconfigurable voice feature extraction device under multi-noise scene
CN110580919A (en) * 2019-08-19 2019-12-17 东南大学 voice feature extraction method and reconfigurable voice feature extraction device under multi-noise scene
CN111028831A (en) * 2019-11-11 2020-04-17 云知声智能科技股份有限公司 Voice awakening method and device
CN111028831B (en) * 2019-11-11 2022-02-18 云知声智能科技股份有限公司 Voice awakening method and device
CN111124511A (en) * 2019-12-09 2020-05-08 浙江省北大信息技术高等研究院 Wake-up chip and wake-up system
CN115132231A (en) * 2022-08-31 2022-09-30 安徽讯飞寰语科技有限公司 Voice activity detection method, device, equipment and readable storage medium
CN115132231B (en) * 2022-08-31 2022-12-13 安徽讯飞寰语科技有限公司 Voice activity detection method, device, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN106601229A (en) Voice awakening method based on soc chip
Peng et al. Efficient speech emotion recognition using multi-scale cnn and attention
Nakkiran et al. Compressing deep neural networks using a rank-constrained topology
CN105976812B (en) A kind of audio recognition method and its equipment
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
CN107767861B (en) Voice awakening method and system and intelligent terminal
US20220215853A1 (en) Audio signal processing method, model training method, and related apparatus
CN108597496A (en) A kind of speech production method and device for fighting network based on production
CN103117059B (en) Voice signal characteristics extracting method based on tensor decomposition
US20170154640A1 (en) Method and electronic device for voice recognition based on dynamic voice model selection
CN109754790B (en) Speech recognition system and method based on hybrid acoustic model
CN110675859B (en) Multi-emotion recognition method, system, medium, and apparatus combining speech and text
CN110310647A (en) A kind of speech identity feature extractor, classifier training method and relevant device
CN110517664A (en) Multi-party speech recognition methods, device, equipment and readable storage medium storing program for executing
CN105139864A (en) Voice recognition method and voice recognition device
CN110246489B (en) Voice recognition method and system for children
CN111210807A (en) Speech recognition model training method, system, mobile terminal and storage medium
CN105895082A (en) Acoustic model training method and device as well as speech recognition method and device
CN112382301B (en) Noise-containing voice gender identification method and system based on lightweight neural network
CN112786004A (en) Speech synthesis method, electronic device, and storage device
CN106782502A (en) A kind of speech recognition equipment of children robot
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
Sagi et al. A biologically motivated solution to the cocktail party problem
CN110580897A (en) audio verification method and device, storage medium and electronic equipment
CN113823323A (en) Audio processing method and device based on convolutional neural network and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170426

RJ01 Rejection of invention patent application after publication