CN102789779A

CN102789779A - Speech recognition system and recognition method thereof

Info

Publication number: CN102789779A
Application number: CN201210242311XA
Authority: CN
Inventors: 张晶; 覃本灼
Original assignee: Guangdong University of Foreign Studies
Current assignee: Guangdong University of Foreign Studies
Priority date: 2012-07-12
Filing date: 2012-07-12
Publication date: 2012-11-21

Abstract

The invention discloses a speech recognition system and a recognition method thereof. The system comprises a speech acquiring module, a speech pretreatment module, a speech feature extraction module, a grouping judgment module and a speech recognition module, wherein the grouping judgment module is used for grouping speeches in a clustering mode; the speech acquiring module is connected with the speech pretreatment module, the speech pretreatment module is connected with the speech feature extraction module, the speech feature extraction module is connected with the grouping judgment module, and the grouping judgment module is connected with the speech recognition module; the grouping judgment module comprises a grouping judgment unit and at least two grouping models; the speech feature extraction module is connected with the grouping judgment unit, and the grouping judgment unit is respectively connected with the at least two grouping models; and the grouping models are connected with the speech recognition module.

Description

A kind of speech recognition system and recognition methods thereof

Technical field

The present invention relates to the speech recognition technology field, particularly a kind of operating system that realizes local speech identifying function based on android operating system.The invention still further relates to the audio recognition method of this speech recognition system.

Background technology

In embedded OS, realize speech identifying function, need carry out pre-service to the voice of input usually, get characteristic parameter, pattern match, output again.Wherein, pattern match takes traditional DHMM model to carry out pattern match usually, and Zhang Weiqing " research of speech recognition algorithm " provides detailed HMM.Hidden Markov model (HMM) can use five units usually to describe, and comprises 2 state sets and 3 probability matrixs, generally can use λ=(A, B, π) next succinct hidden Markov model of expression of tlv triple.Hidden Markov model is actually the expansion of standard markov model, has added the probabilistic relation between may observe state set and these states and the implicit state.And in traditional DHMM algorithm pattern matching process, respectively all templates are mated, when template number increased, time that matching process consumed and thereupon increasing, also promptly when the voice quantity that will discern greatly the time, real-time was relatively poor.

Summary of the invention

The objective of the invention is to design and under the bigger situation of voice quantity, to realize in real time, and to have the speech recognition system of higher discrimination.Another object of the present invention is to provide the audio recognition method of this translation system.

In order to realize the foregoing invention purpose; The present invention includes following technical characterictic: a kind of speech recognition system; It comprises voice acquisition module, voice pre-processing module, pronunciation extracting module, grouping determination module and sound identification module, and the grouping determination module is used for that voice are carried out cluster and divides into groups; Voice acquisition module is connected with the voice pre-processing module, and the voice pre-processing module is connected with pronunciation extracting module, and phonetic feature extracts and is connected with the grouping determination module, and the grouping determination module is connected with sound identification module; The grouping determination module comprises the grouping judging unit and is no less than 2 grouping model; Pronunciation extracting module is connected with the grouping judging unit, and the grouping judging unit is connected with the grouping model that is no less than 2 respectively; Grouping model is connected with sound identification module.

Said voice pre-processing module comprises pre-emphasis unit, branch frame processing unit, window processing unit and the end-point detection unit that connects successively; Pre-emphasis unit is connected with voice acquisition module, and the end-point detection unit is connected with pronunciation extracting module.

The present invention also comprises a kind of audio recognition method of speech recognition system, and it may further comprise the steps:

(1) voice to input carry out pre-emphasis, divide frame, the pre-service of windowing, end-point detection;

(2) extract the MFCC phonetic feature as recognition feature, generate speech characteristic parameter;

(3) the Co vector of calculating input voice through the Euclidean distance of Co vector with each grouping characteristic of correspondence parameter, is judged its affiliated classification;

(4) with whole voice of importing in the affiliated classification of voice, carry out pattern match through traditional DHMM model.This model of DHMM can directly draw matching result according to the characteristic parameter of input voice.Its decision method is: in the voice of all input DHMM models, the maximum voice of the probability of output just are matching result.

Said step (3) comprising:

A, establish voice W _nThe MFCC characteristic parameter be the matrix of Nn * Mm, all provisional capitals are spliced to the first row back, make W _nThe MFCC characteristic parameter characterize with the vectorial Co of row of a Nn * Mm dimension;

B, repeatedly use the K-means algorithm to carry out cluster to the corresponding Co Vector Groups of whole voice, write down each voice, classification number under each cluster is represented with a vectorial Vn of row;

C, calculate the average En of each voice corresponding row vector Vn, variances sigma _n, with En and σ _nProduct Pn characterize each voice;

D, use the K-means algorithm to carry out cluster, draw the voice that each classification comprises the vector formed by Pn;

E, to the corresponding Co vector of all voice in each classification Fe that averages, Fe is the grouping feature parameter.

The MFCC characteristic parameter, the Co vector, product Pn is used for characterizing voice, and just product Pn is a value, and than two of fronts, dimension is low, and data volume is little.

Vector Vn and average En thereof, variances sigma _n, be some intermediate parameters, purpose is to obtain product Pn.And just can draw cluster result through Pn.

Grouping feature parameter F e is used for characterizing a classification.Utilize it to carry out the judgement of classification at cognitive phase.

Having occurred the K-means algorithm in this process twice, is in order to carry out cluster analysis, to obtain product Pn and characterize voice for the first time.Be only for the second time and ask cluster result.

The present invention is the operating system that realizes local speech identifying function on a kind of android operating system, through the voice signal that collects is anticipated, makes that system's efficient when the later stage speech recognition is higher, and recognition accuracy is also higher.The voice that will have a close acoustic feature through clustering algorithm gather and are same group, before the input voice are discerned, divide into groups under judging it earlier, only in this groupings, carry out the calculating of pattern match to importing voice then.Improve recognition accuracy through increasing redundant voice, after the interpolation redundancy, the grouping model storehouse is little, and the identifying expense is also little, has improved real-time and accuracy of identification greatly.

Description of drawings

Fig. 1 is module principle figure of the present invention;

Fig. 2 is a process flow diagram of the present invention;

Fig. 3 is that Co vector of the present invention generates figure;

Fig. 4 is the redundant procedure chart that adds of the present invention.

Embodiment

The present invention is the operating system that realizes local speech identifying function on a kind of android operating system, through the voice signal that collects is anticipated, makes that system's efficient when the later stage speech recognition is higher, and recognition accuracy is also higher.In traditional DHMM algorithm pattern matching process, take to travel through all hmm templates, mate (hidden Markov model (HMM) can use five units usually to describe, and comprises 2 state sets and 3 probability matrixs) respectively:

1. implicit state S

Satisfying Markov property between these states, is the actual state that is implied in the Markov model.These states can't obtain through direct observation usually.(for example S1, S2, S3 or the like)

2. may observe state O

In model, be associated, can obtain through direct observation with implicit state.(it is consistent with the number of implicit state that for example O1, O2, O3 or the like, the number of may observe state not necessarily want.）

3. original state probability matrix π

The implicit state of expression is at the probability matrix of initial time t=1, (for example during t=1, P (S1)=p1, P (S2)=P2, P (S3)=p3, then original state probability matrix π=[p1 p2 p3].

4. implicit state transition probability matrix A.

Transition probability between each state in the HMM model has been described.Aij=P (Sj|Si) wherein, 1≤i,, j≤N., be illustrated in t constantly, state is under the condition of Si, t+1 constantly state be the probability of Sj.

5. observer state transition probability matrix B

Make the implicit state number of N representative, M represents the may observe state number, and then: Bij=P (Oi|Sj), 1≤i≤M, 1≤j≤N. are illustrated in the t moment, implicit state is under the Sj condition, and observation state is the probability of Oi.

General, can use λ=(A, B, π) next succinct hidden Markov model of expression of tlv triple.Hidden Markov model is actually the expansion of standard markov model, has added the probabilistic relation between may observe state set and these states and the implicit state.When the voice quantity that will discern was big, real-time was relatively poor.

Module principle figure of the present invention is as shown in Figure 1; Gather the voice signal of input through voice acquisition module 1; Carry out pre-emphasis through 2 pairs of voice signals of voice pre-processing module, divide frame, windowing; Processing such as end-point detection, what realize above-mentioned processing capacity is pre-emphasis unit 21, branch frame processing unit 22, window processing unit 23 and end-point detection unit 24.Carry out feature extraction through 3 pairs of voice messagings of pronunciation extracting module then, through determination module voice are carried out cluster and divide into groups, again output.

Respectively each modular unit that relates to is described below:

1. pre-service

Pre-service mainly comprises pre-emphasis, divides frame, windowing, end-point detection.

1.1 pre-emphasis

In the pre-emphasis process, input signal is moved suitable frequency range through wave filter.

Transport function is: H (z)=1-0.9375z ^-1

The signal that obtains is:

\tilde{S} (n) = S (n) - 0.9375 S (n - 1)

1.2 dividing frame handles

Voice signal is a transient change, but is metastable in 10～20ms, so can regard the signal of this section in the relatively stable time as a base unit---frame.

1.3 window processing:

To the end-on error of LPC coefficient, we have adopted Hamming window function carry out windowization during for fear of rectangular window.That is:

0≤n≤N-1.

Wherein:

w (n) = 0.54 - 0.46 Cos (\frac{2 π n}{N - 1}),

0≤n≤N-1.

1.4 end-point detection

The end-point detection purpose is to detect the existence that has or not voice signal, promptly from a segment signal that comprises voice, determines the starting point and the terminating point of voice.Effectively end point detects and can not only make the processing time reduce to minimum, and can get rid of the noise of unvoiced segments, thereby makes recognition system have good recognition performance.Common method is through two coefficients: the short-time zero-crossing rate of signal and short-time energy, detect end points.The formula of two coefficients is following:

Short-time energy:

e (i) = Σ_{n = 1}^{N} | x_{i} (n) |

Short-time zero-crossing rate:

ZCR (i) = Σ_{n = 1}^{N - 1} | x_{i} (n) - x_{i} (n + 1) |

2. characteristic parameter extraction

What adopt is the MFCC characteristic parameter.Its calculation process is roughly following:

1. signal is carried out Fast Fourier Transform (FFT) and obtain energy frequency spectrum.

2. the energy frequency spectrum energy multiply by one group of n V-belt bandpass filter, try to achieve the logarithm energy (Log Energy) of each wave filter output, n altogether.This n V-belt bandpass filter is evenly distributed on " Mei Er frequency " (Mel Frequency), and the relational expression of Mei Er frequency and general frequency f is: mel (f)=2595*log10 (1+f/700).

3. discrete cosine transform (Discrete cosine transform, or DCT).Bring above-mentioned n logarithm energy E k into discrete cosine transform, obtain the Mel-scale Cepstrum parameter on L rank, L gets 12 usually here.The discrete cosine transform formula is following:

Cm=Sk＝1Ncos[m*(k-0.5)*p/N]*Ek,m=1,2，...,L

Wherein Ek is the triangular filter that calculated by previous step and the inner product value of spectrum energy, and N is the number of triangular filter here.

4. logarithm energy (Log energy).The energy of a sound frame; It also is the key character of voice; Therefore we add the logarithm energy (be defined as the quadratic sum of signal in the sound frame, get the denary logarithm value again, multiply by 10 again) of a sound frame usually; Make the basic phonetic feature of each sound frame that 13 dimensions just arranged, comprised 1 logarithm energy and 12 cepstrum parameters.

5. residual quantity cepstrum parameter (Delta cepstrum).Though obtained 13 characteristic parameters, yet when being applied to speech recognition, we can add residual quantity cepstrum parameter usually, to show that the cepstrum parameter is to change of time.Its meaning is the cepstrum parameter with respect to the slope of time, just represents the dynamic change in time of cepstrum parameter, and formula is following:

△Cm(t)=[St=-MMCm(t+t)t]/[St=-MMt2]

3. the generation step of packets of voice and grouping feature parameter

(1) carrying out the voice cluster divides into groups

If voice W _nThe MFCC characteristic parameter be the matrix of Nn * Mm.If all provisional capitals are spliced to first row back, the W so _nJust can use the capable vector of a Nn * Mm dimension to characterize, this vector we and be referred to as the Co vector.As shown in Figure 3.

Then, n the Co vector corresponding to n voice uses the K-means algorithm to carry out cluster, will obtain the affiliated classification of each voice and the cluster centre of each classification.But because when using the K-means algorithm to carry out cluster, it is very big related that its cluster result and initial cluster center have, so only can not be as the grouping feature parameter with a cluster gained cluster centre.We should use different initial cluster center as much as possible to carry out cluster, then these a large amount of different cluster results are analyzed, thereby are drawn final cluster result.Each classification is compiled sequence number, so, belong to the voice of same classification, the average classification number under them is close, and variation of type numbering also will be consistent under it.Utilize this point, the present invention uses following method to analyze:

After carrying out m cluster, write down each voice classification number under the cluster each time, and represent with a vectorial Vn of row.Vn has characterized the situation of the affiliated classifications of these voice.Average classification number representes that with the average En of Vn the variation of affiliated classification is with the standard deviation sigma of Vn _nRepresent, then can characterize the situation of the affiliated grouping of these voice with the product Pn of En and Vn.So, voice just can characterize its grouping situation with a numerical value Pn.So far, voice W _nFrom representing to be converted into a product Pn and represent, a n dimension has been gone the clustering problem of vector thereby the clustering problem of n voice just is converted into a Nn * capable vector of Mm dimension.

A lot of clustering methods are little for this data volume, and the vector of simply going is all suitable.For ease, system has still adopted the kmeans algorithm that this row vector is carried out cluster when realizing.

(2) generate the grouping feature parameter

According to the cluster result that generates in the step (1), the Co vector that all voice are corresponding in each classification to be averaged, this value is grouping feature parameter F e.

(3) Euclidean distance of calculating Co vector and each grouping corresponding packet characteristic parameter Fe is based on the grouping under the distance size judgement input voice that draw.Be the judgement classification apart from reckling.

Because the accuracy of dividing into groups has direct influence to the accuracy of speech recognition, so the grouping accuracy must be enough high.Adopting following method is that respective classes increases redundant voice: the sample that collects is input in the system, judges classification under it through the grouping feature parameter.If correct judgment is then imported next sample, otherwise in such, add the corresponding voice of current sample.Like Fig. 4.

After adding redundancy, the voice quantity that each classification comprised will increase.If the voice number that classification comprises is K _P, all the voice number is n, the packets of voice number is m,

α is more little, and then the expense of identifying is just more little.Existing technology then is to be defined as an object to each coefficient, and normal with other objects inheritance is arranged, thereby when making to the template sequence preservation, the file of generation is very big, is about five times of template file of the present invention's preservation.

Claims

1. speech recognition system is characterized in that: comprise voice acquisition module, voice pre-processing module, pronunciation extracting module, grouping determination module and sound identification module, the grouping determination module is used for that voice are carried out cluster and divides into groups; Voice acquisition module is connected with the voice pre-processing module, and the voice pre-processing module is connected with pronunciation extracting module, and phonetic feature extracts and is connected with the grouping determination module, and the grouping determination module is connected with sound identification module; The grouping determination module comprises the grouping judging unit and is no less than 2 grouping model; Pronunciation extracting module is connected with the grouping judging unit, and the grouping judging unit is connected with the grouping model that is no less than 2 respectively; Grouping model is connected with sound identification module.

2. speech recognition system according to claim 1 is characterized in that: said voice pre-processing module comprises pre-emphasis unit, branch frame processing unit, window processing unit and the end-point detection unit that connects successively; Pre-emphasis unit is connected with voice acquisition module, and the end-point detection unit is connected with pronunciation extracting module.

3. the audio recognition method of a speech recognition system according to claim 2 is characterized in that, may further comprise the steps:

(4) with whole voice of importing in the affiliated classification of voice, carry out pattern match through traditional DHMM model.

4. audio recognition method according to claim 3 is characterized in that: said step (3) comprising: