CN1190772C

CN1190772C - Voice identifying system and compression method of characteristic vector set for voice identifying system

Info

Publication number: CN1190772C
Application number: CNB021486832A
Authority: CN
Inventors: 潘接林; 韩疆; 刘建; 颜永红; 庹凌云; 张建平
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2002-09-30
Filing date: 2002-11-15
Publication date: 2005-02-23
Anticipated expiration: 2022-11-15
Also published as: CN1455388A

Abstract

The invention discloses a method for compressing a feature vector set used in a speech recognition system. In the process of clustering the speech feature vector set to obtain a codebook, dynamic combination is added according to the number of vectors in the subset and the total distance measure of the vectors. and the step of splitting sub-sets reduces the sum of the distance measures between vectors and their corresponding codewords in the set after the clustering, improves the accuracy of the clustering algorithm, and applies the compressed codebook of the inventive method to the speech recognition system, While ensuring the recognition performance of the speech system, the storage capacity of the system is greatly reduced. The invention also discloses a speech recognition system, which uses feature codebook and probability table instead of acoustic model, and does not need to calculate Gaussian probability in the decoding process, only needs to find out the required probability value from the pre-stored probability table, greatly The amount of decoding calculation is reduced, so the recognition speed of the system can be greatly improved.

Description

Speech recognition system and be used for the compression method of the feature vector set of speech recognition system

Technical field

The present invention relates to a kind of speech recognition system and a kind of compression method that is used for the feature vector set of speech recognition system.

Background technology

Current speech recognition system nearly all is the method that adopts based on statistical model identification, all needing time domain sound wave with phonetic entry to convert a kind of digitized vector characteristic in all speech recognition systems describes and distinguishes different pronunciations, we are referred to as phonetic feature, based on this feature a sound model is set up in all pronunciations, we are referred to as acoustic model usually in field of speech recognition for this.All speech recognition systems all must have an acoustic model; Simultaneously, for big vocabulary continuous speech recognition system, also need a language model.The purpose of speech recognition be exactly given a string sound characteristic sequence be initial conditions, utilize acoustic model and language model, adopt searching algorithm, the output recognition result, as word, speech or sentence, in other words, speech recognition system is exactly to find out the word, speech or the sentence that are complementary and have maximum probability with given input feature vector sequence in huge word, speech or sentence space.Set of voice features then forms by the set of characteristic parameters of gathering numerous voice, can be used for the vector sequence of input voice is carried out quantization encoding, is converted into corresponding feature codeword sequence.

Shown in Figure 1 is a kind of structured flowchart of known speech recognition system, be transformed to the accessible digital signal of computing machine behind the analog voice process analog to digital conversion unit 11, utilizing 12 pairs of these digital signals of feature extraction unit to carry out the branch frame then handles, usually frame length is 20ms, frame moves and is 10ms, extract the MFCC parameter of each frame voice, obtain the MFCC vector sequence, decoding arithmetic element 14 is according to the feature vector sequence of input voice, acoustic model 13 and language model 15, adopt certain search strategy, as depth-first search (Viterbi algorithm) or BFS (Breadth First Search), the result who obtains discerning, wherein language model is used for the knowledge of linguistic level is applied to speech recognition system when carrying out big vocabulary continuous speech recognition, improves the accuracy of identification of system.

High speed development along with microelectric technique and mechanics of communication, embedded communication device such as mobile phone has almost become people's indispensable article of life of working, and people are also more and more higher to the requirement of its function, this makes voice technology is applied to the focus that this type of device becomes research, the phonetic function of existing mobile phone needs in advance the voice modeling of going into specific, limited phonetic order can only be discerned, and the identification of major term predicative sound can not be really realized, for example identification and the typing of complete pronunciation Chinese joint.Its reason mainly is the resource-constrained of this type of device, do not have required storage space and the arithmetic capability of operation Chinese unspecified person single syllable voice recognition system, therefore how when guaranteeing recognition performance, reduce the required storage space of existing voice recognition system, reducing its required operand, is technical matters anxious to be solved.

Usually in current popular speech recognition system, acoustic model all is to describe with hidden Markov model (HMM), owing to compare with HMM based on discrete probability density based on the hidden Markov model (CDHMM) of continuous probability density, more can accurately describe people's pronunciation, therefore acoustic model all adopts CDHMM in most of speech recognition systems.But the acoustic model of CDHMM need take a large amount of storage spaces, with existing unspecified person Chinese only the syllable speech recognition system be example, its acoustic model has occupied the memory space of 4M byte, and this can realize on the embedded hardware platform (as mobile phone, PDA etc.) of resource-constrained hardly.

A kind of way that reduces memory space is to reduce the status number of CDHMM or the Gaussian distribution number of each state, but this can reduce the recognition performance of system greatly.

Another kind method is acoustic model to be carried out vector quantization generate code book with packed data, its the most frequently used algorithm is the K-means clustering algorithm, earlier this feature is vowed that collection is divided into the plurality of sub space, again all vectors of each subspace are carried out cluster and obtain a code book, the step that each subspace code book generates is as shown in Figure 4: make k=0, the subspace is divided into a subclass, calculate the center vector of this subclass, obtain the initialization code book, step 200; If k=K has so just obtained the code book of K-bits, cluster finishes, otherwise execution in step 220, step 210; Make k=k+1, all subclass are divided into two, generate new center vector, the new code book of synthetic this subspace, step 220; With each vector assignment of this subspace in the subclass corresponding with the center vector of its distance metric minimum, step 230; Calculate total distance metric rate of change of these all vectors of subspace, step 240; The threshold value of this rate of change and a default rate of change is compared, if this rate of change is less than or equal to this threshold value, get back to step 210: greater than this threshold value, then execution in step 250 as if this rate of change; All vectors that distribute according to each subclass recomputate the center vector of this subclass, form new code book and get back to step 230, step 260.

In the process of above-mentioned cluster, the vector number that usually some subclass occurs and comprised seldom causes the distance metric summation of the code word that vector is corresponding with it in the set after the cluster still bigger than normal, and this understands the effect that influence cluster under the situation of given codebook size.And will be used for speech recognition system with the acoustic model after the compression of this method the time, can reduce the accuracy rate of speech recognition.And with the compression of this algorithm application in set of voice features, available feature code book, if the vector that some code word comprises very little, under the condition of given codebook size, the precision of feature code book can reduce, to the phonetic feature sequence quantization encoding of input the time, can influence the precision of coding, thereby can cause the recognition performance of speech recognition system to descend.

In addition, in CDHMM, the probability distribution function of eigenvector is to describe with the weighted sum of a plurality of gauss of distribution function under a certain state, can describe space of feature vectors so more accurately and distribute.If yet in major term predicative sound recognition system, adopt CDHMM, when decoding, the decoding arithmetic element need repeatedly calculate gaussian probability so, usually needed calculated amount all concentrates on gaussian probability and calculates the calculated amount that these need be a large amount of in decode procedure.When on the embedded hardware platform of resource-constraineds such as mobile phone, carrying out the identification of major term predicative sound, can cause the speech recognition system reaction velocity very slow, can't satisfy the needs of actual use.

Summary of the invention

In view of this, the technical problem to be solved in the present invention provides a kind of compression method that is used for the feature vector set of speech recognition system, and it can reduce the memory space of system under the prerequisite that guarantees the voice system recognition performance.

In order to achieve the above object, the invention provides a kind of compression method that is used for the feature vector set of speech recognition system, earlier this feature vector set is divided into the plurality of sub space, again all vectors of each subspace is carried out cluster and obtain a code book, the step that each subspace code book generates comprises:

(a) all vectors of this subspace are divided in the subclass, calculate the center vector of this subclass, obtain the initialization code book;

(b) all subclass are divided into two, generate the new code book of new center vector with synthetic this subspace;

(c) find respectively and the minimum center vector of each vector distance of this subspace tolerance, with each vector assignment in the subclass corresponding with this center vector of its distance metric minimum;

(d) calculate total distance metric rate of change of these all vectors of subspace;

(e) threshold value with this rate of change and a default rate of change compares; If this rate of change is less than or equal to this threshold value, judge whether to obtain the code book of predetermined number of bits again, if, then finish, if not, execution in step (b) then; If this rate of change is more than or equal to this threshold value, execution in step (f) then;

(f) based on vector number in each subclass and average distance metric, subclass is merged and divides; And

(g) center vector of the subclass that division is obtained obtains the new code book of this subspace as the code word of representing this subclass, and gets back to step (c).

In the such scheme, be characterized in that described step (b) can be divided into following steps again: calculate the mean square difference of all vectors of this subclass with respect to its center vector; This center vector respectively added half of mean square difference that it is corresponding obtain a new center vector that respectively this center vector deducts half of its corresponding mean square difference again, obtains another new center vector; And the center vector that this center, subspace is newly-generated lumps together the new code book of generation.

In the such scheme, be characterized in that total distance metric rate of change of all vectors of described subspace calculates by the following method: calculate the summation of each vector of described subspace and the distance between the center vector of its distance metric minimum, obtain the total distance metric of parent's folding; Former total distance metric is deducted new total distance metric obtain a difference; Promptly obtain this total distance metric rate of change with former total distance metric with the absolute value of this difference is attached again.

In the such scheme, be characterized in the described step (e) before judging whether to obtain the code book of predetermined number of bits, also having one new total distance metric value invested the step of original total distance metric value; And in described step (g), getting back to step (c) before, also having one new total distance metric value invested the step of former total distance metric value.

In the such scheme, be characterized in that described merging is meant to delete from code book less than the center vector of each subclass of a certain preset value with comprising the vector number.

In the such scheme, after being characterized in that described division is meant that a subclass is merged, calculate earlier all vectors and this subclass center vector in each subclass apart from sum, calculate again this and with the ratio of the vector number of this subclass, the subclass of gained ratio maximum is divided into two subclass, generates two new center vectors simultaneously.

In the such scheme, be characterized in that described eigenvector is LPC coefficient, cepstrum coefficient, bank of filters coefficient or MFCC coefficient.

In the such scheme, be characterized in that this feature vector set is an acoustic model or a set of voice features, obtain Gauss's code book after this acoustic model compression, obtain a feature code book after this set of voice features compression.

As from the foregoing, the inventive method is obtaining the speech characteristic vector clustering in the process of code book, increased the step that total distance metric dynamically merges and oidiospore is gathered according to vector number and vector in the subclass, reduced the distance metric summation of the code word that vector is corresponding with it in the set after the cluster, improved the precision of clustering algorithm, code book after the inventive method compression is applied in the speech recognition system, can when guaranteeing the voice system recognition performance, greatly reduce the memory space of system.

Another technical matters that the present invention will solve provides a kind of speech recognition system, can reduce the memory space of system under the prerequisite that guarantees the voice system recognition performance.

In order to achieve the above object, the invention provides a kind of speech recognition system, at least comprise an analog to digital conversion unit, a feature extraction unit, a decoding arithmetic element and an acoustic model, be used to the recognition result that receives voice input signal and obtain being complementary, wherein: this analog to digital conversion unit is converted to a digital signal with this voice input signal; This feature extraction unit is carried out the processing of branch frame with this digital signal, extracts the feature vector sequence that speech characteristic parameter obtains importing voice; The computing of decoding obtains recognition result to this decoding arithmetic element to this feature vector sequence.The Gauss code book of this acoustic model for adopting compression method of the present invention to obtain.In the said system, be characterized in also comprising a language model.

Because therefore the compact model that above-mentioned recognition system has adopted compression method of the present invention to obtain can greatly reduce the memory space of system under the prerequisite that guarantees the voice system recognition performance.

The another technical matters that the present invention will solve provides a kind of speech recognition system, can improve the recognition speed of system under the prerequisite that guarantees the voice system recognition performance.

In order to achieve the above object, the invention provides a kind of speech recognition system, comprise at least: the analog to digital conversion unit, will import analog signal of voice and be transformed to digital signal; Feature extraction unit is carried out the branch frame to this digital signal and is handled, and extracts the characteristic parameter of each frame voice, obtains its feature vector sequence; The feature code book is made up of the code word of some; The quantization encoding unit, the feature vector sequence that will import voice according to the feature code book is converted to the feature codeword sequence; Probability tables has been stored in the feature code book probable value of each code word in the corresponding Gauss's code book of each code word; And the decoding arithmetic element, the computing of decoding obtains recognition result to this feature codeword sequence, to each code word in this feature codeword sequence, directly searches the Gauss's code word that has the maximum match probability with it from probability tables in the computing.

In the such scheme, be characterized in also comprising a language model,

In the such scheme, be characterized in that the feature vector sequence that voice will be imported according to following steps in described quantization encoding unit is converted to the feature codeword sequence: described feature vector sequence is divided into subspace with described feature code book equal number, and each subspace is corresponding to a code book; Calculate the distance metric between each code word in all eigenvectors and corresponding code book in each subspace, the code word that will have a minimum distance metric with this eigenvector as in the described feature codeword sequence to code word that should eigenvector; The pairing code word of all vectors of each sub spaces of described feature vector sequence is got up by former vector sequential combination, promptly obtain characteristic of correspondence code book codeword sequence.

In the such scheme, be characterized in that described probability tables generates by following steps: the mean value vector and the variance vector that calculate each code word correspondence in Gauss's code book; Utilize above-mentioned mean value vector and variance vector, calculate the logarithm probable value that each code word is complementary in each code word and Gauss's code book in the described feature code book; The probable value that all code words in the feature code book and all code words in Gauss's code book are complementary stores and can obtain probability tables.

Therefore, use above-mentioned recognition system and the method for the present invention, in the process of decoding, do not need to calculate gaussian probability, need only from the probability tables of storage in advance, find out required probable value, significantly reduce the decoding operand, thereby can greatly improve the recognition speed of system.

The technical matters again that the present invention will solve provides a kind of speech recognition system, can reduce the memory space of system under the prerequisite that guarantees the voice system recognition performance, and improve the recognition speed of system.

In order to achieve the above object, the invention provides a kind of speech recognition system, comprise at least: the analog to digital conversion unit, will import analog signal of voice and be transformed to digital signal; Feature extraction unit is carried out the branch frame to this digital signal and is handled, and extracts the characteristic parameter of each frame voice, obtains its feature vector sequence; The feature code book, the feature code book that adopts compression method of the present invention to obtain; The quantization encoding unit, the feature vector sequence that will import voice according to the feature code book is converted to the feature codeword sequence; Probability tables has been stored in this feature code book the probable value of each code word in the corresponding Gauss's code book of each code word, and this Gauss's code book is for adopting Gauss's code book of compression method of the present invention; And the decoding arithmetic element, the computing of decoding obtains recognition result to this feature codeword sequence, to each code word in this feature codeword sequence, directly searches the Gauss's code word that has the maximum match probability with it from probability tables in the computing.

The acoustic model of said system substitutes with feature code book and probability tables, thus the required storage space of the system that greatly reduces, and can also guarantee that the accuracy of identification of system only has small decline.Simultaneously, because the use of probability tables has significantly reduced the calculated amount of system, according to experimental result, compare with existing speech recognition system, the recognition speed of recognition system of the present invention can improve more than 50%.

The technical matters again that the present invention will solve provides a kind of pronunciation recognition methods, can improve recognition speed.

In order to achieve the above object, the invention provides a kind of audio recognition method, may further comprise the steps: the voice analog signal of input is transformed to digital signal; This digital signal is carried out the branch frame handle, extract the characteristic parameter of each frame voice, obtain importing the feature vector sequence of voice; Utilize the feature code book that described feature vector sequence is carried out quantization encoding, obtain corresponding feature codeword sequence; The computing of decoding obtains recognition result, to each code word in this feature codeword sequence, directly finds the Gauss's code word that has the maximum match probability with it from probability tables in the computing.

In the said method, be characterized in that described feature code book is to obtain by the above-mentioned compression method of the present invention, described probability tables generates by following steps:

Calculate the mean value vector and the variance vector of each code word correspondence in Gauss's code book;

Utilize above-mentioned mean value vector and variance vector, calculate the logarithm probable value that each code word is complementary in each code word and Gauss's code book in the described feature code book;

The probable value that all code words in the feature code book and all code words in Gauss's code book are complementary stores and can obtain probability tables.

Gauss's code book wherein also is to obtain by the above-mentioned compression method of the present invention.

Adopt above-mentioned recognition methods, can under the prerequisite that guarantees the voice system recognition performance, reduce the memory space of system, and improve the recognition speed of system.

Description of drawings

Fig. 1 is the process flow diagram of known speech recognition system.

Fig. 2 is the synoptic diagram of subspace division and cluster.

Fig. 3 is the process flow diagram of the improved K-means clustering algorithm of the present invention.

Fig. 4 is the process flow diagram of known K-means clustering algorithm.

Fig. 5 is the identification process figure of the speech recognition system of the embodiment of the invention.

Fig. 6 is the structured flowchart of the speech recognition system of the embodiment of the invention.

Embodiment

Following explanation earlier is used for the compression method of the feature vector set of speech recognition system.

Phonetic feature has multiple, as LPC coefficient, cepstrum coefficient, bank of filters coefficient, Mel frequency domain cepstrum coefficient (Mel filter frequency coefficients, MFCC) etc., characteristic parameter commonly used is MFCC, here we and be indifferent to and adopt which kind of parameter, the present invention is applicable to any characteristic parameter.In order to understand conveniently, below be the compression method that example explanation the present invention is used for the feature vector set of speech recognition system with the MFCC coefficient.

Suppose L MFCC parameter of each frame voice, L first order difference MFCC parameter and L second order difference MFCC parameter are merged into the 3*L=X n dimensional vector n as characteristic parameter, constitute the set of voice features of X dimension, correspondingly the dimension of Gauss normal distribution also is the X dimension in the acoustic model, as shown in Figure 2, at first the vector of X being tieed up in set of voice features or the acoustic model 21 is divided into Y sub spaces 22, the dimension of each subspace 22 is X/Y=M dimensions, all vectors of the Gauss subspace that the proper subspace 22 of Y M dimension and Y M are tieed up carry out cluster with improved K-Means clustering algorithm 23 respectively, each proper subspace and each Gauss subspace obtain to comprise the code book 24 of setting a numeric character respectively, we just combine all feature code books and represent feature space so, all Gauss's code books are combined the Gauss space of representing acoustic model.

Specify the code book that how obtains explaining this space from the M n-dimensional subspace n below with improved K-Means clustering method.The subspace of supposing this M dimension is: M={m ₁..., m _N, this space has N vector, and we will be with this N vector cluster to K bit code book, and the K value can preestablish, and the code book that finally obtains comprises 2 ^kIndividual code word, each code word is made of the center vector of M dimension, referring now to the flow process among Fig. 3 this method is described as follows:

Step 100: make k=0bit, all vectors of this subspace are divided into a subclass, calculate the center vector c of this subclass _jJ=1 ..., 2 ^k, obtaining the initialization code book, the computing method of subclass center vector are as follows:

c_{j} = \frac{1}{N_{j}} Σ_{i = 1}^{N_{i}} m_{j_{i}},

N wherein _jBe subclass j{m _J1... m _JnjIn the vector number;

Step 110: if k=K has so just obtained the code book of K-bits, cluster finishes.Otherwise execution in step 120;

Step 120: make k=k+1, all subclass are divided into two, the method for fractionation is as follows:

First step by step: to all vectors in each subclass, calculate it with respect to this subclass center vector c _jThe mean square difference, be designated as: δ _j

Second step by step: generate two new center vectors as follows:

c_{j}^{1} = c_{j} + 0.5 \cdot δ_{j}

c_{j}^{2} = c_{j} - 0.5 \cdot δ_{j}

The 3rd step by step: all center vectors are lumped together generate the k-bits code book: { c ₁... c _2k}

Step 130: find and the minimum center vector of each vector distance tolerance of this subspace, in the subclass of this center vector correspondence, concrete grammar is as follows with each vector assignment:

First step by step: for each the vector m in this subspace _l∈ M, l=1 ..., N calculates the center vector c with its distance metric minimum _{N (l)}, make:

n (l) = \arg mi n_{j = 1, . . ., 2^{k}} d (m_{l}, c_{j})

Here d (m _l, c _j) be m _lAnd c _jBetween distance;

Second step by step: with this vector assignment in the subclass of this center vector correspondence;

Step 140: calculate total distance metric rate of change of these all vectors of subspace, the initialization value of total distance metric is D ₁=1e ^-20

All vectors in this subspace are calculated total distance:

D_{2} = Σ_{l = 1}^{N} d (m_{l}, c_{n (1)})

The rate of change η of this total distance metric=| D ₁-D ₂|/D ₁

Step 150: this a rate of change η and a total distance metric rate of change threshold value θ who presets are made comparisons, if η smaller or equal to θ, makes D ₁=D ₂, and get back to step 110; If η is greater than θ, then execution in step 160.

Step 160: the division of subclass and merging

Based on the vector number N in each subclass _jTotal distance metric DT with each subclass _j, carry out the division and the merging of subclass by following steps:

Each is step by step: merge

If N ₁＜φ, subclass CLj just will be merged so, and its center vector will be deleted from code book, j=1 here ..., 2 ^k, φ is predefined vector number threshold value.

Second step by step: division

If there is a subclass to be merged, need to select a subclass to do division so, the criterion of selection is as follows:

m = \arg ma x_{j = 1, . . ., 2^{k}} {DT}_{j} {IN}_{j},

Wherein:

{DT}_{j} = Σ_{i = 1}^{N_{i}} d (m_{j}, c_{j}), m_{j}

It is the vector in this subclass.

Promptly according to total distance metric of all vectors and center vector in the subclass divided by the vector number that this subclass comprises, obtain the mean distance tolerance of each subclass, again with the maximum subclass CL of mean distance tolerance _mDivide;

Step 170: 120 first and second calculate the center vector that divides the new subclass that obtains step by step set by step, constitute new code book with original center vector, and D is set ₁=D ₂, get back to step 130.

Can obtain the code book of this subspace by above-mentioned clustering algorithm, in our experimental result, the former algorithm of the ratio of precision of clustering algorithm has improved 18.5%.With the code book of all proper subspaces combine just obtain condition code this, the code book of all Gauss subspaces combined just obtain Gauss's code book, the acoustic model and the set of voice features that obtain with the above-mentioned compression method of the present invention can be applicable in the various speech recognition systems, significantly reduce the storage space that former acoustics model occupies, and have good recognition performance.

Below illustrate the application of this compression method on known speech recognition system acoustic model.

The acoustic model that adopts with unspecified person Chinese single-syllable speech recognition system is an example.The phonetic feature of this system adopts 12 rank MFCC, and 12 first order difference MFCC and 12 second order difference MFCC be totally 36 parameters, and acoustic model adopts CDHMM, and the shared storage space of model is 4Mbytes.At first the Gauss space with acoustic model is divided into 12 sub spaces, each subspace is 3 dimensions, use aforesaid improved K-Means clustering algorithm each subspace clustering then to the Gauss space, generate 7bits, Gauss's code book of 128 code words, because each Gauss's code word is made up of a mean value vector and a variance vector, so the shared byte number of each Gauss's code word is: Gauss subspace dimension * 2*4bytes

In order from Gauss's code book, to recover original acoustic model, need a concordance list, so Gauss's codebook size can be calculated as follows simultaneously:

Subspace number * Gauss code book code word is counted the shared byte number+concordance list of each code word of *

＝12*128*6*4+216000＝252864bytes

The storage space of system has been reduced to about 252K byte from the 4M byte, because the clustering precision of compression method of the present invention is very high, therefore adopt Gauss's code book that this algorithm obtains speech recognition system as acoustic model, except that the required storage space of the system of greatly reducing, the accuracy of identification of experimental result proof system only has small decline, thereby can be applicable in the embedded equipments such as mobile phone.

Shown in Figure 5 is the identification process figure of embodiment of the invention speech recognition system, and the step of its identification is as follows.

Step 300: the voice analog signal of input is transformed to digital signal;

Step 310: this digital signal is carried out the branch frame handle, extract the characteristic parameter of each frame voice, obtain importing the feature vector sequence of voice:

Step 320: utilize the feature code book that described feature vector sequence is carried out quantization encoding, obtain corresponding feature codeword sequence;

Step 330: the computing of decoding obtains recognition result, to each code word in this feature codeword sequence, only needs directly to search from probability tables to find the Gauss's code word that has the maximum match probability with it in the computing.

Its corresponding system chart as shown in Figure 6, this speech recognition system is made up of analog to digital conversion unit 61, feature extraction unit 62, feature code book 63, quantization encoding unit 64, probability tables 65, decoding arithmetic element 66 and language model 67, certainly, if only be applied to the speech recognition of Chinese single-syllable, then do not need language model 67.Wherein:

Analog to digital conversion unit 61 is used for the input analog signal of voice is transformed to digital signal;

Feature extraction unit 62 is carried out the branch frame to this digital signal and is handled, and extracts the characteristic parameter of each frame voice, obtains its feature vector sequence;

Feature code book 63, compression obtains the above-mentioned compression method of available the present invention to set of voice features, also can adopt other known compression method to obtain;

Quantization encoding unit 64 carries out quantization encoding according to 63 pairs of feature vector sequences of importing voice of feature code book, is converted into the feature codeword sequence.The feature vector sequence of supposing the input voice is { A ₁, A ₂..., A _T, its dimension is X, and the subspace number in the feature code book 63 is Y, and its dimension is X/Y=M, at first feature vector sequence also is divided into the Y sub spaces, and each subspace is corresponding to a code book, and its vector sequence at i proper subspace is { O _I1, O _I2..., O _IT, 1≤i≤Y is to sequence { O _I1, O _I2..., O _ITQuantization encoding is exactly will be from corresponding feature code book { F _I1, F _I2..., F _ILIn find out the codeword sequence that has minimum distance metric accordingly, L is the code word number in the code book, its step is as follows:

At first, calculated characteristics vector O _ItWith the code word F in the corresponding code book _IjBetween distance metric:

D_{itj} = Σ_{m = 1}^{M} {(O_{it} (m) - F_{ij} (m))}^{2}, 1 \leq t \leq T, 1 \leq i \leq Y, 1 \leq j \leq L

O wherein _It(m) and F _Ij(m) be O respectively _ItAnd F _IjM component,

Then, ask O _ItThe codewords indexes of minimum metric distance:

n_{it} = \underset{1 \leq i \leq L}{\arg \min} D_{itj} - - - - 1 \leq t \leq T, 1 \leq i \leq Y

O so _ItThrough code word corresponding behind the quantization encoding is F _Mj, each sub spaces of feature vector sequence is carried out quantization encoding respectively, just obtained the feature codeword sequence of this feature vector sequence.

Probability tables 65, (compression obtains the above-mentioned compression method of the available the present invention of code book collection in this Gauss space to acoustic model to have stored the corresponding Gauss's code book of each code word in the feature code book 63, also can adopt other compression method to obtain) in the probable value of each code word, the generative process of this probability tables is as follows:

The code book of supposing i proper subspace is: { F _I1, F _I2..., F _IL, L is the code word number of this code book, the code book collection of the Y of X dimensional feature space M n-dimensional subspace n is so:

{F ₁₁，F ₁₂，...，F _1L，...，F _Y1，F _Y2，...，F _YL}

The code book of supposing i Gauss subspace is: { G _L1, G _L2..., G _LL, L is the code word number of this code book, the code book collection of Y the M n-dimensional subspace n in X dimension Gauss space is so:

{G ₁₁，G ₁₂，...G _1L，...，G _Y1，G _Y2，...，G _YL}，

Suppose Gauss's code word G _LkCorresponding average and variance vector are respectively: m _Lk, σ _Lk, the probable value of using in decoding algorithm all is the logarithm probable value usually, so certain code word F in the feature code book _IjWith certain code word G in Gauss's code book _LkThe logarithm probable value that is complementary can be calculated with following formula:

\ln (\frac{1}{\sqrt{{2 πσ}_{lk}^{2}}} \exp (- (F_{ij} - m_{lk}) \cdot (F_{ij} - m_{lk}) / σ_{lk}^{2}))

Calculate the probable value that code word in all feature code books and all code words in Gauss's code book are complementary, store and to obtain probability tables.

Decoding arithmetic element 66, the computing of decoding obtains recognition result, to each code word in this feature codeword sequence, only needs directly to search from probability tables to find the Gauss's code word that has the maximum match probability with it in the computing.

And language model 67, when carrying out the continuous speech input, can use the accuracy of identification of raising system by the knowledge of linguistic level.

Above-mentioned analog to digital conversion unit 61 can be carried out with an analog to digital conversion chip, the function of feature extraction unit 62, quantization encoding unit 64 and decoding arithmetic element 66 can be finished by CPU, and probability tables 65, feature code book 63 and language model 67 all are stored in the storer.

Below by a specific embodiment, illustrate when speech recognition system disclosed by the invention has adopted feature code book that compression method of the present invention obtains and probability tables, reduce storage space and improving outstanding effect on the arithmetic speed.

Be that example describes also with unspecified person Chinese single-syllable speech recognition system.In this system, phonetic feature adopts 12 rank MFCC, and 12 first order difference MFCC and 12 second order difference MFCC be totally 36 parameters, and acoustic model adopts CDHMM, and the shared storage space of model is 4Mbytes.Feature space 36 n dimensional vector ns that we concentrate phonetic feature are divided into 12 sub spaces, each subspace is 3 dimensions, correspondingly the Gauss space of acoustic model also is divided into 12 sub spaces, each subspace also is 3 dimensions, uses aforesaid improved K-Means clustering algorithm respectively to each subspace clustering of feature space then, generates 7bits, the feature code book of 128 code words, to each subspace clustering in Gauss space, generate 7bits, Gauss's code book of 128 code words equally.Calculate then in the feature code book that the probable value of each code word is stored in the probability tables in each code word and Gauss's code book, the size of feature code book and probability tables is calculated as follows:

The feature codebook size:

Subspace number * code book code word is counted byte number=12*128*3*4=18432 byte that the every dimension of dimension * of * subspace takies.

The probability tables size:

Subspace number * proper subspace code book code word is counted * Gauss subspace code book code word and is counted byte number=12*128*128*2=393216 byte that each probability of * takies.

The acoustic model available feature code book of system and probability tables substitute, so the storage of system has been reduced to about 412K byte from the 4M byte, thus the required storage space of the system that greatly reduces, and can also guarantee that the accuracy of identification of system only has small decline.Simultaneously, because the use of probability tables has significantly reduced the calculated amount of system, according to experimental result, compare with existing speech recognition system, the recognition speed of recognition system of the present invention can improve more than 50%.

The significantly lifting of recognition system of the present invention on performances such as storage space and arithmetic speed, make the application of speech recognition on embedded devices such as mobile phone of complete pronunciation Chinese joint become possibility, be applied to also can improve the reaction velocity of system on other device in the performance of optimization system.

Claims

1, a kind of compression method that is used for the feature vector set of speech recognition system is divided into the plurality of sub space with this feature vector set earlier, again all vectors of each subspace is carried out cluster and obtains-code book, and the step that each subspace code book generates comprises:

(c) find center vector with each vector distance degree minimum of this subspace respectively, with each vector assignment in the subclass corresponding with the center vector of its distance metric minimum;

2, the compression method that is used for the feature vector set of speech recognition system as claimed in claim 1 is characterized in that described step (b) can be divided into following steps again:

Calculate the mean square difference of all vectors of this subclass with respect to its center vector;

This center vector respectively added half of mean square difference that it is corresponding obtain a new center vector that respectively this center vector deducts half of its corresponding mean square difference again, obtains another new center vector; And

The center vector that this subspace is newly-generated lumps together and obtains new code book.

3, the compression method that is used for the feature vector set of speech recognition system as claimed in claim 1, the total distance metric rate of change that it is characterized in that all vectors of described subspace calculates by the following method: calculate the summation of described subspace vector and the distance between the center vector of its distance metric minimum, obtain new total distance metric; Former total distance metric is deducted new total distance metric obtain a difference; Absolute value with this difference promptly obtains this total distance metric rate of change divided by former total distance metric again.

4, the compression method that is used for the feature vector set of speech recognition system as claimed in claim 1, it is characterized in that in the described step (e), before judging whether to obtain the code book of predetermined number of bits, also have one new total distance metric value invested the step of former total distance metric value; And in described step (g), getting back to step (c) before, also having one new total distance metric value invested the step of former total distance metric value.

5, the compression method that is used for the feature vector set of speech recognition system as claimed in claim 1 is characterized in that described merging is meant to delete from code book less than the center vector of each subclass of a certain preset value comprising the vector number.

6, the compression method that is used for the feature vector set of speech recognition system as claimed in claim 5, after it is characterized in that described division is meant that a subclass is merged, calculate earlier all vectors and this subclass center vector in each subclass apart from sum, calculate again this and with the ratio of the vector number of this subclass, the subclass of gained ratio maximum is divided into two subclass, generates two new center vectors simultaneously.

7, the compression method that is used for the feature vector set of speech recognition system as claimed in claim 1 is characterized in that the described feature amount of raising is LPC coefficient, cepstrum coefficient, bank of filters coefficient or MFCC coefficient.

8, as the described compression method that is used for the feature vector set of speech recognition system of the arbitrary claim of claim 1 to 7, it is characterized in that this feature vector set is an acoustic model or a set of voice features, obtain Gauss's code book after this acoustic model compression, obtain a feature code book after this set of voice features compression.

9, a kind of speech recognition system comprises an analog to digital conversion unit, a feature extraction unit, a decoding arithmetic element and an acoustic model at least, is used to receive the recognition result that voice input signal obtains being complementary, wherein:

This analog to digital conversion unit is converted to a digital signal with this voice input signal;

This feature extraction unit is carried out the processing of branch frame with this digital signal, extracts the feature vector sequence that speech characteristic parameter obtains importing voice;

The computing of decoding obtains recognition result to this decoding arithmetic element to this feature vector sequence;

Obtain Gauss's code book after this acoustic model compression.

10, speech recognition system as claimed in claim 9 is characterized in that also comprising a language model.

11, a kind of speech recognition system is used to the recognition result that receives voice input signal and obtain being complementary, comprises at least:

The analog to digital conversion unit will be imported analog signal of voice and be transformed to digital signal;

Feature extraction unit is carried out the branch frame to this digital signal and is handled, and extracts the characteristic parameter of each frame voice, obtains its feature vector sequence;

The feature code book is for what obtain after this set of voice features compression;

The quantization encoding unit, the feature vector sequence that will import voice according to this feature code book is converted to the feature codeword sequence;

Probability tables has been stored the probable value of each code word in Gauss's code book of each code word correspondence in this feature code book, should be Gauss's code book described in claim 8 from this code book; And

The decoding arithmetic element, the computing of decoding obtains recognition result to this feature codeword sequence, to each code word in this feature codeword sequence, directly searches the Gauss's code word that has the maximum match probability with it from probability tables in the computing.

12, speech recognition system as claimed in claim 11 is characterized in that also comprising a language model.

13, speech recognition system as claimed in claim 11 is characterized in that the feature vector sequence that voice will be imported according to following steps in this quantization encoding unit is converted to the feature codeword sequence;

Described feature vector sequence is divided into subspace with described feature code book equal number, and each subspace is corresponding to a code book;

Calculate the distance metric between each code word in all eigenvectors and corresponding code book in each subspace, the code word that will have a minimum distance metric with this eigenvector as in the described feature codeword sequence to code word that should eigenvector;

The pairing code word of all vectors of each sub spaces of described feature vector sequence is got up by former vector sequential combination, promptly obtain characteristic of correspondence code book codeword sequence.

14, speech recognition system as claimed in claim 11 is characterized in that described probability tables generates by following steps:

Calculate the average and the variance vector of each code word correspondence in Gauss's code book;

Utilize above-mentioned average and variance vector, calculate the logarithm probable value that each code word is complementary in each code word and Gauss's code book in the described feature code book;