CN1190772C - Voice identifying system and compression method of characteristic vector set for voice identifying system - Google Patents

Voice identifying system and compression method of characteristic vector set for voice identifying system Download PDF

Info

Publication number
CN1190772C
CN1190772C CNB021486832A CN02148683A CN1190772C CN 1190772 C CN1190772 C CN 1190772C CN B021486832 A CNB021486832 A CN B021486832A CN 02148683 A CN02148683 A CN 02148683A CN 1190772 C CN1190772 C CN 1190772C
Authority
CN
China
Prior art keywords
code book
feature
vector
subclass
subspace
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
CNB021486832A
Other languages
Chinese (zh)
Other versions
CN1455388A (en
Inventor
潘接林
韩疆
刘建
颜永红
庹凌云
张建平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CNB021486832A priority Critical patent/CN1190772C/en
Publication of CN1455388A publication Critical patent/CN1455388A/en
Application granted granted Critical
Publication of CN1190772C publication Critical patent/CN1190772C/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The present invention discloses a method for compressing a characteristic vector set of a voice recognition system. In the process of clustering the characteristic vector set to obtain a code book, a step for dynamically merging and splitting a subaggregate according to the number of vectors of the subaggregate and the total distance measure of the vectors is added; thus, the total distance measure sum of the vectors in an aggregate after clustering and corresponding code words is reduced, and the precision of a clustering algorithm is increased. When the code book which is compressed by using the method is applied to the voice recognition, the storage capacity of the voice recognition can be greatly reduced under the precondition that the recognizing performance of the voice recognition is guaranteed. The present invention also discloses a voice recognition system which substitutes for an acoustic model by using the code book and a probability table; in the process of decoding, a required probability value is searched from the prestored probability table without calculating a gauss probability; thus, decoding operation quantity is greatly reduced, and the recognizing speed of the voice recognition system is greatly increased.

Description

Speech recognition system and be used for the compression method of the feature vector set of speech recognition system
Technical field
The present invention relates to a kind of speech recognition system and a kind of compression method that is used for the feature vector set of speech recognition system.
Background technology
Current speech recognition system nearly all is the method that adopts based on statistical model identification, all needing time domain sound wave with phonetic entry to convert a kind of digitized vector characteristic in all speech recognition systems describes and distinguishes different pronunciations, we are referred to as phonetic feature, based on this feature a sound model is set up in all pronunciations, we are referred to as acoustic model usually in field of speech recognition for this.All speech recognition systems all must have an acoustic model; Simultaneously, for big vocabulary continuous speech recognition system, also need a language model.The purpose of speech recognition be exactly given a string sound characteristic sequence be initial conditions, utilize acoustic model and language model, adopt searching algorithm, the output recognition result, as word, speech or sentence, in other words, speech recognition system is exactly to find out the word, speech or the sentence that are complementary and have maximum probability with given input feature vector sequence in huge word, speech or sentence space.Set of voice features then forms by the set of characteristic parameters of gathering numerous voice, can be used for the vector sequence of input voice is carried out quantization encoding, is converted into corresponding feature codeword sequence.
Shown in Figure 1 is a kind of structured flowchart of known speech recognition system, be transformed to the accessible digital signal of computing machine behind the analog voice process analog to digital conversion unit 11, utilizing 12 pairs of these digital signals of feature extraction unit to carry out the branch frame then handles, usually frame length is 20ms, frame moves and is 10ms, extract the MFCC parameter of each frame voice, obtain the MFCC vector sequence, decoding arithmetic element 14 is according to the feature vector sequence of input voice, acoustic model 13 and language model 15, adopt certain search strategy, as depth-first search (Viterbi algorithm) or BFS (Breadth First Search), the result who obtains discerning, wherein language model is used for the knowledge of linguistic level is applied to speech recognition system when carrying out big vocabulary continuous speech recognition, improves the accuracy of identification of system.
High speed development along with microelectric technique and mechanics of communication, embedded communication device such as mobile phone has almost become people's indispensable article of life of working, and people are also more and more higher to the requirement of its function, this makes voice technology is applied to the focus that this type of device becomes research, the phonetic function of existing mobile phone needs in advance the voice modeling of going into specific, limited phonetic order can only be discerned, and the identification of major term predicative sound can not be really realized, for example identification and the typing of complete pronunciation Chinese joint.Its reason mainly is the resource-constrained of this type of device, do not have required storage space and the arithmetic capability of operation Chinese unspecified person single syllable voice recognition system, therefore how when guaranteeing recognition performance, reduce the required storage space of existing voice recognition system, reducing its required operand, is technical matters anxious to be solved.
Usually in current popular speech recognition system, acoustic model all is to describe with hidden Markov model (HMM), owing to compare with HMM based on discrete probability density based on the hidden Markov model (CDHMM) of continuous probability density, more can accurately describe people's pronunciation, therefore acoustic model all adopts CDHMM in most of speech recognition systems.But the acoustic model of CDHMM need take a large amount of storage spaces, with existing unspecified person Chinese only the syllable speech recognition system be example, its acoustic model has occupied the memory space of 4M byte, and this can realize on the embedded hardware platform (as mobile phone, PDA etc.) of resource-constrained hardly.
A kind of way that reduces memory space is to reduce the status number of CDHMM or the Gaussian distribution number of each state, but this can reduce the recognition performance of system greatly.
Another kind method is acoustic model to be carried out vector quantization generate code book with packed data, its the most frequently used algorithm is the K-means clustering algorithm, earlier this feature is vowed that collection is divided into the plurality of sub space, again all vectors of each subspace are carried out cluster and obtain a code book, the step that each subspace code book generates is as shown in Figure 4: make k=0, the subspace is divided into a subclass, calculate the center vector of this subclass, obtain the initialization code book, step 200; If k=K has so just obtained the code book of K-bits, cluster finishes, otherwise execution in step 220, step 210; Make k=k+1, all subclass are divided into two, generate new center vector, the new code book of synthetic this subspace, step 220; With each vector assignment of this subspace in the subclass corresponding with the center vector of its distance metric minimum, step 230; Calculate total distance metric rate of change of these all vectors of subspace, step 240; The threshold value of this rate of change and a default rate of change is compared, if this rate of change is less than or equal to this threshold value, get back to step 210: greater than this threshold value, then execution in step 250 as if this rate of change; All vectors that distribute according to each subclass recomputate the center vector of this subclass, form new code book and get back to step 230, step 260.
In the process of above-mentioned cluster, the vector number that usually some subclass occurs and comprised seldom causes the distance metric summation of the code word that vector is corresponding with it in the set after the cluster still bigger than normal, and this understands the effect that influence cluster under the situation of given codebook size.And will be used for speech recognition system with the acoustic model after the compression of this method the time, can reduce the accuracy rate of speech recognition.And with the compression of this algorithm application in set of voice features, available feature code book, if the vector that some code word comprises very little, under the condition of given codebook size, the precision of feature code book can reduce, to the phonetic feature sequence quantization encoding of input the time, can influence the precision of coding, thereby can cause the recognition performance of speech recognition system to descend.
In addition, in CDHMM, the probability distribution function of eigenvector is to describe with the weighted sum of a plurality of gauss of distribution function under a certain state, can describe space of feature vectors so more accurately and distribute.If yet in major term predicative sound recognition system, adopt CDHMM, when decoding, the decoding arithmetic element need repeatedly calculate gaussian probability so, usually needed calculated amount all concentrates on gaussian probability and calculates the calculated amount that these need be a large amount of in decode procedure.When on the embedded hardware platform of resource-constraineds such as mobile phone, carrying out the identification of major term predicative sound, can cause the speech recognition system reaction velocity very slow, can't satisfy the needs of actual use.
Summary of the invention
In view of this, the technical problem to be solved in the present invention provides a kind of compression method that is used for the feature vector set of speech recognition system, and it can reduce the memory space of system under the prerequisite that guarantees the voice system recognition performance.
In order to achieve the above object, the invention provides a kind of compression method that is used for the feature vector set of speech recognition system, earlier this feature vector set is divided into the plurality of sub space, again all vectors of each subspace is carried out cluster and obtain a code book, the step that each subspace code book generates comprises:
(a) all vectors of this subspace are divided in the subclass, calculate the center vector of this subclass, obtain the initialization code book;
(b) all subclass are divided into two, generate the new code book of new center vector with synthetic this subspace;
(c) find respectively and the minimum center vector of each vector distance of this subspace tolerance, with each vector assignment in the subclass corresponding with this center vector of its distance metric minimum;
(d) calculate total distance metric rate of change of these all vectors of subspace;
(e) threshold value with this rate of change and a default rate of change compares; If this rate of change is less than or equal to this threshold value, judge whether to obtain the code book of predetermined number of bits again, if, then finish, if not, execution in step (b) then; If this rate of change is more than or equal to this threshold value, execution in step (f) then;
(f) based on vector number in each subclass and average distance metric, subclass is merged and divides; And
(g) center vector of the subclass that division is obtained obtains the new code book of this subspace as the code word of representing this subclass, and gets back to step (c).
In the such scheme, be characterized in that described step (b) can be divided into following steps again: calculate the mean square difference of all vectors of this subclass with respect to its center vector; This center vector respectively added half of mean square difference that it is corresponding obtain a new center vector that respectively this center vector deducts half of its corresponding mean square difference again, obtains another new center vector; And the center vector that this center, subspace is newly-generated lumps together the new code book of generation.
In the such scheme, be characterized in that total distance metric rate of change of all vectors of described subspace calculates by the following method: calculate the summation of each vector of described subspace and the distance between the center vector of its distance metric minimum, obtain the total distance metric of parent's folding; Former total distance metric is deducted new total distance metric obtain a difference; Promptly obtain this total distance metric rate of change with former total distance metric with the absolute value of this difference is attached again.
In the such scheme, be characterized in the described step (e) before judging whether to obtain the code book of predetermined number of bits, also having one new total distance metric value invested the step of original total distance metric value; And in described step (g), getting back to step (c) before, also having one new total distance metric value invested the step of former total distance metric value.
In the such scheme, be characterized in that described merging is meant to delete from code book less than the center vector of each subclass of a certain preset value with comprising the vector number.
In the such scheme, after being characterized in that described division is meant that a subclass is merged, calculate earlier all vectors and this subclass center vector in each subclass apart from sum, calculate again this and with the ratio of the vector number of this subclass, the subclass of gained ratio maximum is divided into two subclass, generates two new center vectors simultaneously.
In the such scheme, be characterized in that described eigenvector is LPC coefficient, cepstrum coefficient, bank of filters coefficient or MFCC coefficient.
In the such scheme, be characterized in that this feature vector set is an acoustic model or a set of voice features, obtain Gauss's code book after this acoustic model compression, obtain a feature code book after this set of voice features compression.
As from the foregoing, the inventive method is obtaining the speech characteristic vector clustering in the process of code book, increased the step that total distance metric dynamically merges and oidiospore is gathered according to vector number and vector in the subclass, reduced the distance metric summation of the code word that vector is corresponding with it in the set after the cluster, improved the precision of clustering algorithm, code book after the inventive method compression is applied in the speech recognition system, can when guaranteeing the voice system recognition performance, greatly reduce the memory space of system.
Another technical matters that the present invention will solve provides a kind of speech recognition system, can reduce the memory space of system under the prerequisite that guarantees the voice system recognition performance.
In order to achieve the above object, the invention provides a kind of speech recognition system, at least comprise an analog to digital conversion unit, a feature extraction unit, a decoding arithmetic element and an acoustic model, be used to the recognition result that receives voice input signal and obtain being complementary, wherein: this analog to digital conversion unit is converted to a digital signal with this voice input signal; This feature extraction unit is carried out the processing of branch frame with this digital signal, extracts the feature vector sequence that speech characteristic parameter obtains importing voice; The computing of decoding obtains recognition result to this decoding arithmetic element to this feature vector sequence.The Gauss code book of this acoustic model for adopting compression method of the present invention to obtain.In the said system, be characterized in also comprising a language model.
Because therefore the compact model that above-mentioned recognition system has adopted compression method of the present invention to obtain can greatly reduce the memory space of system under the prerequisite that guarantees the voice system recognition performance.
The another technical matters that the present invention will solve provides a kind of speech recognition system, can improve the recognition speed of system under the prerequisite that guarantees the voice system recognition performance.
In order to achieve the above object, the invention provides a kind of speech recognition system, comprise at least: the analog to digital conversion unit, will import analog signal of voice and be transformed to digital signal; Feature extraction unit is carried out the branch frame to this digital signal and is handled, and extracts the characteristic parameter of each frame voice, obtains its feature vector sequence; The feature code book is made up of the code word of some; The quantization encoding unit, the feature vector sequence that will import voice according to the feature code book is converted to the feature codeword sequence; Probability tables has been stored in the feature code book probable value of each code word in the corresponding Gauss's code book of each code word; And the decoding arithmetic element, the computing of decoding obtains recognition result to this feature codeword sequence, to each code word in this feature codeword sequence, directly searches the Gauss's code word that has the maximum match probability with it from probability tables in the computing.
In the such scheme, be characterized in also comprising a language model,
In the such scheme, be characterized in that the feature vector sequence that voice will be imported according to following steps in described quantization encoding unit is converted to the feature codeword sequence: described feature vector sequence is divided into subspace with described feature code book equal number, and each subspace is corresponding to a code book; Calculate the distance metric between each code word in all eigenvectors and corresponding code book in each subspace, the code word that will have a minimum distance metric with this eigenvector as in the described feature codeword sequence to code word that should eigenvector; The pairing code word of all vectors of each sub spaces of described feature vector sequence is got up by former vector sequential combination, promptly obtain characteristic of correspondence code book codeword sequence.
In the such scheme, be characterized in that described probability tables generates by following steps: the mean value vector and the variance vector that calculate each code word correspondence in Gauss's code book; Utilize above-mentioned mean value vector and variance vector, calculate the logarithm probable value that each code word is complementary in each code word and Gauss's code book in the described feature code book; The probable value that all code words in the feature code book and all code words in Gauss's code book are complementary stores and can obtain probability tables.
Therefore, use above-mentioned recognition system and the method for the present invention, in the process of decoding, do not need to calculate gaussian probability, need only from the probability tables of storage in advance, find out required probable value, significantly reduce the decoding operand, thereby can greatly improve the recognition speed of system.
The technical matters again that the present invention will solve provides a kind of speech recognition system, can reduce the memory space of system under the prerequisite that guarantees the voice system recognition performance, and improve the recognition speed of system.
In order to achieve the above object, the invention provides a kind of speech recognition system, comprise at least: the analog to digital conversion unit, will import analog signal of voice and be transformed to digital signal; Feature extraction unit is carried out the branch frame to this digital signal and is handled, and extracts the characteristic parameter of each frame voice, obtains its feature vector sequence; The feature code book, the feature code book that adopts compression method of the present invention to obtain; The quantization encoding unit, the feature vector sequence that will import voice according to the feature code book is converted to the feature codeword sequence; Probability tables has been stored in this feature code book the probable value of each code word in the corresponding Gauss's code book of each code word, and this Gauss's code book is for adopting Gauss's code book of compression method of the present invention; And the decoding arithmetic element, the computing of decoding obtains recognition result to this feature codeword sequence, to each code word in this feature codeword sequence, directly searches the Gauss's code word that has the maximum match probability with it from probability tables in the computing.
The acoustic model of said system substitutes with feature code book and probability tables, thus the required storage space of the system that greatly reduces, and can also guarantee that the accuracy of identification of system only has small decline.Simultaneously, because the use of probability tables has significantly reduced the calculated amount of system, according to experimental result, compare with existing speech recognition system, the recognition speed of recognition system of the present invention can improve more than 50%.
The technical matters again that the present invention will solve provides a kind of pronunciation recognition methods, can improve recognition speed.
In order to achieve the above object, the invention provides a kind of audio recognition method, may further comprise the steps: the voice analog signal of input is transformed to digital signal; This digital signal is carried out the branch frame handle, extract the characteristic parameter of each frame voice, obtain importing the feature vector sequence of voice; Utilize the feature code book that described feature vector sequence is carried out quantization encoding, obtain corresponding feature codeword sequence; The computing of decoding obtains recognition result, to each code word in this feature codeword sequence, directly finds the Gauss's code word that has the maximum match probability with it from probability tables in the computing.
In the said method, be characterized in that described feature code book is to obtain by the above-mentioned compression method of the present invention, described probability tables generates by following steps:
Calculate the mean value vector and the variance vector of each code word correspondence in Gauss's code book;
Utilize above-mentioned mean value vector and variance vector, calculate the logarithm probable value that each code word is complementary in each code word and Gauss's code book in the described feature code book;
The probable value that all code words in the feature code book and all code words in Gauss's code book are complementary stores and can obtain probability tables.
Gauss's code book wherein also is to obtain by the above-mentioned compression method of the present invention.
Adopt above-mentioned recognition methods, can under the prerequisite that guarantees the voice system recognition performance, reduce the memory space of system, and improve the recognition speed of system.
Description of drawings
Fig. 1 is the process flow diagram of known speech recognition system.
Fig. 2 is the synoptic diagram of subspace division and cluster.
Fig. 3 is the process flow diagram of the improved K-means clustering algorithm of the present invention.
Fig. 4 is the process flow diagram of known K-means clustering algorithm.
Fig. 5 is the identification process figure of the speech recognition system of the embodiment of the invention.
Fig. 6 is the structured flowchart of the speech recognition system of the embodiment of the invention.
Embodiment
Following explanation earlier is used for the compression method of the feature vector set of speech recognition system.
Phonetic feature has multiple, as LPC coefficient, cepstrum coefficient, bank of filters coefficient, Mel frequency domain cepstrum coefficient (Mel filter frequency coefficients, MFCC) etc., characteristic parameter commonly used is MFCC, here we and be indifferent to and adopt which kind of parameter, the present invention is applicable to any characteristic parameter.In order to understand conveniently, below be the compression method that example explanation the present invention is used for the feature vector set of speech recognition system with the MFCC coefficient.
Suppose L MFCC parameter of each frame voice, L first order difference MFCC parameter and L second order difference MFCC parameter are merged into the 3*L=X n dimensional vector n as characteristic parameter, constitute the set of voice features of X dimension, correspondingly the dimension of Gauss normal distribution also is the X dimension in the acoustic model, as shown in Figure 2, at first the vector of X being tieed up in set of voice features or the acoustic model 21 is divided into Y sub spaces 22, the dimension of each subspace 22 is X/Y=M dimensions, all vectors of the Gauss subspace that the proper subspace 22 of Y M dimension and Y M are tieed up carry out cluster with improved K-Means clustering algorithm 23 respectively, each proper subspace and each Gauss subspace obtain to comprise the code book 24 of setting a numeric character respectively, we just combine all feature code books and represent feature space so, all Gauss's code books are combined the Gauss space of representing acoustic model.
Specify the code book that how obtains explaining this space from the M n-dimensional subspace n below with improved K-Means clustering method.The subspace of supposing this M dimension is: M={m 1..., m N, this space has N vector, and we will be with this N vector cluster to K bit code book, and the K value can preestablish, and the code book that finally obtains comprises 2 kIndividual code word, each code word is made of the center vector of M dimension, referring now to the flow process among Fig. 3 this method is described as follows:
Step 100: make k=0bit, all vectors of this subspace are divided into a subclass, calculate the center vector c of this subclass jJ=1 ..., 2 k, obtaining the initialization code book, the computing method of subclass center vector are as follows:
c j = 1 N j Σ i = 1 N i m j i , N wherein jBe subclass j{m J1... m JnjIn the vector number;
Step 110: if k=K has so just obtained the code book of K-bits, cluster finishes.Otherwise execution in step 120;
Step 120: make k=k+1, all subclass are divided into two, the method for fractionation is as follows:
First step by step: to all vectors in each subclass, calculate it with respect to this subclass center vector c jThe mean square difference, be designated as: δ j
Second step by step: generate two new center vectors as follows:
c j 1 = c j + 0.5 · δ j
c j 2 = c j - 0.5 · δ j
The 3rd step by step: all center vectors are lumped together generate the k-bits code book: { c 1... c 2k}
Step 130: find and the minimum center vector of each vector distance tolerance of this subspace, in the subclass of this center vector correspondence, concrete grammar is as follows with each vector assignment:
First step by step: for each the vector m in this subspace l∈ M, l=1 ..., N calculates the center vector c with its distance metric minimum N (l), make:
n ( l ) = arg mi n j = 1 , . . . , 2 k d ( m l , c j )
Here d (m l, c j) be m lAnd c jBetween distance;
Second step by step: with this vector assignment in the subclass of this center vector correspondence;
Step 140: calculate total distance metric rate of change of these all vectors of subspace, the initialization value of total distance metric is D 1=1e -20
All vectors in this subspace are calculated total distance:
D 2 = Σ l = 1 N d ( m l , c n ( 1 ) )
The rate of change η of this total distance metric=| D 1-D 2|/D 1
Step 150: this a rate of change η and a total distance metric rate of change threshold value θ who presets are made comparisons, if η smaller or equal to θ, makes D 1=D 2, and get back to step 110; If η is greater than θ, then execution in step 160.
Step 160: the division of subclass and merging
Based on the vector number N in each subclass jTotal distance metric DT with each subclass j, carry out the division and the merging of subclass by following steps:
Each is step by step: merge
If N 1<φ, subclass CLj just will be merged so, and its center vector will be deleted from code book, j=1 here ..., 2 k, φ is predefined vector number threshold value.
Second step by step: division
If there is a subclass to be merged, need to select a subclass to do division so, the criterion of selection is as follows:
m = arg ma x j = 1 , . . . , 2 k DT j IN j , Wherein:
DT j = Σ i = 1 N i d ( m j , c j ) , m j It is the vector in this subclass.
Promptly according to total distance metric of all vectors and center vector in the subclass divided by the vector number that this subclass comprises, obtain the mean distance tolerance of each subclass, again with the maximum subclass CL of mean distance tolerance mDivide;
Step 170: 120 first and second calculate the center vector that divides the new subclass that obtains step by step set by step, constitute new code book with original center vector, and D is set 1=D 2, get back to step 130.
Can obtain the code book of this subspace by above-mentioned clustering algorithm, in our experimental result, the former algorithm of the ratio of precision of clustering algorithm has improved 18.5%.With the code book of all proper subspaces combine just obtain condition code this, the code book of all Gauss subspaces combined just obtain Gauss's code book, the acoustic model and the set of voice features that obtain with the above-mentioned compression method of the present invention can be applicable in the various speech recognition systems, significantly reduce the storage space that former acoustics model occupies, and have good recognition performance.
Below illustrate the application of this compression method on known speech recognition system acoustic model.
The acoustic model that adopts with unspecified person Chinese single-syllable speech recognition system is an example.The phonetic feature of this system adopts 12 rank MFCC, and 12 first order difference MFCC and 12 second order difference MFCC be totally 36 parameters, and acoustic model adopts CDHMM, and the shared storage space of model is 4Mbytes.At first the Gauss space with acoustic model is divided into 12 sub spaces, each subspace is 3 dimensions, use aforesaid improved K-Means clustering algorithm each subspace clustering then to the Gauss space, generate 7bits, Gauss's code book of 128 code words, because each Gauss's code word is made up of a mean value vector and a variance vector, so the shared byte number of each Gauss's code word is: Gauss subspace dimension * 2*4bytes
In order from Gauss's code book, to recover original acoustic model, need a concordance list, so Gauss's codebook size can be calculated as follows simultaneously:
Subspace number * Gauss code book code word is counted the shared byte number+concordance list of each code word of *
=12*128*6*4+216000=252864bytes
The storage space of system has been reduced to about 252K byte from the 4M byte, because the clustering precision of compression method of the present invention is very high, therefore adopt Gauss's code book that this algorithm obtains speech recognition system as acoustic model, except that the required storage space of the system of greatly reducing, the accuracy of identification of experimental result proof system only has small decline, thereby can be applicable in the embedded equipments such as mobile phone.
Shown in Figure 5 is the identification process figure of embodiment of the invention speech recognition system, and the step of its identification is as follows.
Step 300: the voice analog signal of input is transformed to digital signal;
Step 310: this digital signal is carried out the branch frame handle, extract the characteristic parameter of each frame voice, obtain importing the feature vector sequence of voice:
Step 320: utilize the feature code book that described feature vector sequence is carried out quantization encoding, obtain corresponding feature codeword sequence;
Step 330: the computing of decoding obtains recognition result, to each code word in this feature codeword sequence, only needs directly to search from probability tables to find the Gauss's code word that has the maximum match probability with it in the computing.
Its corresponding system chart as shown in Figure 6, this speech recognition system is made up of analog to digital conversion unit 61, feature extraction unit 62, feature code book 63, quantization encoding unit 64, probability tables 65, decoding arithmetic element 66 and language model 67, certainly, if only be applied to the speech recognition of Chinese single-syllable, then do not need language model 67.Wherein:
Analog to digital conversion unit 61 is used for the input analog signal of voice is transformed to digital signal;
Feature extraction unit 62 is carried out the branch frame to this digital signal and is handled, and extracts the characteristic parameter of each frame voice, obtains its feature vector sequence;
Feature code book 63, compression obtains the above-mentioned compression method of available the present invention to set of voice features, also can adopt other known compression method to obtain;
Quantization encoding unit 64 carries out quantization encoding according to 63 pairs of feature vector sequences of importing voice of feature code book, is converted into the feature codeword sequence.The feature vector sequence of supposing the input voice is { A 1, A 2..., A T, its dimension is X, and the subspace number in the feature code book 63 is Y, and its dimension is X/Y=M, at first feature vector sequence also is divided into the Y sub spaces, and each subspace is corresponding to a code book, and its vector sequence at i proper subspace is { O I1, O I2..., O IT, 1≤i≤Y is to sequence { O I1, O I2..., O ITQuantization encoding is exactly will be from corresponding feature code book { F I1, F I2..., F ILIn find out the codeword sequence that has minimum distance metric accordingly, L is the code word number in the code book, its step is as follows:
At first, calculated characteristics vector O ItWith the code word F in the corresponding code book IjBetween distance metric:
D itj = Σ m = 1 M ( O it ( m ) - F ij ( m ) ) 2 , 1 ≤ t ≤ T , 1 ≤ i ≤ Y , 1 ≤ j ≤ L
O wherein It(m) and F Ij(m) be O respectively ItAnd F IjM component,
Then, ask O ItThe codewords indexes of minimum metric distance:
n it = arg min 1 ≤ i ≤ L D itj - - - - 1 ≤ t ≤ T , 1 ≤ i ≤ Y
O so ItThrough code word corresponding behind the quantization encoding is F Mj, each sub spaces of feature vector sequence is carried out quantization encoding respectively, just obtained the feature codeword sequence of this feature vector sequence.
Probability tables 65, (compression obtains the above-mentioned compression method of the available the present invention of code book collection in this Gauss space to acoustic model to have stored the corresponding Gauss's code book of each code word in the feature code book 63, also can adopt other compression method to obtain) in the probable value of each code word, the generative process of this probability tables is as follows:
The code book of supposing i proper subspace is: { F I1, F I2..., F IL, L is the code word number of this code book, the code book collection of the Y of X dimensional feature space M n-dimensional subspace n is so:
{F 11,F 12,...,F 1L,...,F Y1,F Y2,...,F YL}
The code book of supposing i Gauss subspace is: { G L1, G L2..., G LL, L is the code word number of this code book, the code book collection of Y the M n-dimensional subspace n in X dimension Gauss space is so:
{G 11,G 12,...G 1L,...,G Y1,G Y2,...,G YL},
Suppose Gauss's code word G LkCorresponding average and variance vector are respectively: m Lk, σ Lk, the probable value of using in decoding algorithm all is the logarithm probable value usually, so certain code word F in the feature code book IjWith certain code word G in Gauss's code book LkThe logarithm probable value that is complementary can be calculated with following formula:
ln ( 1 2 πσ lk 2 exp ( - ( F ij - m lk ) · ( F ij - m lk ) / σ lk 2 ) )
Calculate the probable value that code word in all feature code books and all code words in Gauss's code book are complementary, store and to obtain probability tables.
Decoding arithmetic element 66, the computing of decoding obtains recognition result, to each code word in this feature codeword sequence, only needs directly to search from probability tables to find the Gauss's code word that has the maximum match probability with it in the computing.
And language model 67, when carrying out the continuous speech input, can use the accuracy of identification of raising system by the knowledge of linguistic level.
Above-mentioned analog to digital conversion unit 61 can be carried out with an analog to digital conversion chip, the function of feature extraction unit 62, quantization encoding unit 64 and decoding arithmetic element 66 can be finished by CPU, and probability tables 65, feature code book 63 and language model 67 all are stored in the storer.
Therefore, use above-mentioned recognition system and the method for the present invention, in the process of decoding, do not need to calculate gaussian probability, need only from the probability tables of storage in advance, find out required probable value, significantly reduce the decoding operand, thereby can greatly improve the recognition speed of system.
Below by a specific embodiment, illustrate when speech recognition system disclosed by the invention has adopted feature code book that compression method of the present invention obtains and probability tables, reduce storage space and improving outstanding effect on the arithmetic speed.
Be that example describes also with unspecified person Chinese single-syllable speech recognition system.In this system, phonetic feature adopts 12 rank MFCC, and 12 first order difference MFCC and 12 second order difference MFCC be totally 36 parameters, and acoustic model adopts CDHMM, and the shared storage space of model is 4Mbytes.Feature space 36 n dimensional vector ns that we concentrate phonetic feature are divided into 12 sub spaces, each subspace is 3 dimensions, correspondingly the Gauss space of acoustic model also is divided into 12 sub spaces, each subspace also is 3 dimensions, uses aforesaid improved K-Means clustering algorithm respectively to each subspace clustering of feature space then, generates 7bits, the feature code book of 128 code words, to each subspace clustering in Gauss space, generate 7bits, Gauss's code book of 128 code words equally.Calculate then in the feature code book that the probable value of each code word is stored in the probability tables in each code word and Gauss's code book, the size of feature code book and probability tables is calculated as follows:
The feature codebook size:
Subspace number * code book code word is counted byte number=12*128*3*4=18432 byte that the every dimension of dimension * of * subspace takies.
The probability tables size:
Subspace number * proper subspace code book code word is counted * Gauss subspace code book code word and is counted byte number=12*128*128*2=393216 byte that each probability of * takies.
The acoustic model available feature code book of system and probability tables substitute, so the storage of system has been reduced to about 412K byte from the 4M byte, thus the required storage space of the system that greatly reduces, and can also guarantee that the accuracy of identification of system only has small decline.Simultaneously, because the use of probability tables has significantly reduced the calculated amount of system, according to experimental result, compare with existing speech recognition system, the recognition speed of recognition system of the present invention can improve more than 50%.
The significantly lifting of recognition system of the present invention on performances such as storage space and arithmetic speed, make the application of speech recognition on embedded devices such as mobile phone of complete pronunciation Chinese joint become possibility, be applied to also can improve the reaction velocity of system on other device in the performance of optimization system.

Claims (14)

1, a kind of compression method that is used for the feature vector set of speech recognition system is divided into the plurality of sub space with this feature vector set earlier, again all vectors of each subspace is carried out cluster and obtains-code book, and the step that each subspace code book generates comprises:
(a) all vectors of this subspace are divided in the subclass, calculate the center vector of this subclass, obtain the initialization code book;
(b) all subclass are divided into two, generate the new code book of new center vector with synthetic this subspace;
(c) find center vector with each vector distance degree minimum of this subspace respectively, with each vector assignment in the subclass corresponding with the center vector of its distance metric minimum;
(d) calculate total distance metric rate of change of these all vectors of subspace;
(e) threshold value with this rate of change and a default rate of change compares; If this rate of change is less than or equal to this threshold value, judge whether to obtain the code book of predetermined number of bits again, if, then finish, if not, execution in step (b) then; If this rate of change is more than or equal to this threshold value, execution in step (f) then;
(f) based on vector number in each subclass and average distance metric, subclass is merged and divides; And
(g) center vector of the subclass that division is obtained obtains the new code book of this subspace as the code word of representing this subclass, and gets back to step (c).
2, the compression method that is used for the feature vector set of speech recognition system as claimed in claim 1 is characterized in that described step (b) can be divided into following steps again:
Calculate the mean square difference of all vectors of this subclass with respect to its center vector;
This center vector respectively added half of mean square difference that it is corresponding obtain a new center vector that respectively this center vector deducts half of its corresponding mean square difference again, obtains another new center vector; And
The center vector that this subspace is newly-generated lumps together and obtains new code book.
3, the compression method that is used for the feature vector set of speech recognition system as claimed in claim 1, the total distance metric rate of change that it is characterized in that all vectors of described subspace calculates by the following method: calculate the summation of described subspace vector and the distance between the center vector of its distance metric minimum, obtain new total distance metric; Former total distance metric is deducted new total distance metric obtain a difference; Absolute value with this difference promptly obtains this total distance metric rate of change divided by former total distance metric again.
4, the compression method that is used for the feature vector set of speech recognition system as claimed in claim 1, it is characterized in that in the described step (e), before judging whether to obtain the code book of predetermined number of bits, also have one new total distance metric value invested the step of former total distance metric value; And in described step (g), getting back to step (c) before, also having one new total distance metric value invested the step of former total distance metric value.
5, the compression method that is used for the feature vector set of speech recognition system as claimed in claim 1 is characterized in that described merging is meant to delete from code book less than the center vector of each subclass of a certain preset value comprising the vector number.
6, the compression method that is used for the feature vector set of speech recognition system as claimed in claim 5, after it is characterized in that described division is meant that a subclass is merged, calculate earlier all vectors and this subclass center vector in each subclass apart from sum, calculate again this and with the ratio of the vector number of this subclass, the subclass of gained ratio maximum is divided into two subclass, generates two new center vectors simultaneously.
7, the compression method that is used for the feature vector set of speech recognition system as claimed in claim 1 is characterized in that the described feature amount of raising is LPC coefficient, cepstrum coefficient, bank of filters coefficient or MFCC coefficient.
8, as the described compression method that is used for the feature vector set of speech recognition system of the arbitrary claim of claim 1 to 7, it is characterized in that this feature vector set is an acoustic model or a set of voice features, obtain Gauss's code book after this acoustic model compression, obtain a feature code book after this set of voice features compression.
9, a kind of speech recognition system comprises an analog to digital conversion unit, a feature extraction unit, a decoding arithmetic element and an acoustic model at least, is used to receive the recognition result that voice input signal obtains being complementary, wherein:
This analog to digital conversion unit is converted to a digital signal with this voice input signal;
This feature extraction unit is carried out the processing of branch frame with this digital signal, extracts the feature vector sequence that speech characteristic parameter obtains importing voice;
The computing of decoding obtains recognition result to this decoding arithmetic element to this feature vector sequence;
Obtain Gauss's code book after this acoustic model compression.
10, speech recognition system as claimed in claim 9 is characterized in that also comprising a language model.
11, a kind of speech recognition system is used to the recognition result that receives voice input signal and obtain being complementary, comprises at least:
The analog to digital conversion unit will be imported analog signal of voice and be transformed to digital signal;
Feature extraction unit is carried out the branch frame to this digital signal and is handled, and extracts the characteristic parameter of each frame voice, obtains its feature vector sequence;
The feature code book is for what obtain after this set of voice features compression;
The quantization encoding unit, the feature vector sequence that will import voice according to this feature code book is converted to the feature codeword sequence;
Probability tables has been stored the probable value of each code word in Gauss's code book of each code word correspondence in this feature code book, should be Gauss's code book described in claim 8 from this code book; And
The decoding arithmetic element, the computing of decoding obtains recognition result to this feature codeword sequence, to each code word in this feature codeword sequence, directly searches the Gauss's code word that has the maximum match probability with it from probability tables in the computing.
12, speech recognition system as claimed in claim 11 is characterized in that also comprising a language model.
13, speech recognition system as claimed in claim 11 is characterized in that the feature vector sequence that voice will be imported according to following steps in this quantization encoding unit is converted to the feature codeword sequence;
Described feature vector sequence is divided into subspace with described feature code book equal number, and each subspace is corresponding to a code book;
Calculate the distance metric between each code word in all eigenvectors and corresponding code book in each subspace, the code word that will have a minimum distance metric with this eigenvector as in the described feature codeword sequence to code word that should eigenvector;
The pairing code word of all vectors of each sub spaces of described feature vector sequence is got up by former vector sequential combination, promptly obtain characteristic of correspondence code book codeword sequence.
14, speech recognition system as claimed in claim 11 is characterized in that described probability tables generates by following steps:
Calculate the average and the variance vector of each code word correspondence in Gauss's code book;
Utilize above-mentioned average and variance vector, calculate the logarithm probable value that each code word is complementary in each code word and Gauss's code book in the described feature code book;
The probable value that all code words in the feature code book and all code words in Gauss's code book are complementary stores and can obtain probability tables.
CNB021486832A 2002-09-30 2002-11-15 Voice identifying system and compression method of characteristic vector set for voice identifying system Expired - Lifetime CN1190772C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB021486832A CN1190772C (en) 2002-09-30 2002-11-15 Voice identifying system and compression method of characteristic vector set for voice identifying system

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN02131087 2002-09-30
CN02131087.4 2002-09-30
CNB021486832A CN1190772C (en) 2002-09-30 2002-11-15 Voice identifying system and compression method of characteristic vector set for voice identifying system

Related Child Applications (2)

Application Number Title Priority Date Filing Date
CNB2004100701402A Division CN1259648C (en) 2002-11-15 2002-11-15 Phonetic recognition system
CNB200410070139XA Division CN1284134C (en) 2002-11-15 2002-11-15 A speech recognition system

Publications (2)

Publication Number Publication Date
CN1455388A CN1455388A (en) 2003-11-12
CN1190772C true CN1190772C (en) 2005-02-23

Family

ID=29271370

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB021486832A Expired - Lifetime CN1190772C (en) 2002-09-30 2002-11-15 Voice identifying system and compression method of characteristic vector set for voice identifying system

Country Status (1)

Country Link
CN (1) CN1190772C (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8214204B2 (en) 2004-07-23 2012-07-03 Telecom Italia S.P.A. Method for generating a vector codebook, method and device for compressing data, and distributed speech recognition system

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452704B (en) * 2007-11-29 2011-05-11 中国科学院声学研究所 Speaker clustering method based on information transfer
CN102623008A (en) * 2011-06-21 2012-08-01 中国科学院苏州纳米技术与纳米仿生研究所 Voiceprint identification method
CN104766608A (en) * 2014-01-07 2015-07-08 深圳市中兴微电子技术有限公司 Voice control method and voice control device
CN105374359B (en) * 2014-08-29 2019-05-17 中国电信股份有限公司 The coding method and system of voice data
CN105893389A (en) * 2015-01-26 2016-08-24 阿里巴巴集团控股有限公司 Voice message search method, device and server
US10411728B2 (en) * 2016-02-08 2019-09-10 Koninklijke Philips N.V. Device for and method of determining clusters
CN106409285A (en) * 2016-11-16 2017-02-15 杭州联络互动信息科技股份有限公司 Method and apparatus for intelligent terminal device to identify language type according to voice data
CN108389576B (en) * 2018-01-10 2020-09-01 苏州思必驰信息科技有限公司 Method and system for optimizing compressed speech recognition model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8214204B2 (en) 2004-07-23 2012-07-03 Telecom Italia S.P.A. Method for generating a vector codebook, method and device for compressing data, and distributed speech recognition system

Also Published As

Publication number Publication date
CN1455388A (en) 2003-11-12

Similar Documents

Publication Publication Date Title
CN1296886C (en) Speech recognition system and method
CN1269102C (en) Method for compressing dictionary data
CN1252675C (en) Sound identification method and sound identification apparatus
CN1277248C (en) System and method for recognizing a tonal language
CN1150515C (en) Speech recognition device
JP3696231B2 (en) Language model generation and storage device, speech recognition device, language model generation method and speech recognition method
US10176809B1 (en) Customized compression and decompression of audio data
CN1551101A (en) Adaptation of compressed acoustic models
CN1103971C (en) Speech recognition computer module and digit and speech signal transformation method based on phoneme
CN1320902A (en) Voice identifying device and method, and recording medium
CN1750120A (en) Indexing apparatus and indexing method
CN1573926A (en) Discriminative training of language models for text and speech classification
CN1338095A (en) Apparatus and method for pitch tracking
US5950158A (en) Methods and apparatus for decreasing the size of pattern recognition models by pruning low-scoring models from generated sets of models
CN1781102A (en) Low memory decision tree
CN1331467A (en) Method and device for producing acoustics model
CN1924994A (en) Embedded language synthetic method and system
CN1190772C (en) Voice identifying system and compression method of characteristic vector set for voice identifying system
CN1201284C (en) Rapid decoding method for voice identifying system
Ren et al. Discovering time-constrained sequential patterns for music genre classification
CN1841496A (en) Method and apparatus for measuring speech speed and recording apparatus therefor
CN1190773C (en) Voice identifying system and compression method of characteristic vector set for voice identifying system
CN1284134C (en) A speech recognition system
CN1499484A (en) Recognition system of Chinese continuous speech
CN1159701C (en) Speech recognition apparatus for executing syntax permutation rule

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C14 Grant of patent or utility model
GR01 Patent grant
CX01 Expiry of patent term
CX01 Expiry of patent term

Granted publication date: 20050223