CN1282069A

CN1282069A - On-palm computer speech identification core software package

Info

Publication number: CN1282069A
Application number: CN99111131A
Authority: CN
Inventors: 邓勇刚; 徐波; 黄泰翼
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 1999-07-27
Filing date: 1999-07-27
Publication date: 2001-01-31

Abstract

The on-palm computer speech recognition kernel software package is a speech recognition application program interface operated under the condition of on-palm computer environment by means of speaker-dependent qualified lexical quantity and isolated works. The speech recognition technique belongs to the field of pattern recognition technology. For isolated words said invention creates speaker-dependent continuous density implicit Markov model, and its software package interface includes starting and ending recognition, training/recognition, model management, menu management and recognition parameter configuration, adopting end-point detection algorithm based on time domain energy, extracting LPC cepstrum characteristic parameter, adopting Viterbi search algorithm recognition and using neural network to make recognition/reject decision.

Description

The on-palm computer speech identification core software package

The present invention is the complete solution of specific person alone word speech recognition kernel software bag under the palmtop computer environment, and it comprises the design of general frame, the classification of interface, and implementation algorithm.Speech recognition problem is a typical pattern recognition problem.

Speech recognition had obtained development rapidly in recent years, all obtained remarkable progress in Study on Problems such as modeling, training, search, robustness, self-adaptation, had accumulated a lot of practical experiences and theoretical method.The modeling method that adopts in the main flow algorithm is based on the continuous density hidden Markov model of statistical law, handles large batch of data, and training and identifying all need to consume many especially storage and computational resource through a large amount of, complex calculations.Compare with personal computer, the palmtop computer arithmetic speed is slow, only be equivalent to 386 levels, and internal memory is fewer, generally only be 8M, built-in sound pick-up outfit signal to noise ratio (S/N ratio) is very low, considers the needs of practical application, training sample can not be too many, and the speech recognition software of exploitation practicability is surrounded by certain degree of difficulty under such environment.

The objective of the invention is to: under resource-constrained situation, be directed to the characteristics of palmtop computer special messenger special use, the complete solution of the specific person alone word speech recognition of palmtop computer kernel software bag is provided, and making in the palmtop computer application program to increase speech identifying function very easily.

Technical essential of the present invention is as shown in Figure 1: it comprises bottom, transition bed and interface layer.The voice signal pre-service of bottom part to from sound pick-up outfit, producing, end-point detection is extracted feature, and training pattern is discerned/is refused to know and adjudicates.The interface function that interface layer provides application program to need realizes that the core that transition bed connects bottom is implemented to interface layer.

The present invention adopts Object Oriented method, and all API is encapsulated in the interface class, and the example that application program only need be created this interface utilizes interface pointer, just can call any interface function.In order to increase the dirigibility of application program, predefine of the present invention a notice class IVCmdNotifySink, such all function all is a Virtual Function, application program must realize one from notifying class to inherit the class of getting off, be used for informing that the identification core detects speech events and (begins, finishes such as voice, and identify an order etc.) time this execution what operate, it is defined as follows: c1assIVCmdNotifySink{public:virtual void UtteranceBegin (void)=0; / ^*Detecting sound begins ^*/ virtual void UtteranceEnd (void)=0; / ^*Sound finishes ^*/ virtual void VUMeter (int Volume)=0; / ^*Volume (from 0 to 100) ^*/ virtual void CmdStart (void)=0; / ^*The beginning recognizing voice ^*/ virtual void CmdNotUnderstand (void)=0; / ^*Do not understand, refuse to know ^*/ virtual void CmdRecognize (SRecogResult result)=0; / ^*Recognition result ^*/;

Application program and software package are discerned the relation between the core as can be seen from Figure 1.

The present invention has used the notion of " menu ", and in fact it is exactly a command history.Order can be added at any time, train.Figure 2 shows that the interface of training order, each is ordered with three samples, and correct for guaranteeing recognition result, the present invention provides playback function for the training sample of recording.According to function, API provided by the invention can be divided into following five classes:

1. startup/end identification: this class interface is used for the registration notification class and finishes whole identification core, comprises two functions, the registration function Register that should call when being the startup of whole software bag, and it finishes some initialization operations; Another is End, and it is to be called before whole application program finishes, this interface releasing memory, and preserve settings such as relevant model and parameter;

2. training/identification: control identification beginning and time-out, and training process;

3. model management: comprise increase, model of deletion, inquire about certain model and whether exist, by the sequence number retrieval model etc.;

4. menu management: comprise and create a voice menu determine which cover menu of current use, for order project of the current set of menu interpolation/deletion, inquire about and be provided with command entry purpose current state etc. simultaneously;

5. identification parameter configuration: for different applied environments, such as ground unrest, the word speed of speaking etc., if recognizer is done corresponding the adjustment, can reach higher accuracy rate, the acoustic conductance core provides this dirigibility, for different application demands provides convenience, comprise whether enabling and refuse to know, dispose end-point detection algorithm etc.The parameter setting is very directly perceived, and the Application developer need not know most speech recognition professional knowledge.

Bottom core processing of the present invention as shown in Figure 1 comprises: recording, and pre-service, end-point detection, feature extraction, knowledge judgement etc. is discerned/refused to training algorithm.Below tell about respectively:

1. recording: utilize the built-in sound pick-up outfit of palm machine, the voice signal sampling rate is 8K/s, every sample 16bit.

2. pre-service: 392 samples of every frame, the overlapping field of consecutive frame.The pre-emphasis formula is: x ¹[n]=x[n]-0.97*x[n-1].

3. end-point detection: based on time domain energy, every frame energy is the quadratic sum of sample after the pre-service.Realization flow comprises following five steps as shown in Figure 3: the estimated background noise also calculates bound, mark and begins, be sure of beginning, mark distal point and be sure of to finish.

4. feature extraction: characteristic parameter adopts the LPC cepstrum.Sample after the pre-emphasis through the hamming window, calculates coefficient of autocorrelation earlier then, and the Levinson-Durbin iterative algorithm by standard draws 12 rank LPC parameters again, and iteration goes out 12 rank LPC cepstrum parameters then, makes single order at last, second order difference.Characteristic parameter comprises the time domain energy of normalization and first order difference, second order difference and 12 rank LPC cepstrum parameters and first order difference thereof, second order difference totally 39 dimension parameters.With 16 integer representations, the training of back and identification can fix a point to realize like this after the characteristic parameter amplification certain multiple.The concrete computing formula of LPC cepstrum is as follows:

1) hamming window function:

2) coefficient of autocorrelation: establish through pre-service, the voice signal after the windowing is s ¹[n], then coefficient of autocorrelation is:

The coefficient of autocorrelation of normalization is:

r [l] = R [l] / R [0]^{'}

L=0 wherein, 1,2 ..., P

3) Levinson-Durbin iterative computation 12 rank LPC parameters:

E 0 = r [0]

k_{i} = \frac{r [i] - &Sum; α_{j}^{(i - 1)} \cdot [i - j]}{E_{i - 1}}, 1 \leq j \leq P

α_{j}^{(i)} = k_{i}

α_{j}^{(i)} = α_{j}^{(i - 1)} - k_{i} \cdot α_{i - j}^{(i - 1)}, 1 \leq j \leq i - 1

E_{i} = (1 - k_{i}^{2}) E_{i - 1}

Wherein P is the rank 12 of prediction, last a ^(P) _j, 1≤j≤P is the LPC coefficient of prediction

4) cepstrum parameter:

h 1 = a 1

hn = an + Σ_{k = 1}^{n - 1} (1 - \frac{k}{n}) anhn - k, 1 < n \leq P

H wherein _iBe cepstrum coefficient, a _nBe the LPC coefficient.

5. training: 3 training samples are recorded in each order, set up 8 states as shown in Figure 4, from left to right the continuous density hidden Markov model.Utilize even segmentation result initialization model, on existing model parameter basis training sample is cut apart with the Viterbi algorithm, and then reappraised model parameter, so the iteration multipass such as 5 times, obtains the final mask parameter.

6. discern/refuse and know judging process: adopt Viterbi searching algorithm to calculate the likelihood ratio logarithm score of unknown sample each model with standard, then the best model of score is extracted 3 index parameters, deliver in the neural network as shown in Figure 5 to discern/refuse to know and adjudicate, detailed process is as follows, and wherein N is a model number total in the application program:

(1) carries out normalization to the likelihood Log score of unknown pattern X, and according to frame length with N the model of Viterbi searching algorithm calculating of standard;

(2) to N score according to from high to low the ordering, might as well establish score and be respectively S ₁-S _N

(3) first place is calculated its three index x ₁, x ₂, x ₃:

x_{1} = S_{1} / mea n_{1},

x_{2} = S_{2} / S_{1}

x_{3} = (S_{1} - \frac{1}{M} \overset{Σ}{k &NotEqual; 1, S_{k} &GreaterEqual; αS} 1) / S_{1}

Mean wherein ₁Average during for model 1 training after the normalization of sample frame length, α is a constant, value is between 0.5 to 0.9.Index 3 is confidence levels of model 1, and the summation in the braces is those model score mean values close with model 1 score, and close degree is determined by constant alpha;

(4) discern/refuse the knowledge judgement according to three indexs:

y ₀=x ₁ ^*W ₀₁+x ₂ ^*W ₀₂+x ₃ ^*W ₀₃ y ₁=x ₁ ^*W ₁₁+x ₂ ^*W ₁₂+x ₃ ^*W ₁₃

Wherein the W coefficient is the network weight coefficient that has trained.

At y ₀＞y ₁Situation under, be identified as model 1, and provide score to come the many of front

Individual candidate result.At y ₀≤ y ₁Situation under, refuse to know.

7. network weight coefficient learning algorithm: initial network weight coefficient picked at random, adjust weight coefficient according to the learning algorithm that the teacher is arranged: Δ W _i=a (t) [x _i(t)-W _i(t)], x wherein _I(t) be t input sample constantly, adjust coefficient a (t)=0.1* (1.0-t/M), M is a frequency of training.

The invention has the advantages that:

1. maximum vocabulary can reach 200, and occupying system resources is few, and the recognition accuracy of general name is surpassed 95%, simultaneously and accent, dialect and languages have nothing to do.Fixed point realizes having increased substantially feature extraction and recognition speed, can handle in real time on the palm machine platform, reaches degree of being practical;

2. the software package Frame Design is reasonable, and interface function is perfect, can satisfy palm machine application program requirements such as generally being similar to Voice Navigation, visiting-card management, sound dialing.While interface function friendly interface, the developer does not need to understand a lot of speech recognition professional knowledge;

3. background noise dynamic estimation, the starting and ending point of detection all have a process of confirming, so the end-point detection algorithm can adapt to the environment of variation, very high precision and robustness are arranged;

4. when discerning/refusing the knowledge judgement with neural network, the ratio that need not determine that factor to account for by rule of thumb is great, and network is adjusted automatically, aggregative weighted; Avoided simple thresholding strategy; 0/1 judgement index all is a number percent, and each index haggles on same level, has dwindled dynamic range, has certain universal significance; Index definition is reasonable, and the close word of pronunciation has been taken into account, and utilizes Useful Information as much as possible.

Description of drawings:

Fig. 1 is the mutual relationship between software package general frame and application program and the kernel software bag.

Fig. 2 is the training order interface.

Fig. 3 is the end-point detection algorithm flow.

Fig. 4 is the hidden Markov model topological structure of each order.

Fig. 5 knows the decision neural network topological structure for discerning/refusing.

The identification name of Fig. 6 on the software package basis, developing, the palm machine Application Program Interface of place name.

Embodiment:

It is very simple to use the present invention to add speech identifying function as palm machine application program, only needs a dynamic base VcmdPpcApi.dll and header file SpeechApi.h.

The present invention adopts Object Oriented method, and all interfaces all concentrate among the dynamic base VcmdPpcApi.dll, use this dynamic base, at first need it is registered in the system.

Application program is at the calling interface function or use in the source file of predefined structure and constant in the SpeechApi.h header file and should comprise this header file.

When using development of practical program of the present invention generally according to following step:

1. realize that a notice class IV CmdNotifySink inherits the class of getting off, and define such example, its address is passed in the past the whole acoustic conductance core of initialization so that call Register registration interface function when creating interface instance.

2. establishment interface instance is to obtain interface pointer.

3. call Register interface function registration notification class, the identification core begins to start.

4. call different API as required.Generally created menu before this, determined the current set of menu, and added different order projects, can train or discern then.

5. before whole procedure finishes, what calling interface End preserved software package has related parameter and a setting.

Fig. 6 is one and uses the palm machine identification place name that the present invention writes and the application program of name that it almost relates to all interface functions.The user can arbitrarily add, deletes, trains and discern menucommand.Recognition result provides nearly 5 candidates, shows from high to low according to the model score.Figure 2 shows that the training order dialog box.All orders are presented in the list box, and double-click can be trained this order.

Claims

1, a kind of on-palm computer speech identification core software package of forming by bottom, transition bed, api interface layer, it is characterized in that: bottom is to be linked by the sound pick-up outfit of built-in computer and pretreatment module, link with end-point detection again, be linked to after the characteristic extracting module respectively connection then again and discern/refuse and know or training module, its result is through realizing linking with Application Program Interface by class; Application Program Interface calls following five class interfaces that the api interface layer provides: startup/end identification, training/identification, model management, menu management and identification parameter configuration; The transition bed of being made up of the identification kernel connects bottom layer realization and api interface layer.

2, on-palm computer speech identification core software package according to claim 1, it is characterized in that the end-point detection algorithm adopts time domain energy, testing process comprises following five steps: the estimated background noise and calculate bound, mark begins, be sure of beginning, mark distal point and be sure of to finish.

3, on-palm computer speech identification core software package according to claim 1, it is characterized in that characteristic parameter adopts the time domain energy of normalization and first order difference, second order difference and 12 rank LPC cepstrum parameters and first order difference thereof, second order difference totally 39 dimension parameters.

4, on-palm computer speech identification core software package according to claim 1 is characterized in that training module trains from left to right continuous density hidden Markov model to isolated word; Utilize even segmentation result initialization model, repeatedly on existing model parameter basis, training sample is cut apart with the Viterbi algorithm, and then reappraised model parameter.

5, on-palm computer speech identification core software package according to claim 1, treat the likelihood ratio logarithm score of knowing each model of sample calculation with the Viterbi searching algorithm when it is characterized in that discerning, and the knowledge judgement is discerned/refused to the model of score first; Under the identification situation, software package also provides many candidates.

6, on-palm computer speech identification core software package according to claim 5 is characterized in that discerning/refuses to know that when judgement is to the several index parameters of the Model Calculation of score first, again through the neural network comprehensive assessment; Under the identification situation, unknown sample is identified as the highest model of score; Refusing under the knowledge situation, unknown sample is considered to gather outer model.

7, on-palm computer speech identification core software package according to claim 6 is characterized in that neural network adopts the Kohonen self-organizing network, and initial network weight coefficient picked at random is adjusted weight coefficient according to the learning algorithm that the teacher is arranged.