CN102945673A

CN102945673A - Continuous speech recognition method with speech command range changed dynamically

Info

Publication number: CN102945673A
Application number: CN 201210483176
Authority: CN
Inventors: 赵乾; 朱群; 吴玲; 潘颂声; 何春江; 王兵
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2012-11-24
Filing date: 2012-11-24
Publication date: 2013-02-27

Abstract

The invention provides a continuous speech recognition method with a speech command range changed dynamically. The continuous speech recognition method comprises the steps of (1) inputting speech command sets, grouping speech command sets according to rules, and constructing decoding networks respectively; (2) inputting speeches, extracting acoustic features, decoding on the basis of decoding networks, and dynamically adding and canceling decoding networks according to current operation conditions at the same time of decoding; (3) judging whether received speeches are valid speeches or not, and determining whether feedback is valid feedback or not; (4) conducting corresponding operations according to commands; and (5) determining whether speeches are still input or not, if speeches are still input, switching to Step (2), and if speeches are not still input, finishing. According to the continuous speech recognition method, a user is allowed to input speeches continuously, a system is allowed to dynamically add and cancel speech commands according to operation conditions of the system during recognition, decoding networks are adjusted in real time, decoding is participated, the recognition efficiency is improved, and simultaneously, the accuracy of recognition is greatly improved.

Description

The continuous speech recognition method of a kind of phonetic order scope dynamic change

Technical field

The present invention relates to a kind of voice instruction recognition method, especially a kind of continuous speech recognition method to the phonetic order variable range.

Background technology

Exchanging with machine, allow it understand what you are saying, is the thing that people dream of for a long time.Speech recognition technology is exactly to allow machine voice signal be changed into the hi-tech of corresponding text or order by identification and understanding process.Speech recognition technology has obtained increasing application as the important channel that solves man-machine interaction in recent years.As: based on computer platform, the large vocabulary Continuous Speech Recognition System, be mainly used in the voice messaging inquiry service that combines with telephone network or internet; Application in miniaturization, portable voice product is such as aspects such as intelligent toy, household remote.

The identification application scenarios of phonetic order has two kinds, and a kind of is that changeless command content is identified; Another kind of situation is that the command content that will identify changed along with the time, is unknown for the content of identifying possibly lower a moment, and user's voice are continuously inputs in the identifying.Here phonetic order can be simply to order word, also can be statement, namely orders a variety of sayings of word or application scenarios round certain.

The example of scene two such as the game of cruel beans, show real-time dynamicly the order word that the user can be read on the screen, it is current discernible order word, for recognition system and user, next order word that constantly can be used for identifying is fully unknown, and in the whole process that the order word changes, user's voice are continuously inputs, at this moment, recognition system also should identify user's voice in real time accurately.And traditional voice instruction recognition method does not often satisfy demand.

Traditional voice instruction recognition method master is to be processed to be the fixing situation of instruction set.Before evaluation and test beginning, make up changeless decoding network according to the content of instruction set, therefore, the dirigibility of this kind recognition technology is relatively poor, needs the situation of at any time conversion to feel simply helpless for instruction set.Its idiographic flow is as shown in Figure 1: the first step: according to the requirements set instruction set; Second step: make up decoding network according to the instruction set content; The 3rd step: the voice of accepting user's input; The 4th step: judge whether the voice receive are efficient voice, and whether provide feedback be effective Feedback, if change the step five over to, otherwise change the step three over to; The 5th step: system makes corresponding operation according to order; The 6th step: whether still have phonetic entry, if change the step 3 over to, otherwise finish.

Mainly there is following shortcoming in existing voice instruction recognition method: (1) but only the processed voice instruction set is fixed and known situation, need real-time transform in instruction set, and the content that constantly will identify is when fully unknown once, existing method is then at one's wit's end; (2) decoding network of existing voice instruction recognition methods is many makes up a complexity and changeless decoding network according to all order words, this kind method is when facing the more situation of phonetic order number, decoding network will be very huge, and the memory cost and the time overhead that therefore need are all larger; And the simultaneous probability of the similar phonetic order of this kind method is larger, and when similar phonetic order was more in the decoding network, the recognition effect of system will be relatively poor.

Summary of the invention

The technology of the present invention is dealt with problems: overcome the deficiencies in the prior art, the continuous speech recognition method of a kind of phonetic order scope dynamic change is provided, the method allows the user to input continuously voice, and the permission system dynamically increases according to system running state in identifying and deletes phonetic order, adjust in real time decoding network and participate in decoding, when improving recognition efficiency, also improved greatly the accuracy of identification.

The technology of the present invention solution: the continuous speech recognition method of a kind of phonetic order scope dynamic change, under the performing step:

(1) input initial speech instruction set text, this processing of the style of writing of going forward side by side; Described phonetic order collection text can be divided into one or more groups according to using needs, and instruction set can not have different life cycles on the same group;

(2) according to the text of output in the step (1), for every group of phonetic order collection built respectively decoding network, and decoding network, acoustic model passed to separately demoder; Described acoustic model is the background mathematics model of speech recognition, and model unit is phoneme, syllable or word;

(3) receive in real time the speech data fragment, and extract the acoustic feature sequence and pass to each demoder and decode.Described acoustic feature is a class value of describing the Short Time Speech essential characteristic.

(4) in the process of decoding, allow outer application system dynamically to increase, delete the phonetic order collection according to the needs of operation logic, and according to the dynamic change real-time update decoding network of phonetic order collection.The process of described dynamic change real-time update decoding network according to the phonetic order collection is specific as follows:

(41) accept outer application system phonetic order collection adjustment request;

(42) if need to increase new phonetic order collection, then the new speech instruction set is carried out text-processing, make up corresponding decoding network according to text, and begin to decode; If need some phonetic order collection of deletion, then stop all computings that this phonetic order set pair is answered demoder, and decoding network corresponding to deletion;

(5) when certain demoder takes the lead in decoding to the end position of network, obtain the optimal result of all demoders, the line ordering of going forward side by side, get the result of maximum probability as optimal result, and judge whether the result is credible at this moment, if the credible computing that then stops all demoders changed for (6) step over to, otherwise change the continuation decoding of (3) step over to;

(6) outer application system is made corresponding operation according to the judged result in (5) step.

Decoding network is order word decoding network or lvcsr decoding network in the described step (2).

Acoustic feature is Mel cepstrum coefficient MFCC, cepstrum coefficient CEP, linear predictor coefficient LPC or perception linear predictor coefficient PLP in the described step (3).

The process of judging reliable result in the described step (5) is as follows:

(51) when certain demoder takes the lead in decoding to the end position of network, obtain the optimal result of all demoders;

(52) according to probability all decoded results are sorted;

(53) get the result of ordering posterior probability maximum as optimal result;

(54) calculate this result's degree of confidence score, and compare with threshold value;

(55) if during greater than threshold value, then think this credible result, otherwise, think insincere.

Described in the step (5) decoded result is carried out Credibility judgement the time, in order to guarantee Accuracy of Judgement, can be with reference to vad(Voice Activation Detection) testing result, confirm namely whether the decoding end position is in quiet section among the vad result, if think that then this recognition result is believable, otherwise think insincere.

The present invention's advantage compared with prior art is:

(1) the present invention allows the user to input continuously voice, and allow outer application system according to the needs of operation logic additions and deletions phonetic order collection dynamically, adjust in real time decoding network and participate in decoding, effectively solved the continuous speech recognition problem of phonetic order scope dynamic change.

(2) the present invention is that each phonetic order collection makes up a decoding network, and network structure is simpler, when phonetic order collection in enormous quantities is identified, compares with traditional recognition method, and this kind method discrimination is higher, operand is lower and committed memory is less.

Description of drawings

Fig. 1 is the realization flow figure of prior art;

Fig. 2 is realization flow figure of the present invention;

Fig. 3 is that the present invention dynamically adjusts phonetic order collection procedure chart;

Fig. 4 is decode procedure process flow diagram of the present invention;

Fig. 5 is that instruction set is take the decoding network exemplary plot of each word as one group;

Fig. 6 is that acoustic feature of the present invention extracts process flow diagram.

Embodiment

As shown in Figure 2, the present invention is implemented as follows:

(1) input initial speech instruction set text, this processing of the style of writing of going forward side by side.

The phonetic order collection of input is the predetermined discernible phonetic order of outer application system, also is one of foundation that makes up decoding network.This step is mainly finished three tasks:

At first, according to rule the phonetic order collection is divided into groups, can be divided into one or more groups, instruction set can not have different life cycles on the same group, and interior phonetic order has identical life cycle on the same group mutually.Rule herein can be according to the requirements set of practical application, and for example according to the number of instruction set, type etc. are divided into groups.As shown in Figure 5, each word is divided into one group.

Secondly, the instruction set text code form after the grouping is unified conversion, convert the UTF8 form to such as unified, the benefit of doing like this is that the code of resolving text only needs to realize a cover;

At last, granularity (such as word, syllable, phoneme) according to corresponding model unit in the acoustic model is resolved (adopt phoneme better as the modeling unit effect), generate and resolve bearing-age tree shape structure, this structure comprises the complete information of sentence, word, word, syllable, five levels of phoneme, wherein front 3 levels can divide word algorithm to resolve according to the text front end, and rear 2 levels can be resolved according to pronunciation dictionary.

(2) build respectively decoding network for every group of phonetic order collection.

According to the group result in the step (1), every group of phonetic order collection made up respectively decoding network, as shown in Figure 5.The concrete operations flow process is as follows:

A) obtain the model unit sequence that obtains in the text-processing step;

B) for every group of unit sequence, according to retaking of a year or grade, the permissions such as skip read aloud arc number in the regular computational grid, and be the arc storage allocation;

C) according to reading aloud rule, make up arc node is coupled together;

D) every group of decoding network that the phonetic order set pair is answered of output.

(3) receive in real time the speech data fragment, and extract the acoustic feature sequence and pass to each demoder parallel decoding.

The type of acoustic feature is more, and the below is characterized as example explanation with MFCC, the extraction flow process of MFCC feature as shown in Figure 6, concrete steps are as follows:

A) A/D conversion is digital signal with analog signal conversion;

B) pre-emphasis: by a limited exciter response Hi-pass filter of single order, make the frequency spectrum of signal become smooth, be not vulnerable to the impact of finite word length effect;

C) divide frame: according to the in short-term smooth performance of voice, voice can be processed take frame as unit, generally can get 25 milliseconds (ms) as a frame;

D) windowing: adopt hamming code window to a frame voice windowing, to reduce the impact of Gibbs' effect;

E) fast fourier transform (Fast Fourier Transformation, FFT): the power spectrum that time-domain signal is for conversion into signal;

F) quarter window filtering: with the quarter window wave filter (totally 24 quarter window wave filters) of one group of Mel frequency marking Linear distribution, power spectrum filtering to signal, the scope that each quarter window wave filter covers is similar to a critical bandwidth of people's ear, simulates the masking effect of people's ear with this;

G) ask logarithm: logarithm is asked in the output of quarter window bank of filters, can obtain being similar to the result of isomorphic transformation;

H) discrete cosine transform (Discrete Cosine Transformation, DCT): remove the correlativity between each dimensional signal, signal map is arrived lower dimensional space;

I) spectrum weighting: because the low order parameter of cepstrum is subject to the impact of speaker's characteristic, the characteristic of channel etc., and the resolution characteristic of high order parameters is lower, so need to compose weighting, suppresses its low order and high order parameters;

J) cepstral mean subtracts (Cepstrum Mean Subtraction, CMS): CMS and can effectively reduce the phonetic entry channel to the impact of characteristic parameter;

K) differential parameter: great many of experiments shows, adds the differential parameter that characterizes the voice dynamic perfromance in phonetic feature, can improve the recognition performance of system.First order difference parameter and the second order difference parameter of MFCC parameter have also been used.

(4) in the process of decoding, accept outer application system phonetic order collection adjustment request, and respond in real time.

In the process of decoding, allow outer application system according to the needs of operation logic additions and deletions phonetic order collection dynamically, and according to the dynamic change real-time update decoding network of phonetic order collection.Overall flow as shown in Figure 3, what Fig. 4 described be the process of increase phonetic order collection in decode procedure.

The process that increases the phonetic order collection is as follows:

A) accept the adjustment request that outer application system increases the phonetic order collection;

B) the new speech instruction set is carried out text-processing;

C) result according to text-processing makes up corresponding decoding network;

D) begin decoding.

The process of deletion phonetic order collection is as follows:

A) accept the adjustment request that outer application system is deleted the phonetic order collection;

B) stop all computings that this phonetic order set pair is answered demoder;

C) decoding network corresponding to deletion.

(5) decode and obtain recognition result

Tone decoding is a step (being decoded as example with Viterbi) important among the present invention, and the implementation process of tone decoding is divided following a few step among the present invention:

A) each demoder calculates output probability and the intra-node state transition probability of current every feasible path corresponding node in the decoding network to every frame acoustic feature of input, and upgrades the accumulated probability of current path.Output probability herein can be corresponding according to the node phoneme Hidden Markov Model (HMM) and acoustics feature calculation, the intra-node state transition probability directly reads from model.

B) in the step a) when being decoded to last state of intra-node, can expand current decoding path, the foundation of expansion is exactly decoding network, when this node is connected to a plurality of node, need the expansion mulitpath to proceed decoding, if have path punishment on the arc of decoding network, then need punishment is added in the accumulated probability in path;

The process of obtaining final recognition result is as follows:

A) when certain demoder takes the lead in dishing out decoded result, obtain the optimal result of all demoders;

B) according to probability all decoded results are sorted;

C) get the result of ordering posterior probability maximum as optimal result;

D) calculate this result's degree of confidence score, and compare with threshold value;

E) if during greater than threshold value, then think this credible result, otherwise, think insincere, continue decoding.

(6) outer application system is made corresponding operation according to the judged result in (5) step.For example, read in the Games Software of word in exercise, when identifying certain word, can from display interface, delete corresponding word.

Instructions of the present invention does not elaborate part and belongs to techniques well known.

Claims

1. the continuous speech recognition method of phonetic order scope dynamic change is characterized in that performing step is as follows:

(3) receive in real time the speech data fragment, and extract the acoustic feature sequence and pass to each demoder and decode.Described acoustic feature is a class value of describing the Short Time Speech essential characteristic;

2. the continuous speech recognition method of a kind of phonetic order scope according to claim 1 dynamic change, it is characterized in that: the decoding network described in the step (2) is order word decoding network or lvcsr decoding network.

3. the continuous speech recognition method of a kind of phonetic order scope according to claim 1 dynamic change is characterized in that: acoustic feature is Mel cepstrum coefficient MFCC, cepstrum coefficient CEP, linear predictor coefficient LPC or perception linear predictor coefficient PLP in the described step (3).

4. the continuous speech recognition method of a kind of phonetic order scope according to claim 1 dynamic change is characterized in that: the process of judging reliable result in the described step (5) is as follows:

(52) according to probability all decoded results are sorted;

5. the continuous speech recognition method of a kind of phonetic order scope according to claim 1 dynamic change, it is characterized in that: described in the step (5) decoded result is carried out Credibility judgement the time, in order to guarantee Accuracy of Judgement, can be with reference to vad(Voice Activation Detection) testing result, confirm namely whether the decoding end position is in quiet section among the vad result, if think that then this recognition result is believable, otherwise think insincere.