CN103810998B

CN103810998B - Based on the off-line audio recognition method of mobile terminal device and realize method

Info

Publication number: CN103810998B
Application number: CN201310652535.2A
Authority: CN
Inventors: 李林; 徐礼奎; 呼延正勇; 方帅; 张晓东; 叶思菁; 姚晓闯; 刘哲
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2013-12-05
Filing date: 2013-12-05
Publication date: 2016-07-06
Anticipated expiration: 2033-12-05
Also published as: CN103810998A

Abstract

The present invention provides a kind of off-line audio recognition method based on mobile terminal device, including: obtain voice signal and extract the speech feature vector that voice signal is corresponding；Based on acoustic model preset in mobile terminal device, speech feature vector is mated, it is thus achieved that the corresponding language character string of speech feature vector；And based on language model preset in mobile terminal device and dictionary, language character string is mated, it is thus achieved that the corresponding matched text data of speech feature vector；Calculate speech feature vector output probability in acoustic model, and based on output probability maximum in output probability, obtain the corresponding corresponding matched text data of speech feature vector, obtain the final recognition result of voice signal.The present invention utilizes acoustic model preset in mobile terminal device, language model and dictionary to realize the coupling of the voice signal for specific area, voice signal is converted into text data, obtains final recognition result, finally realize off-line speech recognition.

Description

Based on the off-line audio recognition method of mobile terminal device and realize method

Technical field

The present invention relates to the field of speech recognition of mobile terminal, be provided in particular in a kind of off-line audio recognition method based on mobile terminal device and realize method based on mobile terminal device off-line speech recognition.

Background technology

Field data collection program based on mobile terminal refers on the intelligent movable equipment that operates in (flat board, smart mobile phone, portable computer etc.), provides the built-in application program of computer technology support for field investigation work.For simplifying the acquisition mode of field data, shorten data collection cycle, strengthen data inputting standardization level and data inputting, the efficiency of management, presently, there are many field data collection programs, extensive in numerous sector applications such as agricultural, forestry, meteorology, geology, entomology, ecology.

Field data collection program construction and applied research start from the nineties in 20th century, present field acquisition system generally all adopts the mode logging data utilizing input through keyboard, but the keyboard of smart mobile phone is smaller, the finger of people is relatively larger, situation about pressing the wrong button often occurs during input data, and both hands are all occupied during logging data, thus cause that the efficiency comparison of logging data is low, thus have impact on the further extensive use of Field data collection system.The application of speech recognition technology, it will become and break the powerful mean that restriction conventional keyboard input data efficiency is low.

Speech recognition technology is a cross discipline relating to signal processing, pattern recognition, theory of probability and theory of information, sounding hearing mechanism, artificial intelligence etc., its target is to be computer-readable input by the vocabulary Content Transformation in human speech, thus reaching the target of man-machine interaction more naturally.The speech recognition software comparing main flow at present is all based on the technology that high in the clouds, the Internet processes, i.e. client input voice, server end speech recognition, recognition result returns to client, and the advantage of this technology is able to utilize the powerful speech processing power of server end；Save the space of client storage language model, acoustic model and dictionary；May identify which a large amount of universal word amount voice, but it cannot recognize that the uncommon vocabulary of application-specific industry, and need networking, when network model is bad it cannot be guaranteed that processing speed, so being not suitable in environmental condition is not the field acquisition system of very good field usage, it is therefore desirable to a kind of speech recognition technology based on off-line supports that field acquisition system seems particularly important and urgent.

Summary of the invention

(1) to solve the technical problem that

It is an object of the present invention to provide a kind of off-line audio recognition method based on mobile terminal device, thus realizing realizing speech recognition when off-line.

(2) technical scheme

For solving above-mentioned technical problem, the present invention provides a kind of off-line audio recognition method based on mobile terminal device, including:

Obtain voice signal and extract the speech feature vector that described voice signal is corresponding；

Based on acoustic model preset in described mobile terminal device, described speech feature vector is mated, it is thus achieved that the corresponding language character string of described speech feature vector；And based on language model preset in described mobile terminal device and dictionary, described language character string is mated, it is thus achieved that the corresponding matched text data of described speech feature vector；

Calculate described speech feature vector output probability in described acoustic model, and based on output probability maximum in described output probability, obtain the corresponding corresponding matched text data of speech feature vector, obtain the final recognition result of described voice signal.

Wherein, the described audio recognition method based on mobile terminal device also includes: described final recognition result is carried out participle.

Concrete, described final recognition result is carried out participle and includes:

S501, the Chinese character number n that in a participle dictionary, maximum entry comprises is set；Wherein, the matched text data that described final recognition result is corresponding are Chinese character string；

S502, front n the character taken in described Chinese character string sequence, as matching field, search described participle dictionary；

If having the words corresponding with described matching field in described participle dictionary, then the match is successful, and described matching field is split out as a word, and is stored into another character string newString, and is separated with other words by blank character；

If can not find corresponding with a described matching field words in described participle dictionary, then it fails to match, enters step S503；

S503, n is become n-1, then the matching field being used for mating that step S502 takes out is removed last Chinese character, as new matching field, search described participle dictionary, if described participle dictionary has the words corresponding with new matching field, then the match is successful, and described new matching field is split out as a word, and is stored in character string newString；

If it fails to match, then repeat step S502-S503, to being matched successfully to described new matching field；

S504, repetition step S503, until all characters with matching field are matched successfully in described Chinese character string, complete the participle to described Chinese character string.

Wherein, the described audio recognition method based on mobile terminal device also includes: described final recognition result is shown to interface bivariate table.

Concrete, described final recognition result is shown to interface bivariate table and includes:

S601, determine that described interface bivariate table needs the field gathered, and these locality field be stored in character string dimension KeyWordString；

S602, the character string after participle, utilize split function to be divided into multiple field with blank character for mark, be stored in character string dimension InputString；

S603, from character string dimension InputString, take out a field, comparing item by item with the field in KeyWordString, if there being coupling, subscript i corresponding in array InputString for this field being stored in array PointKeyWord；If it does not match, do not carry out any operation；Wherein, 1=< i≤n, n is the number of field in character string dimension InputString, and i, n are positive integer；

S604, from InputString, take out next field, compare item by item with the field in keyWordString, if the match is successful, then the subscript i+1 that this field is corresponding in InputString is stored in PointKeyWord, array ValueString [i] is set to sky, if it does not match, the value ValueString [i] is set to this field；

S605, repetition step S603 and step S604, mate complete to all fields in InputString；

S606, the result of coupling to store in Hashmap in the way of key-value pair, utilize the Key of key-value pair and the gauge outfit of bivariate table to compare, and the value in key-value pair be stored in the bivariate table at interface.

Concrete, described speech feature vector is mated by the described audio recognition method based on mobile terminal device by tieing up bit algorithm.

Concrete, described language character string is mated by the described audio recognition method based on mobile terminal device by NGram algorithm.

For solving above-mentioned technical problem, the present invention also provides for a kind of realizing method based on mobile terminal device off-line speech recognition, including:

Collection project vocabulary；

HMM model training acoustic model data and language model data is utilized based on described project vocabulary；

Set up acoustic model based on the acoustic model data completing training, set up language model based on the language model data completing training, and utilize text editor to create dictionary；

Described acoustic model, language model and dictionary are stored in described mobile terminal device.

Wherein, described acoustic model data is based on the HMM parameter optimization algorithm training of segmentation K mean algorithm.

Wherein, described language model data is based on NGram Algorithm for Training.

(3) beneficial effect

It is different from background technology, cardinal principle of the present invention is that voice signal is converted into text data, obtain final recognition result, it mainly realizes process is the coupling utilizing acoustic model preset in mobile terminal device, language model and dictionary to realize the voice signal for specific area, finally realizes off-line speech recognition.Further, present invention achieves when gathering the information data of specific area, it is not necessary to be manually entered the collection that just can complete information data, drastically increase collecting efficiency, reduce the cost gathering data.

Accompanying drawing explanation

Fig. 1 realizes method flow schematic diagram based on mobile terminal device off-line speech recognition in embodiment one；

Fig. 2 is the HMM parameter training flow chart in embodiment illustrated in fig. 1 based on segmentation K mean algorithm；

Fig. 3 is the present invention overall procedure schematic diagram based on the off-line audio recognition method of mobile terminal device；

Fig. 4 is the off-line audio recognition method schematic flow sheet in embodiment two based on mobile terminal device；

Fig. 5 is Chinese word segmentation schematic flow sheet in embodiment illustrated in fig. 4；

Fig. 6 be in embodiment illustrated in fig. 4 after Chinese word segmentation result echo to the schematic flow sheet of system interface bivariate table；

Fig. 7 is the embodiment of the present invention three system that the realizes recorded voice oscillogram based on mobile terminal device off-line speech recognition；

Fig. 8 is that the embodiment of the present invention three realizes system acoustic analysis chart based on mobile terminal device off-line speech recognition；

Fig. 9 is that the embodiment of the present invention three realizes system analysis.conf configuration file figure based on mobile terminal device off-line speech recognition；

Figure 10 is that the embodiment of the present invention three realizes system HMM proto file figure based on mobile terminal device off-line speech recognition；

Figure 11 is the embodiment of the present invention three trains procedure chart based on the system HMM that realizes of mobile terminal device off-line speech recognition；

Figure 12 is that the embodiment of the present invention three realizes system HTK Tool Framework FIGURE based on mobile terminal device off-line speech recognition；

Figure 13 is that the embodiment of the present invention three realizes system HTK speech processes flow chart based on mobile terminal device off-line speech recognition.

Detailed description of the invention

For making the purpose of the present invention, content and advantage clearly, below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following example are used for illustrating the present invention, but are not limited to the scope of the present invention.

Embodiment one

The present embodiment provides a kind of method that realizes based on mobile terminal device off-line speech recognition, and the method originates in step 101, gathers project vocabulary, here project vocabulary be the dialect for specific area, for the everyday expressions of specific area and phrase.

In step 102, the hidden Markov HMM model based on probability statistics is utilized to train acoustic model data and language model data based on described project vocabulary.Described acoustic model data is based on the HMM parameter optimization algorithm training of segmentation K mean algorithm.Described language model data is based on NGram Algorithm for Training.

NGram algorithm is mainly based upon such a it is assumed that the appearance of n-th word is only relevant to above N-1 word, and all uncorrelated with other any word, and the probability of whole sentence is exactly the product of each word probability of occurrence.The number of times that these probability can occur by directly adding up N number of word from language material simultaneously obtains.The Bi-Gram of binary that conventional is and the Tri-Gram of ternary.Computing formula is as follows: P(w_n|w₁,w₂,…,w_n-1)=C (w₁,w₂,…,w_n)/C(w₁,w₂,…,w_n-1), namely according to law of great number, when front N-1 word occurs, the probability that n-th word occurs is the frequency that the frequency that this N number of word occurs simultaneously occurs divided by N-1 word simultaneously.For promotion prospect, it is simply that when occurring before calculating popularization, the probability that scape occurs, i.e. P(scape | push away, extensively, front)=C (push away, extensively, front, scape)/C (pushing away, extensively, front).Utilizing NGram Algorithm for Training speech data is exactly the probability based on context training other words of each word in dictionary to occur when simultaneously occurring, then in the result storage of training to speech model.

The quality of HMM model parameter directly affects the effect of speech recognition, at HMM model M={A, B, π } three parameters in, state transition probability matrix A and initial state probabilities set π is little on phonetic recognization rate impact, is usually arranged as and is uniformly distributed value or non-zero random value.The initial value of parameter B arranges ratio A and π difficulty, simultaneously also most important.In this reality is invented, for the shortcoming of Baum-Welch classic algorithm, summarize three kinds of innovatory algorithm, three kinds of innovatory algorithm can be trained to realize training acoustic model data and language model data by HMM model, specifically introduce these three innovatory algorithm separately below.

The first: is based on the HMM parameter optimization algorithm of segmentation K mean algorithm

Referring to Fig. 2, in step 201, preset HMM model initial parameter values, initial value can divide state method by decile or experience obtains；

Preset maximum iteration time I and convergence threshold ζ；

In step 202, with Viterbi algorithm, the training speech data of input is carried out state segmentation；

In step 203, by segmentation K mean algorithm, B parameter in model is reappraised.It is divided into two kinds of situations:

Discrete type system:

Whole speech frame numbers under the speech frame occurrence number/state Sj of label i in bji=state Sj；

Continuous system:

The probability density function of each state is to be formed by stacking by M normal distyribution function.

Speech frame number in class i speech frame number/state j in mixed coefficint ω ji=state j；

Sample average μ ji: the sample average of class i in state j；

Sample covariance matrix: class i sample covariance matrix in υ ji=state j；

Utilize above-mentioned parameter computation model parameter M*.

In step 204, with M* as initial value, utilize Baum-Welch algorithm to HMM parameter revaluation；

In step 205, forward step 202 to, until iterations > I or meet the condition of convergence.

As in figure 2 it is shown, segmentation K mean algorithm is based on the maximum likelihood criterion of state optimization, it is possible to greatly speed up the convergence rate of model, some additional informations can also be provided in training simultaneously.

The second: based on the HMM parameter training innovatory algorithm of genetic algorithm

(1) preset HMM model initial parameter values, initial value can be uniformly arranged or obtained by experience；

(2) preset maximum evolution number of times I and convergence threshold ζ；

(3) select optimal base because of: in each generation, according to adaptive value from high to low, choose optimal base according to certain ratio and want height because of, the ratio of choosing that wherein adaptive value is high accordingly, adaptive value computing formula:

f (λ) = Σ_{k = 1}^{N} \ln (P (O | λ));

(4) hybridization and variation: the hybridization i.e. appropriate section of two parents occurs to exchange to produce filial generation, be equivalent to the search of a regional area, namely the subregion of parent is carried out additions and deletions and changes products raw Variants by variation, makes filial generation jump out current local search area, it is to avoid to sink into local optimum too early.

(5) Renewal model parameter；

(6) if evolution number of times > I or meet the condition of convergence, train complete, otherwise forward (3) to.

The third: is based on the HMM parameter training innovatory algorithm of relaxed algorithm.

(2) preset maximum iteration time I and convergence threshold ζ；

(3) making Tm=T0 × f (m), f (m)=Km, wherein m value is 0,1,2...I, K < 1；

(4) example of N × M separate normal random variable X, average EX=0, variance DX=Tm are generated；

(5) classical Baum-Welch algorithm is utilized to obtain bij parameter, the value making bij*=bij+x(x be X above), 1≤i≤N, 1≤j≤M；If negative value occurs in bij*, value is set to zero, is normalized simultaneously；

(6) if m > I or meet the condition of convergence, train complete, otherwise forward (3) to.

But in the present embodiment, only make use of the first in above three kinds of optimized algorithms namely based on the HMM parameter optimization algorithm of segmentation K mean algorithm, this algorithm is mainly used for accelerating the training speed of model, and similar syllable is clustered, and shortens solution code space when identifying；And latter two innovatory algorithm is mainly used for improving precision, for our speech recognition based on off-line, it is primarily directed to some vocabulary of specific industry application training, in such model, vocabulary quantity is little, the likelihood of different words is very low, use the DeGrain of both algorithms, therefore only select to employ the first optimized algorithm.

In step 103, set up acoustic model based on the acoustic model data completing training, set up language model based on the language model data completing training, and utilize text editor to create dictionary.Wherein, the establishment process of dictionary and language model is as follows: utilize text editor to need manual creation text file according to application system, in the phonetic write text of the Chinese character of the field information needing to gather and correspondence, lmtool instrument is then utilized to automatically generate participle dictionary and language model.

In step 104, described acoustic model, language model and dictionary are stored in described mobile terminal device.

After said process, dialect for specific area, everyday expressions and phrase for specific area establish acoustic model, language model and dictionary, wherein, described speech feature vector is mated by acoustic model, character string after acoustic model mates is mated by language model and dictionary, above-mentioned acoustic model can be passed through, language model and dictionary realize the off-line speech recognition for specific area, when gathering the information data of specific area, the collection that just can complete information data need not be manually entered, drastically increase collecting efficiency, reduce the cost gathering data, solve the technical problem mentioned in background technology.

Embodiment two

Refer to Fig. 3 and Fig. 4, present embodiments provide a kind of off-line audio recognition method based on mobile terminal device, the method is based on implements the audio recognition method that acoustic model, language model and the dictionary set up completes, originate in step 401, obtain voice signal and extract the speech feature vector that described voice signal is corresponding.

In step 402, based on acoustic model preset in described mobile terminal device, described speech feature vector is mated, it is thus achieved that the corresponding language character string of described speech feature vector；And based on language model preset in described mobile terminal device and dictionary, described language character string is mated, it is thus achieved that the corresponding matched text data of described speech feature vector.Concrete, described speech feature vector is mated by the described audio recognition method based on mobile terminal device by the dimension bit algorithm in HMM model.Described language character string is mated by the described audio recognition method based on mobile terminal device by NGram algorithm.

In step 403, calculate described speech feature vector output probability in described acoustic model, and based on output probability maximum in described output probability, obtain the corresponding corresponding matched text data of speech feature vector, obtain the final recognition result of described voice signal.

It is not difficult to find out through foregoing description, the cardinal principle of the present embodiment is that voice signal is converted into text data, obtain final recognition result, it mainly realizes process is the coupling utilizing acoustic model preset in mobile terminal device, language model and dictionary to realize the voice signal for specific area, finally realizes off-line speech recognition.Further, the present embodiment achieves when gathering the information data of specific area, it is not necessary to is manually entered the collection that just can complete information data, drastically increases collecting efficiency, reduces the cost gathering data.

Referring to Fig. 5 and Fig. 6, for user can be made to be further appreciated by the final recognition result after the present embodiment processes, the present embodiment also includes after obtaining described final recognition result based on the audio recognition method of mobile terminal device:

Described final recognition result is carried out participle；

And the result after participle is shown to interface bivariate table.

After above-mentioned steps, being shown to by final result after participle in system interface bivariate table accurately and gather the position that field is corresponding, user can be directly viewable, with written form, the voice signal oneself gathered.In this way, user merely enters a secondary data and just can complete to gather a data, rather than input data can be only done the information gathering of a field, greatly improves the efficiency of data acquisition, saves time cost and the human cost of field information acquisition every time.

Fig. 5 is the schematic flow sheet that final recognition result carries out participle, in the present embodiment, makes an explanation for the matched text data that final recognition result is corresponding for Chinese character string.Concrete, described final recognition result is carried out participle and includes:

S501, the Chinese character number n that in a participle dictionary, maximum entry comprises is set；Wherein, the matched text data that described final recognition result is corresponding are Chinese character string；Wherein, the establishment process of participle dictionary is as follows: utilize text editor to need manual creation text file according to application system, in the phonetic write text of the Chinese character of the field information needing to gather and correspondence, lmtool instrument is then utilized to automatically generate participle dictionary.

S502, front n the character taken in described Chinese character string sequence, as matching field, search described participle dictionary.

If having the words corresponding with described matching field in described participle dictionary, then the match is successful, and described matching field is split out as a word, and is stored into another character string newString, and is separated with other words by blank character.

If can not find corresponding with a described matching field words in described participle dictionary, then it fails to match, enters step S503.

S503, n is become n-1, then the matching field being used for mating that step S502 takes out is removed last Chinese character, as new matching field, search described participle dictionary, if described participle dictionary has the words corresponding with new matching field, then the match is successful, and described new matching field is split out as a word, and is stored in character string newString.

If it fails to match, then repeat step S503, to being matched successfully to described new matching field.

S504, repetition step S502-S503, until all characters with matching field are matched successfully in described Chinese character string, complete the participle to described Chinese character string.

Referring to Fig. 6, Fig. 6 is the schematic flow sheet the result echo after Chinese character string is carried out participle to system interface bivariate table, concrete, described final recognition result (i.e. result after participle) is shown to interface bivariate table and includes:

S601, determine that described interface bivariate table needs the field gathered, and these locality field be stored in character string dimension KeyWordString；Wherein, the field of these collections is the field that interface bivariate table needs display, includes but are not limited to all fields after above-mentioned participle.

S602, the character string after participle, utilize split function to be divided into multiple field with blank character for mark, be stored in character string dimension InputString.

S603, from character string dimension InputString, take out a field, comparing item by item with the field in KeyWordString, if there being coupling, subscript i corresponding in array InputString for this field being stored in array PointKeyWord；If it does not match, do not carry out any operation；Wherein, 1=< i≤n, n is the number of field in character string dimension InputString, and i, n are positive integer.

S604, from InputString, take out next field, compare item by item with the field in keyWordString, if the match is successful, then the subscript i+1 that this field is corresponding in InputString is stored in PointKeyWord, array ValueString [i] is set to sky, if it does not match, the value ValueString [i] is set to this field.

S605, repetition step S603 and step S604, mate complete to all fields in InputString.

Embodiment three

Referring to Fig. 7-13, the system that realizes of the present embodiment mobile terminal device off-line speech recognition is based on android system platform, and that data base is free at present small-sized movable client database SQLite.The system that realizes of the present embodiment mobile terminal device speech recognition is divided into off-line sound identification module, word-dividing mode, and word-dividing mode word segmentation result is shown to the echo module of page bivariate table, and data process four modules.Off-line sound identification module mainly completes speech modeling data (including acoustic model data and language model data) training, and the establishment of language model, acoustic model and dictionary, the cardinal principle of off-line speech recognition is that voice signal is changed into text signal；The character string literal that word-dividing mode primary responsibility identifies off-line sound identification module according to string matching method change into one by one phrase and with such as comma and or/space accord with at equal intervals separating and connect into a character string；Echo module mainly realizes the character string after word-dividing mode processes, and the mode according to mating with interface bivariate table gauge outfit is shown in bivariate table accurately；Data processing module mainly complete in bivariate table data operation and by database data in mobile terminal device with in server data realize Tong Bu.What above-mentioned each module is described below implements process.

1, off-line sound identification module

Sound identification module at native system, only make use of the first in three kinds of optimized algorithms described in embodiment one namely based on the training to acoustic model data of the HMM parameter optimization algorithm of segmentation K mean algorithm, this algorithm is mainly used for accelerating the training speed of model, similar syllable is clustered, shortens solution code space when identifying；And latter two innovatory algorithm is mainly used for improving precision, for our speech recognition based on off-line, it is primarily directed to some vocabulary of specific industry application training, in such model, vocabulary quantity is little, the likelihood of different words is very low, use the DeGrain of both algorithms, therefore only select to employ the first optimized algorithm.

Data in off-line speech recognition prepare, and data are trained, speech recognition and interpretation of result, and we complete with HTK instrument.The software architecture of HTK as shown in figure 12, HTK speech processes flow process as shown in figure 13:

1.1, the file of storaged voice identification material requested is set up.

Set up file data, this file is used for storing training and test data, two subdirectory data/train and data/test are set up under data, continuing to set up two subdirectories under train, data/train/sig (in order to store the training speech data of recording) and data/train/mfcc (is used for storing the mfcc parameter after training data converts)；Data/test is used for storing test data.Set up model file, for storing the associated documents of the model of identification system.Set up def file, be used for storing language model and dictionary.

1.2, training set is created.

In this stage, it would be desirable to utilize the recording of the voice signal of HTK instrument HSLab finished item vocabulary, then utilizing this instrument is that each voice signal writes label, and namely one text of association describes the content of voice.The dos command that this instrument completes this work is utilized to be: HSLabname.sig, wherein name is concrete vocabulary phonetic.

1.2.1, recording

Pressing Rec button and start recorded speech signal, stop recording when pressing Stop, default sample frequency is 16KHZ, Fig. 7 is that of numeral 5 records waveform.

1.2.2, mark signal

First press Mark button, then select you to open the region of label.After area marking, press Labelas, input label title, then press enter key Enter.For each signal, it would be desirable to mark three continuous print regions: starting pause and be labeled as sil, recording word is labeled as project vocabulary name, and technology is paused and is labeled as sil.These three region can not be overlapping, even if the gap between them is only small.After these three has marked, pressing Save button, label file name.lab is created successfully.Be below numeral 5 label file: 41712509229375sil922937515043750name1504375020430625sil, wherein each label of digitized representation beginning and terminate sampled point.Such file can carry out manual amendment, such as adjusts beginning or the end point of label.

1.3, acoustic analysis

nullSpeech recognition tools can not directly process waveform voice，Require over more succinct effective method to represent waveform voice，This is accomplished by utilizing HCopy instrument to carry out acoustic analysis，As shown in Figure 8，Its dos command is HCopy-A-D-Canalysis.conf-Stargetlist.txt. wherein-A is the parameter of display command，-D is arranged for display configuration，-C formulates configuration file，-S is file destination road, appointment source strength，Analysis.conf is for extracting parameter configuration files (for annotation after #)，Acoustics coefficient extracting parameter is set，We use MFCC as feature extraction parameter，All parameters include: 12 MFCC coefficient [c1,...,C12] (because NUMCEPS=12);1 MFCC coefficient c0, be directly proportional (in TARGETKIND, suffix _ 0) to the gross energy of frame;13 Deltacoefficients, by [c0, c1 ..., c12] derive from (in TARGETKIND, suffix _ D);13 Accelerationcoefficients, (in TARGETKIND, suffix _ A).Targetlist.txt is for specifying title and the deposit position of each wave file for processing, and the title of target factor file and deposit position.Analysis.conf configures as shown in Figure 9

1.4, HMM prototype definition

Set up file hmm_yi.hmm, be saved under htk/model/proto.As shown in Figure 10, the HMM model of other vocabulary has identical content to Hmm_yi.hmm content, as long as yi changes corresponding vocabulary phonetic in～hyi.～h " yi "<BeginHMM>...<EndHMM>, the encapsulation description to HMM model.

1.5, HMM training

HMM complete model is trained as shown in figure 11, including initializing and two parts of training

1.5.1 initialize

Utilize command below enforcement Viterbi algorithm that HMM model is initialized: Hinit-A-D-Strainlist.txt-Mmodel/hmm1 Hmodel/proto/hmm0label Lnameofhmmnameofhmm is the title carrying out initialized HMM model.The .mfcc listed files that trainlist.txt has provided.Label is used for indicating which label segment in training set.Model/hmm1 initializes HMM model to describe the directory name (must create in advance) of result output.Each model is repeated by this process.

1.5.2, training

Utilize HTK instrument HRest's once to estimate iteration again, estimate the optimum of HMM model parameter, order and be: HResttrainlist.txt-Mmodel/hmmi-Hmodel/hmmi-1/hmmfile-lla bel-Lnameofhmm wherein nameofhmm is the title of the HMM model to train.Hmmfile is the description file of the HMM model of nameofhmm by name.Trainlist.txt provides complete list (the being stored in data/train/mfcc/) label of the .mfcc file of composing training collection and indicates the label used in training data.Model/hmmi, output directory, i represents current iteration number of times.For each HMM model to train, this process to be repeated many times over.Stop condition: measure sign convergence by change after HRest iteration every time and be shown on screen.Once this measurement value no longer reduces (absolute value), process should stopped, and is then put under hmm_result file unified for corresponding training result.The similar training of other vocabulary.

1.6, task definition

The each file relevant to task should be stored in special def/ catalogue

1.6.1 grammatical rules and dictionary are set up

Setting up Grammar Rule file gram.txt (under def file), content is: * Taskgrammar

$WORD=YI|ER|...|SIL;

({SIL}[$WORD]{SIL})

Draw together SIL with bracket { } and represent that it can be absent from or repeatedly (allow to pause for a long time before or after word, or do not pause at all).Bracket [] is drawn together $ WORD and is represented zero or (without word, it is possible to simply identify pause) once occur.

1.6.2 dictionary file is set up

Dict.txt (under def file), content is:

2YI[1]yi

3ER[2]er

…

SIL[sil]sil

1.7, network task is set up

Task grammar (described in gram.txt) uses HParse instrument to be compiled, and generates Task Network, and dos command is:

HParsedef/gram.txtdef/net.slf

For guaranteeing that grammer does not have mistake, HSGen instrument can being used to test, dos command is:

HSGen-sdef/net.slfdef/dict.txt

1.8, speech recognition

Utilization is that the Hvite in HTK carries out speech recognition.Utilize previous ready dictionary, grammatical structure file, the acoustic model trained, speech data is exported corresponding statement according to identification probability size, and utilizes the HResults instrument of HTK to be analyzed the result of output.

1.9 interpretations of result

Interpretation of result mainly accuracy rate and recognition speed to speech recognition is estimated, and is a ring important in speech recognition.Interpretation of result instrument in HTK is HResults, and it carries out paired comparison analysis by identification test result with completed HMM mark file, exports corresponding accuracy and accuracy rate (accuracy rate considers inserting error on the basis of accuracy).

2, Chinese word segmentation

The Chinese word segmentation of present system, have employed, based on the maximum match mode in string matching, the character string of speech recognition is carried out Chinese word segmentation.According to the vocabulary in data dictionary, the character string that speech recognition is gone out by with the terminology match in data dictionary, after the match is successful, the result space-separated identified is become the character string of one group of word composition, specifically sets forth in embodiment two, do not repeat them here.

3, word segmentation result echo interface

Present system is in the process of character string echo, first with split function the character string after participle with space for mark, segmentation is deposited in a character string dimension later, then pass through the gauge outfit field with bivariate table to compare, the data of the gauge outfit that the match is successful are stored in key-value pair major key, corresponding worth data are stored in key-value pair value, finally by key-value pair major key inquiry mode, value in key-value pair is inserted in bivariate table one by one, specifically set forth in embodiment two, do not repeat them here.

4, data process

What the present embodiment system data processed employing is first the data collected in system interface bivariate table are stored in the SQLite data base that mobile terminal is local, can in the data syn-chronization not synchronized in local data base all server databases when mobile terminal networking.During data syn-chronization, present system both can gather the data synchronizing just to have gathered immediately, it is also possible to the centralized and unified data synchronizing all collections together.When data syn-chronization, present system adopt time need synchronize data be packaged together, change into XML file, by TCP/IP procotol, utilize HTTP to connect, XML file is transferred to Web server end, after analyzing XML file, data is updated in the data base of service end.

The foregoing is only embodiments of the invention; not thereby the scope of the claims of the present invention is limited; every equivalent structure utilizing description of the present invention and accompanying drawing content to make or equivalence flow process conversion; or directly or indirectly it is used in other relevant technical fields, all in like manner include in the scope of patent protection of the present invention.

Claims

1. the off-line audio recognition method based on mobile terminal device, it is characterised in that including:

Calculate described speech feature vector output probability in described acoustic model, and based on output probability maximum in described output probability, obtain the corresponding corresponding matched text data of speech feature vector, obtain the final recognition result of described voice signal；

Also include: described final recognition result is carried out Chinese word segmentation；

Also include: described final recognition result is shown to interface bivariate table；Described described final recognition result be shown to interface bivariate table include:

S603, from character string dimension InputString, take out a field, comparing item by item with the field in KeyWordString, if there being coupling, the subscript i that this field is corresponding in array InputString being stored in array PointKeyWord；If it does not match, do not carry out any operation；Wherein, 0=< i≤n-1, n is the number of field in character string dimension InputString, and i is integer；

S605, repetition step S603 and step S604, until all fields in InputString are all mated complete；

2. off-line audio recognition method according to claim 1, it is characterised in that described final recognition result is carried out participle and includes:

If it fails to match, then repeat step S503, until described new matching field is matched successfully；

3. off-line audio recognition method according to claim 1, it is characterised in that: by tieing up bit algorithm, described speech feature vector is mated.

4. off-line audio recognition method according to claim 1, it is characterised in that: by NGram algorithm, described language character string is mated.