CN103810998A

CN103810998A - Method for off-line speech recognition based on mobile terminal device and achieving method

Info

Publication number: CN103810998A
Application number: CN201310652535.2A
Authority: CN
Inventors: 李林; 徐礼奎; 呼延正勇; 方帅; 张晓东; 叶思菁; 姚晓闯; 刘哲
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2013-12-05
Filing date: 2013-12-05
Publication date: 2014-05-21
Anticipated expiration: 2033-12-05
Also published as: CN103810998B

Abstract

The invention provides a method for off-line speech recognition based on a mobile terminal device. The method comprises the steps of obtaining speech signals and extracting speech feature vectors corresponding to the speech signals; performing matching on the speech feature vectors on the basis of an acoustic model preset in the mobile terminal device, and obtaining language character strings corresponding to the speech feature vectors; performing matching on the language character strings on the basis of a language model preset in the mobile terminal device and a dictionary, and obtaining corresponding matching text data of the speech feature vectors; calculating output probabilities of the speech feature vectors in the acoustic model, obtaining corresponding matched text data of the corresponding speech feature vectors on the basis of the maximum output probability among the output probabilities, and obtaining final recognition results of the speech signals. The matching of the speech signals in specific fields is achieved by utilizing the preset acoustic model and the language model in the mobile terminal device and the dictionary, the speech signals are converted into the text data, the final recognition results are obtained, and off-line speech recognition is achieved finally.

Description

Off-line audio recognition method and implementation method based on mobile terminal device

Technical field

The present invention relates to the field of speech recognition of mobile terminal, a kind of off-line audio recognition method based on mobile terminal device and the implementation method based on the speech recognition of mobile terminal device off-line are especially provided.

Background technology

Field data collection program based on mobile terminal refers to that the intelligent movable equipment that operates in (flat board, smart mobile phone, portable computer etc.) is upper, the built-in application program that provides computer technology to support for field study work.For simplifying the acquisition mode of field data, shorten data collection cycle, strengthen data typing standardization level and data typing, the efficiency of management, exist at present many field data collection programs, extensive in numerous sector applications such as agricultural, forestry, meteorology, geology, entomology, ecology.

Field data collection program construction and applied research start from the nineties in 20th century, present field acquisition system generally all adopts the mode logging data that utilizes keyboard input, but the keyboard of smart mobile phone is smaller, people's finger is larger, when input data, often there is situation about pressing the wrong button, and both hands are all occupied when logging data, so just cause the efficiency of logging data lower, so just affected the further widespread use of Field data collection system.The application of speech recognition technology, will become and break the low powerful mean of restriction conventional keyboard input data efficiency.

Speech recognition technology is a cross discipline that relates to signal processing, pattern-recognition, theory of probability and information theory, sounding hearing mechanism, artificial intelligence etc., its target is that the vocabulary content in human speech is converted to computer-readable input, thereby reaches the target of man-machine interaction more naturally.Relatively the speech recognition software of main flow is all the technology of processing based on high in the clouds, internet at present, be client input voice, server end speech recognition, recognition result returns to client, and the advantage of this technology is to utilize the powerful speech processes ability of server end; Save the space of client stores language model, acoustic model and dictionary; Can identify a large amount of universal word amount voice, but it can not identify the uncommon vocabulary of application-specific industry, and need to network, in the time that network model is bad, can not guarantee processing speed, so being not suitable in environmental baseline is not the field acquisition system that very good field is used, and therefore needs a kind of speech recognition technology based on off-line to support field acquisition system to seem very important and urgent.

Summary of the invention

(1) technical matters that will solve

The object of the invention is, a kind of off-line audio recognition method based on mobile terminal device is provided, the in the situation that of off-line, realize speech recognition thereby realize.

(2) technical scheme

For solving the problems of the technologies described above, the invention provides a kind of off-line audio recognition method based on mobile terminal device, comprising:

Obtain voice signal and extract the speech feature vector that described voice signal is corresponding;

Acoustic model based on preset in described mobile terminal device mates described speech feature vector, obtains the corresponding language character string of described speech feature vector; And language model and dictionary based on preset in described mobile terminal device mate described language character string, obtain the corresponding matched text data of described speech feature vector;

Calculate the output probability of described speech feature vector in described acoustic model, and output probability based on maximum in described output probability, obtain the corresponding matched text data of corresponding speech feature vector, obtain the final recognition result of described voice signal.

Wherein, the described audio recognition method based on mobile terminal device also comprises: described final recognition result is carried out to participle.

Concrete, described final recognition result is carried out to participle and comprises:

S501, the Chinese character that in a participle dictionary, maximum entry comprises is set counts n; Wherein, the matched text data that described final recognition result is corresponding are Chinese character string;

S502, get front n character in described Chinese character string sequence as matching field, search described participle dictionary;

If have the words corresponding with described matching field in described participle dictionary, the match is successful, and described matching field is split out as a word, and be stored into another character string newString, and separate by blank character and other words;

If can not find a words corresponding with described matching field in described participle dictionary, it fails to match, enters step S503;

S503, n is become to n-1, then the matching field for mating step S502 being taken out removes last Chinese character, as new matching field, search described participle dictionary, if have the words corresponding with new matching field in described participle dictionary, the match is successful, and described new matching field is split out as a word, and be stored in character string newString;

If it fails to match, repeating step S502-S503, extremely till described new matching field is matched to merit;

S504, repeating step S503, until all characters with matching field are matched to merit in described Chinese character string, complete the participle to described Chinese character string.

Wherein, the described audio recognition method based on mobile terminal device also comprises: described final recognition result is shown to interface bivariate table.

Concrete, described final recognition result is shown to interface bivariate table and comprises:

S601, determine the field that described interface bivariate table need to gather, and these locality field deposit in character string array KeyWordString;

S602, the character string after participle, utilize split function take blank character as mark is divided into multiple fields, deposit in character string array InputString;

S603, from character string array InputString, take out a field, compare item by item with the field in KeyWordString, if there is coupling, this field in array InputString in corresponding subscript i storage array PointKeyWord; If do not mated, do not carry out any operation; Wherein, 1=<i<=n, n is the number of field in character string array InputString, i, n are positive integer;

S604, from InputString, take out next field, compare item by item with the field in keyWordString, if the match is successful, this field corresponding subscript i+1 in InputString is deposited in PointKeyWord, array ValueString[i] be set to sky, if do not mated, ValueString[i] value be set to this field;

S605, repeating step S603 and step S604, to mating complete to all fields in InputString;

S606, coupling result store in Hashmap in the mode of key-value pair, utilize the Key of key-value pair and the gauge outfit of bivariate table to compare, and the value in key-value pair deposited in to the bivariate table at interface.

Concrete, the described audio recognition method based on mobile terminal device mates described speech feature vector by dimension bit algorithm.

Concrete, the described audio recognition method based on mobile terminal device mates described language character string by NGram algorithm.

For solving the problems of the technologies described above, the present invention also provides a kind of implementation method based on the speech recognition of mobile terminal device off-line, comprising:

Collection project vocabulary;

Utilize HMM model training acoustic model data and language model data based on described project vocabulary;

Acoustic model data based on completing training is set up acoustic model, and the language model data based on completing training are set up language model, and utilizes text editor to create dictionary;

Described acoustic model, language model and dictionary are stored in to described mobile terminal device.

Wherein, described acoustic model data is HMM parameter optimization Algorithm for Training based on segmentation K mean algorithm.

Wherein, described language model data are based on NGram Algorithm for Training.

(3) beneficial effect

Be different from background technology, cardinal principle of the present invention is that voice signal is converted into text data, obtain final recognition result, its main implementation procedure is to utilize acoustic model preset in mobile terminal device, language model and dictionary to realize the coupling for the voice signal of specific area, finally realizes off-line speech recognition.Further, the present invention has realized in the time gathering the information data of specific area, does not need manually input just can complete the collection of information data, has greatly improved collecting efficiency, has reduced the cost of image data.

Accompanying drawing explanation

Fig. 1 is the implementation method schematic flow sheet based on the speech recognition of mobile terminal device off-line in embodiment mono-;

Fig. 2 be embodiment illustrated in fig. 1 in HMM parameter training process flow diagram based on segmentation K mean algorithm;

Fig. 3 is the overall procedure schematic diagram that the present invention is based on the off-line audio recognition method of mobile terminal device;

Fig. 4 is the off-line audio recognition method schematic flow sheet based on mobile terminal device in embodiment bis-;

Fig. 5 is middle Chinese word segmentation schematic flow sheet embodiment illustrated in fig. 4;

Fig. 6 is the schematic flow sheet that after middle Chinese word segmentation embodiment illustrated in fig. 4, result echo arrives system interface bivariate table;

Fig. 7 is the realize system recorded voice oscillogram of the embodiment of the present invention three based on the speech recognition of mobile terminal device off-line;

Fig. 8 is the realize system acoustic analysis figure of the embodiment of the present invention three based on the speech recognition of mobile terminal device off-line;

Fig. 9 is the realize system analysis.conf configuration file figure of the embodiment of the present invention three based on the speech recognition of mobile terminal device off-line;

Figure 10 is the realize system HMM prototype file map of the embodiment of the present invention three based on the speech recognition of mobile terminal device off-line;

Figure 11 is the realize system HMM training process figure of the embodiment of the present invention three based on the speech recognition of mobile terminal device off-line;

Figure 12 is the realize system HTK tool rack composition of the embodiment of the present invention three based on the speech recognition of mobile terminal device off-line;

Figure 13 is the realize system HTK speech processes process flow diagram of the embodiment of the present invention three based on the speech recognition of mobile terminal device off-line.

Embodiment

For making object of the present invention, content and advantage clearer, below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used for illustrating the present invention, but are not used for limiting the scope of the invention.

Embodiment mono-

The present embodiment provides a kind of implementation method based on the speech recognition of mobile terminal device off-line, and the method originates in step 101, gathers project vocabulary, and here project vocabulary is for the professional term of specific area, for everyday expressions and the phrase of specific area.

In step 102, utilize hidden Markov HMM model training acoustic model data and the language model data based on probability statistics based on described project vocabulary.Described acoustic model data is HMM parameter optimization Algorithm for Training based on segmentation K mean algorithm.Described language model data are based on NGram Algorithm for Training.

NGram algorithm is mainly that the appearance of N word is only to N-1 word is relevant above based on so a kind of hypothesis, and all uncorrelated with other any word, the probability of whole sentence is exactly the product of each word probability of occurrence.These probability can obtain by directly add up the number of times that N word occur from language material simultaneously.That conventional is the Bi-Gram of binary and the Tri-Gram of ternary.Computing formula is as follows: P(w _n| w ₁, w ₂..., w _n-1)=C (w ₁, w ₂..., w _n)/C (w ₁, w ₂..., w _n-1), according to law of great number, in the situation that front N-1 word occurs, the probability of N word appearance is the frequency that frequency that this N word occurs simultaneously occurs divided by N-1 word simultaneously.Take promotion prospect as example, calculate exactly in situation about having occurred before promoting, the probability that scape occurs, i.e. P(scape | push away, wide, front)=C (push away, wide, front, scape)/C (pushing away, wide, front).Utilizing NGram Algorithm for Training speech data is exactly based on context to train the probability occurring in the situation that in dictionary, other words of each word occur simultaneously, then the result storage of training in speech model.

The quality of HMM model parameter directly affects the effect of speech recognition, in three parameters of HMM model M={ A, B, π }, state transition probability matrix A and original state Making by Probability Sets π are little on phonetic recognization rate impact, are conventionally set to the value of being uniformly distributed or non-zero random value.The initial value of B parameter arranges than A and π difficulty, simultaneously also most important.In the invention of this reality, for the shortcoming of Baum-Welch classic algorithm, sum up three kinds and improved algorithm, can improve algorithm by three kinds of HMM model trainings and realize training acoustic model data and language model data, specifically introduce respectively these three kinds below and improve algorithm.

The first: the HMM parameter optimization algorithm based on segmentation K mean algorithm

Refer to Fig. 2, in step 201, preset HMM model parameter initial value, initial value can be divided state method or experience obtains by decile;

Preset maximum iteration time I and convergence threshold ζ;

In step 202, with Viterbi algorithm, the training utterance data of input are carried out to state and cut apart;

In step 203, B parameter in model is reappraised by segmentation K mean algorithm.Be divided into two kinds of situations:

Discrete type system:

Whole speech frame numbers under speech frame occurrence number/state Sj of label i in bji=state Sj;

Continuous type system:

The probability density function of each state is to be formed by stacking by M normal distyribution function.

Speech frame number in class i speech frame number/state j in mixing constant ω ji=state j;

Sample average μ ji: the sample average of class i in state j;

Sample covariance matrix: class i sample covariance matrix in υ ji=state j;

Utilize above-mentioned parameter computation model parameter M*.

In step 204, as initial value, utilize Baum-Welch algorithm to the revaluation of HMM parameter with M*;

In step 205, forward step 202 to, until iterations > I or meet the condition of convergence.

As shown in Figure 2, segmentation K mean algorithm is the maximum likelihood criterion based on state optimization, can greatly accelerate the speed of convergence of model, can also in training, provide some additional informations simultaneously.

The second: the HMM parameter training based on genetic algorithm improves algorithm

(1) preset HMM model parameter initial value, initial value can evenly arrange or obtain by experience;

(2) preset maximum evolution number of times I and convergence threshold ζ;

(3) select optimal base because of: in every generation, according to adaptive value from high to low, according to certain ratio choose optimal base because of, the ratio of choosing that wherein adaptive value is high is corresponding wants high, adaptive value computing formula:

f (λ) = Σ_{k = 1}^{N} \ln (P (O | λ));

(4) hybridization and variation: the i.e. appropriate section generation of two parents of hybridization is exchanged and produced filial generation, be equivalent to the search of a regional area, variation carries out the additions and deletions raw Variants of changing products to the subregion of parent, makes filial generation jump out current Local Search region, avoids sinking into too early local optimum.

(5) Renewal model parameter;

(6) if evolution number of times > I or meet the condition of convergence trains completely, otherwise forward (3) to.

The third: the HMM parameter training based on relaxed algorithm improves algorithm.

(2) preset maximum iteration time I and convergence threshold ζ;

(3) make Tm=T0 × f (m), f (m)=Km, wherein m value is 0,1,2...I, K<1;

(4) example of N × M separate normal random variable X of generation, average EX=0, variance DX=Tm;

(5) utilize classical Baum-Welch algorithm to obtain bij parameter, making bij*=bij+x(x is the value of X above), 1≤i≤N, 1≤j≤M; If negative value appears in bij*, value is made as zero, is normalized simultaneously;

(6) if m > I or meet the condition of convergence trains completely, otherwise forward (3) to.

But in the present embodiment, the i.e. HMM parameter optimization algorithm based on segmentation K mean algorithm of the first before only having utilized in three kinds of optimized algorithms, this algorithm is mainly used for accelerating the training speed of model, and similar syllable is carried out to cluster, between the decode empty while shortening identification; And latter two improvement algorithm is mainly used for improving precision, speech recognition for us based on off-line, mainly for some vocabulary of specific industry application training, in model, vocabulary is little like this, the likelihood of different words is very low, use the DeGrain of these two kinds of algorithms, therefore a choice for use the first optimized algorithm.

In step 103, the acoustic model data based on completing training is set up acoustic model, and the language model data based on completing training are set up language model, and utilizes text editor to create dictionary.Wherein, the constructive process of dictionary and language model is as follows: utilize text editor to need manual creation text file according to application system, the Chinese character of the field information of needs collection and corresponding phonetic are write in text, then utilize lmtool instrument automatically to generate participle dictionary and language model.

In step 104, described acoustic model, language model and dictionary are stored in to described mobile terminal device.

After said process, for the professional term of specific area, acoustic model set up in everyday expressions and phrase for specific area, language model and dictionary, wherein, acoustic model mates described speech feature vector, language model and dictionary mate the character string after acoustic model coupling, can pass through above-mentioned acoustic model, language model and dictionary are realized the off-line speech recognition for specific area, in the time gathering the information data of specific area, do not need manually input just can complete the collection of information data, greatly improve collecting efficiency, reduce the cost of image data, solve the technical matters of mentioning in background technology.

Embodiment bis-

Refer to Fig. 3 and Fig. 4, the present embodiment provides a kind of off-line audio recognition method based on mobile terminal device, the method is the audio recognition method completing based on implementing an acoustic model of setting up, language model and dictionary, originate in step 401, obtain voice signal and extract the speech feature vector that described voice signal is corresponding.

In step 402, the acoustic model based on preset in described mobile terminal device mates described speech feature vector, obtains the corresponding language character string of described speech feature vector; And language model and dictionary based on preset in described mobile terminal device mate described language character string, obtain the corresponding matched text data of described speech feature vector.Concrete, the described audio recognition method based on mobile terminal device mates described speech feature vector by the dimension bit algorithm in HMM model.The described audio recognition method based on mobile terminal device mates described language character string by NGram algorithm.

In step 403, calculate the output probability of described speech feature vector in described acoustic model, and output probability based on maximum in described output probability, obtain the corresponding matched text data of corresponding speech feature vector, obtain the final recognition result of described voice signal.

Be not difficult to find out through foregoing description, the cardinal principle of the present embodiment is that voice signal is converted into text data, obtain final recognition result, its main implementation procedure is to utilize acoustic model preset in mobile terminal device, language model and dictionary to realize the coupling for the voice signal of specific area, finally realizes off-line speech recognition.Further, the present embodiment has been realized in the time gathering the information data of specific area, does not need manually input just can complete the collection of information data, has greatly improved collecting efficiency, has reduced the cost of image data.

Refer to Fig. 5 and Fig. 6, for making user further understand through the present embodiment final recognition result after treatment, the audio recognition method of the present embodiment based on mobile terminal device also comprises after obtaining described final recognition result:

Described final recognition result is carried out to participle;

And the result after participle is shown to interface bivariate table.

After above-mentioned steps, net result after participle is shown to accurately in system interface bivariate table and gathers the corresponding position of field, user can directly check the voice signal that oneself gathers with written form.In this way, user only inputs a secondary data just can complete data of collection, rather than input data at every turn and can only complete the information acquisition of a field, and improve greatly the efficiency of data acquisition, save time cost and the human cost of field information acquisition.

Fig. 5 is the schematic flow sheet that final recognition result is carried out to participle, and in the present embodiment, the matched text data corresponding take final recognition result make an explanation as Chinese character string as example.Concrete, described final recognition result is carried out to participle and comprises:

S501, the Chinese character that in a participle dictionary, maximum entry comprises is set counts n; Wherein, the matched text data that described final recognition result is corresponding are Chinese character string; Wherein, the constructive process of participle dictionary is as follows: utilize text editor to need manual creation text file according to application system, the Chinese character of the field information of needs collection and corresponding phonetic are write in text, then utilize lmtool instrument automatically to generate participle dictionary.

S502, get front n character in described Chinese character string sequence as matching field, search described participle dictionary.

If have the words corresponding with described matching field in described participle dictionary, the match is successful, and described matching field is split out as a word, and be stored into another character string newString, and separate by blank character and other words.

If can not find a words corresponding with described matching field in described participle dictionary, it fails to match, enters step S503.

S503, n is become to n-1, then the matching field for mating step S502 being taken out removes last Chinese character, as new matching field, search described participle dictionary, if have the words corresponding with new matching field in described participle dictionary, the match is successful, and described new matching field is split out as a word, and is stored in character string newString.

If it fails to match, repeating step S503, extremely till described new matching field is matched to merit.

S504, repeating step S502-S503, until all characters with matching field are matched to merit in described Chinese character string, complete the participle to described Chinese character string.

Refer to Fig. 6, Fig. 6 is through Chinese character string being carried out to result echo after the participle schematic flow sheet to system interface bivariate table, concrete, described final recognition result (be participle after result) is shown to interface bivariate table and comprises:

S601, determine the field that described interface bivariate table need to gather, and these locality field deposit in character string array KeyWordString; Wherein, the field of these collections is the field that interface bivariate table need to show, includes but are not limited to all fields after above-mentioned participle.

S602, the character string after participle, utilize split function take blank character as mark is divided into multiple fields, deposit in character string array InputString.

S603, from character string array InputString, take out a field, compare item by item with the field in KeyWordString, if there is coupling, this field in array InputString in corresponding subscript i storage array PointKeyWord; If do not mated, do not carry out any operation; Wherein, 1=<i<=n, n is the number of field in character string array InputString, i, n are positive integer.

S604, from InputString, take out next field, compare item by item with the field in keyWordString, if the match is successful, this field corresponding subscript i+1 in InputString is deposited in PointKeyWord, array ValueString[i] be set to sky, if do not mated, ValueString[i] value be set to this field.

S605, repeating step S603 and step S604, to mating complete to all fields in InputString.

Embodiment tri-

Refer to Fig. 7-13, the system that realizes of the present embodiment mobile terminal device off-line speech recognition is based on Android system platform, and that database is used is current free small-sized movable client database SQLite.The system that realizes of the present embodiment mobile terminal device speech recognition is divided into off-line sound identification module, and word-dividing mode is shown to word-dividing mode word segmentation result the echo module of page bivariate table and four modules of data processing.Off-line sound identification module mainly completes speech model data (comprising acoustic model data and language model data) training, the establishment of language model, acoustic model and dictionary, and the cardinal principle of off-line speech recognition is that voice signal is changed into text signal; Word-dividing mode be mainly responsible for character string word that off-line sound identification module is identified according to string matching method change into one by one phrase and with as comma with or/space uniformly-spaced accords with separating and connects into a character string; Echo module mainly realizes through word-dividing mode character string after treatment, is shown to accurately in bivariate table according to the mode of mating with interface bivariate table gauge outfit; Data processing module mainly completes the operation to data in bivariate table and database data in mobile terminal device is realized and being synchronizeed with data in server.Introduce the specific implementation process of above-mentioned each module below.

1, off-line sound identification module

At the sound identification module of native system, only utilize the i.e. training of the HMM parameter optimization algorithm based on segmentation K mean algorithm to acoustic model data of the first in three kinds of optimized algorithms described in embodiment mono-, this algorithm is mainly used for accelerating the training speed of model, similar syllable is carried out to cluster, between the decode empty while shortening identification; And latter two improvement algorithm is mainly used for improving precision, speech recognition for us based on off-line, mainly for some vocabulary of specific industry application training, in model, vocabulary is little like this, the likelihood of different words is very low, use the DeGrain of these two kinds of algorithms, therefore a choice for use the first optimized algorithm.

Data in off-line speech recognition are prepared, data training, and speech recognition and interpretation of result, we use HTK instrument to complete.The software architecture of HTK as shown in figure 12, HTK speech processes flow process as shown in figure 13:

1.1, set up the file of storaged voice identification material requested.

Set up file data, this file is used for storing training and testing data, under data, set up two sub-directory data/train and data/test, under train, continue to set up two sub-directories, data/train/sig (the training utterance data of recording in order to storage) and data/train/mfcc (being used for storing the mfcc parameter after training data transforms); Data/test is used for store test data.Set up model file, for storing the associated documents of model of recognition system.Set up def file, for storing language model and dictionary.

1.2, create training set.

In this stage, we need to utilize the recording of voice signal of HTK instrument HSLab finished item vocabulary, and then utilizing this instrument is that each voice signal is write label, and an associated text is described the content of voice.The dos command that utilizes this instrument to complete this work is: HSLabname.sig, wherein name is concrete vocabulary phonetic.

1.2.1, recording

Press Rec button and start recorded speech signal, stop recording while pressing Stop, default sample frequency is 16KHZ, and Fig. 7 is that of numeral 5 records waveform.

1.2.2, mark signal

First press Mark button, then select you will open the region of label.After area marking, press Labelas, input label title, then presses enter key Enter.For each signal, we need to mark three continuous regions: starting to pause is labeled as sil, and recording word is labeled as project vocabulary name, and technology is paused and is labeled as sil.These three regions can not be overlapping, even if the gap between them is very little.After these three marks complete, press Save button, label file name.lab is created successfully.Be the label file of numeral 5: 41712509229375sil922937515043750name1504375020430625sil below, the wherein beginning of the each label of digitized representation and end sampled point.Such file can carry out manual modification, such as adjusting beginning or the end point of label.

1.3, acoustic analysis

Speech recognition tools can not directly be processed waveform voice, need to represent waveform voice by more succinct effective method, this just need to utilize HCopy instrument to carry out acoustic analysis, as shown in Figure 8, its dos command be HCopy-A-D-C analysis.conf-S targetlist.txt. wherein-parameter that A is display command,-D is for showing configuration setting,-C formulates configuration file,-S is assigned source file destination road strength, analysis.conf is for extracting parameter configuration files (being annotation after #), acoustics coefficient extracting parameter is set, we use MFCC as feature extraction parameter, all parameters comprise: 12 MFCC coefficient [c1, ..., c12] (because NUMCEPS=12), 1 MFCC coefficient c0, is directly proportional to the gross energy of frame (in TARGETKIND, suffix _ 0), 13 Delta coefficients, by [c0, c1 ..., c12] derivation (in TARGETKIND, suffix _ D), 13 Acceleration coefficients, (in TARGETKIND, suffix _ A).Targetlist.txt be used to specify for the treatment of title and the deposit position of each wave file, and the title of target factor file and deposit position.Analysis.conf configures as shown in Figure 9

1.4, HMM prototype definition

Set up file hmm_yi.hmm, be kept under htk/model/proto.As shown in figure 10, the HMM model of other vocabulary has identical content to Hmm_yi.hmm content, and in need only～h yi, yi changes corresponding vocabulary phonetic into.～h " yi " <BeginHMM>...<En dHMM>, the description of encapsulation to HMM model.

1.5, HMM training

HMM complete model is trained as shown in figure 11, comprises initialization and two parts of training

1.5.1 initialization

Utilize order line below to use Viterbi algorithm to carry out initialization to HMM model: Hinit-A-D-S trainlist.txt-M model/hmm1 – H model/proto/hmm0label – Lnameofhmmnameofhmm is the title that will carry out initialized HMM model.The .mfcc listed files that trainlist.txt has provided.Which label section label is used for indicating for training set.Model/hmm1 is the directory name (must create in advance) of initialization HMM model description result output.This process repeats each model.

1.5.2, training

That utilizes HTK instrument HRest once estimates iteration again, estimate the optimum value of HMM model parameter, order to be: HRest trainlist.txt-M model/hmmi-H model/hmmi-1/hmmfile-l label-L nameofhmm wherein nameofhmm is the title of the HMM model that will train.Hmmfile is the description document of the HMM model of nameofhmm by name.Complete list (the being stored in data/train/mfcc/) label that trainlist.txt provides the .mfcc file of composing training collection indicates the label using in training data.Model/hmmi, output directory, i represents current iteration number of times.For the each HMM model that will train, this process will repeat many times.Stop condition: measure and indicate convergence and be presented on screen by change after each HRest iteration.Once this measurement value no longer reduces (absolute value), process should stop, and then corresponding training result unification is put under hmm_result file.The similar training of other vocabulary.

1.6, task definition

The each file relevant to task should be stored in to special def/ catalogue

1.6.1 set up syntax rule and dictionary

Set up syntax rule file gram.txt (under def file), content is: * Task grammar

$WORD=YI|ER|...|SIL;

({SIL}[$WORD]{SIL})

Draw together SIL with bracket { } and represent that it can not exist or repeatedly (allow to pause for a long time, or do not pause at all) before or after word.Bracket [] is drawn together $ WORD and is represented zero or once occur (if there is no word, may just identify pause).

1.6.2 set up dictionary file

Dict.txt (under def file), content is:

2YI[1]yi

3ER[2]er

…

SIL[sil]sil

1.7, set up network task

Task grammer (describing in gram.txt) uses HParse instrument to compile, and generates Task Network, and dos command is:

HParsedef/gram.txt?def/net.slf

For guaranteeing that grammer does not have mistake, can test with HSGen instrument, dos command is:

HSGen-s?def/net.slfdef/dict.txt

1.8, speech recognition

Utilization is that the Hvite in HTK carries out speech recognition.Utilize previous ready dictionary, syntactic structure file, the acoustic model having trained, exports corresponding statement to speech data according to identification probability size, and utilizes the HResults instrument of HTK to analyze to the result of output.

1.9 interpretation of result

The interpretation of result mainly accuracy rate to speech recognition and recognition speed is assessed, and is a ring important in speech recognition.Interpretation of result instrument in HTK is HResults, and it carries out paired comparisons analysis by identification test result with completed HMM mark file, exports corresponding accuracy and accuracy rate (accuracy rate has been considered inserting error on the basis of accuracy).

2, Chinese word segmentation

The Chinese word segmentation of system of the present invention, has adopted the maximum matching way based in string matching to carry out Chinese word segmentation to the character string of speech recognition.According to the vocabulary in data dictionary, the character string that speech recognition is gone out by with data dictionary in terminology match, after the match is successful, the result of identification is become to the character string of one group of word composition by space-separated, specifically set forth at embodiment bis-, do not repeat them here.

3, word segmentation result echo interface

System of the present invention is in the process of character string echo, first utilizing split function is mark take space the character string after participle, after cutting apart, be deposited in a character string array, then by comparing with the gauge outfit field of bivariate table, the data of the gauge outfit that the match is successful deposit in key-value pair major key, corresponding worth data are deposited in key-value pair value, finally by key-value pair major key inquiry mode, value in key-value pair is inserted in bivariate table one by one, specifically set forth at embodiment bis-, do not repeated them here.

4, data processing

It is first the data that collect in system interface bivariate table to be deposited in the SQLite database of mobile terminal this locality that the present embodiment system data is processed what adopt, can be data not synchronous in local data base synchronously all in server database in the situation that of mobile terminal networking.When data are synchronous, system of the present invention both can gather immediately the synchronous data that just gathered, data that also can centralized and unified synchronous all collections together.In the time that data are synchronous, the time handle that system of the present invention adopts needs synchronous data encapsulation to together, changes into XML file, by TCP/IP procotol, utilize HTTP to connect, XML file transfer to Web server end, after analyzing XML file Data Update in the database of service end.

The foregoing is only embodiments of the invention; not thereby limit the scope of the claims of the present invention; every equivalent structure or conversion of equivalent flow process that utilizes instructions of the present invention and accompanying drawing content to do; or be directly or indirectly used in other relevant technical fields, be all in like manner included in scope of patent protection of the present invention.

Claims

1. the off-line audio recognition method based on mobile terminal device, is characterized in that, comprising:

2. off-line audio recognition method according to claim 1, is characterized in that, also comprises: described final recognition result is carried out to Chinese word segmentation.

3. off-line audio recognition method according to claim 2, is characterized in that, described final recognition result is carried out to participle and comprise:

If it fails to match, repeating step S503, extremely till described new matching field is matched to merit;

4. off-line audio recognition method according to claim 2, is characterized in that, also comprises: described final recognition result is shown to interface bivariate table.

5. off-line audio recognition method according to claim 4, is characterized in that, describedly described final recognition result is shown to interface bivariate table comprises:

S603, from character string array InputString, take out a field, compare item by item with the field in KeyWordString, if there is coupling, this field in array InputString in corresponding subscript i storage array PointKeyWord; If do not mated, do not carry out any operation; Wherein, 0=<i<=n-1, n is the number of field in character string array InputString, i, is integer;

6. off-line audio recognition method according to claim 1, is characterized in that: by dimension bit algorithm, described speech feature vector is mated.

7. off-line audio recognition method according to claim 1, is characterized in that: by NGram algorithm, described language character string is mated.

8. the implementation method based on the speech recognition of mobile terminal device off-line, is characterized in that, comprising:

Collection project vocabulary;

9. implementation method according to claim 8, is characterized in that, described acoustic model data is HMM parameter optimization Algorithm for Training based on segmentation K mean algorithm.

10. implementation method according to claim 8, is characterized in that, described language model data are based on NGram Algorithm for Training.