CN103811000A

CN103811000A - Voice recognition system and voice recognition method

Info

Publication number: CN103811000A
Application number: CN201410062780.2A
Authority: CN
Inventors: 蔡中军; 贾春晖; 周京蕙; 王翀; 郑潜; 余代员
Original assignee: China Mobile Shenzhen Co Ltd
Current assignee: China Mobile Shenzhen Co Ltd
Priority date: 2014-02-24
Filing date: 2014-02-24
Publication date: 2014-05-21

Abstract

The invention relates to a voice recognition system and a voice recognition method. The voice recognition method includes the following steps of: S1, acquiring audio information to be recognized and region information corresponding to the audio information; S2, calling a voice database and a grammar database which correspond to the region information according to the region information, and calling corresponding grammar documents in the grammar database; and S3, recognizing the audio information according to the grammar documents and the voice database. The voice recognition system and the voice recognition method have the advantage of improving the recognition rate of multiple tones and large vocabularies.

Description

Speech recognition system and method

Technical field

The present invention relates to field of speech recognition, relate in particular to a kind of speech recognition system and method.

Background technology

Existing voice is known method for distinguishing and is mainly contained dynamic time reform technology DTW, vector quantization technology VQ, hidden Markov model HMM and artificial neural network ANN.

The dynamic time technology DTW that reforms is a kind of pattern match and model training technology early, and its applied dynamic programming method has successfully solved the phonic signal character argument sequence difficult problem that duration does not wait relatively time, in alone word voice identification, has obtained superperformance.

Vector quantization technology VQ extracts eigenvector from training utterance, obtain feature vector set, generate code book by LBG algorithm, in the time of identification, extract feature vector sequence from tested speech, they are mated with each code book, calculate average quantization error separately, select the code book of average quantization error minimum, as the voice that are identified.

Hidden Markov model HMM be voice signal time varying characteristic have a ginseng representation, it is described the statistical property of signal jointly by two stochastic processes that are mutually related, one of them is hidden (unobservable) has the Markov chain of finite state, and another is the stochastic process (observable) of the observation vector that is associated with each state of Markov chain.The feature of hidden Markov chain will disclose by the signal characteristic can observe.Like this, the feature that voice time varying signal is a certain section is just described by the stochastic process of corresponding states observation symbol, and signal is described by the transition probability of hidden Markov chain over time.Model parameter comprises HMM topological structure, state transition probability and describes one group of random function observing symbol statistical property.According to the feature of random function, HMM model can be divided into Discrete Hidden Markov Models (with HMM and semicontinuous hidden Markov model.

The application of artificial neural network in speech recognition is the another focus of studying now.ANN is a self-adaptation nonlinear dynamical system in essence, has simulated the principle of human neuronal activity, has self-study, association, contrast, reasoning and abstract ability.

Current above-mentioned main flow audio recognition method all comes with some shortcomings, and wherein main shortcoming is: for same vocabulary, if adopt the accent of different regions to say, tone color has certain change, and this can cause phonetic recognization rate greatly to reduce.

Summary of the invention

The defect that can cause when changing for tone color in existing voice recognition technology phonetic recognization rate to reduce, provides a kind of speech recognition system and method.

The technical scheme that technical solution problem of the present invention adopts is: a kind of audio recognition method is provided, comprises the following steps:

S1: gather audio-frequency information to be identified and the regional information corresponding with described audio-frequency information;

S2: call the speech database corresponding with this area's information and grammar database according to described regional information, call grammar file corresponding in described grammar database;

S3: described audio-frequency information is identified according to described grammar file and described speech database.

In audio recognition method provided by the invention, described step S1 also comprises: gather the class of business information corresponding with described audio-frequency information;

The grammar file calling in described grammar database in described step S2 further comprises: in described grammar database, call the grammar file corresponding with described class of business information according to described class of business information.

In audio recognition method provided by the invention, before described step S1, also comprise step S0: set up multiple speech databases and multiple grammar database and preserve corresponding class of business generative grammar file in each described grammar database according to different regions.

In audio recognition method provided by the invention, in described step S1, the regional information that the described audio-frequency information of described collection is corresponding comprises: the number of server of described audio-frequency information is sent in inquiry, according to described number inquiry and extract the regional information that described audio-frequency information is corresponding.

In audio recognition method provided by the invention, also comprise at described step S3: key word or word are set, while starting to identify described audio-frequency information, start timing, while recognizing described key word or word, stop timing, output recognition time.

The present invention also provides a kind of speech recognition system, comprising:

Acquisition module, described acquisition module comprises for gathering audio-frequency information to be identified the first collecting unit and for gathering the second collecting unit of the regional information corresponding with described audio-frequency information;

Scheduler module, described scheduler module is used for according to the described regional information Selection and call speech database corresponding with this area and grammar database and calls corresponding grammar file at described grammar database;

Identification module, described identification module is for identifying described audio-frequency information according to described grammar file and described speech database.

In speech recognition system provided by the invention, described acquisition module also comprises the 3rd collecting unit for gathering the class of business information corresponding with described audio-frequency information, and described scheduler module is also for calling the grammar file corresponding with described class of business information according to described class of business information at described grammar database.

In speech recognition system provided by the invention, described speech recognition system also comprises memory module, described memory module is for storing described multiple speech databases and the described multiple grammar database set up according to different regions, corresponding class of business generative grammar file in each described grammar database.

In speech recognition system provided by the invention, described the second collecting unit comprises the first inquiry subelement of number for inquiring about the server that sends described audio-frequency information and for according to described number inquiry and extract first of regional information that described audio-frequency information is corresponding and extract subelement.

In speech recognition system provided by the invention, also comprise timing module and module is set, the described module that arranges is for arranging key word or word, described timing module is for timing in the time that described identification module starts to identify described audio-frequency information, and described identification module stops timing and exports recognition time while recognizing described key word or word.

Speech recognition system provided by the invention and method have following beneficial effect with respect to prior art: because speech recognition system provided by the invention and method are when the identification audio-frequency information to be identified, what call is speech database and the grammar database corresponding with the regional information of audio-frequency information, therefore can avoid same vocabulary because of accent difference from different places, thereby cause phonetic recognization rate to reduce, and because the class of business of different regions is different, therefore call corresponding grammar database for the class of business situation of different regions and can improve the efficiency of speech recognition, therefore the present invention has the beneficial effect that improves tone color discrimination.

Accompanying drawing explanation

Below in conjunction with drawings and Examples, the invention will be further described, in accompanying drawing:

Fig. 1 is the theory diagram of the speech recognition system in first embodiment of the invention;

Fig. 2 is the theory diagram of second collecting unit of the present invention in embodiment illustrated in fig. 1;

Fig. 3 is the theory diagram of the speech recognition system in second embodiment of the invention;

Fig. 4 is the theory diagram of the speech recognition system in third embodiment of the invention;

Fig. 5 is the theory diagram of the speech recognition system in fifth embodiment of the invention;

Fig. 6 is the FB(flow block) of the audio recognition method in first embodiment of the invention;

Fig. 7 is the FB(flow block) of the audio recognition method in second embodiment of the invention;

Fig. 8 is the FB(flow block) of the audio recognition method in third embodiment of the invention.

Embodiment

Change in order to solve existing tone color in prior art the defect that causes phonetic recognization rate greatly to reduce, innovative point of the present invention is: provide corresponding speech database and grammar database for the different tone colors in different regions, to improve phonetic recognization rate.

Clearer for object of the present invention, technical scheme and advantage are described, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be understood that, specific embodiment described herein only, for explaining the present invention, is not intended to limit the present invention.

Fig. 1 shows the speech recognition system in first embodiment of the invention, it is mainly used in business monitoring, the audio-frequency information that monitoring server automatic imitation real user is sent instruction the server reply to operator to the server of operator is identified, to judge whether business service level meets expection.This speech recognition system comprises: acquisition module 1, scheduler module 2 and identification module 3.This acquisition module 1 comprises the first collecting unit 11 for gathering audio-frequency information to be identified and for gathering the regional information corresponding with this audio-frequency information the second collecting unit 12.Scheduler module 2 is for calling the speech database corresponding with this area and grammar database and calling corresponding grammar file at grammar database according to regional Information Selection.Identification module 3 is for identifying and export recognition result according to the grammar file and the speech database that call to audio-frequency information.Acquisition module 1, scheduler module 2 and identification module 3 connect successively.

This speech recognition system realizes by installation procedure on monitoring server.The CPU by monitoring server carries out work and realizes the function of acquisition module 1, scheduler module 2 and identification module 3 according to the software program of realizing these functions.

Monitoring server Reality simulation user sends instruction to the server of operator, and the server of operator automatically replies audio-frequency information, and the first collecting unit 11 that is arranged on the acquisition module 1 on monitoring server gathers this audio-frequency information to be identified by communication link.The second collecting unit 12 of acquisition module 1 is for obtaining the regional information that the audio-frequency information to be identified with this is corresponding and the regional information collecting being sent to scheduler module 2.Scheduler module 2, according to this regional information, is called and the corresponding speech database of this area's information and grammar database, and calls the corresponding grammar file in grammar database.Sound identification module 3 is identified audio-frequency information to be identified according to the speech database and the grammar file that call, and exports recognition result.

As shown in Figure 2, the second collecting unit 12 can comprise the server for inquiring about operator number first inquiry subelement and for according to this number inquiry and extract the server of operator regional information first extract subelement.Scheduler module 2 is scheduling speech database and the grammar database corresponding with this area's information just.

Understandably, the second collecting unit 12 can also obtain the regional information that audio-frequency information to be identified is corresponding by other means.

In the present embodiment, identification module 3 adopts stencil matching HMM method to realize speech recognition.This identification module 3 comprises feature extraction unit, acoustics recognition unit, speech recognition unit.Wherein, feature extraction unit is for carrying out feature extraction to the corresponding speech waveform of the audio-frequency information collecting, to obtain Speech acoustics feature.Can adopt traditional speech feature extraction algorithm to carry out feature extraction to speech waveform, for example, extract MFCC (Mel frequency cepstrum system), LPC (linear forecast coding coefficient), speech energy etc.The characteristic quantity of each acoustic model in the speech database that acoustics recognition unit calls according to scheduler module 2 contrasts with the characteristic quantity that feature extraction unit is extracted successively, to obtain the phone string that this audio-frequency information is corresponding.

Wherein, acoustic model is can become which type of characteristic quantity according to voice to carry out the data that modeling obtains.Because each area is for the pronunciation difference of same word, therefore its acoustic model is also not identical, so will set up multiple speech databases according to the difference in each area, in each speech database, be provided with the acoustic model of setting up according to the pronunciation custom of this area, this can improve speech recognition accuracy rate and recognition speed.The grammar file that speech recognition unit calls according to scheduler module 2 is identified phone string, obtains word strings output.Because the service conditions in each area is different, therefore, set up corresponding grammar database according to the service conditions of different regions, can improve phonetic recognization rate.

For the same business of areal, its voice vocabulary is substantially fixing.Therefore, can be adopted as each business and write a grammar file among a small circle.Grammar file limits some word phrases by specific principle combinations the data model of voice recognition unit Output rusults together.Therefore, as shown in Figure 3, in a second embodiment, on the basis of the first embodiment, acquisition module 1 can also comprise that the 3rd collecting unit 13, the three collecting units 13 are for gathering the class of business information corresponding with this audio-frequency information.Scheduler module 2 is first called the grammar database corresponding with this area's information according to regional information, and then in this grammar database, calls the grammar file corresponding with this class of business according to class of business information.In this grammar file of producing according to class of business, mainly comprise with the key vocabularies phrase of this traffic aided etc., therefore can improve recognition efficiency.

For example, there are b1, b2, b3, tetra-kinds of business of b4 in B area.Corresponding B area, sets up grammar database according to these four kinds of business, and in this grammar database, correspondence comprises four grammar files.In each grammar file, only have with the various common grammer vocabulary of this kind of traffic aided and with the corresponding phone string of this common grammer vocabulary.In the time that recognition unit recognizes this phone string, export the word strings corresponding with this phone string as recognition result output.

In this embodiment, the detailed process of scheduler module 2 schedule voice databases and grammar database is: obtain audio-frequency information, corresponding province information.Obtain current speech database library information: the path of the regional information that current speech database is corresponding, the name of current speech database, current speech database.Obtain sound bank information to be identified: province to be identified information, sound bank name to be identified, sound bank to be identified path.Current speech library information and sound bank information to be identified are compared.If inconsistent, the backup sound bank in corresponding province and current speech storehouse are switched, return to the grammar file in corresponding province, current speech storehouse.If unanimously, directly return to the grammar file in corresponding province, current speech storehouse.

As shown in Figure 4, in the 3rd embodiment, on the basis of the second embodiment, speech recognition system provided by the invention also comprises memory module 4.Memory module 4 is for storing the multiple speech databases and the multiple grammar database that generate according to different regions.Scheduler module 2 is called speech database and grammar database from this memory module 4.

In the 4th embodiment, on the basis of the 3rd embodiment, speech recognition system provided by the invention also comprises reparation module.When scheduler module 2 schedule voice failed database, speech database is repaired automatically.In practical work process, the reason of scheduler module 2 schedule voice failed databases has a lot.Sound bank switches and can carry out with rename form.If rename mistake, automatically repair process is exactly to be revised and come by the sound bank of rename mistake, or adopt artificial delete sound bank automatically repair procedure will arrive in backup sound bank path and found and copy to path, existing voice storehouse.Be likely also that this speech database is in occupied state, now just need to wait for, after occupied speech database is used up, can again call for scheduler module 2.Be likely in storage module 4, there is no the speech database corresponding with this area, so now just need to send the information that lacks this speech database.Repair module and there is automatic reparation speech data library facility.Each speech database has identical speech database for subsequent use, in the time calling the speech database failure in a certain area, switches to and calls corresponding speech database for subsequent use.

Understandably, on the basis of above-described embodiment, in the 5th embodiment, as shown in Figure 5, speech recognition system provided by the invention also comprises module 5 and timing module 6 is set, and this arranges sensitivity, identification languages information etc. that module 5 can arrange identification.This arranges module 5 can also arrange key word or word, timing module 6 timing in the time that identification module 3 starts to identify, stop timing when identification module 3 recognizes when key word that module 6 sets in advance or word are set, timing module 6 is exported identification module 3 and is recognized key word or word time used.This time can be used for judging whether the recognition efficiency of identification module 3 reaches expection.

As shown in Figure 6, the present invention also provides a kind of audio recognition method, and this audio recognition method comprises the following steps in the first embodiment:

S1: gather the regional information that audio-frequency information to be identified and audio-frequency information are corresponding.In this step, monitoring server simulation true man input instruction and send to the server of operator, the server of operator automatically replies audio-frequency information according to this instruction, and the first collecting unit 11 that is arranged on the acquisition module 1 in monitoring server gathers this audio-frequency information by communication link.The second collecting unit 12 of acquisition module 1 can send by inquiry the number of the server of the operator of audio-frequency information, then according to this number inquiry and extract the regional information corresponding to server of operator, that is to say the regional information that audio-frequency information is corresponding.

S2: according to the regional Information Selection speech database corresponding with this area and grammar database; Call the grammar file in grammar database.In this step, the regional information that scheduler module 2 gathers according to the second collecting unit 12 calls the speech database corresponding with this area's information and grammar database in memory module 4, and the step of going forward side by side is dispatched the grammar file in this grammar database.

S3: identification module 3 is identified audio-frequency information according to the grammar file and the speech database that are scheduled, and exports recognition result.

On the basis of the first embodiment of this audio recognition method, as shown in Figure 7, in a second embodiment, step S1 can further include following steps: the 3rd collecting unit 13 of acquisition module 1 gathers the class of business information corresponding with audio-frequency information.Correspondingly, in step S2, the grammar file calling in grammar database is specially: scheduler module 2 is called the grammar file corresponding with this class of business information according to class of business information in grammar database.In this step, can to identification module, identification sensitivity be set according to the class of business of identification in advance.Key word or vocabulary also can be set, timing module 5 starts timing in the time that identification module 3 starts to identify, stop timing when identification module 3 recognizes when key word that module 6 sets in advance or word are set, timing module 5 is exported identification module 3 and is recognized key word or word time used.This time can be used for judging whether the recognition efficiency of identification module 3 reaches expection.As shown in Figure 8, on the basis of the second embodiment, this audio recognition method can also comprise step S0 in the 3rd embodiment: set up multiple speech databases and multiple grammar database according to different regions and be kept in storage module 4, in each grammar database, generating and have multiple grammar files according to the difference of class of business.

In the 4th embodiment, regional information that is to say province information, and this audio recognition method comprises the following steps:

S1: acquisition module 1 gathers province information that path, audio-frequency information to be identified place, title and this audio-frequency information to be identified are corresponding, identify business information that this audio-frequency information to be identified is corresponding, with the key word of this traffic aided, and the communication of these collections to scheduler module.

S2: scheduler module 2 reads affiliated province, current speech data Kuku, title, path, and regional information under current speech database and the corresponding province of audio-frequency information information to be identified are compared.

If it is identical with the corresponding province of the audio-frequency information information of band identification that scheduler module 2 judges corresponding province, current speech data Kuku information, do not need to carry out speech data Kuku blocked operation, directly return to path and the title of the grammar database corresponding with the province information of audio-frequency information to be identified.

If scheduler module 2 judges the corresponding genus of current speech database province information and to import province into different, scheduler module 2 is carried out speech database switching, by the backup speech database in the corresponding province of current speech data Kuku RNTO, by the speech database RNTO current speech database corresponding with audio-frequency information to be identified, finally return to the grammar database in the corresponding province of current speech database again.And call the grammar file in this grammar database according to class of business information.

S3: if scheduler module 2 schedule voice databases and grammar database are unsuccessful, directly exit identification process, and carry out the automatic retrieval of speech database and automatically repair.Identification module 3 is arranged according to parameters such as the required sensitivity of identification, degree of accuracy, identification languages if dispatched successfully.

Create identification context according to the identifying of identification module 3, comprise and be identified as merit information, error message, abnormal information etc. for recording information that identifying produces.

In identification module 3, create identification audio stream, the Formart parameter of audio stream is set, as play frequency, hertz etc., and the audio-frequency information of needs identification is tied on identification audio stream.Activate identification module 3, identification module 3 and load speech database and grammar file and import identification audio stream and start identification process, after identification module 3 is activated, etc. special Windows self-defined message to be identified is triggered.In the time that the special Windows self-defined message of identification is triggered, if the parameter of Windows self-defined message is end of identification, identification module exits identification process.If the parameter of Windows self-defined message is for identification content, by identification contents extraction and be saved in recognition result.If the parameter of Windows self-defined message is identification error, process according to concrete type of error.

Can also create in this embodiment other self-defined prompting messages and other prompting message conflicts of guaranteeing to get along well, the identification message processing function of specific messages and the identification message type of key decryptor are set.

In sum, speech recognition system provided by the invention and method are owing to being provided with speech database and grammar database for different regions, in the process of identification, correspondence calls this speech database and grammar database can provide phonetic recognization rate and recognition speed.In addition, corresponding each business is also provided with grammar file specially in grammar database, and in identifying, genuine class of business, calls corresponding grammar file, can further improve phonetic recognization rate and recognition speed.

Should be understood that; by reference to the accompanying drawings embodiments of the invention are described above; but the present invention is not limited to above-mentioned embodiment; above-mentioned embodiment is only schematic; rather than restrictive, those of ordinary skill in the art, under enlightenment of the present invention, is not departing from the scope situation that aim of the present invention and claim protect; also can make a lot of forms, within these all belong to protection of the present invention.

Claims

1. an audio recognition method, is characterized in that, comprises the following steps:

2. audio recognition method according to claim 1, is characterized in that, described step S1 also comprises: gather the class of business information corresponding with described audio-frequency information;

3. audio recognition method according to claim 2, it is characterized in that, before described step S1, also comprise step S0: set up multiple speech databases and multiple grammar database and preserve corresponding class of business generative grammar file in each described grammar database according to different regions.

4. according to the audio recognition method described in claims 1 to 3 any one, it is characterized in that, in described step S1, the regional information that the described audio-frequency information of described collection is corresponding comprises: the number of server of described audio-frequency information is sent in inquiry, according to described number inquiry and extract the regional information that described audio-frequency information is corresponding.

5. according to the audio recognition method described in claims 1 to 3 any one, it is characterized in that, also comprise at described step S3: key word or word are set, timing while starting to identify described audio-frequency information, while recognizing described key word or word, stop timing, output recognition time.

6. speech recognition system according to claim 1, is characterized in that, comprising:

Acquisition module (1), described acquisition module (1) comprises for gathering audio-frequency information to be identified the first collecting unit (11) and for gathering second collecting unit (12) of the regional information corresponding with described audio-frequency information;

Scheduler module (2), described scheduler module (2) is for according to the described regional information Selection and call speech database corresponding with this area and grammar database and call corresponding grammar file at described grammar database;

Identification module (3), described identification module (3) is for identifying described audio-frequency information according to described grammar file and described speech database.

7. speech recognition system according to claim 6, it is characterized in that, described acquisition module (1) also comprises the 3rd collecting unit (13) for gathering the class of business information corresponding with described audio-frequency information, and described scheduler module (2) is also for calling the grammar file corresponding with described class of business information according to described class of business information at described grammar database.

8. speech recognition system according to claim 7, it is characterized in that, also comprise memory module (4), described memory module (4) is for storing described multiple speech databases and the described multiple grammar database set up according to different regions, corresponding class of business generative grammar file in each described grammar database.

9. according to the speech recognition system described in claim 6 to 8 any one, it is characterized in that, described the second collecting unit (12) comprises the first inquiry subelement of number for inquiring about the server that sends described audio-frequency information and for according to described number inquiry and extract first of regional information that described audio-frequency information is corresponding and extract subelement.

10. according to the speech recognition system described in claim 6 to 8 any one, it is characterized in that, also comprise timing module (5) and module (6) is set, the described module (5) that arranges is for arranging key word or word, described timing module (5) for starting timing in the time that described identification module (6) starts to identify described audio-frequency information, and described identification module (6) stops timing and exports recognition time while recognizing described key word or word.