CN102074231A

CN102074231A - Voice recognition method and system

Info

Publication number: CN102074231A
Application number: CN2010106143655A
Authority: CN
Inventors: 冯雁; 杨永胜; 黄石磊
Original assignee: VANYINDA CO Ltd
Current assignee: VANYINDA CO Ltd
Priority date: 2010-12-30
Filing date: 2010-12-30
Publication date: 2011-05-25

Abstract

The invention relates to a voice recognition method. The voice recognition method comprises the following steps of: acquiring voice; performing characteristic extraction on the acquired voice; acquiring scene information of a user and matching with a grammar model or a language model according to the scene information; and performing pattern matching algorithm according to the matched grammar model or language model to obtain a voice recognition result. By the voice recognition method, accuracy of voice recognition can be improved. In addition, the invention also provides a voice recognition system.

Description

Audio recognition method and speech recognition system

[technical field]

The present invention relates to speech recognition technology, relate in particular to a kind of audio recognition method and speech recognition system.

[background technology]

Speech recognition is that the vocabulary content in people's the voice is converted to computer-readable input, for example button, binary coding or character string etc.Traditional audio recognition method is to gather voice, again the voice that collect are carried out feature extraction, feature extraction is meant and will obtains one group of vector after speech waveform process linearity or the nonlinear operation, pass through pattern matching algorithm then, vector is converted to and the immediate pronunciation unit sequence of model, and then is converted to voice identification result.Yet traditional this audio recognition method only carries out pattern match according to voice that collect and fixing acoustic model and language model (perhaps syntactic model), and recognition accuracy is not high.

[summary of the invention]

Based on this, be necessary to provide a kind of audio recognition method that can improve the speech recognition accuracy.

A kind of audio recognition method may further comprise the steps:

Gather voice;

The voice of gathering are carried out feature extraction;

Obtain user's scene information, according to described scene information coupling syntactic model or language model;

Syntactic model or language model according to described coupling carry out pattern matching algorithm, obtain voice identification result.

Preferably, described method also comprises obtains user position information, according to the step of described positional information coupling syntactic model or language model.

Preferably, described method also comprises the step according to described positional information and scene information coupling Pronounceable dictionary;

Described syntactic model or language model according to described coupling carries out pattern matching algorithm, and the step that obtains voice identification result is:

Syntactic model, language model and Pronounceable dictionary according to described coupling carry out pattern matching algorithm, obtain voice identification result.

Preferably, described positional information detects geographic position or the GPS locating information that provides automatically for user's terminal device, and described scene information is the scene delta data in the user interaction process.

Preferably, geographic position or GPS locating information that described positional information initiatively provides for the user or revises, the scene delta data that described scene information is initiatively set or changed for the user.

In addition, also be necessary to provide a kind of speech recognition system that can improve the speech recognition accuracy.

A kind of speech recognition system comprises client and carries out mutual server with it, and described client comprises:

Voice acquisition module is used to gather voice;

First communication module, the voice that are used for gathering are sent to server;

Described server comprises:

Second communication module is used to receive the voice that described first communication module sends;

Characteristic extracting module is used for described voice are carried out feature extraction;

Sound identification module is used to obtain user's scene information, according to described scene information coupling syntactic model or language model, carries out pattern matching algorithm according to the syntactic model or the language model of described coupling, obtains voice identification result.

Preferably, described client also comprises:

The information acquisition module is used to obtain user's scene information and positional information;

Described first communication module also is used for described scene information and positional information are sent to described server.

Preferably, described sound identification module also is used to obtain user position information, according to described positional information coupling syntactic model or language model; Described server also comprises the database that is used to store user position information and scene information.

Preferably, described sound identification module also is used for according to described positional information and scene information coupling Pronounceable dictionary, carries out pattern matching algorithm according to syntactic model, language model and the Pronounceable dictionary of described coupling, obtains voice identification result.

Preferably, described positional information detects positional information or the GPS locating information that provides automatically for user's terminal device, and described scene information is the scene delta data in the user interaction process.

Above-mentioned audio recognition method and speech recognition system, scene information coupling syntactic model or language model according to the user, can when carrying out pattern matching algorithm, change the parameter of syntactic model or language model according to user's scene information, make syntactic model that pattern matching algorithm adopted or language model adaptation user's interaction scenarios, therefore can improve the accuracy of speech recognition.

[description of drawings]

Fig. 1 is the process flow diagram of an audio recognition method among the embodiment;

Fig. 2 is the structural representation of a speech recognition system among the embodiment;

Fig. 3 is the structural representation of the speech recognition system among another embodiment.

[embodiment]

Fig. 1 shows an audio recognition method flow process among the embodiment, and this method flow may further comprise the steps:

Step S102 gathers voice.In one embodiment, by being installed in the client software input voice of portable terminal, for example, user's click keys is talked after entering the voice collecting pattern, phonetic entry finishes after the click keys once more, client software is gathered voice, and the voice that collect can be sent to background server and handle.

Step S104 carries out feature extraction to the voice of gathering.The data that collect are speech waveform, speech waveform is carried out feature extraction after, obtain the voice acoustic feature.Can adopt traditional phonetic feature extraction algorithm that speech waveform is carried out feature extraction, for example extract MFCC (Mel frequency cepstrum system), LPC (linear forecast coding coefficient), speech energy etc.

Step S106 obtains user's scene information, according to described scene information coupling syntactic model or language model.User's scene information is meant the scene delta data in the user interaction process.The user realizes various application by the client software that is installed in portable terminal, can produce the scene delta data in the reciprocal process of application system.Query context that produces when for example, inquiring about shopping information, Flight Information and Query Result etc.

Scene information coupling appropriate grammar model or language model according to the user, for example, during the user inquiring firm name, adopt the big syntactic model of each trade name probability of occurrence, during user inquiring clothes shop information, then adopt clothes shop's title probability big syntactic model or language model.

Step S108 carries out pattern matching algorithm according to the syntactic model or the language model that mate, obtains voice identification result.The speech recognition resource needed has speech model, syntactic model and Pronounceable dictionary etc., according to the above-mentioned voice acoustic feature that obtains, from speech recognition resources, find the result of mating most, can adopt traditional Viterbi (Viterbi) algorithm to carry out speech recognition, obtain voice identification result.

Scene information by the user changes the parameter of syntactic model or language model, makes syntactic model that pattern matching algorithm adopted or language model adaptation user's interaction scenarios, therefore can improve the accuracy of speech recognition.

In one embodiment, said method also comprises: obtain user position information, according to positional information coupling syntactic model or language model.User position information detects geographic position or the GPS locating information that provides automatically for user's terminal device.In addition, user position information can also be geographic position or the GPS locating information that the user initiatively provides or revises, the scene delta data that described scene information is initiatively set or changed for the user.User self the geographic position of filling in by client software for example, this geographic position is stored in server as userspersonal information's a part, and when this information of user's modification, server upgrades.The GPS locating information can be obtained in real time, when the change in location at user place, obtains user's GPS locating information, then can obtain the residing position of active user.Also can obtain the geographic position that the user sets, mate syntactic model or language model according to the geographic position that the user sets.For example, user's terminal device detects the current position of user in Beijing, and the user sets self geographical position in Shanghai, then according to Shanghai this geographic position coupling syntactic model or language model.

But at the relation data of server end maintenance position information and syntactic model, language model, get access to user position information after, can mate appropriate grammar model or language model according to positional information.For example, user position information is the Beijing area, is called main syntactic model, language model with then mating the Beijing area.When the user moves to Shanghai from Beijing, get access to user's current position information, be called main syntactic model, language model with mating above Haiti district.

In another embodiment, said method also comprises: according to user position information and scene information coupling Pronounceable dictionary.Among this embodiment, according to user position information and scene information coupling appropriate grammar model, language model and Pronounceable dictionary, then syntactic model, language model and the Pronounceable dictionary according to coupling carries out pattern matching algorithm, obtains voice identification result.

After carrying out pattern matching algorithm, obtain the word sequence of one or more speech, in the speech that obtains, choose the speech of probability of occurrence maximum and form word sequence, be voice identification result.Voice identification result can be symbol, numerical value or a literal etc., and the voice that for example collect are " today ", and the result that identification obtains can be " today ", " jintian ", " today " etc., and this result can do subsequent treatment in application program.

Fig. 2 shows a speech recognition system among the embodiment, and this system comprises that client 100 reaches and client 100 is carried out mutual server 200, wherein:

Client 100 comprises voice acquisition module 102 and first communication module 104, and wherein: voice acquisition module 102 is used to gather voice; The voice that first communication module 104 is used for collecting send to server 200.In one embodiment, the user can begin input by being installed in the application software input voice on the portable terminal behind button click, stop input once more behind the button click, 102 of voice acquisition module are gathered voice, send to server 200 by first communication module 104 and handle.

Server 200 comprises second communication module 202, characteristic extracting module 204 and sound identification module 206, and wherein: second communication module 202 is used to receive the voice that first communication module 104 sends; Characteristic extracting module 204 is used for these voice are carried out feature extraction; Sound identification module 206 is used to obtain user's scene information, according to scene information coupling syntactic model or language model, carries out pattern matching algorithm according to the syntactic model or the language model that mate, obtains voice identification result.User's scene information is the scene delta data in the user interaction process.

Among this embodiment, the data that second voice module 202 receives are speech waveform, and 204 pairs of speech waveforms of characteristic extracting module carry out feature extraction, obtain the voice acoustic feature.Can adopt traditional feature extraction algorithm to extract MFCC (Mel frequency cepstrum system), LPC (linear forecast coding coefficient), the speech energy etc. of voice.

The user produces the scene delta data by the various application software that are installed on the portable terminal, for example, and query context that produces when inquiry shopping information, Flight Information and Query Result etc.Sound identification module 206 is according to user's scene information coupling appropriate grammar model or language model, for example, during the user inquiring firm name, adopt big syntactic model of each trade name probability of occurrence or language model, during user inquiring clothes shop information, then adopt clothes shop's title probability big syntactic model or language model.The speech recognition resource needed has speech model, syntactic model and Pronounceable dictionary etc., according to the above-mentioned voice acoustic feature that obtains, sound identification module 206 finds the result of mating most from speech recognition resources, can adopt traditional Viterbi (Viterbi) algorithm to carry out speech recognition, obtain voice identification result.

Fig. 3 shows the speech recognition system among another embodiment, and this system is on basis embodiment illustrated in fig. 1, and client 100 also comprises information acquisition module 106, and server 200 also comprises database 208. wherein:

Information acquisition module 106 is used to obtain user's scene information and positional information.Among this embodiment, first communication module 104 sends to server 200 with user's scene information and positional information.User's scene information is the scene delta data in the user interaction process, and user position information can be that geographic position or the GPS locating information that provides is provided user's terminal device automatically.It also can be the geographic position that the user initiatively provides or revises, for example, the user fills in by client software the geographic position of self, this geographic position is stored in the database 208 of server 200 as userspersonal information's a part, when this information of user's modification, database 208 upgrades.The GPS locating information can be obtained in real time, when the change in location at user place, obtains user's GPS locating information, then can obtain the residing position of active user.Also can obtain the geographic position that the user sets, mate syntactic model, language model according to the geographic position that the user sets.For example, user's terminal device detects the current position of user in Beijing, and the user sets self geographical position in Shanghai, then according to this geographic position coupling syntactic model of Shanghai, language model.

Database 208 is used to store user position information and scene information.In addition, database 208 also can be used for the storaged voice recognition resource, promptly is used to carry out speech model, syntactic model and the Pronounceable dictionary etc. of speech recognition.

Sound identification module 206 also is used to obtain user position information, according to positional information coupling syntactic model or language model.But the relation data of maintenance position information and syntactic model, language model in database 208 after sound identification module 206 gets access to user position information, can mate appropriate grammar model or language model.

But the relation data at database 208 maintenance position information and syntactic model, language model after sound identification module 206 gets access to user position information, can mate appropriate grammar model or language model according to positional information.For example, user position information is the Beijing area, is called main syntactic model or language model with then mating the Beijing area.When the user moves to Shanghai from Beijing, get access to user's current position information, be called main syntactic model or language model with mating above Haiti district.

In another embodiment, sound identification module 206 also is used to obtain positional information and scene information, according to positional information and scene information coupling Pronounceable dictionary, carry out pattern matching algorithm according to syntactic model, language model and the Pronounceable dictionary of described coupling, obtain voice identification result.

After sound identification module 206 carries out pattern matching algorithm, obtain the word sequence of one or more speech, in the speech that obtains, choose the speech of probability of occurrence maximum and form word sequence, be voice identification result.Voice identification result can be symbol, numerical value or a literal etc., and the voice that for example collect are " today ", and the result that identification obtains can be " today ", " jintian ", " today " etc., and this result can do subsequent treatment in application program.

The above embodiment has only expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but can not therefore be interpreted as the restriction to claim of the present invention.Should be pointed out that for the person of ordinary skill of the art without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims

1. audio recognition method may further comprise the steps:

Gather voice;

The voice of gathering are carried out feature extraction;

2. audio recognition method according to claim 1 is characterized in that, described method also comprises obtains user position information, according to the step of described positional information coupling syntactic model or language model.

3. audio recognition method according to claim 2 is characterized in that, described method also comprises the step according to described positional information and scene information coupling Pronounceable dictionary;

4. according to claim 2 or 3 described audio recognition methods, it is characterized in that described positional information detects geographic position or the GPS locating information that provides automatically for user's terminal device, described scene information is the scene delta data in the user interaction process.

5. according to claim 2 or 3 described audio recognition methods, it is characterized in that geographic position or GPS locating information that described positional information initiatively provides for the user or revises, the scene delta data that described scene information is initiatively set or changed for the user.

6. a speech recognition system comprises client and carries out mutual server with it, it is characterized in that described client comprises:

Voice acquisition module is used to gather voice;

Described server comprises:

7. speech recognition system according to claim 6 is characterized in that, described client also comprises:

8. speech recognition system according to claim 6 is characterized in that described sound identification module also is used to obtain user position information, according to described positional information coupling syntactic model or language model; Described server also comprises the database that is used to store user position information and scene information.

9. speech recognition system according to claim 6, it is characterized in that, described sound identification module also is used for according to described positional information and scene information coupling Pronounceable dictionary, syntactic model, language model and Pronounceable dictionary according to described coupling carry out pattern matching algorithm, obtain voice identification result.

10. according to any described speech recognition system in the claim 6 to 9, it is characterized in that, described positional information detects positional information or the GPS locating information that provides automatically for user's terminal device, and described scene information is the scene delta data in the user interaction process.

11. according to any described speech recognition system in the claim 6 to 9, it is characterized in that, geographic position or GPS locating information that described positional information initiatively provides for the user or revises, the scene delta data that described scene information is initiatively set or changed for the user.