CN103514882A

CN103514882A - Voice identification method and system

Info

Publication number: CN103514882A
Application number: CN201210227158.3A
Authority: CN
Inventors: 贾磊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2012-06-30
Filing date: 2012-06-30
Publication date: 2014-01-15
Anticipated expiration: 2032-06-30
Also published as: CN103514882B

Abstract

The invention provides a voice identification method and system. The voice identification method comprises the steps that A, a client terminal module sends an obtained user voice instruction to a server module; B, the server module identifies the voice instruction preliminarily by the utilization of an instruction template set and a named entity set, obtains a preliminary identification result, and sends the preliminary identification result to the client terminal module, wherein the preliminary identification result is an identification result containing unknown variable information; C, the client terminal module identifies unknown variables through the named entity information stored in the client terminal module so as to obtain an integral identification result of the voice instruction. By means of the mode, the computing resources of a server can be fully utilized, and voice identification accuracy is improved.

Description

A kind of audio recognition method and system

[technical field]

The present invention relates to speech recognition technology, particularly a kind of method and system of speech recognition.

[background technology]

Along with the development of the software and hardware technology relevant to mobile terminal, it is more and more intelligent that mobile terminal becomes.By voice command, mobile terminal is operated, it is the direction of mobile terminal technical development, and to realize the control of voice command to mobile terminal, its core is correctly to identify user's voice command, only have user's voice command is correctly identified, could triggering mobile terminals carry out corresponding action.In prior art, the speech recognition of mobile terminal has two kinds of methods conventionally:

First method, is in the built-in speech recognition system of mobile terminal, when user sends phonetic order to mobile terminal, utilizes this built-in system to identify phonetic order.This method can make full use of the personal information (for example address list) of preserving on mobile terminal and realize speech recognition, more effective when carrying out the voice operating of phonetic dialing and so on.But this mode exists a problem, and the computing power of mobile terminal is limited, built-in speech recognition system is difficult to complicated voice command to be identified.The for example login of the webpage on mobile terminal, map operation, song inquiry, or the speech recognition relating in the function such as information search, built-in speech recognition system has just been difficult to, and because the computing power of mobile terminal is limited, built-in speech recognition system is difficult to apply complicated speech recognition algorithm, even if also caused this method of prior art to be applied in phonetic dialing, also had the defect that accuracy of identification is low.

Second method, is the phonetic order by acquisition for mobile terminal user, then the phonetic order getting is sent to server, utilizes the speech recognition system that server is set up in advance to identify phonetic order, and recognition result is back to mobile terminal the most at last.This mode can make full use of the powerful computing power of server, thereby realize the function that complicated phonetic order is identified, its shortcoming is, this mode cannot make full use of the personal information of storing on mobile terminal, thereby can have influence on the accuracy of identification of sound bite relevant with the personal information of storing on mobile terminal in phonetic order.

[summary of the invention]

Technical matters to be solved by this invention is to provide a kind of method and system of speech recognition, to realize the computational resource that makes full use of server, improves the object of accuracy of identification.

The present invention is the system that technical scheme that technical solution problem adopts is to provide a kind of speech recognition, comprising: client modules and server module, and wherein, described client modules comprises: voice collecting unit, for obtaining user's phonetic order; Client communication unit, for being sent to server module by described phonetic order; Described server module comprises: the first recognition unit, be used for utilizing instruction template set and named entity set tentatively to identify described phonetic order, obtain preliminary recognition result, wherein said preliminary recognition result is the recognition result that contains known variables information, and described known variables is sound bite relevant to the named entity information of described client stores in described phonetic order; Server communication unit, for being sent to described client modules by described preliminary recognition result; Described client modules also comprises: the second recognition unit, and for utilizing the named entity information of described client stores to identify described known variables, to obtain the complete recognition result of described phonetic order.

The preferred embodiment one of according to the present invention, described the first recognition unit comprises: generation unit between the first decode empty, in advance instruction template set and named entity set being compiled into respectively to two independently WFST networks, to form between the first decode empty; The first decoding unit, for when receiving described phonetic order, utilize between described the first decode empty described phonetic order is decoded, to determine the instruction template under described phonetic order, and the start-stop of described known variables in described phonetic order constantly, and the instruction template under described phonetic order and the start-stop of described known variables in described phonetic order are constantly as described preliminary recognition result.

The preferred embodiment one of according to the present invention, described the second recognition unit comprises: generation unit between the second decode empty, in advance the named entity information of described client stores being compiled into WFST network, to form between the second decode empty; The second decoding unit, for when receiving described preliminary recognition result, start-stop according to described known variables in described phonetic order constantly, from described phonetic order, determine sound bite to be identified, and utilize between described the second decode empty described sound bite to be identified is decoded, obtain the recognition result of described known variables.

The preferred embodiment one of according to the present invention, described server module further comprises: feature extraction unit, for extracting the acoustic feature relevant to speaker from described phonetic order; And described server communication unit is further used for the described acoustic feature relevant to speaker to be sent to described client modules.

The preferred embodiment one of according to the present invention, described client modules further comprises: acoustic training model unit, for utilizing in advance speaker's speech samples training acoustic model relevant to speaker; And, when described the second decoding unit is decoded to described sound bite to be identified, utilize between described relevant to speaker acoustic feature, described the second decode empty and acoustic model described and that speaker is relevant is decoded to described sound bite to be identified.

The present invention also provides a kind of method of speech recognition, comprising: A. client modules is sent to server module by the user speech instruction of obtaining; B. described server module utilizes instruction template set and named entity set tentatively to identify described phonetic order, obtain preliminary recognition result, and described preliminary recognition result is sent to described client modules, wherein said preliminary recognition result is the recognition result that contains known variables information, and described known variables is sound bite relevant to the named entity information of described client stores in described phonetic order; C. described client modules utilizes the named entity information of described client stores to identify described known variables, to obtain the complete recognition result of described phonetic order.

The preferred embodiment one of according to the present invention, the step that described server module utilizes instruction template set and named entity set to carry out preliminary identification to described phonetic order comprises: described server module is when receiving described phonetic order, utilize between the first decode empty described phonetic order is decoded, to determine the instruction template under described phonetic order, and the start-stop of known variables in described phonetic order constantly, and the start-stop in described phonetic order of the instruction template under described phonetic order and described known variables is constantly as described preliminary recognition result, between wherein said the first decode empty, in advance instruction template set and named entity set being compiled into respectively to two independently forms after WFST network.

The preferred embodiment one of according to the present invention, the step that described client modules utilizes the named entity information of described client stores to identify described known variables comprises: described client modules is when receiving described preliminary recognition result, start-stop according to described known variables in described phonetic order constantly, from described phonetic order, determine sound bite to be identified, and utilize between the second decode empty described sound bite to be identified is decoded, obtain the recognition result of described known variables, between wherein said the second decode empty, be to form after in advance the named entity information of described client stores being compiled into WFST network.

The preferred embodiment one of according to the present invention, described step B further comprises: server module extracts the acoustic feature relevant to speaker from described phonetic order, and the described acoustic feature relevant to speaker is sent to described client modules.

The preferred embodiment one of according to the present invention, when described client modules is decoded to described sound bite to be identified, utilize relevant acoustic model between described relevant to speaker acoustic feature, described the second decode empty and with speaker to decode to described sound bite to be identified, wherein the acoustic model relevant to speaker utilizes the training of speaker's speech samples to obtain in advance.

As can be seen from the above technical solutions, the present invention is by being divided into the identification of phonetic order two stages, at server cognitive phase, obtain the preliminary recognition result that comprises known variables information, at client cognitive phase, known variables is identified, thereby obtained the complete recognition result of phonetic order, the computational resource of server can be made full use of, the precision of the information raising speech recognition that is stored in client can be made full use of again simultaneously.

[accompanying drawing explanation]

Fig. 1 is the structural representation block diagram of the embodiment of speech recognition system in the present invention;

Fig. 2 is the structural representation block diagram of the embodiment of the first recognition unit in the present invention;

Fig. 3 is the structural representation block diagram of the embodiment of the second recognition unit in the present invention;

Fig. 4 is the structural representation block diagram of another embodiment of the server module in the speech recognition system in the present invention;

Fig. 5 is the structural representation block diagram of another embodiment of the client modules in the speech recognition system in the present invention;

Fig. 6 is the schematic flow sheet of the embodiment of audio recognition method in the present invention.

[embodiment]

In order to make the object, technical solutions and advantages of the present invention clearer, below in conjunction with the drawings and specific embodiments, describe the present invention.

Please refer to Fig. 1, Fig. 1 is the structural representation block diagram of the embodiment of speech recognition system in the present invention.As shown in Figure 1, this system comprises: client modules 101 and server module 201.Wherein client modules 101 comprises: voice collecting unit 1011, client communication unit 1012 and the second recognition unit 1013.Server module 201 comprises: the first recognition unit 2011 and server communication unit 2012.

Wherein, voice collecting unit 1011, for obtaining user's phonetic order.Client communication unit 2012, for being sent to server module 201 by the phonetic order obtaining.The first recognition unit 2011, for utilizing the phonetic order template set of collection and named entity set tentatively to identify phonetic order, obtain preliminary recognition result, wherein preliminary recognition result is the recognition result that contains known variables information, and known variables is sound bite relevant to the named entity information of client stores in phonetic order.Server communication unit 2012, for being sent to client modules 101 by preliminary recognition result.The second recognition unit 1013, for utilizing the named entity information of client stores to identify known variables, to obtain the complete recognition result of phonetic order.

Below by specific embodiment, said system is introduced.

Please refer to Fig. 2, Fig. 2 is the structural representation block diagram of the embodiment of the first recognition unit in the present invention.As shown in Figure 2, the first recognition unit 2011 comprises: generation unit 2011_1 and the first decoding unit 2011_2 between the first decode empty.

Generation unit 2011_1 between the first decode empty wherein, for in advance the instruction template set of collection and named entity set being compiled into respectively to two independently WFST(weighted finite state transducer, weighted finite state transducer) network, to form between the first decode empty.The first decoding unit 2011_2, for when receiving the phonetic order of client communication unit 2012 transmissions, utilize between the first decode empty phonetic order is decoded, to determine the instruction template under sound instruction, and the start-stop of known variables in phonetic order constantly, and the instruction template under phonetic order and the start-stop of known variables in phonetic order are constantly as preliminary recognition result.

Instruction template, has explained the indicated action of instruction.For example " making a phone call to * * * " is exactly an instruction template, and wherein " * * * " is template groove position, represents that this place can be replaced by named entity.Named entity combination by an instruction template and this template groove type that position limits, just can form a complete instruction.For example " give * * * make a phone call " this instruction template middle slot position " * * * " is defined as name, and named entity " Zhang San " " the making a phone call to Zhang San " obtaining that combine with this instruction template just formed a complete instruction.

In the present invention, instruction template and named entity can obtain by data mining in advance, are appreciated that this can pass through existing techniques in realizing, because this is not emphasis of the present invention, is no longer described in detail at this.

Named entity in the present invention in named entity set, can contain that the those skilled in the art such as name, place name, song title, title can expect, relevant to application on mobile terminal various entities.

In the present invention, between the first decode empty, generation unit 2011_1 is compiled as respectively two independently WFST networks by instruction template set and named entity set, has formed between the first decode empty.WFST network, the network that while being decoding, all possible paths form.Utilize WFST network, the first decoding unit 2011_2 is when decoding, can be according to each frame of phonetic order, actual dynamic expansion, in the middle of expansion, each paths that acoustic model is expanded for each frame provides probability estimate score value, in this process, the first decoding unit 2011_2 carries out beta pruning according to the score value of extensions path to extensions path, when being decoded to the last frame of phonetic order, in all extensions paths, score Gao path is exactly the recognition result that phonetic order obtains at server end.In the present invention, due between the first decode empty by two respectively the independent WFST network of include instruction Template Information and named entity information form, therefore, last word of the recognition result obtaining from server end is recalled forward, just can determine the instruction template that this recognition result is affiliated, and the part matching with named entity in this recognition result, the corresponding sound bite of part matching with named entity in this recognition result, is exactly the known variables in phonetic order.Be appreciated that after having determined the known variables in phonetic order, the start-stop of this known variables in phonetic order also just determined constantly.The first decoding unit 2011_2 is by the instruction template under phonetic order, and the preliminary recognition result of the start-stop of known variables in phonetic order conduct constantly, and the server communication unit 2012 in Fig. 1 is sent to client modules 101 by preliminary recognition result.

In above-mentioned explanation; compiling WFST network and the method for utilizing WFST network to decode; can list of references: Mehryar Mohri; Fernando Pereira; Michael Riley; Weighted Finite-State Transducers in Speech Recognition, Computer Speech & Language Volume16, Issue1, January2002, is called document 1 below Pages69-88(), do not repeat them here herein.

Be appreciated that, the effect that acoustic model plays in decoding is that the probable value that acoustic signal is occurred is estimated, therefore, acoustic model in the present invention can be used the acoustic model of any type, an existing acoustic model for example, or the acoustic model being provided by third party, the present invention does not limit this.

Please refer to Fig. 3, Fig. 3 is the structural representation block diagram of the embodiment of the second recognition unit in the present invention.As shown in Figure 3, the second recognition unit 1013 comprises: generation unit 1013_1, the second decoding unit 1013_2 between the second decode empty.

Generation unit 1013_1 between the second decode empty wherein, for being compiled into WFST network by the named entity information of client stores in advance, to form between the second decode empty.The second decoding unit 1013_2, for when receiving preliminary recognition result, start-stop according to known variables in phonetic order constantly, from phonetic order, determine sound bite to be identified, and utilize between the second decode empty sound bite to be identified is decoded, obtain the recognition result of known variables.

The named entity information of client stores, comprises information, the song title of client stores, the named entity relating in the types of applications that the title of client stores etc. it may occur to persons skilled in the art that in the address list of client stores.Between the second decode empty generation unit 1013_1 obtain the mode between the second decode empty and acquisition the first decode empty of introducing above between mode be similarly, the concrete document 1 that please refer to.

In the present invention, the second decoding unit 1013_2 utilizes between the second decode empty known variables is decoded, it is a decode procedure that limits length (being the length of this sound bite of known variables), compared with prior art, the time of decoding shortens dramatically, and because calculated amount reduces, the limited computational resource of client also can be born this calculated amount preferably.

In the present embodiment, the acoustic model that the second decoding unit 1013_2 is used in decode procedure, the acoustic model using with the first decoding unit 2011_2 is the same, can be acoustic model arbitrarily.

Please refer to Fig. 4, Fig. 4 is the structural representation block diagram of another embodiment of the server module in the speech recognition system in the present invention.As shown in Figure 4, in this embodiment, server module 201 further comprises feature extraction unit 2013, and feature extraction unit 2013 is for extracting the acoustic feature relevant to speaker from phonetic order.And in this embodiment, server communication unit 2012 is further used for the acoustic feature relevant to speaker to be sent to client modules 101.

Please refer to Fig. 5, Fig. 5 is the structural representation block diagram of another embodiment of the client modules in the speech recognition system in the present invention.As shown in Figure 5, in this embodiment, client modules 101 further comprises acoustic training model unit 1014, and acoustic training model unit 1014 is for utilizing in advance speaker's speech samples training acoustic model relevant to speaker.And in this embodiment, when the second 10132 pairs of decoding units sound bite to be identified (being sound bite corresponding to known variables) is decoded, utilize relevant acoustic model between the acoustic feature relevant to speaker that server module 201 sends, the second decode empty and with speaker to decode to sound bite to be identified.

In the embodiment of the speech recognition system shown in Fig. 4 and Fig. 5, at server end, except phonetic order is tentatively identified, also can from phonetic order, extract the acoustic feature relevant to speaker, and, in client, can set up the acoustic model relevant to speaker in advance, by the system of the present embodiment, the computational resource that can make full use of server end completes the evaluation work that extracts the acoustic feature relevant to speaker, and client is owing to there being the acoustic model relevant to speaker, after obtaining the acoustic feature relevant to speaker, can more effectively decode, adopt the system of the present embodiment, can fully accelerate the decoding speed in client, and because the acoustic model of client has speaker adaptation, in client, obtain after the acoustic feature relevant to speaker, decoding accuracy to sound bite corresponding to known variables also can improve greatly.The method that the acoustic training model relevant to speaker and utilization and speaker's acoustic feature is decoded has solution in the prior art, particularly can list of references: Tasos Anastasakos, John McDonough, Richard Schwartz, John Makhoul, A compact modelfor speaker-adaptive training, Spoken Language, 1996.IC SLP96.Proceedings., Fourth International Conference, Volume2, Pages1137-1140.

Please refer to Fig. 6, Fig. 6 is the schematic flow sheet of the embodiment of the method for speech recognition in the present invention.As shown in Figure 6, the method comprises:

Step S301: client modules is sent to server module by the user speech instruction of obtaining.

Step S302: server module utilizes instruction template set and named entity set tentatively to identify phonetic order, obtain preliminary recognition result, and preliminary recognition result is sent to client modules, wherein preliminary recognition result is the recognition result that contains known variables information, and known variables is sound bite relevant to the named entity information of client stores in phonetic order.

Step S303: client modules utilizes the named entity information of client stores to identify known variables, obtains the complete recognition result of phonetic order.

According to an embodiment, the step that in step S302, server module utilizes instruction template set and named entity set to carry out preliminary identification to phonetic order comprises:

Server module is when receiving phonetic order, utilize between the first decode empty phonetic order is decoded, to determine the instruction template under phonetic order, and the start-stop of known variables in described phonetic order constantly, the start-stop in phonetic order of instruction template under phonetic order and known variables is constantly as preliminary recognition result, wherein between the first decode empty, in advance instruction template set and named entity set is compiled into respectively to two and independently after WFST network, forms.

Accordingly, in step S303, the step that client modules utilizes the named entity information of client stores to identify known variables comprises:

Client modules is when receiving preliminary recognition result, start-stop according to known variables in phonetic order constantly, from phonetic order, determine sound bite to be identified, and utilize between the second decode empty sound bite to be identified is decoded, obtaining the recognition result of known variables, is wherein to form after in advance the named entity information of client stores being compiled into WFST network between the second decode empty.

According to another embodiment, step S302 further comprises:

Server module extracts the acoustic feature relevant to speaker from phonetic order, and the acoustic feature relevant to speaker is sent to client modules.

Correspondingly, in step S303, when client modules is decoded at the sound bite to be identified (being sound bite corresponding to known variables), utilize relevant acoustic model between the acoustic feature relevant to speaker, the second decode empty and with speaker to decode to sound bite to be identified, wherein the acoustic model relevant to speaker utilizes the training of speaker's speech samples to obtain in advance.

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims

1. a system for speech recognition, comprising:

Client modules and server module, wherein,

Described client modules comprises:

Voice collecting unit, for obtaining user's phonetic order;

Client communication unit, for being sent to server module by described phonetic order;

Described server module comprises:

The first recognition unit, be used for utilizing instruction template set and named entity set tentatively to identify described phonetic order, obtain preliminary recognition result, wherein said preliminary recognition result is the recognition result that contains known variables information, and described known variables is sound bite relevant to the named entity information of described client stores in described phonetic order;

Server communication unit, for being sent to described client modules by described preliminary recognition result;

Described client modules also comprises:

The second recognition unit, for utilizing the named entity information of described client stores to identify described known variables, to obtain the complete recognition result of described phonetic order.

2. system according to claim 1, is characterized in that, described the first recognition unit comprises:

Generation unit between the first decode empty, in advance instruction template set and named entity set being compiled into respectively to two independently WFST networks, to form between the first decode empty;

The first decoding unit, for when receiving described phonetic order, utilize between described the first decode empty described phonetic order is decoded, to determine the instruction template under described phonetic order, and the start-stop of described known variables in described phonetic order constantly, and the instruction template under described phonetic order and the start-stop of described known variables in described phonetic order are constantly as described preliminary recognition result.

3. system according to claim 2, is characterized in that, described the second recognition unit comprises:

Generation unit between the second decode empty, in advance the named entity information of described client stores being compiled into WFST network, to form between the second decode empty;

The second decoding unit, for when receiving described preliminary recognition result, start-stop according to described known variables in described phonetic order constantly, from described phonetic order, determine sound bite to be identified, and utilize between described the second decode empty described sound bite to be identified is decoded, obtain the recognition result of described known variables.

4. system according to claim 3, is characterized in that, described server module further comprises:

Feature extraction unit, for extracting the acoustic feature relevant to speaker from described phonetic order;

And described server communication unit is further used for the described acoustic feature relevant to speaker to be sent to described client modules.

5. system according to claim 4, is characterized in that, described client modules further comprises:

Acoustic training model unit, for utilizing in advance speaker's speech samples training acoustic model relevant to speaker;

And, when described the second decoding unit is decoded to described sound bite to be identified, utilize between described relevant to speaker acoustic feature, described the second decode empty and acoustic model described and that speaker is relevant is decoded to described sound bite to be identified.

6. a method for speech recognition, comprising:

A. client modules is sent to server module by the user speech instruction of obtaining;

B. described server module utilizes instruction template set and named entity set tentatively to identify described phonetic order, obtain preliminary recognition result, and described preliminary recognition result is sent to described client modules, wherein said preliminary recognition result is the recognition result that contains known variables information, and described known variables is sound bite relevant to the named entity information of described client stores in described phonetic order;

C. described client modules utilizes the named entity information of described client stores to identify described known variables, to obtain the complete recognition result of described phonetic order.

7. method according to claim 6, is characterized in that, the step that described server module utilizes instruction template set and named entity set to carry out preliminary identification to described phonetic order comprises:

Described server module is when receiving described phonetic order, utilize between the first decode empty described phonetic order is decoded, to determine the instruction template under described phonetic order, and the start-stop of known variables in described phonetic order constantly, and the start-stop in described phonetic order of the instruction template under described phonetic order and described known variables is constantly as described preliminary recognition result, between wherein said the first decode empty, in advance instruction template set and named entity set is compiled into respectively to two and independently after WFST network, forms.

8. method according to claim 7, is characterized in that, the step that described client modules utilizes the named entity information of described client stores to identify described known variables comprises:

Described client modules is when receiving described preliminary recognition result, start-stop according to described known variables in described phonetic order constantly, from described phonetic order, determine sound bite to be identified, and utilize between the second decode empty described sound bite to be identified is decoded, obtaining the recognition result of described known variables, is to form after in advance the named entity information of described client stores being compiled into WFST network between wherein said the second decode empty.

9. method according to claim 8, is characterized in that, described step B further comprises:

Server module extracts the acoustic feature relevant to speaker from described phonetic order, and the described acoustic feature relevant to speaker is sent to described client modules.

10. method according to claim 9, it is characterized in that, when described client modules is decoded to described sound bite to be identified, utilize relevant acoustic model between described relevant to speaker acoustic feature, described the second decode empty and with speaker to decode to described sound bite to be identified, wherein the acoustic model relevant to speaker utilizes the training of speaker's speech samples to obtain in advance.