CN1828723B

CN1828723B - Dispersion type language processing system and its method for outputting agency information

Info

Publication number: CN1828723B
Application number: CN2005100512710A
Authority: CN
Inventors: 王瑞璋
Original assignee: Delta Optoelectronics Inc
Current assignee: Delta Electronics Inc; Delta Optoelectronics Inc
Priority date: 2005-03-03
Filing date: 2005-03-03
Publication date: 2011-07-06
Anticipated expiration: 2025-03-03
Also published as: CN1828723A

Abstract

The related system comprises: a speech input interface to receive the speech signal, a speech identification interface to identify the received signal, a speech process unit to receive the identified result and analyze to obtain a speech-meaning signal, and a dialogue management unit to receive the meaning signal for decision and produce the opposite speech-meaning information. Wherein, the output method applies an opposite protocol for intermediate information. This invention improves speech identification effect.

Description

The method of dispersion type language processing system and employed output intermediary message thereof

Technical field

The present invention relates to a kind of dispersion type language processing system and employed output intermediary message method thereof, and particularly relevant for a kind of single phonetic entry interface that utilizes, allow the user face simple single interface, can improve the dispersion type language processing system and the employed output intermediary message method thereof of user's speech recognition accuracy simultaneously.

Background technology

Utilize phonetic entry more and more ripe as the technology of man-machine interface, the speech interfaces that use may be more than the same simultaneously, can cause user's puzzlement like this.The dialog elements of single speech interfaces but can link different application systems simultaneously, for the user, is a very convenient and necessary design.

Utilize phonetic entry more and more ripe as the technology of man-machine interface, can be used as the phonetic order control interface of application apparatus, by phone with speech recognition and machine dialogue inquiry information information or reservation booking or the like automatically automatically.Phonetic order control provides the convenience of the Buddhist wireless remote control of walking back and forth, and adds the nature of human speech, and utilization automatic speech conversational system can be assisted true man's service, and provides twenty four hours not have the service of stopping weekly in seven days.Even do not need in the dead of night to close yet.The machine of automatic speech has replaced loaded down with trivial details routine work, and promotes the graceful quality of true man's service.

At present still in the voice technology of development phase, a lot of products are not tending towards ripe yet, therefore, and the convenience demand when considering to use simultaneously multinomial voice technology product as yet.For example, these interfaces have different use-patterns respectively, and when occupying simultaneously considerable calculating and memory body (be storage medium, storer, internal memory below all is called memory body) resource separately, the user that causes must provide the puzzlement of expensive high computing.

Generally speaking, voice entry system with regard to vocabulary, can be divided into little vocabulary the phonetic order control function and in the speech dialogue system of big vocabulary.With regard to distance, can be divided into client (Client) software that near-end (Local) uses, or server level (Server) system of far-end (Remote) use.Various application software have different user's speech interfaces respectively, do not link up mutually.Each speech dialogue system is only to single application element thereof.When using a plurality of different application system, will open different user's speech interfaces respectively, how assorted inconvenient as hand-held a plurality of telepilots, use very inconvenience.And traditional framework as shown in Figure 1.

See also shown in Figure 1ly, in the framework of Fig. 1, comprise a microphone and loudspeaker 110, in order to receive the voice signal that the user imported.After then being converted to audio digital signals, be sent to server level (Server) system, server level system 112,114 as shown in the figure, with 116 with application.And each server level system all comprises application user interface, speech recognition, voice deciphering and dialogue management.If the user uses the media of phone as input, then transmit the analogy voice signal via telephone set 120, and respectively via telephone interface card 130,140 and 150 be sent to server level system 132,142, with 152, and each server level system comprises that all application user interface, speech recognition, voice understand and dialogue management.Various application software have different user's speech interfaces respectively, do not link up mutually.Each speech dialogue system is only to single application element thereof.When using a plurality of different application system, will open different user's speech interfaces respectively, use very inconvenience.

For example, by the speech dialogue system that telephone wire uses, use the system of far-end server level mostly.For example, aircraft natural-sounding booking system of restaurant or medical natural-sounding reservation system or the like.Voice signal or speech characteristic parameter are after the near-end acquisition, be sent to a distant place by telephone wire, utilize the speech recognition and the language understanding processing unit in a distant place again, voice signal is translated into meaning of one's words information, by the dialogue control element and the application treatment element of application system, finish the task of communication and user explanation.Generally speaking, speech recognition and language are understood processing unit and are placed on far-end, and use and language person's irrelevant (Speaker-independent) element processing, as shown in Figure 2.

See also shown in Figure 2, the user uses the media of phone as input, then transmit the analogy voice signal via telephone set 210, be sent to server level system 230 via phone networking and telephone interface card 220, and this server level system 230 comprises speech recognition unit 232, voice deciphering unit 234, dialogue management unit 236 and connected database server 240, and in order to producing voice 238, and be returned to the user via original telephone interface card 220.

Such design has its conspicuous shortcoming, but will overcome these problem not a duck soups.The first, as previously mentioned, use a plurality of different user's speech interfaces simultaneously, easily cause situation about mixing.The second, when newly increasing and decreasing application end software, there is not unified interface to combine with original applied environment, increase the puzzlement of installing.Simultaneously, the voice signal path of use and model comparison are calculated, and the situation of how avoiding robbing resource mutually takes place, and also is the where the shoe pinches in the running.The 3rd, the acoustics comparison engine and the model parameter of not supporting operates separately mutually, can't enjoy the advantage of shared resource.For example, accumulation user's voice signal and use habit are used and are adjusted acoustic model parameter and language model parameter and the application preference parameters that skill upgrading is correlated with for the language person.Generally speaking, through the speech recognition accuracy rate after adjusting, the discrimination power that the person that is much better than the language is had nothing to do.

Generally speaking, single user's speech interfaces not only provide environment for use more easily, also will promote the overall efficiency of speech recognition.

This shows that above-mentioned existing speech dialogue system obviously still has inconvenience and defective, and demands urgently further being improved in method and use.In order to solve the problem that language processing system exists, relevant manufacturer there's no one who doesn't or isn't seeks solution painstakingly, but do not see always that for a long time suitable design finished by development, and common product does not have appropriate structure to address the above problem, and this obviously is the problem that the anxious desire of relevant dealer solves.

Because the defective that above-mentioned existing speech dialogue system exists, the inventor is based on being engaged in this type of product design manufacturing abundant for many years practical experience and professional knowledge, and the utilization of cooperation scientific principle, actively studied innovation, in the hope of founding a kind of new dispersion type language processing system and the method for employed output intermediary message thereof, can improve general existing speech dialogue system, make it have more practicality.Through constantly research, design, and after studying sample and improvement repeatedly, create the present invention who has practical value finally.

Summary of the invention

The objective of the invention is to, overcome the defective that existing speech dialogue system exists, and provide a kind of dispersion type language processing system of new structure, technical matters to be solved is to make it utilize single phonetic entry interface, allow the user face simple single interface, simultaneously can improve user's speech recognition accuracy, more can learn personalized dialogue mode, strengthen the convenience that uses.

Another object of the present invention is to, overcome the defective of the method existence of output information that existing speech dialogue system uses, and provide a kind of method of the new employed output intermediary message of dispersion type language processing system, technical matters to be solved is the result who makes the front end speech recognition, can be accepted by the processing unit of rear end, and keep the reliable meaning of one's words and understand accuracy rate.

The object of the invention to solve the technical problems realizes by the following technical solutions.In order to reach the foregoing invention purpose, method according to dispersion type language processing system of the present invention and employed output intermediary message thereof, propose a kind of single phonetic entry dialog interface, and have the multiple system that is applied as main language processing unit (Distributed Multiple Application-dependent Language ProcessingUnits) of single voice identification function, single dialog interface and distributing.This system not only provides environment for use more easily, also will promote the overall efficiency of speech recognition.

The present invention proposes a kind of dispersion type language processing system, comprises a phonetic entry interface, a speech recognition interface, a language processing unit and a dialogue administrative unit.This phonetic entry interface is in order to receive a voice signal.This speech recognition interface produces a speech recognition result according to the voice signal that is received after the identification.This language processing unit in order to reception speech recognition result, and is obtained a meaning of one's words signal after analyzing.This dialogue management unit is in order to receive meaning of one's words signal, and according to after the judgement of meaning of one's words signal, generation is corresponding to a meaning of one's words information of voice signal, wherein the speech recognition interface has the function that a model is adjusted, the voice signal that one sound model can be received via the identification of model regulation function, and the function that this model is adjusted is the relevant and relevant sound model of a device with a language person, reference one language person has nothing to do and device-independent common model is an initial model parameter, adjust the parameter of sound model, obtained best identification effect.

Above-mentioned dispersion type language processing system, wherein the function adjusted of model is to comprise using a dictionary as the foundation of adjusting, or uses a language phase conjunction model (N-gram) as the foundation of adjusting.

Above-mentioned dispersion type language processing system, in one embodiment, more comprise a map unit, between speech recognition interface and language processing unit, in order to receive the speech recognition result, and mapping is converted to the mapping signal according to an output intermediary message agreement, and passes to language processing unit, with as the speech recognition result.And the mode that above-mentioned mapping signal passes to language processing unit is for transmitting with the mode of broadcasting or the mode at wire communication networking or the mode at wireless telecommunications networking.And above-mentioned output intermediary message agreement is that mapping signal is formed with most speech unit and most time speech unit.And this time speech is formed with a syllable (Syllable) of Chinese or English one or most English phonemes or English syllable.

And according to this output intermediary message agreement, mapping signal is most speech unit and most the sequences that time speech unit is formed, or most speech unit and most the netted lattice (Lattice) that time speech unit is formed.

In the above-mentioned dispersion type language processing system, the generation of dialogue management unit is corresponding to the meaning of one's words information of voice signal, if a phonetic order then carries out the action corresponding to phonetic order.In one embodiment, whether can judge phonetic order, if just carry out action corresponding to phonetic order greater than a confidence index.

In the above-mentioned dispersion type language processing system, language processing unit comprises a language deciphering unit and a database, after wherein unit reception speech recognition result understood in language, analyze and contrast this database to obtain meaning of one's words signal corresponding to the speech recognition result.

In the above-mentioned dispersion type language processing system, in one embodiment, be according to decentralized architecture combination, wherein in the decentralized architecture combination, phonetic entry interface, speech recognition interface and dialogue management unit are at user end, and language processing unit is at an application system server end.And each application system server end has the language processing unit of a correspondence, and these language processing units are in order to receive the speech recognition result, obtain meaning of one's words signal after analyzing and then pass the dialogue management unit back, after judging, produce meaning of one's words information corresponding to these voice signals according to these meaning of one's words signals.

In the above-mentioned dispersion type language processing system, in one embodiment, phonetic entry interface, speech recognition interface, language processing unit and dialogue management unit also can all be positioned at same user's end.

In the above-mentioned dispersion type language processing system, in one embodiment, the custom of carrying out, the usefulness of promoting identification via study can be talked with according to a user in the speech recognition interface.Further, a greeting language controlling mechanism can be comprised, a greeting language at phonetic entry interface can be adjusted according to a user.

The object of the invention to solve the technical problems also adopts following technical scheme to realize.Again, in order to reach the foregoing invention purpose, propose a kind of method of intermediary message and agreement of use thereof exported according to the present invention, be applicable to a dispersion type language processing system, wherein dispersion type language processing system is to make up with a decentralized architecture.In this decentralized architecture combination, comprise a speech recognition interface and a dialogue administrative unit at user end, then comprise a language processing unit at an application system server end, wherein receive a voice signal when the speech recognition interface, and according to the voice signal that is received, after producing a speech recognition result after the identification, via an output intermediary message protocol conversion is to have most speech unit and most the signals that time speech unit is formed, be sent to language processing unit, and obtain a meaning of one's words signal after analyzing, after passing the dialogue management unit back, produce a meaning of one's words information corresponding to voice signal.

The method of above-mentioned output intermediary message and the agreement of use thereof, wherein time speech is formed with a syllable (Syllable) of Chinese or with the individual phoneme of one or majority of English or the syllable of an English.And be that to have the signal of most speech unit and most time speech unit being formed be a sequence or the netted lattice (Lattice) that these speech unit and time speech unit formed of serving as reasons according to this intermediary message protocol conversion.

The present invention compared with prior art has tangible advantage and beneficial effect.It has a kind of single phonetic entry dialog interface, and has that single voice identification function, single dialog interface and distributing are multiple to be applied as main language processing unit system.This system not only provides environment for use more easily, also will promote the overall efficiency of speech recognition.Be applied as main language processing unit system and this diffusing formula is multiple, utilize single phonetic entry interface, allow the user face simple single interface, can improve user's speech recognition accuracy simultaneously, more can learn personalized dialogue mode, strengthen the convenience that uses.

By technique scheme, the method of dispersion type language processing system of the present invention and employed output intermediary message thereof has following advantage at least: speech recognition is handled well by individual near-end in this system, only transmit and handle intermediary message (comprising some common speech unit and time speech unit), then can be with the circuit of any transmission data, with the channel transfer that can postpone it, save the communication cost.Server end does not need processed voice, saves the calculation resources cost of server end.And minimizing user in unified interface is in the face of the puzzlement of new increase and decrease application element thereof, for the application of development voice technology provides broader space.

It has above-mentioned many advantages and practical value, and in like product and method, do not see have similar structural design and method to publish or use and really genus innovation, no matter it all has bigger improvement on product, method or function, have large improvement technically, and produced handy and practical effect, and more existing speech dialogue system has the multinomial effect of enhancement, thereby be suitable for practicality more, and have the extensive value of industry, really be a new and innovative, progressive, practical new design.

Above-mentioned explanation only is the general introduction of technical solution of the present invention, for can clearer understanding technological means of the present invention, and can be implemented according to the content of instructions, and for above-mentioned and other purposes, feature and advantage of the present invention can be become apparent, below especially exemplified by preferred embodiment, and conjunction with figs., be described in detail as follows.

Description of drawings

Fig. 1 is traditional voice entry system.

Fig. 2 is that the treatment circuit calcspar understood in speech recognition and language in the traditional voice entry system.

Fig. 3 be show a preferred embodiment of the present invention have a multiple system architecture diagram that is applied as main language processing unit of single voice identification function, single dialog interface and distributing.

110: microphone and loudspeaker

112,114, with 116: the server level system

120: telephone set

130,140 and 150: the telephone interface card

132,142, with 152: the server level system

210: telephone set

220: phone networking and telephone interface card

230: the server level system

232: the speech recognition unit

234: the unit understood in voice

236: the dialogue management unit

240: database server

310 and 320: the speech processes interface

330 and 340: application server

312,322: voice receiving unit

314,324: the speech recognition unit

316,326: short speech map unit

318,328: the dialogue management unit

330,340: application server

332,342: database

334,344: the unit understood in language

Embodiment

Reach technological means and the effect that predetermined goal of the invention is taked for further setting forth the present invention, below in conjunction with accompanying drawing and preferred embodiment, the dispersion type language processing system that foundation the present invention is proposed and its embodiment of method, structure, feature and the effect thereof of employed output intermediary message thereof, describe in detail as after.

The present invention proposes a kind of single phonetic entry dialog interface, and has the multiple system that is applied as main language processing unit (DistributedMultiple Applicaion-dependent Language Processing Units) of single voice identification function, single dialog interface and distributing.This system not only provides environment for use more easily, also will promote the overall efficiency of speech recognition.

Utilize phonetic entry more and more ripe as the technology of man-machine interface, simultaneously, in order to control different application apparatus or to inquire about different information information or during the reservation booking, may demand side to a lot of different phonetic entry interfaces.If these interfaces have different use-patterns respectively, and occupy considerable calculating and memory body resource separately simultaneously, can cause the suitable puzzlement of user like this.So a maneuverable simple interface but can link different application systems simultaneously, and unified environment for use is provided,, seem quite important to the development and the universalness of advanced voice technology.

The present invention is exactly in order to solve above-mentioned puzzlement, to design single phonetic entry interface, allow the user face simple single interface, can improve user's speech recognition accuracy simultaneously, more can learn personalized dialogue mode, strengthening the convenience that uses.

Sound model with language person relevant (Speaker-dependent) and device relevant (Device-dependent) places proximal members earlier, and this design is in order to promote the preferable acoustic ratio of user to quality.Select among the embodiment one, irrelevant and the device-independent common model of sound model can utilize language person is initial model parameter, use a model to adjust (Model Adaptation) technology, the model parameter that the person that improves the Chinese idiom gradually is relevant and device is relevant, this can improve the identification quality in a large number.Select among the embodiment one,, also can be used in this model and adjust technology, to improve the identification quality with the closely-related dictionary of speech recognition (Lexicon) and language conjunction model (N-gram) mutually.

Above-mentioned dictionary (Lexicon) provides the vocabulary of speech recognition engine identification and the information of corresponding sound unit thereof.For example, vocabulary " identification " corresponds to/bian4//ren4/ in dictionary (Lexicon) " syllable sound unit, or/b//i4//e4//M//r//e4//M/ phoneme sound unit.Speech recognition engine is formed the sound comparison model of vocabulary by this information.For example conceal dynamic comparison model of Marko husband (Hidden Markov Model, " HMM ") or the like.

Above-mentioned language phase conjunction model (N-gram) then is the model of record vocabulary and the probability that is connected of vocabulary, for example, how many probability of " China " connection " the Republic of China " has, and how many probability of " China " connection " nationality " has, and how many probability that " China " connects other vocabulary has.That is a kind of mode of record vocabulary and the possibility that is connected of vocabulary, its function is just like the function of the syntax, so English name is called with " gram ".Rigorous definition is: the probability model of N the vocabulary that is connected.Learn Chinese as the foreigner,, also will read article more to obtain the use-pattern that literal is connected except the pronunciation of association's vocabulary.Language phase conjunction model also is the probit value that estimates N the vocabulary that is connected from sampling article data on a large scale.

The second, the output intermediary message agreement of design speech recognition element makes the result of front end speech recognition, can be accepted by the processing unit of rear end, and keep the reliable meaning of one's words and understand accuracy rate.Different application element thereofs uses phrase inequality usually, as if being unit with the word, will constantly increase new identification phrase with the increase of application.When application system is few, also do not have puzzlement, but application system is many time, phrase amount too conference can't run front end speech recognition unit.Therefore, shared intermediary message intends adopting common speech and time speech units shared.Common speech can comprise the phonetic order that usually uses, and the adding of common speech can increase the identification accuracy, lowers the identification diachesis of certain degree.Above-mentioned inferior speech unit than speech also little " fragment " (Fragment), for example, the syllable in the Chinese (Syllable), or the phoneme in the English or multiple phoneme or syllable.

Above-mentioned syllable (Syllable) is the pronunciation unit of Chinese words.One has more than 1300 contains tonic syllable, or not toned calculating has 408 syllables approximately.The pronunciation of each individual character of Chinese all is a single syllable, and in other words, each single syllable is all represented the pronunciation of a word, and having read one piece of article number has several syllables that several words are just arranged.The example (representing with Chinese phonetic alphabet literary style) that contains tonic syllable has :/guo2 (state)/or/jia1 (family)/etc., write as not toned syllable and then be/guo//jia/.

And the phoneme in the above-mentioned English or multiple phoneme or syllable then are when using English, and the pronunciation of vocabulary is multisyllable mostly.When using the identification of automatic speech recognizing device English, it is suitable more than the shared unit of the little sound of multisyllable, as the unit of model comparison to extract in advance.Such selection has single syllable unit or inferior syllable unit certainly.In the English language teaching the most normal use be the factor unit, for example :/a/ ,/i/ ,/u/ ,/e/ or/o/ etc.

The output of front end speech recognition can be the sequence of best several (N-Best) common speech and time speech unit, selects among the embodiment at another, also the netted lattice (Lattice) of a shared unit.When the user said one section, the speech recognition device can produce the highest possible identification result of comparison mark with sound through comparison.Because the accuracy of identification is not 100%, therefore, the output of identification result can be contained a plurality of possibility identification results.Use N string literal result as output format be called the N-Best identification result, each string literal result is independent word string sentence.

The output format of another possibility is netted lattice (Lattice), and the form of the netted lattice of vocabulary (WordLattice) just connects to a node (Node) with the shared vocabulary of different word strings.Different sentences all connects this shared vocabulary, makes all possible sentence appear as a trellis structure.Di Xia trellis structure for example:

The netted lattice of vocabulary (Word Lattice) is described as:

Node 1 is for opening initial point (Start_Node)

Node 5 is terminating point (End_Node)

Node 12 ' seems ' Score (1,2, ' seeming ')

Node 12 ' is thought ' Score (1,2, ' thinking well ') well

Node 23 ' is ' Score (2,3, ' being ')

Node 23 Score (2,3, ' having a try ') that ' has a try '

Node 24 Score (2,4, ' having a try ') that ' has a try '

Node 35 ' like this ' Score (3,5, ' like this ')

Node 45 ' ' Score (4,5, ' ')

Then, above-mentioned sequence or netted lattice are gone out or via the wire communication networking or via the wireless telecommunications networking through broadcasting, receive separately by different applied analysis elements respectively, even do not pass to the Language Processing analysis element on the same device, to understand the content of its meaning of one's words via the networking.Each Language Processing analysis element analyzing and processing language is separately understood, and obtains its corresponding meaning of one's words content.Different language interpretive analysis processing units, therefore corresponding different respectively application systems, has the different vocabulary and the sentence syntax.Language comprehension analysing is handled and can be filtered out beyond all recognition intermediary message (comprising some common speech and time speech unit), staying can knowable information, further parsing sentence, to carry out syntax comparison, and choose best and reliable meaning of one's words information as output, the phonetic entry interface device of user's near-end is given in passback.

At last, the meaning of one's words information of all passbacks is collected in the dialogue management unit on the phonetic entry interface device, and adds contextual meaning of one's words information, comprehensively judges present optimal results, and utilizes multi-mode to respond the user, and certain that finish in the talk is once responded.Or be judged as phonetic order, and under the situation of confidence index abundance, instruct the subsequent action of being paid, finish mission.

See also shown in Figure 3, being to show that a preferred embodiment of the present invention has the multiple system architecture diagram that is applied as main language processing unit of single voice identification function, single dialog interface and distributing, for example is a kind of phonetic entry and dialog process interface device.As shown in the figure, for convenience of description, this system is with two speech processes interfaces 310 and 320, and two application servers 330 and 340 are the example explanation.Two speech processes interfaces shown in right this embodiment is not limited to illustrate and application server.

This speech processes interface 310 comprises 314, one short speech mapping (Shortcut Words Mapping Unit) unit 316, a speech recognition unit (Speech RecognitionUnit) and a dialogue administrative unit (Dialogue Management Unit) 318.This speech processes interface 310 is that the sound model with language person relevant (Speaker-dependent) and device relevant (Device-dependent) places proximal members, and so design can promote preferable acoustic ratio to quality.And this speech processes interface 310 can receive the voice signal from the user, certainly, the embodiment of this speech processes interface 310 in also can be as shown more comprises a voice receiving unit 312, microphone (Microphone) or the like for example is so that receive user's voice signal.

And other speech processes interface 320 comprises a speech recognition unit 324, one a short speech map unit 326 and a dialogue administrative unit 328.This speech processes interface 320 can receive the voice signal from the user, certainly, the embodiment of this speech processes interface 320 in also can be as shown comprises that more voice receive single 322, microphone (Microphone) or the like for example is so that receive user's voice signal.In this embodiment, be to receive the voice signal that user (A) is transmitted.

In above-mentioned speech processes interface 310, can place speech recognition unit 314 by the sound model language person is relevant and that device is relevant, so design can promote preferable acoustic ratio to quality.But the sound model relevant for the person that sets up the language and device is relevant, select among the embodiment one, sound model can be utilized the common model of a language person irrelevant (Speaker-Independent) and device irrelevant (Device-Independent) is initial model parameter, use a model to adjust (Model Adaptation) technology, the model parameter that the person that improves the Chinese idiom gradually is relevant and device is relevant, this can improve the identification quality in a large number.

Select among the embodiment one,, also can be used in this model and adjust technology, to improve the identification quality with the closely-related dictionary of speech recognition (Lexicon) and language conjunction model (N-gram) mutually.

Speech processes interface 310 in preferred embodiment of the present invention, its output can be according to output intermediary message agreements, and according to the result of speech recognition unit 314 speech recognition of export, carry out a mapping comparison back via short speech map unit 316 and export.And because the processing unit of rear end also can be understood the signal according to this output intermediary message agreement, therefore also can accept the result after such speech recognition, and the meaning of one's words that can keep trust is understood accuracy rate.And the output intermediary message agreement in preferred embodiment of the present invention, the signal that the sender transmitted is the signal that adopts common speech and time speech units shared to be combined.

In traditional framework, different application element thereofs uses phrase inequality to combine usually.If with the word is unit, will constantly increase new identification phrase with the increase of application.When application system is few, also do not have puzzlement, but application system is many time, phrase amount too conference can't run front end speech recognition unit.Therefore, in an embodiment of the present invention,, after the mapping comparison of being carried out via short speech map unit 316, produce the signal of common speech and time speech units shared according to the result of speech recognition unit 314 speech recognition of exporting.And signal sender and signal recipient all can understand processing like this via the defined signal of output intermediary message agreement.

Above-mentioned inferior speech unit than speech also little " fragment " (Fragment), for example, the syllable in the Chinese (Syllable), or the phoneme in the English or multiple phoneme or syllable.Common speech can comprise the phonetic order that usually uses, and the adding of common speech can increase the identification accuracy, lowers the identification diachesis of certain degree.The output of front end speech recognition can several (N-Best) common speech of foregoing the best and the sequence of time speech unit, or the netted lattice (Lattice) of a shared unit.

Then, speech processes interface 310 is according to above-mentioned output intermediary message agreement, the result after the speech recognition of being exported, as shown in Figure 3, after shining upon comparison via short speech map unit 316, be sent to a language processing unit by signal 311, so that understand the content of its meaning of one's words.For example, this signal 311 is sent to application server (A) 330 or application server (B) 340.This signal 311 is an above-mentioned sequence signal or a netted lattice signal that meets output intermediary message agreement.And it is sent to the method for application server (A) 330 and application server (B) 340, comprise via the broadcasting transmission or via a wire communication networking or via a wireless telecommunications networking, receive separately by different applied analysis elements respectively, even do not pass to the analysis element on the same device via the networking.

See also shown in Figure 3ly, application server (A) 330 comprises that a database 332 and a language understand unit 334.And application server (B) 340 comprises a database 342 and language deciphering unit 344.When application server (A) 330 and application server (B) 340 receive signal 311, understand unit 334 and 344 via its language respectively and carry out the analysis and the processing of language, and comparable data storehouse 332 and 342 obtains the content of its meaning of one's words respectively.

And for another one speech processes interface 320, according to above-mentioned output intermediary message agreement, result after the speech recognition of being exported, shine upon comparison via short speech map unit 326 after, be sent to application server (A) 330 or application server (B) 340 by signal 321.This signal 321 is an above-mentioned sequence signal or a netted lattice signal that meets output intermediary message agreement.When application server (A) 330 and application server (B) 340 receive signal 321, understand unit 334 and 344 via its language respectively and carry out language analysis and processing, and comparable data storehouse 332 and 342 obtains the content of its meaning of one's words respectively.

The unit understood in different language, and therefore corresponding different respectively application systems, has the different vocabulary and the sentence syntax.The language interpretive analysis is handled and can be filtered out beyond all recognition intermediary message (comprising common speech of part and time speech unit), can knowable information and stay, and further parsing sentence, carrying out syntax comparison, and choose the best and reliable meaning of one's words information.And after these understand unit 334 and 344 analyses of carrying out language and processing via language, resulting meaning of one's words information, send back speech processes interface 310 via meaning of one's words signal 331 and 341 respectively, or send back speech processes interface 320 via meaning of one's words signal 333 and 343 respectively.

Then, dialogue management unit on phonetic entry and the dialog process interface device, as the dialogue management unit 318 in the speech processes interface 310, or the dialogue management unit 328 in the speech processes interface 320, collect the meaning of one's words signal of all passbacks, and add contextual meaning of one's words information, comprehensively judge present optimal results, and utilize multi-mode to respond the user, certain that finish in the talk is once responded.Or be judged as phonetic order, and under the situation of confidence index abundance, instruct the subsequent action of being paid, finish mission.

Have in above-mentioned preferred embodiment that single voice identification function, single dialog interface and distributing are multiple to be applied as in the main language processing unit system architecture, all elements that dialogue is carried out, be positioned at different positions separately, link up each other by different transmission media, for example go out or via the wire communication networking or via the wireless telecommunications networking through broadcasting, receive separately by different applied analysis elements respectively, even do not pass to the analysis element on the same device via the networking.

The system architecture of present embodiment basically, can be main according to a decentralized architecture, at user's near-end, and for example above-mentioned speech processes interface 310 and 320, it has the function of processed voice identification and dialogue management.Understand the unit and be used to carry out language interpretive analysis language, then can be placed in the rear end of application system server, unit 334 understood in the language of for example above-mentioned application server (A) 330, or unit 344 understood in the language of application server (B) 340.

In further embodiment of this invention, this is used to carry out language interpretive analysis language and understands the unit, can place user's near-end, and this need have the computing power of processing on the device of the needs in the design and user's near-end and decide.For example, need in a large amount of application systems of calculating if be used in, for example, weather information inquiry system, the processing of information needs a large amount of computings usually and stores a large amount of information, therefore, need quite a large amount of arithmetic processors, just the needed data of computing fast, and much the syntax of its required comparison also assorted, therefore, these should be positioned at far-end in order to the meaning of one's words application system in the anolytic sentence, just at the application server end.And, if comprise many special words in the application system, be different from other application systems, then directly seem comparatively natural by the processing of application server end, more can further collect different language persons' use vocabulary and syntactic structure, the supply usefulness of the further study of system of server end.Yet, personal phonebook for example, information is directly got final product by the language deciphering cell processing that near-end had usually just at user's near-end.

The for example electric light of large conference room control is in addition considered can not put the processor that contains computing power usually on the lamp socket, can send wireless instructions and carry out by after the near-end language deciphering cell processing.Also can be at a very little wafer, only handle very limited vocabulary, comprise " turning on light ", " turning off the light ", " lamp is turned on " and " lamp is shut " gets final product.Application system end and user's interface edge, each is the dialog channel of multi-to-multi naturally, and different people can turn on light by voice control, and different people also can use the weather inquiry.

Select among the embodiment one, the present invention has that single voice identification function, single dialog interface and distributing are multiple to be applied as main language processing unit system, the use habit that its dialogue is carried out, can be via study enhanced performance.For example, bring into use the greeting language at phonetic entry interface to vary with each individual at every turn, but all get the accurate of identification.Each switching command of changing the application system of control or dialogue also can individualize and adjust, and then accurate switch application.Select among the embodiment other one, the application that some individual is commonly used can have pet name instruction, increases convenient and operational enjoyment.The Apply Names that some is difficult for remembering can give personalized title.These functions can provide at this unified phonetic entry interface.

Traditional call voice conversational applications system, the person's that comprises the language irrelevant (Speaker-independent) speech recognition device and language comprehension analysing device.Usually speech recognition is calculate large, and a cover system can only be handled limited telephone channel, if will handle more telephone channel, then can improve considerable cost.And the passage that transmits voice can take more resource, causes the service bottleneck of peak hour, also increases the communication fee of user's burden.If speech recognition is handled well by individual near-end, only transmit and handle intermediary message (comprising some common speech and time speech unit), then can be with the circuit of any transmission data, with the channel transfer that can postpone it, saving communication cost.Server end does not need processed voice, saves the calculation resources cost of server end.

Such architecture design, and satisfy the accuracy rate requirement of speech recognition is saved many costs again, and unified interface reduces the puzzlement of user in the face of new increase and decrease application element thereof, and using for the development voice technology provides broader space.The research and development of present central processing unit are maked rapid progress, and handheld apparatus also develops the processor that high calculated amount gradually, and we expect that more easily man-machine interface should be in time.

The above, it only is preferred embodiment of the present invention, be not that the present invention is done any pro forma restriction, though the present invention discloses as above with preferred embodiment, yet be not in order to limit the present invention, any those skilled in the art, in not breaking away from the technical solution of the present invention scope, when the method that can utilize above-mentioned announcement and technology contents are made a little change or be modified to the equivalent embodiment of equivalent variations, but every content that does not break away from technical solution of the present invention, according to technical spirit of the present invention to any simple modification that above embodiment did, equivalent variations and modification all still belong in the scope of technical solution of the present invention.

Claims

1. dispersion type language processing system is characterized in that it comprises:

One phonetic entry interface is in order to receive a voice signal;

One speech recognition interface according to this voice signal that is received, produces a speech recognition result after the identification;

One language processing unit in order to receiving this speech recognition result, and is obtained a meaning of one's words signal after analyzing; And

One dialogue administrative unit in order to receiving this meaning of one's words signal, and after judging according to this meaning of one's words signal, produces the meaning of one's words information corresponding to this voice signal,

Wherein, described speech recognition interface has the function that a model is adjusted, this voice signal that function identification received that one sound model is adjusted via this model, and the function that described model is adjusted is relevant and this relevant sound model of a device with a language person, reference one language person has nothing to do and device-independent common model is an initial model parameter, adjusts the parameter of this sound model.

2. dispersion type language processing system according to claim 1, it is characterized in that described dispersion type language processing system more comprises a map unit, between this speech recognition interface and this language processing unit, in order to receive this speech recognition result, and mapping is converted to the mapping signal according to an output intermediary message agreement, and pass to this language processing unit, with as this speech recognition result.

3. dispersion type language processing system according to claim 2 is characterized in that wherein said output intermediary message agreement, and this mapping signal is formed with most speech unit and most time speech unit.

4. dispersion type language processing system according to claim 2 is characterized in that wherein said mapping signal one of formed sequence by speech unit and time speech unit.

5. dispersion type language processing system according to claim 2 is characterized in that wherein said mapping signal is most speech unit and most the netted lattice that time speech unit is formed.

6. dispersion type language processing system according to claim 1 is characterized in that wherein said dialogue management unit generation this meaning of one's words information corresponding to this voice signal, if a phonetic order then carries out the action corresponding to this phonetic order.

7. dispersion type language processing system according to claim 6, it is characterized in that the unit generation of wherein said dialogue management is this phonetic order corresponding to this meaning of one's words information of this voice signal, whether then judge this phonetic order greater than a confidence index, if then carry out action corresponding to this phonetic order.

8. dispersion type language processing system according to claim 1, it is characterized in that wherein said language processing unit comprises a language deciphering unit and a database, after wherein this language deciphering unit receives this speech recognition result, analyze and contrast this database to obtain this meaning of one's words signal corresponding to this speech recognition result.

9. dispersion type language processing system according to claim 1, it is characterized in that described dispersion type language processing system is to make up according to a decentralized architecture, wherein in this decentralized architecture combination, this phonetic entry interface, this speech recognition interface and this dialogue management unit are at user end, and this language processing unit is at an application system server end.

10. dispersion type language processing system according to claim 9, it is characterized in that each described application system server end wherein has this language processing unit of a correspondence, and those language processing units are in order to receive this speech recognition result, obtain those meaning of one's words signals after analyzing and then pass the dialogue management unit of phonetic entry and dialog process interface device back, in order to passback meaning of one's words signal, comprehensively judge according to this application system server end.

11. dispersion type language processing system according to claim 1, it is characterized in that described dispersion type language processing system is to make up according to a decentralized architecture, wherein in this decentralized architecture combination, this phonetic entry interface, this speech recognition interface, this language processing unit and this dialogue management unit are all to be positioned at user end.

12. dispersion type language processing system according to claim 1 is characterized in that wherein said speech recognition interface, according to the custom that user dialogue is carried out, the usefulness of promoting identification via study.

13. dispersion type language processing system according to claim 1 is characterized in that wherein said phonetic entry interface comprises a greeting language controlling mechanism, adjusts a greeting language at this phonetic entry interface according to a user.

14. a dispersion type language processing system is characterized in that it also comprises:

One phonetic entry interface is in order to receive a voice signal;

Most language processing units in order to receiving this speech recognition result, and are analyzed the back and are produced most meaning of one's words signals; And

One dialogue administrative unit in order to receiving those meaning of one's words signals, and after judging according to those meaning of one's words signals, produces the meaning of one's words information corresponding to this voice signal,

15. dispersion type language processing system according to claim 14, it is characterized in that described dispersion type language processing system more comprises a map unit, between this speech recognition interface and those language processing units, in order to receive this speech recognition result, and mapping is converted to the mapping signal according to an output intermediary message agreement, and pass to those language processing units, with as this speech recognition result.

16. dispersion type language processing system according to claim 15 is characterized in that wherein said output intermediary message agreement, and this mapping signal is formed with most speech unit and most time speech unit.

17. dispersion type language processing system according to claim 14 is characterized in that wherein said dialogue management unit generation this meaning of one's words information corresponding to those voice signals, if a phonetic order then carries out the action corresponding to this phonetic order.

18. dispersion type language processing system according to claim 17, it is characterized in that the unit generation of wherein said dialogue management is this phonetic order corresponding to this meaning of one's words information of this voice signal, whether then judge this phonetic order greater than a confidence index, if then carry out action corresponding to this phonetic order.

19. dispersion type language processing system according to claim 14, it is characterized in that wherein each this language processing unit comprises a language deciphering unit and a database, after wherein this language deciphering unit receives this speech recognition result, analyze and contrast this database to obtain this meaning of one's words signal corresponding to this speech recognition result.

20. dispersion type language processing system according to claim 14, it is characterized in that described dispersion type language processing system is to make up according to a decentralized architecture, wherein in this decentralized architecture combination, this phonetic entry interface, this speech recognition interface and this dialogue management unit are that those language processing units are then corresponding to an application system server end at user end.

21. dispersion type language processing system according to claim 20, it is characterized in that each this application system server end wherein has this language processing unit of a correspondence, and those language processing units are in order to receive this speech recognition result, obtain those meaning of one's words signals after analyzing and then pass this dialogue management unit back, in order to passback meaning of one's words signal, comprehensively judge according to this application system server end.

22. dispersion type language processing system according to claim 14 is characterized in that wherein said speech recognition interface, according to the custom that user dialogue is carried out, the usefulness of promoting identification via study.

23. dispersion type language processing system according to claim 14 is characterized in that wherein the phonetic entry interface comprises a greeting language controlling mechanism, adjusts a greeting language at this phonetic entry interface according to a user.

24. a method of exporting intermediary message is the agreement of utilization one output intermediary message, and be applicable to a dispersion type language processing system, wherein this dispersion type language processing system makes up with a decentralized architecture, in this decentralized architecture combination, comprise a user end and an application system server end at least, wherein this user's end comprises a speech recognition interface and a dialogue administrative unit, this application system server end then comprises a language processing unit, it is characterized in that the method for this output intermediary message comprises:

When this speech recognition interface receives a voice signal, and, produce a speech recognition result after the identification according to this voice signal that is received;

Via this output intermediary message protocol conversion is to have most speech unit and most the signals that time speech unit is formed, and is sent to this language processing unit, and obtains a meaning of one's words signal after analyzing;

And

After passing this dialogue management unit back, produce a meaning of one's words information corresponding to this voice signal.

25. the method for output intermediary message according to claim 24 is characterized in that wherein via this output intermediary message protocol conversion being to have the sequence that the signal those speech unit and those times speech unit formed is made up of those speech unit and time speech unit.

26. the method for output intermediary message according to claim 25 is characterized in that wherein via this output intermediary message protocol conversion being to have the netted lattice that the signal those speech unit and those times speech unit formed is made up of those speech unit and time speech unit.