CN101551998A

CN101551998A - A group of voice interaction devices and method of voice interaction with human

Info

Publication number: CN101551998A
Application number: CNA2009100510319A
Authority: CN
Inventors: 潘竞; 程青云; 马果
Original assignee: Shanghai Jinxin Electronic Technology Co Ltd
Current assignee: Shanghai Jinxin Electronic Technology Co Ltd
Priority date: 2009-05-12
Filing date: 2009-05-12
Publication date: 2009-10-07
Anticipated expiration: 2029-05-12
Also published as: CN101551998B

Abstract

The invention provides a group of voice interaction devices and the method of voice interaction with human. The group of devices includes more than two devices; each device is provided with a voice recognition system, the voice recognition system comprises a voice input module, a database, a voice recognition and controlling module, and a voice output module for outputting voice. the data stored in the database in the voice recognition of each device has logic correlation there between, thereby the voice interaction between the devices in the group can be realized. the voice interaction between the devices are realized by the data with logic correlation stored in the database in the voice recognition of each device; and by grouping the data in the databases, and comparing the inputted voice with the corresponding data group in voice recognition, the voice recognition speed is increased and the voice recognition content is enriched.

Description

One group of device that can carry out voice interface with and and people's voice interface method

Technical field

The present invention relates to field of speech recognition, have the device of speech recognition, relate in particular to one group and can carry out the device of voice interface and utilize the voice interface method of this group of device between people and this group of device.

Background technology

Carry out speech exchange with machine, allow machine understand what you say, this is the thing that people dream of for a long time.Speech recognition technology is exactly to allow machine voice signal be changed into the high-tech of corresponding text or order by identification and understanding process.Speech recognition is a cross discipline, and recent two decades comes, and speech recognition technology is obtained marked improvement, begins to move towards market from the laboratory.People estimate that in following 10 years, speech recognition technology will enter every field such as industry, household electrical appliances, communication, automotive electronics, medical treatment, home services, consumption electronic product, are one of electronics in the period of 2000 to 2010, message area ten big scientific and technological achievements application.This achievement will play sizable product renewal effect in the whole nation and even global household electrical appliances, communication and industrial control field.At present, many companies have in the world all used speech recognition technology on telecommunications, service sector and industrial production line, and create the voice product (as voice memo basis, voice-control toy, voice remote controller, home server) of a collection of novelty.At present, in field of speech recognition, man-to-man communication between speech recognition equipment person of being to use and the device, and also the scene of this communication is very limited, and the discernible clauses and subclauses of speech recognition equipment are also very limited.For above shortcoming, be necessary to propose a kind ofly can between a plurality of devices, carry out the language interaction, and enrich one group of device of session operational scenarios.

Summary of the invention

The technical problem to be solved in the present invention be to provide one group of device that can carry out voice interface with and and people's voice interface method, by the data with logical interdependency of storing in the database in the speech recognition system in each device in this group device, realize the voice interface between each device; And by the data in the database are divided into groups, when carrying out speech recognition, can improve speech recognition speed thereby the voice of input are compared with corresponding data group, greatly reduce the demand to Installed System Memory; Simultaneously during the data in increasing database, can not reduce the speed of speech recognition, not need to change the capacity of random access memory yet, thereby can make things convenient for and freely enrich the content of speech recognition.

For solving above technical matters, the invention provides one group of device that can carry out voice interface, wherein this group of device comprises plural device; Each the device in be provided with speech recognition system, this speech recognition system comprises a voice input module, in order to phonetic entry in speech recognition system; One database is stored content to be identified and the speech datas such as content that will make response according to institute's content identified in this database; One speech recognition controlled module, it is discerned in order to the statement that will store in the speech data of described voice input module input and database, the voice output module output voice that in this speech recognition system, comprise, it is characterized in that: have logical interdependency between the data of storing in the database in the speech recognition system in described each device, thereby can realize the voice interface between this group of device.

Further improvement of the present invention is: store in the described database data based use the dialogue scene be divided into several groups, each scene is one group of data, and each the group data have a head node, this head node contains the scene information of this data group; Wherein, at least one group of data subsistence logic correlation in the database in the speech recognition system at least one group of data in the database in the speech recognition system in each device and other devices.

Further improvement of the present invention is: described each group data can be divided into the plural groups divided data; Content in described each divided data group also can be combined into new group, a scene in other words with the content of the divided data group of other group.

Further improvement of the present invention is: in the described speech recognition system, include a Data Input Interface, be used for new data are input to database.

On the other hand, the invention provides a kind of method of carrying out voice interface between people and one group of device, wherein this group of device comprises plural device; Each the device in be provided with speech recognition system, this speech recognition system comprises a voice input module, in order to phonetic entry in speech recognition system; One database is stored content to be identified and the speech datas such as content that will make response according to the content of identifying in this database; One speech recognition controlled module, it is identified in order to the statement that will store in the speech data of described voice input module input and database, the voice output module output voice that in this speech recognition system, comprise, subsistence logic correlation between the data of storing in the database in the speech recognition system in described each device, thereby can realize the voice interface between this group of device, this method comprises: a) at first send instruction by people's speech;

B) after each device in this group of device is heard this instruction, each device is discerned this instruction by the speech recognition controlled module in the speech recognition system on it, and finds one group of data of the scene corresponding with this instruction in database by the speech recognition controlled module;

C) after relevant apparatus finds one group of data of corresponding scene, send voice according to this instruction by its speech output end by first device;

It is characterized in that: after d) first device in the device relevant with scene sent voice, other devices received this speech data by its speech recognition system, and the data of storing in this speech data and its database are compared identification; Second device relevant with scene is by the result of the speech output end on it according to relative discern, the voice of the voice match that output and first device send;

Repeat above step, until finish a complete scene dialogue.

The further improvement of this aspect of the present invention is: the data based applied scene of storing in the described database is divided into several groups, each scene is one group of data, and each group data has a head node, and this head node contains the scene information of this data group; Wherein, at least one group of data subsistence logic correlation in the database in the speech recognition system at least one group of data in the database in the speech recognition system in each device and other devices.

The further improvement of this aspect of the present invention is: also comprise in step c), step c1) after instruction is sent in user's speech, speech recognition system in each device will be instructed by its speech recognition controlled module and be compared identification with each head node scene information data of organizing data, find corresponding data set then; Step c2) voice that have correlativity by output of the speech output end in its speech recognition system and user instruction by first device;

In steps d) in also comprise, steps d 1) after first device sends voice, the speech data that other devices send this first device is packed into by the voice input module of the speech recognition system on it in speech recognition controlled module of speech recognition system, and the speech data that this first device is sent and corresponding scene the data group in data compare identification;

Steps d 2) speech data that finds the voice that send with first device to be complementary at second device, by the voice output module on it with voice output.

By above-described technical scheme, one group of device that can carry out voice interface provided by the invention with and and people's voice interface method, by the data with logical interdependency of storing in the database in the speech recognition system in each device in this group device, realize the voice interface between each device; And by the data in the database are divided into groups, when carrying out speech recognition, can improve speech recognition speed thereby the voice of input are compared with corresponding data group, greatly reduce the demand to Installed System Memory; Simultaneously during the data in increasing database, do not need to change the capacity of random access memory yet, can not reduce the speed of speech recognition, thereby can make things convenient for and freely enrich the content of speech recognition.

Description of drawings

Fig. 1 is the speech recognition system module map that arranges in each device in one group of device of a preferred embodiment of the present invention;

Fig. 2 can carry out the identification process figure of the speech recognition system of each device in the device of voice interface for a group of a preferred embodiment of the present invention;

Fig. 3 can carry out the data of database packet diagram of the speech recognition system in each device in the device of voice interface for a group of a preferred embodiment of the present invention; And

The flow chart of Fig. 4 for carrying out the language interaction between the people of a preferred embodiment of the present invention and the one group of device.

Embodiment

The present invention relates to multiple arrangement, but the hardware configuration of each device all is identical with workflow.Realize that the present invention mainly is the technology of 3 aspects, one is speech recognition, and two is the switching that helps scene by good data structure.The 3rd, improve the correctness of discerning between the device with effective method, the correctness that device is judged user's voice.In this specific embodiment, be example, describe one group of device that can carry out voice interface in detail with two devices.Below with reference to accompanying drawing the present invention is described in detail.

It with reference to figure 1 the speech recognition system module map that is provided with in each device in one group of device of a preferred embodiment of the present invention; This speech recognition system comprises a speech recognition controlled module 10, respectively a voice input module 20 that communicates to connect with this speech recognition controlled module 10, a database 30, a Data Input Interface 40, a voice output module 50 and an action output module 60; Wherein, speech recognition controlled module 10 comprises a processor and operation speech recognition algorithm in the above, and in addition, this speech recognition controlled module 10 also can be that a processor adds independent sound identification module; Voice input module 20 comprises a microphone microphone, be used for the voice of input are amplified input, one modulus (A/D) change-over circuit, its voice that are used for importing are digital signal by analog signal conversion, then with this digital signal input speech recognition controlled module 10; Database 30, wherein storage is content to be identified and the speech datas such as content that will make response according to institute's content identified; Data Input Interface 40 is used for by this interface 40 new data being input to database 30, makes device to change function and content according to user's needs; Voice output module 50 comprises digital-to-analogue (D/A) change-over circuit and loudspeaker, is used for digital voice data to be exported is converted to the analog voice data after loudspeaker amplify output.Output content is not limited only to voice, also can be the action of other machinery of making after recognizing voice and electronics.

The above is the introduction of the speech recognition system that has in each device that uses among the present invention.In this speech recognition system, the data of storage are open type data in its database 30, that is to say that the user can change content wherein according to the needs of oneself, can increase, reduce, changes the identification clauses and subclauses before promptly each the use, thereby can satisfy user's oneself needs; By described Data Input Interface 40, the user can the data that prior burning is good be input in the described speech recognition controlled module 10, and the data of utilizing this speech recognition controlled module 10 will come in by data-interface 40 are put in the database 30.

In addition, with reference to figure 2, the data based various scene of storage is divided into plurality of

data group

31,32,33... in this database 30, and each organizes a data represented different scene; And each data set 31,32,33..., can be divided into a plurality of divided data groups 311,312,313... again, 321,322,323..., and the content in described each divided data group also can be combined into new group, a scene in other words with the content of the divided data group of other group; Wherein, when data are divided into groups, each data set has a head node, and this head node contains the scene information of this data set, comprises the scene title, the address of all possible identification items etc., and according to concrete scene, each data set has several partial nodes again according to the situation of its divided data group, these several partial nodes contain the information of divided data group equally, comprise address of name information, possible all identification items etc.; Described speech recognition controlled module 10, when the data that will store in speech data in load module 20 these speech recognition controlled modules 10 of input and described database 30 compare identification, be not as traditional audio recognition method, the data of storing in the speech data of input and all database 30 are compared, but with the input speech data and the scene title in each data set be that head node compares, thereby select corresponding data set, then the data set of corresponding scene and the speech data of input are compared; By so a kind of data mode relatively, can accelerate the speed of speech recognition, and also can the not slow down speed of speech recognition of the data that can increase storage in the database 30.In addition, by the method for grouping, the present invention can also utilize the identification clauses and subclauses that are available to obscure easily for some or the recognition node of multi-lingual synonym increases the chromaffin body point, comes Effective Raise phonetic recognization rate and recognition effect with this.Such as in identification " you are good " in this, increase chromaffin body point " hello " " you ", carry out scene Recognition at a device according to speech content, also the chromaffin body point is compared identification simultaneously when node is compared identification, thereby can improve recognition efficiency and recognition effect; Make device can cooperate user's speech custom better like this.

It is the speech recognition flow chart of speech recognition system in each device of a preferred embodiment of the present invention with reference to figure 3; 201: at first user speech is sent instruction or voice are sent in other device speeches, and the voice signal of this speech content is converted to voice digital signal by input module 20 with this voice analog signal and is input to speech recognition controlled module 10 after amplifying then; 202: according to the definite scene content that will identify of speech content; 203: speech recognition controlled module 20 adds recognized list with the audio digital signals content of input; 204: speech recognition controlled module 20 will add the content of recognized list and the speech data of User input or the speech data of other device inputs and compare identification; 205: identify successfully, export recognition result and determine new scene according to the result; If identification is unsuccessful, then returns step 204 and re-start relative discern.

The flow chart that carries out voice interface for the user of a preferred embodiment of the present invention and stream oriented device with reference to figure 4.When utilizing two devices that can carry out voice interface to carry out voice interface, comprise step 401: say that by the user in short sending instruction starts two voice interface devices; Step 402,402 ': first device and second device receive the words of being said by the user by the voice input module 20 on it, and carrying out speech recognition by the words that 10 couples of users of the speech recognition controlled module on it say, the head node of the data group of storage compares in the words of the user being said by this speech recognition controlled module 10 and the database 30; Step 403,403 ': by the speech recognition in the step 402, first device finds the data group N of the corresponding scene of words of saying with the user, and second device finds the data group N of the corresponding scene of words of saying with the user ^-Step 404: after first device found corresponding contextual data group, first device was told a word of scene, by voice output module 50 this a word was exported; 404 ': after second device finds corresponding contextual data group, a word that second device is told this first device is made as the identification content, a word with other scenes writes recognized list simultaneously, through identifying by 10 pairs of these a words of speech recognition controlled module; Step 405 ': if corresponding scene, then second device is told second word, if not the scene of correspondence, then in short changes scene according to the of other scenes in the recognized list, find the scene of correspondence after, tell second word; Step 405: second word that first device is told second device by voice input module 20 speech recognition list in the speech recognition controlled module 10 of packing into, and identify this second words, tell afterwards the 3rd word; Repeat above step until finish this scene dialogue.

Voice interface between the device described above be two the device and and the people between voice interface, in the present invention, when relating to more than the voice interface between two devices, its working method is identical with working method between two devices, at first send instruction by user's speech, each device finds corresponding scene, and each device as content identified, and is told the speech content that conforms to other device speech content with contents of other device speeches according to recognition result afterwards.

Be understandable that the detailed description of above-described embodiment is in order to set forth and explain principle of the present invention rather than to the restriction of protection scope of the present invention.Under the prerequisite that does not break away from purport of the present invention, one of ordinary skill in the art can be made modification on these embodiment bases by the understanding to the principle of instructing of technique scheme, changes and changes.Therefore protection scope of the present invention by appended claim with and be equal to and limit.

Claims

1, one group of device that can carry out voice interface, wherein this group of device comprises plural device; Each the device in be provided with speech recognition system, this speech recognition system comprises a voice input module, in order to phonetic entry in speech recognition system; One database is stored content to be identified and the speech datas such as content that will make response according to the content of identifying in this database; One speech recognition controlled module, it is identified in order to the statement that will store in the speech data of described voice input module input and database, the voice output module output voice that in this speech recognition system, comprise, it is characterized in that: subsistence logic correlation between the data of storing in the database in the speech recognition system in described each device, thus can realize voice interface between this group of device.

2, one group of device that can carry out voice interface as claimed in claim 1, it is characterized in that: the data based applied scene of storing in the described database is divided into several groups, each scene is one group of data, and each group data has a head node, and this head node contains the scene information of this data group; Wherein, at least one group of data subsistence logic correlation in the database in the speech recognition system at least one group of data in the database in the speech recognition system in each device and other devices.

3, one group of device that can carry out voice interface as claimed in claim 2, it is characterized in that: described each group data can be divided into plural groups divided data group, content in described each divided data group also can be combined into the content of the divided data group of other group new group, in other words a scene.

4, as the arbitrary described one group of device that can carry out voice interface of claim 1-3, it is characterized in that: in the described speech recognition system, include a Data Input Interface, be used for new data are input to database

The method of 5, between people and one group of device, carrying out voice interface, wherein this group of device comprises plural device; Each the device in be provided with speech recognition system, this speech recognition system comprises a voice input module, in order to phonetic entry in speech recognition system; One database is stored content to be identified and the speech datas such as content that will make response according to the content of identifying in this database; One speech recognition controlled module, it is identified in order to the statement that will store in the speech data of described voice input module input and database, the voice output module output voice that in this speech recognition system, comprise, subsistence logic correlation between the data of storing in the database in the speech recognition system in described each device, thereby can realize the voice interface between this group of device, this method comprises: a) at first send instruction by people's speech;

B) after each device in this group of device was heard this instruction, each device was discerned this instruction by the identification module in the speech recognition system on it, and found one group of data of the scene corresponding with this instruction in database by identification module;

C) after relevant apparatus finds one group of data of corresponding scene, send voice according to this instruction by its speech output end;

Repeat above step, until finish a complete scene dialogue.

6, method of between people and one group of device, carrying out voice interface as claimed in claim 5, it is characterized in that: the data based applied scene of storing in the described database is divided into several groups, each scene is one group of data, and each group data has a head node, and this head node contains the scene information of this data group; Wherein, at least one group of data subsistence logic correlation in the database in the speech recognition system at least one group of data in the database in the speech recognition system in each device and other devices.

7, method of between people and one group of device, carrying out voice interface as claimed in claim 8, it is characterized in that: in step c), also comprise, step c1) after instruction is sent in user's speech, speech recognition system in each device will be instructed by its speech recognition controlled module and be compared identification with each head node scene information data of organizing data, find corresponding data set then; Step c2) voice that have correlativity by output of the speech output end in its speech recognition system and user instruction by first device;