CN1193343C

CN1193343C - Voice commands depend on semantics of content information

Info

Publication number: CN1193343C
Application number: CNB018011926A
Authority: CN
Inventors: P·J·L·A·斯维伦斯; J·米德杨斯; O·阿尔伯达; V·斯坦比斯
Original assignee: Koninklijke Philips Electronics NV
Current assignee: Koninklijke Philips NV
Priority date: 2000-05-03
Filing date: 2001-04-26
Publication date: 2005-03-16
Anticipated expiration: 2021-04-26
Also published as: EP1281173A1; CN1381039A; WO2001084539A1; JP2003532164A; KR20020027382A

Abstract

Voice control of the play-out or other processing of video or audio content information uses voice commands that semantically relate to the content information.

Description

Make the method and apparatus that the terminal user can the control and treatment content information

The present invention relates to speech control, especially for speech control by consumer electronics (CE) equipment broadcast content information.

The equipment of speech control can be quoted the document with reference known from all at this, for example, and United States Patent (USP) 4,506,377; United States Patent (USP) 4,558,459; United States Patent (USP) 4,856,072; United States Patent (USP) 5,255,326; United States Patent (USP) 5,950,166.United States Patent (USP) 5,255,326 particularly point out the interactive audion system of the audio signal processor that a kind of use is linked to each other with microprocessor as interactive audio frequency control system.As boombox and link to each other with signal processor equally and be used to receive voice commands from main users as a pair of transceiver that receives the microphone operation.Voice commands is processed to move various different device, televisor for example, and tape, radio or CD-audio player offer processor with signal, and the signal of the from processor loudspeaker that is provided for transceiver produces desirable sound subsequently.Additional infrared sensor can be used for continuously triangulation being carried out so that signal is returned to processor via transceiver system in main audience's position, and the balance that is used for adjusting continuously sound remains focused on main audience with " dessert (sweet spot) " with sound.Additional device can produce output from signal processor by signal processor control with the voice commands that the instruction of response and storage is complementary equally, moves these other device according to oral voice commands.This system can respond voice commands reproduction of stereo sound from arbitrary sound source of being moved by this system simultaneously.

Speech recognition is a kind of technology, and wherein each aspect comes into question in this quotes document with reference at all, and for example United States Patent (USP) 5,987,409; United States Patent (USP) 5,946,655; United States Patent (USP) 5,613,034; United States Patent (USP) 5,228,110; With United States Patent (USP) 5,995,930.

Voice control and speech control known in various devices or the application are subject to the fixing instruction of the cover of one on the equipment of being bundled in.The inventor has recognized that, if voice commands or some voice commands link with the information content that will be broadcasted, rather than link with device or platform, but user's friendship of voice operated device, but and be improved aspect the human engineering of voice operated device in the operation use.Be that the inventor believes, the control of CE equipment should be the center with the content, rather than is the center with the device.

Therefore, propose in one aspect of the invention, with phonetic order and for example CD, in the data carrier of DVD or solid-state memory or on content information combine.The semantic requirement of content information is preferably satisfied in these instructions.For example, if content information comprises audio frequency, it is specific one first or how first that for example a collection of song, the title of the song by saying song or the lyric part of song can realize selecting in these songs.Special metadata is added in the content of CD and makes this feature become possibility.This metadata is typical by device or the speech vocabulary that controller requires used, but not necessarily a kind of expression, becomes possibility so that control for the speech of special CD and the melody on it.On the other hand or replenish ground, the part that the user can groan out or (attempting) sings out a desirable first melody is used for broadcasting to select it.On this meaning, see and quote with reference at this, Mark Hoffberg is at the United States Patent (USP) 5 of application on October 5th, 99,963,957, exercise question is the bibliography music data storehouse (BIBLIOGRAPHIC MUSIC DATA BASE WITH NORMALIZEDMUSICAL THEMES) (attorney docket PHA 23,241) with standardized melody theme.This latter's patent relates to the information handling system that comprises a music data storehouse.The unisonance consensus sequence of music data library storage musical sound note.This consensus sequence all is normalized to identical scale degree makes them to be stored by dictionary editing type ground.When finding between the musical sound note of a string input and a special consensus sequence via N unit inquiry when mating, system provides the Bibliographic information relevant with the consensus sequence that mates.The output that this system can be used for being groaned out by the user equally is play instruction through N unit query conversion.

Do not having under the situation of other measure, for example when broadcasting a first song, the output of the audio frequency of system may cause undesirable activation of voice control and treatment.For example by pushing for example general programmable telepilot of Philip electrical equipment (Philips Electronics), Pronto (trade mark), activator button on the telepilot is with voice activated command reception, disappear mutually via echo, or specific gesture by the equipment records user is made, Deng, this undesirable activation can be prevented from.If content information comprises video, crucial scene makes with some key word marks says the begin broadcast of those words settings from associated scenario.Or through the one to one mapping of voiceband user input with key word, or through the semantics mapping of voiceband user input with content key word mark and their synon mark-on catalogue listing, the key word of video content distributes and can be used for debating not certain scene.Preferably, for example, prevent undesirable activation by using certain fixed instruction or its part to instruct for example prefix.Similarly, phonetic entry is combined with the controllable characteristics of the shown diagram object that maybe will be shown, use illustrated interactive software to use by making processing, for example virtual reality or video-game, becoming can be voice-operated.For example, by allowing the user say to meet the proper word of semantics condition, can be voice-operated or voice selecting but become by the action that the diagram object of a for example incarnation will be realized.This video-game that is suitable for allowing the various control model (for example, via moving input of the both hands of control lever and phonetic entry), and be used to teach another kind of language or be used to teach children for the suitable word of certain notion of for example tangible object or action and the educational procedure of expression.Voice are converted into processed data so that distinguish the suitable action of being planned.This can be via for example the project in speech data and the predetermined look-up table being carried out semantics coupling and finding the candidate of close coupling to realize.Between phonetic entry and plan action combine can be trainable by noticing that user's history becomes.

In another aspect of the present invention, when the back is downloaded and/or broadcasted to content from the Internet and during by local storage, phonetic order derives from from this content.For example, express one's emotion key word in the part is distinguished relatively as a section audio that matches with them and is stored.This can realize by special-purpose software application.Broadcasting in the process of audio content for the first time, for example by the speech part being separated with the device part and analyzing the former, perhaps numerical data lyric part analyzed or that can listen is analyzed.The phonetic order of Chuan Jianing can additionally or alternatively be used as basic group that satisfies certain content thus.

In still another aspect of the invention, the user can from the instruction of and that the online download that be stored in subscriber equipment be pre-existing in or customization that match with certain content information be used for can speech the information content of control relevant on semantics.Thus, the user can be so that the family expenses library that his/her is considered to the electronic content information of home network resource becomes whole voice driven.For example, the user has a collection of CD, DVD on his/her juke-box and/or hard disk.If content relates to the public audio or video that gets, service provider can be every section content creating note library in advance, and the user can download and his/her those relevant unit of collection.The note of CD or DVD can be bundled in the disc identifier with and fragment on.For example, the title of the collection of records of being said by the user is linked with retrieving and select CD in the juke-box or certain identifier of DVD successively.The title of song or scene can link with the identifier of CD or DVD and link with relevant key structure.The user says project " film " and " car chasing " and obtains the available film that relates to the car chasing scene that has therein successively subsequently.

In still another aspect of the invention, phonetic order with as being present in, for example as the broadcasting that provides by service provider, the content link in the electronic program guides (electronic program guideEPG).In addition, speech interfaces can be selected the program classification of the specific program or the said word of match user of the said word of match user.

In still another aspect of the invention, the instruction of being said by the user is through the server process of for example client server or online service device, and oppositely sends to the playing device that can start on the net as instruction.Server has the catalogue of available content and the dictionary of learning the word of expression for contents semantic.The equipment that can start on the net, for example via the identifier code of CD or DVD, or via the head of file, for server is distinguished content, the phonetic order for this content easily is complementary with the indication that is used to control through a for example look-up table thus.

Speech control can for example select one section content information to be used for broadcasting, or is used for storage or is used for fast forward stopping up to one, etc.Equally, in advance with can be under the speech control viewed certain extracts that is used to retrieve on the key word level with speech input coupling of the content of key word bookmark identification.

Another aspect of the present invention proposes, for example CD or DVD copy on another medium from a medium with content information.First medium comprises content information and makes the possible control information of speech control becoming as explained above.Preferably, the information that is used for speech control is copy protection, and consequently copy does not have steering order.This is considered to a kind of feature of supporting content information industry.If the consumer wants to obtain the complete copy of the version of speech control, he or she can be by by distinguishing that to the link of CD numbering or DVD numbering the server from the internet is downloaded the speech control information by certain price.Even it only is symbolistic that such advantage is price, author's right also obtains confirming.So this feature is that the understanding of author or his/her assignee's intellecture property has contribution to the protection content information.

What be incorporated herein by reference is the United States serial 09/345 that Mark Hoffberg and Eugene Shteyn applied on July 1st, 99,339 (attorney docket PHA23,700) exercise question is the voice or the audio frequency browser (CONTENT-DRIVEN SPEECH-OR AUDIO-BROWSER) of content driven.This Patent data relates to searching for Internet provides for example resource of (streamable) audio frequency that flows of on-the-spot Internet radio to find.These resources are debated based on their file extension and and according to for example natural language or music style are not classified.The user can browse the collection based on text or melody input.

Express " voice commands " as used herein and mean and point out a kind of can the input, but it can comprise more tediously long language performance equally by the speech control that one or more key words are formed.

With reference to accompanying drawing and by example, further describe the present invention, wherein:

Fig. 1 and 2 is the block scheme of system among the present invention.

The present invention considers that the speech control of device or software application, particularly those uses prerecords the speech control of the content on medium.Relate on the use semantics of voice commands, relevant with it or based on, be stored in the content in the medium.So instruct all different with each sample of media content.For example, the instruction that has by the CD of the melody of composer and lyric author Y composition for the available commands of the CD of the melody with composer or lyric author X and those is different.

For a CD-audio player, its operation is as follows.The user inserts the CD of player Daan vanSchooneveld in the phonograph.CD storage melody and software can interact the user via speech control with CD.When the user said " Mustang Danny ", phonograph began to play the rock song that a head in the CD track part of Schooneveld should the song name.When the user said " leaking oil ", phonograph began to play the Blues song that its part of expressing one's emotion has the lyrics " I wept gently in therain as the gearbox was still leaking oil ".Or the like.One similar controlling schemes is applied to have the speech control of overlapping top box or another device that CD drives.But the delay that may need user program between voice commands is to separate the instruction of every first song.Perhaps, specific expression can be used for as the separation scraper between every first song instruction.For example, the user can say and " play Mustang Danny twice, play a leakage of oil; ".This will be understood that song " Mustang Danny " will be play twice continuously, and the song that relates to " leaking oil " subsequently will be play twice continuously.Express " playing twice " and " playing one time " as distinguishing every first song and the separation scraper that how system's plan operates before system prepares to receive another voice commands.

The speech control that juke-box is applied on the PC is described as follows.The application of juke-box is that a kind of considering filed software application on the hard drive (HDD) of PC with the CD content.The user is archived in the CD " maximum strike (GreatestHits) " of Jos Swillens on the HDD.When the user said " Swil, Beemer ", juke-box began to play the head " MyBeemer fits my crewcut " on the CD track part that is archived in the last Swillens of PC.Voice commands does not need only to be made up of key word and can to comprise more tediously long language performance.For example, the user we can say " beginning to play the relevant tack of title of the song (crewcut) from the impact of the maximum of Swillens ", and this speech input of system handles is with itself and one of the option that uses searching algorithm suitable in the catalogue listing for example to get coupling.When the user says " Swil, always be nice to your patent attorney ", juke-box begins to play symphony masterpiece " Always be nice etc.”。

The user also is archived in the CD " maximum strike " of Koos Middeljans on the PC.When the user said " Koos, Sweet Dommel Valley ", juke-box began to play the folk song of first this title of the song in the CD track part of filing.When the user says " Koos, Nat theLab ", be archived in another track part of the CD Mid on the PC " maximum strike ", juke-box begins to play " Nat the Lab ".When the user says that " arbitrarily ", juke-box is play the track part of this CD with any order for Middeljans, maximum strike.

Content protecting with regard to copyright is a sensitive issue.The copy protection measure is feasible and effective, for example DRM (digital rights management Digital RightsManagement).For this is contributed, as carrying out by this way with the phonetic order that CD or the DVD content information that upward semantics is relevant provide together, promptly they can not be copied to other position except the phonograph machine carried memory.Anyly will lose this feature and become no longer attractive to duplicating of other position.

In another example, user's content of having the relevant control date of semantics simultaneously through the Internet download makes becomes possibility to broadcast to the selection of the speech control of the similar mode that juke-box is discussed.Control data is preferably the integral part of data downloaded in this example.

For the juke-box technique background, see and quote with reference at this, Pietervan der Meulen is virtual juke-box (VIRTUALJUKEBOX) at United States serial 09/326,506 (attorney docket PHA 23, the 417) exercise question of application on June 4th, 99.

For example consider the different of language and pronunciation in the different geographic regions, identical content information can be bundled on the different voice commands group of voice so that speech recognition.On this meaning, the user preferably has the selection of language that he or she wants to be used for the speech control of system.For storage the instruction of the language that might use, the memory capacity of medium may be too little.If a kind of in the language that is used with most probable can not obtain voice commands from medium, broadcast device preferably can be downloaded the phonetic order of the equivalence of desirable language, and system becomes instruction into corresponding explanation in working time by this.Can obtain special-purpose service on the internet.On this meaning, with reference to the United States serial 09/160 of people such as Adrian Turner in application on September 25th, 98,490 (attorney docket PHA 23,500) exercise question is the customization upgrading (CUSTOMIZED UPGRADING OFINTERNET-ENABLED DEVICES BASED ON USER-PROFILESmartConnect trade mark) based on the device of the internet of user distribution starting, and people such as Erik Ekkel is in the United States serial 09/519 of application on March 6th, 00,546 (attorney docket PHA US000014) exercise question is that both all quote with reference at this by means of the configuration (PERSONALIZING CEEQUIPMENT CONFIGURATION AT SERVER VIA WEB-ENABLED DEVICE) of the device of netting starting at the individualized CE equipment of server.These documents have been discussed the service that offers the CE terminal user through the internet.

Expectation Voice ﹠ Video content in the future offers the terminal user by the internet on the degree that enlarges day by day.Record can be finished under the environment of safety at home at that time.Local record preferably allows the consumer to create his relevant instruction group of the particular segment semantics with content information.This needs some editors and a preferred user of help to set up contents fragment, the relation between speech input instruction and the action or the specific graphic user interface (GUI) of desirable processing.For example, if content information without any note, which fragment the user must determine, and he wants as the independent project control, he want with which the instruction how to control, which kind of the instruction under the reply which fragment should adopt which kind of action.In case create, the instruction group can together be stored in specific content in the identical file or with unique identifier and specific content link.

In a more complicated system, it is irrelevant that voice recording covers any and voice inventory, for example is subject to the vocabulary subclass, or only for the correlation form of the voice recording outside the Received Pronunciation.Done necessary correction on the details, this acoustic model that is applied to equally to choose wantonly (acoustics with reference to).Language model can be chosen use wantonly, it comprises people how typically with system interaction and say the description of some statements (so-called " language model "), it is by the example statement, pattern or phrase, by (at random) limited formal grammar, by (at random) and context-free grammer, or another kind of grammer.Language model can only comprise the improvement of the talking mode of any standard.As for speech understanding, system option comprises by by the certain word that typically provides through grammer, instructs, and phrase is expressed, and should start the description of which action.System can comprise a dialogue model, this model comprise to system should be how to user's input is made a response and how system enters dialogue mode description.For example, under specific environment, system can inquire and be used for clarification, or confirm instruction or the like again.System can utilize the data of shaping speech recognition device and the relation between other data.For example, system has one the user is shown we can say what is to broadcast the display of current track part.

Preferably, CD for example, DVD, the medium of solid-state (for example flash memory) storer etc. has the bit pattern that obtains discerning and confirm the availability of voice commands feature in starting process.Affirmation can convey to the user via the Pop-up screen on the display for example or through the text of prerecording of saying that loudspeaker provides.

As for the format of voice Control Software in the medium, CD-DA has to be used in and does not lose the overhead provision that CD adds the R-W passage of speech controlling features backward under the situation of compatibility.Introduce track part and may not have the storage space that enough is used for various language versions, but data can be downloaded to local storage from disk.In this case, every kind of language must be on disk only once.On the other hand, CD ROM has and makes it be easy to adapt to the file structure of the voice control documents on the needed disk.DVD has a file structure equally and considers the solution route identical with CD ROM.Flash memory, HDD etc. can handle in an identical manner.

Fig. 1 is the block scheme of system 100 among the present invention.System 100 comprises the playing device 102 that is used to broadcast the content information 104 that is stored on the carrier 106.Carrier 106 comprises for example CD, DVD or solid-state memory.Perhaps, carrier 106 comprise content information 104 through the internet or another data network be downloaded to HDD on it.Content information 104 is stored with digital format in these examples.As for those skilled in the art clearly, content information 104 can be stored with analog format equally.Device 102 has an executive subsystem 108 to make the terminal user can obtain content information 104.For example, if content information 104 comprises audio frequency, subsystem 108 comprises one or more loudspeakers, and comprises at content information 104 under the situation of video information that subsystem 108 comprises a display monitor.

According to the present invention, carrier 106 comprises and content information 104 relevant control information 110 on semantics.Control information 110 make data process subsystem 112 can determine the user through the speech of microphone (not shown) input 114 whether with control information in the information project coupling.If there is coupling, relevant presentation mode is selected, and the example provides in the above.Broadcast in the example as audio content in the above and to explain, because the intuition correspondence of height, on the one hand, and control information 110 and on the other hand, the semantics relation between the content information 104 is convenient to the interaction of user and device 102.Preferably, provide visual feedback through for example little LCD 116 of sectional displays about available content and/or selected pattern.

Carrier 106 can be the element that can insert one at a time in the device 102.Perhaps, even comprise can be from as a plurality of carrier (not shown) of carrier 106 or from fact different carriers for device 102, the juke-box functional 118 of chosen content in CD and the solid-state memory for example.

Control information 110 is being stored or is being recorded on the carrier 106 with content information 104 this illustrate.CD has the speech control application and the instruction of prerecording so DVD or flash memory can be provided.Perhaps control information 110 and the special software of operation on data handling system 112 are used and combine the one or more projects that are used for speech input 114 and control information 110 can get and mate.In this latter's configuration, software application provides through another passage rather than control information, for example through the internet or the installation floppy disk that is used for erecting device 102 provide.

Speech control itself is known, and the working method that is used for selecting arrangement with the user who installs is known equally.The present invention relates to a kind of control of use interface at this, and part wherein is relevant with the content information semantics that can be used for broadcasting.

Content below the selection of preferred combination in system of the present invention comprises.The spoken command that system 100 provides the sense of hearing or visual feedback response user to import.For example, if for example there is coupling, by with speech repetitive instruction word of prerecording or instruction words, if or have coupling, by providing word " affirmations " with the speech of prerecording, system 100 confirms to receive instructs.This feature can realize with the predetermined instruction relatively in a small amount of each information content item purpose.Confirm that data can be combined in the control data 110.If do not understood by the voice commands that the user provides, promptly system 100 does not discern this and do not find coupling in control data 110, and system 100 provides the audio feedback of pointing out de-asserted state.For example, the speech that system's 100 usefulness are prerecorded provides " can not handle this instruction ", " can not find this artist ", or " can not find this first song " or equivalent.System 100 can provide visual feedback in order to substituting, or is additional to audio feedback, if for example system 100 can handle the speech input, green passage of scintillation light is and if can not handle red light then.Identical therewith, system 100 is preferred with that prerecord or synthetic speech pronunciation, says artistical name and the title of the song of the song of the content of selecting to broadcast or the title of collection of records.Synthetic speech uses the text voice engine to be used for that this feature system that makes can use can be from the information of downloading or media bearer obtains.Text voice (TTS) system will be converted to the voice that can listen from the vocabulary of computer data (for example, word processor data, webpage) via loudspeaker.In tts system, preferred vocabulary is with the storages such as voice inventory of their intonation that comprises the carrier statement.Equally, as selection, control data 110 comprises that to which bar instruction of user interpretation what for example the key word of which first song was feasible prerecords or synthetic voice data.The speech of prerecording or synthesizing can become the part of control data 110 once more.When he did not want system that audio feedback is provided, the user should be able to be switched on or switched off it.

Fig. 2 is the illustrating of system 200 that has EPG, can get among the EPG that content information is distinguished and arranges with row 202 and row 204 on display monitor 206.For example, the corresponding TV channel of each corresponding line display and specific time slot is shown in each tabulation.Right at each specific ranks, the intersection point of row 208 and row 210 for example, mark or title 212 are illustrated the expression content and can obtaining from specific channel and special time slot.The arrangement of other type can replace using, and for example by subject classification and time, or press User Priority according to a profile of each channel or resource (for example on the internet) and arrange, or the like.The user can be by for example browsing EPG to obtain to fall into the part that EPG shows in window 214 borders via appropriate users interface (for example the arrow key on Wireless Keyboard or other orienting device is not shown) with the grid that window 214 moves past EPG.So the user can select special content information by knocking or give prominence to mark relevant in the shown part.

Typically, EPG is provided through the internet by service provider.In the present invention, become possible Control Software 216 enhancing EPG with the additional pattern of knocking or giving prominence to desirable mark of user and interactional pattern of EPG rather than routine that makes.Preferred and the EPG of Control Software 216 together is downloaded, and upgrades or refreshes.Control Software 216 comprises and the relevant control information 218 of semantics of selecting to distinguish the mark of program among the EPG for the user.For example, when the user will express " film " via user input apparatus 220 input data process subsystems, for example import via the speech of microphone, the grid of EPG be organized with only shown in the window 214 according to classification " film " program that can get, or film program and program in other classification are differently by graph-based.The user preferably browses classification " film " subsequently equally under phonetic order.The user sees film that he likes and expresses " The Magnificent Six andOkke " by the speech input and enter that its theme is pointed out in the EPG of the classical film of relevant aviation incident.In another example, the user enters " tonight " and " since eight ", and window 214 is located at least in part and illustrates thus, same day and since the set of the available program of 8 points (8:00pm).In a further example, the user has picked out program interesting in the part that is presented at the EPG in the window 214, and the vocabulary of saying the representation program theme enters microphone 220.Subsequently, the user says " watching " or " record ".The vocabulary of expression theme is converted into suitable form and is used for comparing with control information 218.When finding coupling, Control Software 216 makes microprocessor 222 can control tuner 224 and display monitor 206 or pen recorder 226.By this way, the user can use speech control and EPG to interact.

Claims

1. one kind makes the method that the terminal user can the control and treatment content information, this method is by adding special metadata in processed content information, thereby phonetic order and described content information are combined, and wherein phonetic order is relevant with the content information semantics.

2. the voice Control Software that provides together with the information content is provided the method for claim 1.

3. the process of claim 1 wherein to instruct and distinguish that content information is used for handling.

4. the process of claim 1 wherein that content information comprises audio frequency; And instruction comprises the word that appears in the audio frequency.

5. the process of claim 1 wherein that content information comprises video information; And incident or object in the video are distinguished in instruction.

6. the process of claim 1 wherein that content information is stored in the medium; And instruction storage is used to the control handled on medium.

7. the method for claim 1 comprises to the terminal user feedback about the treatment state of phonetic order is provided.

8. the electronic installation that is used for process content information, this device comprises:

Be used to receive the phonetic entry end of phonetic order;

Be used to receive comprise content information with the input end of content information in the medium of semantically relevant Control Software;

Be used under phonetic order control data processor through the software processes content information.

9. the device of claim 8, wherein the data processor processes content information is with the response phonetic order relevant with the content information semantics.

10. the device of claim 8, wherein medium comprises one of following at least: CD; Disk; Solid-state memory.

11. the device of claim 8 comprises the output terminal that is used for pointing out to the terminal user treatment state of voice commands.

12. method that the control data relevant with the semantics of certain content information is provided, wherein by in processed content information, adding special metadata, thereby phonetic order and described content information are combined, the terminal user can be controlled and the specific content information of control and treatment by the voice of being supported by control data, and wherein said phonetic order is relevant with the content information semantics.

13. the method for claim 12 comprises making the user can be through data network downloading control data.

14. the method for claim 12, wherein the control data of Xia Zaiing is used for using for the copy of specific content information.

15. the method for claim 12 comprises making the user download content information through data network.

16. the method for claim 12, wherein content information comprises an EPG, and wherein handles the interaction that comprises with EPG.