CN105009205A

CN105009205A - Method and system for voice recognition input on network-enabled devices

Info

Publication number: CN105009205A
Application number: CN201480012543.3A
Authority: CN
Inventors: T·熊; C·迈考伊
Original assignee: Sony Corp; Sony Network Entertainment International LLC
Current assignee: Sony Corp; Sony Interactive Entertainment LLC
Priority date: 2013-03-08
Filing date: 2014-03-07
Publication date: 2015-10-28
Anticipated expiration: 2034-03-07
Also published as: WO2014138685A2; CN105009205B; WO2014138685A3

Abstract

Apparatus and methods to implement a technique for using voice input to control a network-enabled device. In one implementation, this feature allows the user to conveniently register and manage an IPTV device using voice input rather than employing a bulky remote control or a separate registration website.

Description

The method and system of the speech recognition input on the equipment enabling network

The cross reference of related application

This application claims the US Pat Appl Ser No.13/790 that the title proposed on March 8th, 2013 is " METHOD ANDSYSTEM FOR VOICE RECOGNITION INPUT ONNETWORK-ENABLED DEVICES ", the right of priority of 426.The full content of this application is incorporated herein by reference.The application has quoted in full the US Pat Appl Ser No.12/982 that the title proposed Dec 30 in 2010 is " DEVICEREGISTRATION PROCESS FROM SECOND DISPLAY ", 463, the title that the latter requires November 10 in 2010 to propose is " DEVICE REGISTRATION PROCESS FROM 2 ^nDdISPLAY " U.S. Provisional Patent Application No.61/412, the right of priority of 312; The application has also quoted in full the US Pat Appl Ser No.12/844 that the title proposed on July 27th, 2010 is " CONTROL OF IPTV USING SECONDDEVICE ", and 205, these all applications are all had by assignee of the present invention.

Background technology

The Internet is sent digital content to IPTV and is continued to increase, just as the popularity of IPTV itself continues to increase.As for those on many digital devices, particularly network, the facility registration of IPTV can bring many benefits to user.The key benefit of the registration of IPTV is and permission associating the user account that various service conducts interviews.But user's registration of IPTV device is inconvenient.User needs to leave living room and visits PC---and this is inconvenient, or directly on IPTV, performs registration, and IPTV generally has poor inputting interface.Such as, in some systems, use a teleswitch to the web browser input registration code on equipment.Although user need not leave the position of equipment, most of telepilot does not designed to be used a large amount of data of input.

Successfully attempt being such as, by allowing user to use second display, cell phone or flat computer, to assist the execution of registration by inputting data on the equipment more friendly to user for one that remedies this situation.Although very convenient in many cases, the user of technology not too halo still may suffer difficulty when performing function (such as, the down load application, the Equipments Setting second display utilizing them etc.) of necessity of registration IPTV.

Research shows, IPTV and other users enabling the very high number percent of (network-enabled) equipment (such as, Blu-ray playback device) of network do not register their equipment.Not only user omits the benefit of registration, and network provider also omits the business information of reception about such user, can use these business informations to improve service and advertise to consumer.Correspondingly, need to make the registration process of the equipment of such as IPTV and so on more convenient, and enable user obtain the benefit of registration so more easily thus.In addition, also need to improve the overall user experience to such equipment input data.

Summary of the invention

In the realization of system and method, user can use voice command and and the troublesome equipment of non-usage maybe must navigate to independent registration of website, easily registration and organize content playback apparatus, such as IPTV.So, registration can be realized more easily.Upon registration, extra follow-up feature can being realized, such as directly selecting equipment, the log-on message of other equipment be associated with user account or the succession of configuration for browsing.

The realization of system and method can use web list and input and server side scripting language to accept user with the web technology of equipment and browser-safe.Speech (speech) engine can be used phonetic entry to be converted to text or numeric data to register IPTV in various position, or in fact any equipment enabling network.Voice engine can receive phonetic entry in every way, such as, from USB or the hardware port of specifying, from the microphone being coupled to telepilot, IPTV or other equipment, second display etc. or be embedded in them.The text identified can be presented on content playback device to user, to guarantee to transcribe accurately.Then, can by identified text (such as log-on message) automatically or artificially be submitted to network provider.

In an example of method of operating, be opened when content playback device and can carry out communicating (such as with network, time in a wired fashion or wirelessly), point out user to input network cipher if necessary, and be then automatically directed registration door.If user does not have user account in registration door, then them can be pointed out to create one.After user signs in registration door, prompting user adds registration code or other code that can identify, the such as MAC Address of content playback device.Then, user says code to audio input device, all says character by character or once.With can maybe can being undertaken by the combination of voice and the artificial input used a teleswitch by voice completely alternately of registration (or other manage) door.Once successful registration, equipment just can be ready for and browsing and content choice.User can also use previous configuration to fill the log-on message of fresh content playback apparatus, such as, from previous configuration inherited information, only needs the registration code adding new equipment.

The content playback device enabling network can present many forms, and multiple content playback device can be coupled to given LAN (Local Area Network), and can be selected in given LAN (Local Area Network).Example content playback apparatus can comprise IPTV, DTV, digital audio system, player or be suitably configured to connect more traditional Audio and Video system.In video system, content playback device comprises and controls video display with the processor of rendering content thereon.

On the one hand, the present invention relates to the method to the equipment input data of enabling network, comprise: be the state being in audio reception data by the Equipments Setting enabling network, described data be attached to described in enable the service of the equipment of network, the server be associated with the described equipment enabling network or described in enable the user interface of the equipment of network operation be associated; Audio reception data; Received voice data is converted to text data; And the equipment enabling network described in causing performs an action based on described text data, described text data represents the function on described service or described server, or enables the operation in the user interface of the equipment of network described in representing.

Realization of the present invention can comprise following in one or more.The voice data received can be log-on data, and the method can also comprise and being associated with user account by text data, thus the equipment enabling network is registered to this user account.The method can also comprise based on log-on data establishment user account.The voice data received can be user name or password or both, the function in service can be the user account in the service of signing in.The voice data received can be navigation command, and executable operations can comprise this navigation command of execution on a user interface.The method can also comprise the signal that transmission causes enabling the equipment display text data of network.In audio reception data and after being converted into the text data corresponding to character, enable on the equipment of network described in the text version of described character may be displayed on.The method can also comprise prompting user and confirm described text data.The method can also comprise and stores the voice data that receives, and if following after display reminding is that user revises described text data, then the method can also comprise and being associated with the voice data received by the text data revised.The method can also comprise: from the audio data detection language form received; If enable in the supporting language of the equipment of network described in the language form detected does not correspond to, so: perform switch process, text data is made to be form corresponding to the language form detected; Create the image file of text data; And by described image file transfers to described in enable the equipment of network, for display.The method can also comprise: from the audio data detection language form received; If enable in the supporting language of the equipment of network described in the language form detected does not correspond to, so: perform switch process, text data is made to be form corresponding to the language form detected; And by described transmission of textual data to described in enable the equipment of network, for display.The method can also comprise: from the audio data detection language form received; If enable in the supporting language of the equipment of network described in the language form detected does not correspond to, so, the language module corresponding to the language form detected is downloaded to described in enable the equipment of network.The method can also comprise: prompting user input language type, and once input language type, enables the equipment of network described in just being downloaded to by the language module corresponding to inputted language form.

On the other hand, the present invention relates to non-transitory computer-readable medium, comprising the instruction for causing computing equipment to realize method above.

On the other hand, the present invention relates to the method for the equipment input data for enabling network, comprising: be the state being in audio reception data by the Equipments Setting enabling network; Audio reception data; The voice data received is converted to text data; And the equipment enabling network described in causing performs an action based on using the request of described text data.

Each realization of the present invention can comprise following in one or more.The input of request msg can comprise display list and prompting input data, and the method can also comprise and utilizes text data to carry out filling form and the list of display through filling.This list can point out input registration code, and the method can also to comprise transmission of textual data to server to perform registration, and once receiving the signal of instruction successful registration from server, the just instruction of display successful registration.The input of request msg can comprise the input accepting navigation command.Audio reception data can comprise the use input port enabled on the equipment of network and carry out audio reception data.The voice data received is converted to text data to perform on the equipment enabling network.The method can also comprise: before conversion, determines that the voice data received uses not by the language supported; And, download the language module of the language corresponding to the voice data received.Input port can be configured to accept the voice data from mobile phone, flat computer, laptop computer, microphone or audio stream, or can be USB port.Softdog can be coupled to USB port, and audio reception data can be performed by the microphone being coupled to softdog.The voice data received is converted to text data to perform in softdog.Audio reception data can comprise from remote control voice data.The voice data received is converted to text data can perform on a remote control or on the equipment enabling network.Audio reception data can comprise from second display audio reception data, and such as, wherein second display is smart phone, flat computer or notebook.The voice data received is converted to text data can perform on the second display or on the equipment enabling network.Audio reception data can comprise the radio frequency audio input device used with the device pairing enabling network, and carry out audio reception data, such as, wherein radio frequency audio input device is smart phone.The voice data received is converted to text data to perform on radio frequency audio input device.

In another, the present invention relates to the method for the equipment input data for enabling network, comprising: be the state being in audio reception data by the Equipments Setting enabling network; Audio reception data; Receive the instruction of language form; Determine that described language form is not supported; By the audio data transmission that receives to first server; Receive the data through conversion from first server, the data through conversion are according to the voice data computing that receives; And, the instruction of the data through conversion that display receives.

Each realization of the present invention can comprise following in one or more.The voice data received can correspond to navigation command, and the instruction of the data through conversion that display receives can comprise the described navigation command of execution.The voice data received can correspond to the data that will be imported in list, and the instruction of the data through conversion that display receives can comprise and these data being input in list.The instruction receiving language form can comprise: receive the selection to language form; From arranging file determination language form; Based on the audio data detection language form received; Or by audio data transmission to second server, and, the instruction of language form is received from second server.The data through conversion received can be text datas, or can be the image files of instruction text data.

In another, the present invention relates to the method for the equipment input data for enabling network, comprising: be the state being in audio reception data by the Equipments Setting enabling network; Audio reception data; Receive the instruction of language form; Determine that described language form is not supported; Server is transferred to by the request of the language module corresponding to described language form; The language module of asking is received from described server; Use the language module received that voice data is converted to text data; And, show the instruction of described text data.

Each realization of the present invention can comprise following in one or more.Enable on the equipment of network described in described language module can be stored in, be stored on the softdog of the equipment enabling network described in being connected to, or store and the described external unit enabling the devices communicating of network on.The instruction receiving language form can comprise: receive the selection to language form; From arranging file determination language form; Based on the audio data detection language form received; Or by audio data transmission to second server, and receive the instruction of language form from second server.

On the other hand, the present invention relates to the softdog equipment being suitable for being placed in and carrying out signal with the equipment enabling network and communicate, comprising: for the device of audio reception file; For audio file being converted to the device of text; And, for text being transferred to the output unit of the equipment enabling network.

Each realization of the present invention can comprise following in one or more.Receiving trap can be select from by the group that the following is formed: RF signal receiver, microphone and hardware port.Output unit can be select from by the group that the following is formed: USB port, RF signal projector and hardware port.This equipment can also comprise the storer for storing user profile, the acoustic characteristic of described user profiles indicating user voice.

The advantage of some embodiment of the present invention can comprise following in one or more.Use this system and method, content playback and other equipment enabling network can be registered to user account easily and be managed.Registration can use with seldom or do not have the speech recognition system of extra hardware cost.Speech recognition system can be used in the equipment enabling network to input very eurypalynous data.Speech recognition system can allow not used their mother tongue to control their equipment by the speaker of supporting language.Can by eliminating the necessity that multiple group of languages be delivered together with each equipment---only need one or in some cases one do not need to be provided yet, the software of the equipment enabling network that simplifies the operation.According to description subsequently, comprise figure and claim, other advantages will become distinct.

Accompanying drawing explanation

Fig. 1 is the block diagram of an example system realized according to present principles.

Fig. 2 shows the diagram of various types of voice data and their purposes.

Fig. 3 shows the process flow diagram of an illustrative methods realized according to present principles.

Fig. 4 shows the process flow diagram of another illustrative methods realized according to present principles, wherein processes not by the phonetic entry of supporting language.

Fig. 5 shows the process flow diagram of another illustrative methods realized according to present principles, wherein download language module.

Fig. 6 shows and voice data is converted to the mode of text data and the diagram of device.

Fig. 7 shows another process flow diagram of illustrative methods realized according to present principles, comprises process not by the two kinds of modes of the service condition of language supported.

Fig. 8 shows the diagram of the mode of the instruction receiving language form.

Fig. 9 shows the diagram of the exemplary realization of the softdog according to present principles.

Figure 10-13 shows the sequential flowchart of the concrete but exemplary method according to present principles.

Figure 14 shows exemplary computing environments, the exemplary computing environments of such as disclosed second display, server, smart phone, mobile device, flat computer, softdog etc.

The identical element that identical Ref. No. represents in the whole text.

Embodiment

Fig. 1 is the block diagram of an example system 10 realized according to present principles.In this system 10, the equipment 12 enabling network is connected to server 18 by the Internet 16.The equipment 12 enabling network be generally IPTV, media player, player etc., and there is user interface 14, can input wherein and show data, such as, in list 46.User interface 14 can be enabled navigation command and moved around different lists to allow user or select different entries, and navigation command is schematically illustrated by arrow 48.

It is mutual that system 10 can allow have to a certain degree by the voice of user and user.When doing like this, the equipment 12 enabling network can comprise voice engine 34, such as, with hardware or software simulating, wherein, voice data is inputted by hardware port 32, RF port 44, such as, makes it possible to pass through agreement or transmitted by other devices.Such other device can be the video cameras with microphone 52, be embedded in enable network equipment 12 in or be coupled with it.

The equipment enabling network can comprise user memory 42, and the demanded storage being used for user to commonly use is audio file or equivalent, makes the equipment enabling network As time goes on " study " user how can provide order.Such as, if user has obvious accent or dialect, then user memory can memory command and the record of action that performs subsequently, and action can be learnt to be associated with the voice command with dialect or accent.In some cases, if the unintelligible user of system, then it can point out user to read a short essay, occurs to allow study to a certain degree.User memory 42 also can the record of not only storaged voice order, but also after being stored in Edit Text, user determines that what is the record of the correct conversion of voice data.

Also can use other modes to make the equipment enabling network to obtain voice data.Such as, softdog 36 can be coupled to the port on equipment 12 by connection 38.Softdog 36 can comprise microphone, for storing the user memory of data (information of the dialect of such as associated subscriber, accent or voice mode), and even voice engine.For clarity sake, in the graphic not shown these.Softdog 36 such as can be attached to USB on TV or other ports, or can wirelessly connect.In such a system, softdog can transport between each equipment, provides certain dirigibility to system 10.

Also external unit 24 can be used to provide phonetic entry.External unit 24 can comprise voice engine 54 to realize the function of voice engine 34, or two kinds can work together, so that voice data is converted to text data.Due to reason same or similar with user memory 42, user memory 56 can be used.Typical external unit 24 can comprise as those shown by equipment 28, such as, and smart phone, flat computer, notebook etc.Such equipment can pass through RF, infrared ray, wire link etc., communicates with the equipment enabling network.Other external units 24 can comprise second display, and this second display uses as the proxy server described by application cited herein carries out alternately with the equipment enabling network.

Being appreciated that when using voice engine 34 or 54, carrying out the conversion of voice data to text data at client-side.In some cases, can by audio data transmission to server 18 for conversion, under these circumstances, voice engine 19 can be used to carry out converts voice.The process of server side provides some benefit, comprises the scalability of computing power etc.In addition, it should be noted that, although need the step of such as arranging necessary connectivity and so on the connection of server, in the object of phonetic entry be, such as, when the registration of actuating equipment 12, the inconvenience of user is minimized, because the general Exactly-once of such process, and can not again performs.Server 18 also can comprise image follower 21, when text data is the language form do not supported by equipment 12, image follower 21 can be used to create the image of text data.That is, image can be made up of text data, and image is sent out for being presented in user interface 14.Further, for making it possible to process a series of language, language server 22 can use several language librarys 23a-23d, wherein, each language library all towards different language, such as Chinese, Korean, Japanese etc.

No matter where voice engine is positioned at, and nominal system can store the voice data of letter and number simply, instead of complicated order or username/password combination.So, the equipment enabling network can have enough storeies simply to store the audio file of letter and number, and some simple commands, such as " upwards ", " downwards ", " next one ", " downward page turning ", the title of service or social network site, etc.Any other input can be inputted as character string by user simply.In some cases, character can be spoken up by user, is converted into text, and is presented on screen, such as, in list as text data character by character.In other cases, a string character of user, it is then sent out for conversion as single sound frequency file.

Fig. 2 shows the diagram of various types of voice data and their purposes.Such as, voice data 58 can correspond to log-on data 62, by log-on data 62, can perform the step (step 64') registered to user account by equipment.If necessary, user account (step 64) can first be created.Those of ordinary skill in the art will understand, identical systems and method can be applied to the equipment management of other types or other data being inserted into web list, the machine application or being used for controlling to enable network other this type of in.Voice data 58 also can correspond to user name or password 66, and user name or password 66 can make again user's service of can signing in or server (step 68).Such as, user can say " user name " and " password " in case sign in they on account.Voice data 58 can correspond to playback command 72 or navigation command 76 further.In any one situation, result can be fill order (step 74).Playback command 72 can relate to the special play-back of the audio video document (such as movie or television program) of storage, navigation command 76 can correspond to menu in the user interface of equipment of enabling network on or mobile cursor or highlight on other interactive screens.

Fig. 3 shows flow process Figure 20 of the illustrative methods of the realization according to present principles.First step in Fig. 3 is that configuration device is with audio reception data (step 78).This step needs to make the equipment enabling network be ready to receive voice command simply, such as pass through the microphone of embedding or pass through from external unit audio reception data, this voice data is such as recorded on external unit, for using this external unit as the pipeline of voice data.

After configuration step, next step is audio reception data (step 82).Such voice data can generally be stored in the impact damper of this equipment or external unit or in storer.Then, voice data is converted to text data (step 84).Pointed by with reference to figure 1, conversion can occur in client-side, and at server side, or algorithm can allow to carry out part process in both sides.In some cases, the equipment enabling network is made to perform an action (step 86) based on text data.The action performed can be minimum, such as, shows the text data calculated according to voice data simply.Action also can be obvious, such as makes the equipment enabling network recall favorites list, playback video in service, etc.

When text data is the data such as list, text data may be displayed on (step 92) on equipment.User can be pointed out to confirm the accurate conversion (step 94) of text data, such as, by vision or auditory cues.In some cases, (step 96) can be modified to text data.Instruction and its use of the successful conversion of text data can be shown, such as, show this and succeed in registration.Then, text data can store (step 88) together with optionally revising.Text data can be stored in user memory, and its this storage can be such as, due to reason as described above, Applied Learning, to become the voice more getting used to user.

Fig. 4 shows flow process Figure 30 of another illustrative methods realized according to present principles, specifically, wherein when the equipment that the mother tongue of user is not activated network is supported, and the step of effective language translation.The first step of Fig. 4 can be similar to the first step of Fig. 3, and wherein, equipment is configured to audio reception data.Then, by server audio reception data (step 102).Can from the audio data detection language form (step 104) received.Such detection can be carried out in several mode as described below.If the language form detected is not activated the equipment support of network, so, the conversion of voice data and translation can be carried out, and make it be converted into text formatting and are translated into the form (step 108) of the language form being suitable for detecting.In some cases, language form is not supported, but the text of language can be shown by the equipment enabling network.In the case, can by transmission of textual data to the equipment enabling network, for display (step 112).In the another kind of situation that text data can not be shown, the image file (step 116) of text data can be created, text data be transferred to enable network equipment for being shown as image file (step 118).In the realization that another substitutes, language module can be downloaded to the equipment enabling network, local can change and translate voice data (step 114) to make it.

Fig. 5 shows flow process Figure 40 of another illustrative methods realized according to present principles, but wherein, the download of language module allows speech recognition this locality to carry out, and not on the server.As previously described, first step is that configuration device is with audio reception data (step 122) and audio reception data (step 124), but, in the case, receive the language form (step 126) selected by user.Such as, can start new TV, and when not having the previous input from user, new TV display reminding user can select the menu of the option of language.User can use a teleswitch or use voice command, navigates to selected language, as long as " substantially " command group is loaded in TV at first, and such as at least " downwards ", " upwards ", " next one ", " selection " etc.

Once have selected language, if TV is not natural support this language form inherently, then can download the language module relating to this language form, to allow to carry out speech recognition (step 128) with this language.

Fig. 6 shows and voice data is converted to the mode of text data and the diagram of device.Such as, the step (step 154) voice data being converted to text data can perform (step 156) on external unit, such as, on phone, on flat computer, on telepilot, on second display, on softdog etc. (step 158).Conversion also can carry out (step 162) on the equipment enabling network, combines individually or with external unit (step 164).Alternatively, conversion can be carried out on the equipment enabling network, but based on the data specific to language from server.In this implementation, the data needed for language that the equipment enabling network may not have converting users to say at first, but equipment can be downloaded specific to the data of language, to make it possible to change from server.Finally, conversion also can be carried out (166) on the server.Be appreciated that in some cases, multiple voice engine can the some parts of voice data that arrives of conversion receiver, and under these circumstances, responsibility voice data being converted to text data is shared by different modules.

Fig. 7 shows the flow process Figure 60 of the more detailed method realized according to the another kind of present principles.As previously described, equipment is configured to audio reception data (step 168), and audio reception data (step 172).Receive the instruction (step 174) of language form.Such instruction can be passed through herein and various methods described in fig. 8.Once determining that language form is not activated the equipment support (step 176) of network, various step group can be taked.In the first set, can by the audio data transmission that receives to server (step 178).The voice data received can be converted to text formatting by this server, and is transmitted the equipment (step 182) got back to and enable network, for display (step 184).

In the realization substituted, follow after determining that language form is not by support, the request (step 186) for the language module corresponding to this language form can be transmitted.Language module can be received, and is stored on network devices or on external unit, such as second display, softdog, smart phone, flat computer, etc. (step 188).Then, can use language module that voice data is converted to text data (step 192), and the instruction (step 194) of the data through conversion can be shown.Such as, text data itself can be shown, make user can confirm to input accurately and change.

Fig. 8 shows the diagram of the mode of the instruction (step 196) receiving language form.Such as, system can receive the selection (step 198) to language form from user, such as, from the selection menu.In another implementation, language form (step 202) can be determined according to arranging file.Also detection language type (step 204) can be carried out based on voice data.In such a system, use and analyzing audio file itself at client-side, to determine the said language form of user or possible language form.In another realizes, can, by audio data transmission to server, supply to carry out such analysis (step 206).

Fig. 9 shows the diagram of the exemplary realization of the softdog 80 according to present principles.Softdog 80 comprises the device 208 for audio reception file.Such device 208 can comprise embedding or be coupled to the microphone of softdog, the hardware port of audio reception file (such as, the equipment from enabling network), etc.Comprise for from when enabling the hardware port of equipment audio reception file of network at device 208, such device 208 can allow not have the softdog of microphone to change from its oneself microphone or the voice data that receives from certain other sources the equipment enabling network.Softdog 80 also comprises the device 212 for audio file being converted to text.Device 212 provides voice engine function generally to softdog 80.In some implementations, device 212 is positioned to be enabled on the equipment of network, not on softdog, or enables performance and the responsibility that the equipment of network and softdog can share this function.Softdog 80 also comprises the output unit 214 for text being transferred to the equipment enabling network.Such device 214 may be used for providing enable network equipment display through conversion audio file needed for data, and this device can comprise hardware port or such as RF communication port.Softdog 80 also comprises the storer 216 for storing user profile or other user data.Storer 214 can store the information of associated subscriber accent, dialect etc., and provides self-defined language translation function.It should be noted that device 208-214 and storer 216 (or their some parts) can usually be embodied as non-transitory computer-readable medium.

Figure 10-13 shows the process flow diagram of the one or more specific method of the realization according to present principles.With reference to flow process Figure 90 of Figure 10, after method starts (step 218), can use and determine whether the step (step 222) by the equipment supported to be detected.Such as, (such as, passed through by RF communication plan ) whether be ready to the equipment that the equipment enabling network carries out communicating.If do not detect by the equipment supported, then Text Input can be used to carry out filling form (step 224), such as, by using keyboard or telepilot.

If detect by the equipment supported, then speech can be asked to input session (step 226), such as, by method as described above, such as by clickable icon or say key word.If necessary, then instantiation voice engine (step 228), and start voice conversation (step 232).Voice engine may change, and it should be noted that it can pass through obtained by open source software etc.

Then, the populated list of phonetic entry (or performing other Web or browser action) (step 234) can be used.If speech (step 236) detected, then flow process moves on to flow process Figure 110 of Figure 11.If speech do not detected, then system can be waited for, until time-out occurs.Once speech be detected, the step catching speech can start (step 238).Catch speech can continue, until suspend, time-out, special key words (step 242) detected, or user points out that speech seizure should terminate (step 252) in another way.It should be noted that speech catches can be phrase, word, single letter or numeral etc.

Other aspects also can cause speech catch terminate, if time-out (step 258) such as detected, if the mistake of detecting (step 262), if or user stop speech catch (step 264).In any one situation, mistake (step 266) can be shown to user.

Suppose to capture certain speech, voice engine can be used to carry out speech conversion (step 254).Can point out user confirm through conversion text (step 255).Suppose that conversion correctly completes, system can be reported successfully (step 256).If not, then can show mistake (step 266) to user.

Suppose that synthesis is correctly carried out, and audio file is successfully converted to text, text can be shown and automatically submit (step 268) to.Be appreciated that and also conceive nonautomatic submission in system and method, such as, those submissions requiring user to confirm.When using speech recognition to perform registration, registration process can continue (step 272).Be appreciated that other management functions will be carried out similarly.If registration mistake (step 274) detected, then method can terminate (step 276).If mistake do not detected, then registration can complete (step 278).If the language form of detecting (step 282), then can perform the language form that the detects step to user's audio plays " congratulations " or other message.Be appreciated that the audio prompt that other can be provided with the language of user's mother tongue or selection such to user for other objects.

In flow process Figure 130 of Figure 13, give for the treatment of not by the more details of the method for supporting language.Specifically, follow after the step (step 252) of the seizure terminating speech, can detection language type (step 288).This step can be performed in many modes as described above.Also can perform checkout equipment by the step (step 292) of language supported.If determine language difference (step 294), then can call voice engine to carry out phonetic synthesis (step 296).Also can call translation engine with make it possible to carry out from the language detected to equipment by the translation of supporting language, such as, from Chinese to English, thus allow the input from the mother tongue of user or the language of selection and control (step 298).So, user with the language of their mother tongue or selection, can input data, text and order by voice, and in equipment, input text or the order of equivalence.

Disclose the Consumer's Experience allowing to improve IPTV and the system and method not increasing the hardware cost of unit.As disclosed, user can use this system and method, by the content playback device using voice command to carry out control and management (such as, register or perform other functions) such as IPTV and so on.In some implementations, this system and method allows the equipment enabling network to overcome inherent defect, and such as solution is never by the problem of language supported.

Realize comprising one or more programmable processor with corresponding computing system assembly to store and a computer instructions, such as execution provides the code of voice engine, user interface or network function.With reference to Figure 14, show the expression of operable exemplary computing environments.

Computing environment comprises controller 302, storer 306, reservoir 312, media device 316, user interface 324, I/O (I/O) interface 326 and network interface 328.Assembly is interconnected by versabus 332.Alternatively, different connection configuration can be used, such as with the star schema of controller being positioned at center.

Controller 302 comprises programmable processor, and is the operation that speech recognition system 304 comes control system and their assembly.Controller 302 from storer 306 or embedded controller storer (not shown) load instructions, and performs these instructions with control system.In its execution, controller 302 can provide speech recognition system using partly as software systems.Alternatively, this service can be implemented as the independent modular assembly in controller 302 or second display.

The data that other assemblies that storage and supply system deposited by the storer 306 that can comprise non-Transient calculation machine readable memory 308 temporarily use.In one implementation, storer 306 is implemented as RAM.In other realize, storer 306 also comprises long-term or permanent memory, such as flash memory and/or ROM.

Interim or the store data long term of the reservoir 312 of non-Transient calculation machine readable memory 314 can be comprised, for other assemblies of system and method, such as storing the data used by system.In one implementation, reservoir 312 is hard disk drive or solid-state drive.

The media device 316 that can comprise non-Transient calculation machine readable memory 322 receives removable media and reads to the media inserted and/or write data.In one implementation, media device 316 is CD drive or recorder, such as, can write disk drive 318.

User interface 324 comprises for accepting user's input (such as, content playback device log-on message) from the user of second display and presenting the assembly of information to user.In one implementation, user interface 324 comprises keyboard, mouse, audio tweeter and display.Controller 302 uses the input from user to carry out the operation of adjustment System.

I/O interface 326 comprises one or more I/O port, such as, for being connected to corresponding I/O equipment, such as external storage or ancillary equipment, printer or PDA.In one implementation, the port of I/O interface comprises the port of such as USB port, pcmcia port, serial port and/or parallel port and so on.In another implementation, I/O interface 326 comprises for carrying out the wave point of radio communication with external unit.These I/O interfaces may be used for being connected to one or more content playback device.

Network interface 328 allows with LAN (Local Area Network) and is connected with external unit alternatively, and comprises wired and/or wireless network connection, and such as RJ-45 or Ethernet connect or " WiFi " interface (802.11).To understand, the network of a lot of other types connects and is fine, comprise WiMax, 3G or 4G, 802.15 agreements, 802.16 agreements, satellite, etc..

Computing environment can comprise the typically extra hardware and software of such equipment, such as power supply and operating system, although do not illustrate these assemblies particularly in the graphic for the sake of simplicity.In other realize, the different configuration of equipment can be used, such as, different buses or stored configuration or multiprocessor configuration.

Describe various illustrative realization of the present invention.But those of ordinary skill in the art will recognize, extra realization will also be possible, and within the scope of the invention.Such as, phonetic entry can be received by the application run on the second display.In the case, the operation of second display and itself and content playback device and network provider alternately can described in the patented claim be incorporated to by reference above.

User can also use phonetic entry to perform various function, such as browser function, such as browses or search service and assets, and concludes the business, as video rental or home shopping.User can also use phonetic entry and identify the various auxiliary functions of the service that performs.User can also ask by using speech recognition and control to want the content item of playback.Equipment voice registration can expand to registration not only show or rendering content project and store and playback of content items object equipment, such as DVR, player, media player, game console or in fact any equipment enabling network.Although complete registration of website can mainly for PC exploitation, the subset of complete registration of website or its function used more continually can be realized, for the voice response in the registration menu in user interface 14.Detect in each realization of language form wherein, once there occurs detection, language type information can be passed to other websites of being accessed by user, such as make it possible to the version specific to language form presenting those websites immediately.Similarly, follow after language form detects, if present list on the equipment enabling network, then the language of this list automatically can be set to the language form that detects.

Once being attached softdog or external unit (that is, can serve as the external unit of the pipeline for phonetic entry) being detected, voice detection mode can automatically start.Alternatively, the icon on smart phone or the button on telepilot can be used to start phonetic entry.

The step of speech detection can be used to detect the identity of speaker, and automatically the profile of speaker is loaded in equipment, thus allow such as to control based on the father and mother of the authority meeting speaker.Such as, if child's voice detected, then IPTV can automatically be set to the program being limited to children.

Correspondingly, the present invention is not limited only to those realizations as described above.

Claims

1. input a method for data to the equipment enabling network, comprising:

A. be the state being in audio reception data by the Equipments Setting enabling network, these data be attached to described in enable the equipment of network service, be associated with described in enable the equipment of network server or described in enable the user interface of the equipment of network operation be associated;

B. audio reception data;

C. the voice data received is converted to text data; And

D. the equipment enabling network described in making performs an action based on described text data, and described text data represents the function in described service or on described server, or enables the operation in the user interface of the equipment of network described in representing.

2. the voice data the method for claim 1, wherein received is log-on data, and described method also comprises and being associated with user account by described text data, thus described in enable network equipment be registered to described user account.

3. method as claimed in claim 2, also comprises and creates user account based on log-on data.

4. the voice data the method for claim 1, wherein received be user name or password or both, and wherein, the function in described service signs in the user account in described service.

5. the voice data the method for claim 1, wherein received is navigation command, and wherein, the operation performed in described user interface comprises the described navigation command of execution.

6. the method for claim 1, also comprise transmission make described in enable network equipment show the signal of described text data.

7., the method for claim 1, wherein after voice data is received and is converted into the text data comprising character, enable on the equipment of network described in the text version of described character is presented at.

8. the method for claim 1, also comprises prompting user and confirms described text data.

9. method as claimed in claim 8, also comprises the voice data storing and receive, and if follow user after display reminding to revise described text data, is then associated with the voice data received by modified text data.

10. the method for claim 1, also comprises:

A. from the audio data detection language form received;

If enable in the supporting language of the equipment of network described in the language form b. detected does not correspond to, so:

I. perform switch process, make text data be form corresponding to the language form detected;

Ii. the image file of described text data is created; And

Iii. by described image file transfers to described in enable the equipment of network for display.

11. the method for claim 1, also comprise:

A. from the audio data detection language form received;

I. perform switch process, make text data be form corresponding to the language form detected; And

Ii. by described transmission of textual data to described in enable the equipment of network for display.

12. the method for claim 1, also comprise:

A. from the audio data detection language form received;

If enable in the supporting language of the equipment of network described in the language form b. detected does not correspond to, described in so being downloaded to by the language module corresponding to the language form detected, enable the equipment of network.

13. the method for claim 1, also comprise prompting user input language type, and once input language type, enable the equipment of network described in just being downloaded to by the language module corresponding to inputted language form.

14. 1 kinds of non-transitory computer-readable medium, comprise the instruction for making computing equipment realize the method for claim 1.

15. 1 kinds input the method for data for the equipment enabling network, comprising:

A. be the state being in audio reception data by the Equipments Setting enabling network;

B. audio reception data;

C. the voice data received is converted to text data; And

D. the equipment enabling network described in making performs an action based on using the request of described text data.

16. methods as claimed in claim 15, wherein, the input of request msg comprises display list and prompting input data, and described method also comprises and utilizes described text data to carry out filling form also to show the list of filling.

17. methods as claimed in claim 16, wherein, list prompting input registration code, and described method also to comprise described transmission of textual data to server to perform registration, and once receiving the signal of instruction successful registration from described server, the just instruction of display successful registration.

18. methods as claimed in claim 15, wherein, the input of request msg comprises the input accepting navigation command.

19. methods as claimed in claim 15, wherein, audio reception data comprise the input port enabled on the equipment of network described in use and carry out audio reception data.

20. methods as claimed in claim 15, wherein, the described equipment enabling network perform the voice data received are converted to text data.

21. methods as claimed in claim 20, also comprise:

A. before conversion, determine that the voice data received uses not by the language supported; And

B. the language module of the language corresponding to the voice data received is downloaded.

22. methods as claimed in claim 19, wherein, described input port is configured to accept the voice data from mobile phone, flat computer, laptop computer, microphone or audio stream.

23. methods as claimed in claim 19, wherein, described input port is USB port.

24. methods as claimed in claim 23, wherein, softdog is coupled to described USB port, and wherein, audio reception data are performed by the microphone being coupled to described softdog.

25. methods as claimed in claim 24, wherein, perform and the voice data received are converted to text data in described softdog.

26. methods as claimed in claim 15, wherein, audio reception data comprise from remote control voice data.

27. methods as claimed in claim 26, wherein, perform and the voice data received are converted to text data on described telepilot or on the described equipment enabling network.

28. methods as claimed in claim 15, wherein, audio reception data comprise from second display audio reception data.

29. methods as claimed in claim 28, wherein, second display is smart phone, flat computer or laptop computer.

30. methods as claimed in claim 29, wherein, on the second display or perform on the described equipment enabling network the voice data received is converted to text data.

31. methods as claimed in claim 15, wherein, audio reception data comprise using carrys out audio reception data with the described radio frequency audio input device enabling the device pairing of network.

32. methods as claimed in claim 31, wherein, described radio frequency audio input device is smart phone.

33. methods as claimed in claim 31, wherein, described radio frequency audio input device perform the voice data received are converted to text data.

34. 1 kinds of non-transitory computer-readable medium, comprise the instruction for making computing equipment realize method as claimed in claim 15.

35. 1 kinds input the method for data for the equipment enabling network, comprising:

B. audio reception data;

C. the instruction of language form is received;

D. determine that described language form is not supported;

E. by the audio data transmission that receives to first server;

F. receive the data through conversion from first server, these data through conversion calculate according to the voice data received; And

G. the instruction of the data through conversion received is shown.

36. methods as claimed in claim 35, wherein, the voice data received corresponds to navigation command, and wherein, the instruction of the data through conversion that display receives comprises the described navigation command of execution.

37. methods as claimed in claim 35, wherein, the voice data received corresponds to the data that will be imported in list, and wherein, and the instruction of the data through conversion that display receives comprises and entering data in described list.

38. methods as claimed in claim 35, wherein, the instruction receiving language form comprises:

A. the selection to language form is received;

B. from arranging file determination language form;

C. based on the audio data detection language form received; Or

D. by audio data transmission to second server, and receive the instruction of language form from second server.

39. methods as claimed in claim 35, wherein, the data through conversion received are text datas.

40. methods as claimed in claim 35, wherein, the data through conversion received are image files of instruction text data.

41. 1 kinds input the method for data for the equipment enabling network, comprising:

B. audio reception data;

C. the instruction of language form is received;

D. determine that described language form is not supported;

E. the request for the language module corresponding to described language form is transferred to server;

F. the language module of asking is received from described server;

G. use the language module received that described voice data is converted to text data; And

H. the instruction of described text data is shown.

42. methods as claimed in claim 41, wherein, described language module be stored in described in enable on the equipment of network, on the softdog that is stored in the equipment enabling network described in being connected to or be stored in on the described external unit enabling the devices communicating of network.

43. methods as claimed in claim 41, wherein, the instruction receiving language form comprises:

A. the selection to language form is received;

B. from arranging file determination language form;

C. based on the audio data detection language form received; Or

44. 1 kinds are suitable for being placed in the softdog equipment carrying out signal with the equipment enabling network and communicate, comprise:

A. for the device of audio reception file;

B. for described audio file being converted to the device of text; And

C. for described text being transferred to the output unit of the equipment enabling network.

45. equipment as claimed in claim 44, wherein, receiving trap selects from by the group that the following is formed: RF signal receiver, microphone and hardware port.

46. equipment as claimed in claim 44, wherein, output unit selects from by the group that the following is formed: USB port, RF signal projector and hardware port.

47. equipment as claimed in claim 44, also comprise the storer for storing user profile, the acoustic characteristic of described user profiles indicating user voice.