KR100727548B1 - Method and device for providing speech-enabled input in an electronic device having a user interface - Google Patents

Method and device for providing speech-enabled input in an electronic device having a user interface Download PDF

Info

Publication number
KR100727548B1
KR100727548B1 KR1020057019000A KR20057019000A KR100727548B1 KR 100727548 B1 KR100727548 B1 KR 100727548B1 KR 1020057019000 A KR1020057019000 A KR 1020057019000A KR 20057019000 A KR20057019000 A KR 20057019000A KR 100727548 B1 KR100727548 B1 KR 100727548B1
Authority
KR
South Korea
Prior art keywords
display
cpu
input
voice
voice input
Prior art date
Application number
KR1020057019000A
Other languages
Korean (ko)
Other versions
KR20050111633A (en
Inventor
헨리 살미넨
카트리이나 할로넨
Original Assignee
노키아 코포레이션
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 노키아 코포레이션 filed Critical 노키아 코포레이션
Priority to KR1020057019000A priority Critical patent/KR100727548B1/en
Publication of KR20050111633A publication Critical patent/KR20050111633A/en
Application granted granted Critical
Publication of KR100727548B1 publication Critical patent/KR100727548B1/en

Links

Images

Abstract

The present invention provides a method, apparatus and system for multimodal interaction. The method according to the invention comprises the steps of activating a multi-modal user interaction, providing at least one key input option and at least one voice input option, displaying the at least one key input option, the voice input option Checking whether there is at least one condition affecting and providing voice input options in accordance with the condition and displaying indications of the provided voice input options. The method includes checking whether at least one condition affecting the speech input is met and providing, according to the condition, the at least one speech input option and displaying indications of the speech input options on the display. Characterized in that it comprises a step.
Voice, recognition, user, interface, multi-modal, interaction

Description

Method and device for providing speech-enabled input in an electronic device having a user interface

The present invention relates to multimodal interactive browsing in electronic devices and portable terminals and a communication network. In particular, the present invention relates to a simple multimodal user interface concept that provides intimate guidance of voice data-input and voice browsing as an alternative entry to using manual input. Moreover, the present invention relates to checking for preliminary conditions that must be met for valid voice input.

In multimodal applications, users can interact with other input methods rather than just a keypad. For example, the commands traditionally provided by scrolling and clicking can be voice-activated in the application so that the user can speak the commands to be recognized by the automatic speech recognition engine. Adding voice interaction to visual applications is of increasing interest as enabling technologies mature, because in many mobile scenarios it is difficult to use the keypad, for example, while driving or walking. .

So far different multimodal browsing structures have already been proposed. For example, document US 6101473 describes how voice browsing is realized by simultaneous operation of telephone network services and internet services. This is clearly forbidden due to the waste of network resources, requiring two different communication links. Moreover, the service requires an interconnection between the telephone service and the Internet service. Another obstacle to user satisfaction is that over-the-air co-browser synchronization, which is required in a distributed browser architecture, can cause latency in browser behavior that will degrade the user experience.

Document US 6188985 describes how a wireless control unit implements voice browsing capability for a host computer. To this end, a number of multimodal browser structures have been proposed in which the operations are placed on a network server.

Patent US 6374226 describes a system that can dynamically change the speech recognition grammar. For example, when an email program enters compose mode, the new grammar setting is dynamically activated. This involves an improved use of device resources on the one hand, but also includes the serious disadvantage that the device changes its "passive vocabulary". This can frustrate experiences because a user who has learned that the device understands a certain expression when running other applications may confront a device that is deaf to its input.

Known systems suffer from the fact that users are not very keen on using voice-activated features. Another problem caused by the latest technology is that users are not always aware of the operational status of voice activated browsing systems.

Although there are many standards being developed on how to write multimodal applications, there are no standards on how the application interface should be formed so that the user is aware that voice input can be used as easily as possible.

In particular, it would be desirable for a user to know which particular voice input is allowed at different times or under certain conditions in devices and applications.

But if the user has successfully used the speech recognition system, the user is also likely to continue using it. That is, there is an obstacle in starting to use voice control.

The above problems have been solved earlier, such as by audio prompts, but they become very annoying, degrading the usability experience.

Moreover, due to system load or the behavior of applications, all voice control options may not always be available, which is very difficult to convey to a user using the prior art.

All of the above approaches to a multimodal browsing structure are commonly not suitable for use in mobile phone devices of terminals such as mobile phones or handheld computers because of their computational power, limited resources or low battery capacity. Has a point.

It would therefore be desirable to have a multimodal browsing system that is voice activated and provides good user-friendliness.

According to a first aspect of the present invention, there is provided a method comprising: activating a multimodal user interaction comprising at least one key input option and at least one voice input option, displaying the at least one key input option, the voice input Providing a method for multimodal interactive browsing comprising checking whether there is at least one condition affecting an option and providing voice input options in accordance with the condition and displaying indications of the provided voice input options. do.

Activation of a multimodal user interaction provided with the at least one key input option and conditionally at least one voice input option may be provided by at least switching on the device or activating respective menus or respective settings.

In the multimodal browsing key input options are provided unconditionally and the at least one voice input option is provided conditionally. The at least one voice input option is not provided if at least one condition is met that may possibly interfere with the voice input. The condition may be, for example, noise or too low a signal to noise ratio at the audio input. The condition may be for example too low processing power or battery condition. The condition may be too low a voice transmission capability, for example for a distributed voice input / recognition system. The condition may be limited device resources. It should be noted that conditions affecting the speech recognition feature may be caused by a combination of the above conditions.

The at least one key input option is displayed on the display of the electronic device or mobile terminal device, as in the case of conventional devices and conventional browsing.

The method includes checking whether at least one condition affecting the speech input is met and if none of the conditions are met, providing the at least one speech input option and displaying the speech on the display. Displaying indications of input options. The check can be carried out, for example, every second, at quick intervals or continuously. The check can also be performed in an event controlled manner, and the check is only performed if an event is detected that indicates an impossible voice input.

If no such condition is met, the method may provide at least one voice input option and display indications of the at least one available voice input option on the display. It is also possible to display a description or representation or indication that a voice input option is present and that the voice input can actually be performed if no such conditions are met. The first part describes the principle that speech input can be formed or is in the passive vocabulary of the speech recognition engine and the second part describes that the speech recognition engine is active.

It is also possible to display an indication of the tested condition that is actually met and which interferes with the voice input option. This can be implemented, for example, as any type of icon or text indicating what type of condition interferes with the voice input and how it can be eliminated.

In multimodal applications in which voice input can be provided in addition to visual input (using a keypad), the user must be aware of when voice inputs are possible and what are allowed inputs. The method proposes an obvious way for the user to know exactly when speech recognition is activated and which speech commands are voice-activated at what time.

Event mechanisms may also be used by the system to determine situations when speech recognition is not available for an unexpected reason or when the application designer has specified that a command or set of commands is speech-enabled. All commands that are voice-activated at any moment will be marked in a suitable visual way such as, for example, coloring, to indicate to the user both the moment of speaking and the allowed words.

The present invention proposes to dynamically represent elements that can be voice controlled in terms of visual keywords or visual cues, depending on the availability of voice control for each item. For example, if the speech recognition engine is temporarily unavailable or only certain options are available at any point in the application, then only those options are highlighted on the screen.

It may also be marked when voice input is temporarily unavailable. It is also possible to only mark entries that cannot be voice activated. This is some type of reverse approach that can be extended to any type of transition between marking direct voice operation and marking non-voice operation input options depending on the number of markings required. This is direct; Green: operable and black: not operable and red in reverse notation; It can be implemented as non-voice activated input options and black: voice activated input options.

The present invention proposes visual keywords or cues to indicate to the user what can be said and when the voice-action is on or off. If a visual command is voice-activated, the command itself is marked with a different color or respective icon, for example than commands that are not voice-activated. When the voice-activation is off, the color or each icon of the command is dynamically changed to black, and when the voice-activation is turned on again, the color or icon will change again. The marking will immediately indicate to the user what can be said and when. The method can be combined with an input prediction method to sort frequently used input options at the top of the list.

Reasons for which the voice-action of a command may change while the user stays on the same screen may be, for example:

-System error: access to the speech recognizer is unexpectedly blocked,

Change of environment: the device detects too much background noise for the recognition to work properly,

The system is currently performing some action that the system cannot listen at the same time due to system or application limitations, exhausted or wasted system resources, eg fetching data for the user, and

The choice of application designer, as described in more detail in the paragraphs below.

Different applications may choose different recognition grammars and vocabulary to enable speech in different ways, and the use may even vary within one application. For example, if the user can perform some different actions (each including 2-3 selections of the menu) on a screen, the order does not matter, allowing the user to say which of the options It is reasonable. On the next screen, there may be some actions again, but this time the order is not entirely free. It is best to guide the user's voice input by highlighting the actions at their appropriate time, to clarify the order of actions with the visual voice-action cue being selected.

However, in an entirely eye-free situation, where voice is the only available way, the present invention cannot be used as the only cue for the user. Some audible keywords will be required to indicate to the user when the user can speak (and / or what can be said). One way to indicate that speech recognition is actually available may be implemented with a vibrating alarm prompt. The vibration alarm prompt may include a single vibration as a start signal and may include a short double vibration as a stop signal.

In an exemplary embodiment the displayed representations of the voice input options include keywords. The keywords may image the available voice input or control options. The keywords may include any type of cues or hints for the actual voice input that cannot be represented (such as whistle, humming or such sound).

In another exemplary embodiment, displaying the indications of the speech input options on the display further includes indicating whether speech recognition is actually possible. As already mentioned above it is the recognition state or recording of the speech or speech recognition engine. This can be described as a 'recording' or 'recognition' sign.

In another exemplary embodiment of the present invention, displaying the indications of the voice input options comprises displaying the voice input options themselves. That is, the input options are depicted as acronyms of words to be spoken for the voice input. The phrase “input option” has been carefully chosen to not limit the indication or the input option to any type of particular form.

In another exemplary embodiment of the present invention, the displaying of the voice input options on the display is provided with hysteresis. The use of hysteresis operation helps to avoid a quick change to the indication of the availability of the voice input options when one of the checked conditions is near a threshold between inference and non-inference of the voice input characteristic. The hysteresis may be implemented in the test or a program that performs the test or an application that performs the display.

In another exemplary embodiment of the present invention, displaying the indications of the voice input options on the display is provided with a backlog function. As in the case of the hysteresis, the backlog function sets a threshold associated with a condition (e.g., even prior to the hysteresis) to prevent the user from being confused by the rapidly changing voice input capability or voice input options. It can be used to determine and eliminate fast changing conditions that can intersect. The backlog function may be implemented by deactivating the store and voice input option for storing the last 'n' second test results as long as a single "over threshold" entry exists in the backlog file. As in the case of the hysteresis, the backlog function may be implemented in the display application or the inspecting application. In both cases, the information conveyed to the user is formed independently from small and fast changes near the threshold.

According to another aspect of the present invention, there is provided a software tool comprising program code means for performing the method of the preceding description when the program product is run on a computer, a network device or a mobile terminal device.

According to another aspect of the present invention, there is provided a computer program product downloadable from a server for performing the method of the preceding description, wherein the computer program product is transferred when the program is run on a computer, a network device or a mobile terminal device. Program code means for performing all the steps of the methods of FIG.

According to another aspect of the invention, when the program product is executed on a computer, a network device or a mobile terminal device, it comprises program code means stored on a computer readable medium for performing the methods of the preceding description. A computer program product is provided.

According to another aspect of the present invention, a computer data signal is provided. When the computer data signal is embodied as a carrier wave and the computer program is executed on a computer, a network device or a mobile terminal device, it represents a program that causes the computer to perform the steps of the method included in the previous description.

The computer program and the computer program product may be distributed to other parts and devices of the network. The computer program and the computer product device run on different devices, for example a terminal device and a remote speech recognition engine of the network. Therefore, the computer program and the computer program device must be different in capability and source code.

According to another aspect of the present invention, a mobile terminal device for performing simulated communication is provided. The terminal device comprises a central processing unit, a display, a key based input system, a microphone and data access means.

The central processing unit (CPU) is provided for executing and executing applications on the mobile terminal. The display is coupled to the CPU to display visual content received from the CPU. The key based input system is coupled to the CPU to provide a key input feature that can provide the key input options displayed on the display. The microphone is coupled to the CPU to provide a conditional voice input feature. The data access means is coupled to the CPU to process data and exchange data necessary for the operation of the CPU. In the simplest case the data access means is a storage device and in more complex embodiments the data access means may comprise a modem for network access, for example.

The CPU is configured to perform multimodal browsing through the display, the key based input system and the microphone. The CPU is configured to continuously monitor conditions interfering with the voice input and is configured to provide the voice input feature, and if no condition is met, displaying an indication of the voice input option of the voice input feature on the display. Configured to display.

According to another aspect of the present invention, there is provided a speech recognition system capable of multimodal user interaction. The speech recognition system includes at least one central processing unit, a display, a key-based input system, a microphone and a data bus. The display is coupled to the central processing unit to be controlled by the central processing unit (CPU). The key-based input system is operably connected to the central processing unit to provide a key input feature that provides key input options that can be displayed on the display. The microphone is operably connected to the at least one CPU to provide an audio-to-electric converter to form a voice input accessible to the CPU. The data bus is operably connected to the at least one CPU to process data and exchange data required for operation of the at least one CPU.

The at least one CPU includes a first central processing unit and a second central processing unit. The first central processing unit of the at least one CPU is configured to control multimodal interactions through the display, the key based input system and the microphone. The first processing device is further configured to monitor conditions affecting the voice input and to control and display the display of the voice input option of the voice input feature on the display according to the condition. The second central processing unit of the at least one CPU is configured to provide the voice input feature.

In another exemplary embodiment of the present invention, the first central processing unit and the second central processing unit of the at least one CPU are included in the same device.

In another exemplary embodiment of the system the first central processing unit and the second central processing unit of the at least one CPU are included in different interconnected devices. The interconnect may be provided by an audio telephone connection. The interconnection may be provided by a data connection such as a General Packet Radio Service (GPRS), the Internet, a Local Area Network (LAN), or the like.

In another exemplary embodiment the mobile electronic device further comprises a mobile phone.

In the following, the invention will be described in detail with reference to the accompanying drawings.

1 is a flowchart of a method for dynamically presenting a voice-enabled state to a user in multimodal mobile applications in accordance with an aspect of the present invention.

2 is an example of an electronic device capable of dynamically indicating a voice-enabled state to a user for multimodal browsing.

3 is an example of a display that includes different indications of visual input options and their actual possible input states.

4A and 4B are examples of distributed speech recognition systems that can dynamically indicate a voice-enabled state to a user for multimodal browsing.

1 is a flowchart of a method for dynamically presenting a voice-activated state to a user in multimodal mobile applications in accordance with an aspect of the present invention. The method begins with the activation of multimodal browsing (4). The expression 'multimodal browsing' is used to describe the possibility of interacting with the device in different modes, ie the device may represent different modes, for example a visual mode or an audible mode. Multimodal browsing may also include different input modes, such as cursor or menu-keys and / or alphanumeric keyboards, speech recognition or eye tracking. In the figures, a system with key and voice input capabilities is exemplarily selected for imaging the features of the present invention. After or concurrent with the activation of the multimodal browsing, monitoring or investigation of the available input capabilities is initiated. The monitoring may be implemented by directly and repeatedly examining the conditions affecting the speech recognition. The monitoring also operates with parameters that affect the speech recognition, and by implementing sub algorithms in each application that informs the speech input application of a signal or message that speech input is (possibly) not possible, It can be implemented by indirect investigation of. This approach can be described as an event based approach.

Possible conditions are, for example, the processing power actually available. In the case of a distributed voice input system, the condition may be connection properties such as bandwidth signal to noise ratio. Other conditions include noise or background noise that affects the speech recognition capabilities.

From the above exemplary conditions, perhaps a method by which voice or speech input is recognized can be obtained. Therefore, whether or not the voice input feature is actually available can be obtained. It should be noted that the ability to recognize certain voice inputs may vary depending on the condition. Background noise, including, for example, a sound signal that can be detected every second, does not necessarily interfere with the input of very short speech inputs, and speech inputs longer than one second cannot be recognized because of the noise event.

In a next step visual content is depicted 12 according to the input capabilities monitored and evaluated above. That means that input options are depicted on the display of the electronic device or the mobile telephone device. Due to the usually limited information content of a small mobile display, it should be clear that usually not all possible input options can be depicted on the display at the same time. It should be noted that the unavailability of voice input can also be depicted.

The user can simply recognize the available and possible voice inputs and can use the voice input or key input 16 to browse the elements depicted on the display. When performing multimodal browsing, new display content may be called and depicted, and the new content may also be dynamically generated by examining and evaluating the multimodal browsing conditions (ie voice input / eye tracking / recognition conditions). Voice input keywords or cues and the like are provided.

The method ends with deactivation 18 of multimodal browsing. With the end of the multimodal browsing, monitoring of the multimodal input conditions may also be stopped or stopped. Direct connection between boxes 8 or 12 and 18 can be saved because the termination of the multimodal browsing is performed by user input. In the case of an automatic stop (eg low battery power stop), the device can jump directly from 8 or 12 to 18.

As the usability tests indicated, the learning curve of users in using speech is inclined to allow users to select voice interactions rather quickly and smoothly after the first successful attempt. However, there is a high threshold to overcome before the learning can begin. That is, users usually do not understand that voice input is available unless explicitly said so. Moreover, if they are not sure what they can say, it takes time and courage to try them. After a successful trial, many even begin to support the voice input method when making everyday choices. After a trial error it may occur that users simply ignore any voice input capability.

Tasks in which voice can be used in visual applications can be divided into two categories:

1) Voice-Activation of Existing Visual Commands (Link Selection, Radio Buttons, etc.)

2) Allowing actions where no visual equivalents exist (e.g. Hotkey = Some commands that allow the user to bypass hierarchical selections or allow the user to enter text, such as in dictation). Combining horses)

The present invention focuses primarily on Category 1, which shows the user what is voice-activated and when they are voice-activated at different points in time in the application. In category 2 types of tasks, the present invention allows the user to indicate when speech input is possible by selecting an implementation appropriately, but in the case of a combination with a speech input prediction system where the boundaries between both categories are blurred. Except for what exactly may be mentioned in the above tasks, it is outside the scope of the present invention.

Lowering the threshold for using voice input and multimodal browsing a small demo version implemented in the electronic device or terminal can be implemented as any type of language lab, the telephone being pre-recorded voice inputs and input operation. And demonstrate a typical input scenario in the replayed dialogue. For example: "Repeat to say 'Fuelstate' to select the actual battery state: ... ... ... and the required information is 25% battery power. Read aloud "," show 'Fuelstate' to select the actual battery state and the required information is depicted on the display ", and both operations may involve their respective outputs.

Basic cursor-based voice navigation system with 'right', 'left', 'up', 'down', 'click', 'double click', 'click click', ' Voice access may even be provided to non-speech menu structures in cooperation with voice recognizable words such as 'hold', 'delete' and 'select'. The indication of the voice activated voice navigation system may be provided by mouth icons surrounded by respective action icons or mouth-shaped cursors. When selecting a game application by browsing through the menu (saying "up-up-up-click" or "game"), the possible voice input features are called "down-click" or "snake" to select the game "snake". ), Highlighted with a teeth / mouth icon or a snake icon.

2 is an example of a terminal or electronic device capable of dynamically indicating a voice-enabled state to a user for multimodal browsing. The device is shown as having a user interface as it is known from a mobile phone. The mobile device is capable of performing multimodal interactive browsing and includes a user interface having input and output means such as a display 82, the keys 84 and 84 ′, a microphone 86 and a loudspeaker 88. . The user interface can be used for multimodal browsing including audio and key inputs and audio and display outputs. All elements of the user interface are reconnected to a central processing unit (CPU) 80 to control the user's interaction with the device.

The central processing unit is also connected to data access means 90 to process data and to exchange data required for the operation of the CPU 80 or for applications running on the CPU 80. The CPU 80 is configured to perform multimodal browsing via the display 82, the key based input systems 84, 84 ′ and the microphone, and possibly via the loudspeaker 88. . The availability or operability of the multimodal browsing depends on the parameters or the determined conditions. The CPU 80 may provide multimodal browsing capability, for example, by running speech recognition applications on the device.

The CPU 80 is further connected to data access means, for example, to access data stored in an embedded storage device (not shown) or to access data via a network connection 92 to provide the multimodal browsing feature. do.

The CPU 80 is further configured to monitor the conditions to continuously determine the availability of the voice input feature.

The monitoring may be applied at short intervals or continuously, for example every second, depending on the type of conditions or parameters to be monitored or investigated.

The determined availability of the voice input feature is then visually displayed on the display based on the determined availability.

If the multimodal browsing is constant independently of any external or internal limitations, the invention cannot be applied in a meaningful way, as if there are no changing parameters affecting the multimodal browsing, and the vocabulary or the voice It is useless to monitor these parameters because no change in input capability can occur.

3 is an example of a display that includes different indications of visual input options and their actual possible input state. A display 58 of a mobile device capable of multimodal browsing operation is shown. A light emitting diode (LED) 60 is disposed on the right side of the display 58. The LED can be used to indicate that the speech recognition engine or module is actually active or in a receive mode. Glowing, flashing or flashing LEDs 60 may indicate that a user can speak to perform user input or user selection.

A typical list of selectable menu points " Menu option 1-4 " 62 is shown on the display. Icons 64 and 68 are displayed to indicate possible input modes associated with each of the menu options 62. The "menu options 1, 2 and 4" is provided with a mouth icon indicating that the input options are "voice input". The "menu option 3" is provided with a finger icon indicating that only the available input options for the menu option are holding down the key.

To indicate that the cursor is actually selectable by voice input, such as 'OK', 'click', 'double click', 'click click' or 'select' or by pressing the 'OK' button. 2 "is underlined.

The "Menu option 2" is displayed in bold letters to indicate that the "menu option 2" is selectable by voice inputting the words "menu option 2". The word 'option' of the 'Menu option 1' is displayed in bold letters to indicate that the 'menu option 1' is selectable by voice input of the words 'option'. The number '4' and syllable 'men' of the 'Menu option 4' means that the 'menu option 4' is a phrase based on the words 'Men four' or the abbreviation. It is indicated by bold letters to indicate that it is selectable by input.

Icons 66 and 70 at the bottom of the display 58 may be used to indicate that the speech recognition engine or module is in fact activated, in receive mode, or not in receive mode. The icon 66, an open mouth, may indicate that the user can speak to perform user input or user selection. The icon 70, a closed lip sealed with a fingertip, may indicate that the voice input option is not actually available.

The icons 66 and 70 and 64 and 68 may be complementary to each other or excluded from each other because they provide redundant information.

In addition to the icons, the following means can be used to indicate when a user can speak:

-Spoken prompts can be played back to the user, asking them to speak ("Please select / speak category")

Playing the earcon alone (audible icon, eg beep) or at the end of the prompt to indicate that the user can start speaking

The user can be allowed to control the speaking moment by clicking on a specific button (so-called push-to-talk or "PTT" button) to activate recognition

In order to indicate what the user can say, the following means may additionally be used:

Command lists are dictated to the user at the prompt ("speak 'Next', 'Previous', 'Back', 'Exit', or 'Help')

-The prompt is designed to provide implicit guidance to the user ("Do you want to go next or previous?")

-The prompt provides an example of what can be dictated ("Please select the date and time, for example 'Monday 3 o'clock")

The dictated prompt is particularly useful at the start of a session that reminds the user about voice interaction. However, since humans can visually grasp the contents of a small moving screen faster than it takes to hear a sentence, prompts tend to sound long and boring easily. Although barge-in (the user interrupts the system prompt by speaking) is usually allowed in well-developed voice applications, in human-to-human conversation it is considered rude, It can be inconvenient to speak before the system stops. A serious problem with dictated prompts is that if the user is not focused, the information in them is usually lost beyond the extent of restoration. In addition, since almost all computer-generated monologs lasting longer than seven words or three seconds can be easily recognized as boring or annoying, long instruction lists are not useful because they increase the user's memory load and boredom.

In summary, while prompts are useful for making the situation more conversational, they tend to be too long and only available for a short time. Audible icons are short but they are also temporary signals. If voice is allowed, visual cues to speak visually on the screen to indicate when voice is not allowed and exactly what can be spoken would be an easy and obvious way to indicate voice-action to the user. Indicating when voice is allowed is also an easy way for users to know the pants-in feature to encourage them to interrupt or “override with voice” possible prompts.

While allowing the user more control of the interaction, push-to-talk buttons are also not entirely without problems. The device must have a separate button for voice activation or the user must learn individually that the button is used as a push-to-button in certain situations. In some mobile situations, even one button can be cumbersome while riding, for example, in the back seat of a motorcycle.

4A and 4B are examples of distributed speech recognition systems that can dynamically indicate a voice-operational state for multimodal browsing to a user.

4A is an example of a distributed speech recognition system capable of dynamically presenting a voice-operational state for multimodal browsing to a user, the distributed speech recognition system being integrated into a single device 77. The term "distributed speech recognition" is used to indicate that the multimodal browsing and speech recognition is performed at least in different processing units of the single device 77.

The mobile device 77 includes a speech recognition system capable of performing multimodal interactive browsing, and inputs and outputs such as a display 82, keys 84 and 84 ', a microphone 86 and a loudspeaker 88. It includes a user interface with means. The user interface can be used for multimodal browsing including audio and key inputs and audio and display outputs. All elements of the user interface are reconnected to the central processing unit (CPU) 80 to control the user's interaction with the device.

The speech recognition system includes at least one central processing unit 80, a display 82, key-based input systems 84, 84 ′, a microphone 86 and a data bus 91. The display is connected to the central processing unit to be operated by the CPU 80. The key-based input system 84, 84 ′ is operably connected to the central processing unit 80 to provide a key input feature that provides key input options that can be displayed on the display 82. .

The microphone 86 is operatively coupled to the at least one CPU 80 to provide an audio-to-electronic converter that forms an audio input accessible to the CPU 80. The data bus 91 is operably connected to the at least one CPU 80 to process data and exchange data necessary for the operation of the at least one CPU 80. The data bus 91 operatively operates the at least one CPU 80 into an internal memory 83 to provide data access to stored data required to provide the key input feature and / or the voice input feature. Connect. The internal memory 83 may store, for example, different states and combinations of states of the device for which voice input features are not accessible or possible.

The at least one CPU 80 includes a first central processing unit 81 and a second central processing unit 81 '. A first central processing unit 81 of the at least one CPU 80 is configured to control multimodal interactions through the display 82, the key based input systems 84, 84 ′ and the microphone 86. do. The first central processing unit 81 is further configured to monitor conditions affecting the voice input and controls the display of the voice input option of the voice input feature on the display 82 according to the monitored condition. Is further configured to display.

4B is an example of a distributed speech recognition system capable of dynamically presenting to a user a voice-operational state for multimodal browsing distributed between at least two devices. Distributed speech recognition may include the advantage that the resources required for speech recognition can be used economically, for example, in a small portable device 78.

In order to provide a distributed system, the CPU 80 must be distributed between the two devices. The first central processing unit 81 and the second central processing unit 81 ′ of the at least one CPU 80 are included in differently interconnected devices 78 and 79. The interconnection 97 between the two devices (and the interconnection of the first central processing unit 81 and the second central processing unit 81 ') may be provided by, for example, a telephone connection. The interconnection may also be provided by a data connection such as a General Packet Radio Service (GPRS), the Internet, a Local Area Network (LAN), or the like.

The first central processing unit 81 may be further configured to monitor the conditions alone to continuously determine the availability of the voice input feature. The monitoring can be applied at short intervals or continuously, for example every second, depending on the type of conditions or parameters to be monitored or investigated.

The main advantage of the present invention is that it can be applied to any type of mobile electronic devices regardless of the features used. A user who always uses an electronic device under best voice control or multimodal browsing conditions will not be aware of the existence of the present invention. The present invention can be applied to any type of voice control or voice input used in technical applications. There is also the possibility of applying the present invention to non-mobile systems with no restrictions on resources. In non-mobile systems, the present invention is not considered available (or requires more training) as it can be used to represent words that can be recognized with a near 100% probability and words that can only be recognized with low recognition rates.

The visible keyword or cue selected for marking the voice-action may be any other method such as color scheme or underscore. However, underscores can easily be confused with hyperlinks. Color will be a good choice, and color displays are becoming more and more common. Red is typically used to mark active recordings in audio applications, which may be a suitable choice to indicate that voice-action is on. Some traffic light scenarios may also be adopted. Animated icons can help to visualize that longer gestures are possible for depicted elements such as ant colons, animated sound spectrum monitors and talking mouths.

The color scheme should also be learned, although only two colors are used, one representing speech-on and the other representing speech-off indications. A small legend describing the color usage can be visualized on the initial screen of the application.

Instead of color, the voice-activated commands can be marked in some other way, for example by drawing a small speech bubble around the command. However, in order to make the operation method as clear to the user as possible, the visible cue should be directly coupled to the command.

Dynamically changing the visible queue on the same page can be done with a suitable event mechanism. In the same way that the browser can highlight visible symbols in an XHTML application when a suitable 'onclick' or 'onfocus' event is detected, new events are placed in the visible speech-enabled queue. Can be defined for cases requiring a change of. If a multimodal mobile browser detects the events, it will change the color or other selected visible cue in the corresponding GUI elements as required.

Any traffic light scheme can be used to indicate whether speech recognition is enabled or disabled, with voice-enabled tasks that have no visible equivalent. This is relatively easy to implement with events that affect the entire screen at the same time. One such method is to shake the display illumination, invert the depiction mode or to selectively animate the voice operation menu points, or to jump small balls from syllables to syllables, as known from 'Karaoke' videos. It may be to.

Additional features that may be combined with the present invention include, for example, input prediction, training conversations, voice input suggestions via text or voice output, icon based menu structures for illiterate persons, trainable voice input, “lead out” Read out user manuals that use the "read out" and "read in" keys.

This application contains the description of embodiments and embodiments of the present invention with the aid of examples. It will be understood by those skilled in the art that the present invention is not limited to the details of the embodiments presented above, and that the present invention may be embodied in other forms without departing from the features of the present invention. The embodiments presented above should be considered illustrative and not restrictive. Accordingly, the possibility of implementing and using the invention is limited only by the appended claims. Therefore, various options for implementing the invention as determined by the claims, including equivalent embodiments, also fall within the scope of the invention.

Claims (14)

  1. A method of representing voice-actuated input for multimodal interaction in an electronic device having a user interface,
    Activating multimodal user interaction of said user interface provided with at least one key input option and at least one voice input option; and
    And displaying the at least one key input option on a display of the electronic device.
    Checking whether at least one condition affecting speech input is generally satisfied; And
    Providing the at least one voice input option comprising keywords in accordance with the condition and displaying indications of the voice input options on the display.
  2. delete
  3. The method of claim 1, wherein displaying the indications of speech input options on the display further comprises indicating whether speech recognition is actually possible.
  4. 4. The method of claim 1 or 3, wherein displaying the indications of speech input options comprises displaying the speech input options.
  5. 4. A method according to claim 1 or 3, wherein the displaying of the voice input options on the display is provided with hysteresis.
  6. 4. A method according to claim 1 or 3, wherein the displaying of the voice input options on the display is provided with a backlog function.
  7. A computer-readable medium having recorded thereon a computer program for executing the method of claim 1 or 3 on a computer or a network device.
  8. delete
  9. A computer-readable medium, which executes the method of claim 1 or 3 on a computer or a network device and records a program downloadable from a server.
  10. An electronic device capable of performing multimodal interactive browsing,
    A central processing unit (CPU) 80;
    A display 82 connected to the CPU 80 to display visual content received from the CPU 80 on a display 82;
    A key-based input system (84, 84 ') operably connected to the CPU (80) to provide a key input feature that provides key input options displayed on the display;
    A microphone 86 operatively connected to the CPU 80 to provide a voice input feature; And
    A data bus 91 operably connected to the CPU 80 for processing data and exchanging data necessary for the operation of the CPU 80,
    The CPU 80 is configured to control multimodal interactions through the display 82, the key based input systems 84, 84 ′ and the microphone 86,
    The CPU 80 is configured to monitor conditions affecting the voice input, provide the voice input feature according to the condition and include the keywords of the voice input feature on the display 82. And display an indication of.
  11. The electronic device of claim 10, further comprising a mobile communication device.
  12. In a speech recognition system capable of multimodal interaction and having a user interface,
    At least one central processing unit (CPU) 80;
    A display 82 connected to the CPU 80;
    A key-based input system (84, 84 ') operably connected to the CPU (80) to provide a key input feature that provides key input options displayed on the display;
    A microphone (86) operatively connected to the at least one central processing unit (80); And
    A data bus 91 operably connected to the at least one CPU 80 for processing data and exchanging data necessary for the operation of the at least one CPU 80,
    A first central processing unit 81 of the at least one CPU 80 is configured to control multimodal interactions through the display 82, the key-based input systems 84, 84 ′ and the microphone 86. And control and display a display of voice input options including keywords of the voice input feature on the display 82 according to the conditions and for monitoring conditions affecting the voice input,
    And a second central processing unit (81 ') of said at least one CPU (80) is configured to provide said speech input feature.
  13. 13. A speech recognition system according to claim 12, wherein the first central processing unit (81) and the second central processing unit (81 ') are included in the same device (77).
  14. 13. A speech recognition system according to claim 12, wherein the first central processing unit (81) and the second central processing unit (81 ') are included in different interconnected devices (78, 79).
KR1020057019000A 2005-10-06 2003-04-07 Method and device for providing speech-enabled input in an electronic device having a user interface KR100727548B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020057019000A KR100727548B1 (en) 2005-10-06 2003-04-07 Method and device for providing speech-enabled input in an electronic device having a user interface

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020057019000A KR100727548B1 (en) 2005-10-06 2003-04-07 Method and device for providing speech-enabled input in an electronic device having a user interface

Publications (2)

Publication Number Publication Date
KR20050111633A KR20050111633A (en) 2005-11-25
KR100727548B1 true KR100727548B1 (en) 2007-06-14

Family

ID=37286763

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020057019000A KR100727548B1 (en) 2005-10-06 2003-04-07 Method and device for providing speech-enabled input in an electronic device having a user interface

Country Status (1)

Country Link
KR (1) KR100727548B1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5892813A (en) 1996-09-30 1999-04-06 Matsushita Electric Industrial Co., Ltd. Multimodal voice dialing digital key telephone with dialog manager
US6212408B1 (en) 1999-05-03 2001-04-03 Innovative Global Solution, Inc. Voice command system and method
US20010047263A1 (en) * 1997-12-18 2001-11-29 Colin Donald Smith Multimodal user interface
US6532447B1 (en) 1999-06-07 2003-03-11 Telefonaktiebolaget Lm Ericsson (Publ) Apparatus and method of controlling a voice controlled operation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5892813A (en) 1996-09-30 1999-04-06 Matsushita Electric Industrial Co., Ltd. Multimodal voice dialing digital key telephone with dialog manager
US20010047263A1 (en) * 1997-12-18 2001-11-29 Colin Donald Smith Multimodal user interface
US6212408B1 (en) 1999-05-03 2001-04-03 Innovative Global Solution, Inc. Voice command system and method
US6532447B1 (en) 1999-06-07 2003-03-11 Telefonaktiebolaget Lm Ericsson (Publ) Apparatus and method of controlling a voice controlled operation

Also Published As

Publication number Publication date
KR20050111633A (en) 2005-11-25

Similar Documents

Publication Publication Date Title
US9753912B1 (en) Method for processing the output of a speech recognizer
US9558745B2 (en) Service oriented speech recognition for in-vehicle automated interaction and in-vehicle user interfaces requiring minimal cognitive driver processing for same
US9990177B2 (en) Visual indication of a recognized voice-initiated action
KR101703911B1 (en) Visual confirmation for a recognized voice-initiated action
EP3243199B1 (en) Headless task completion within digital personal assistants
US9293139B2 (en) Voice controlled wireless communication device system
US10679605B2 (en) Hands-free list-reading by intelligent automated assistant
US20190095050A1 (en) Application Gateway for Providing Different User Interfaces for Limited Distraction and Non-Limited Distraction Contexts
DE112016003459T5 (en) speech recognition
US20190051317A1 (en) Method of and system for real time feedback in an incremental speech input interface
EP2864980B1 (en) Mixed model speech recognition
US9466293B1 (en) Speech interface system and method for control and interaction with applications on a computing system
US20150350421A1 (en) Genius Button Secondary Commands
US20150279366A1 (en) Voice driven operating system for interfacing with electronic devices: system, method, and architecture
EP2400371B1 (en) Gesture recognition apparatus, gesture recognition method and program
US9576571B2 (en) Method and apparatus for recognizing and reacting to user personality in accordance with speech recognition system
US8824641B2 (en) Real time automatic caller speech profiling
EP2761860B1 (en) Automatically adapting user interfaces for hands-free interaction
US7665024B1 (en) Methods and apparatus for controlling a user interface based on the emotional state of a user
US20170200075A1 (en) Digital companions for human users
US7546382B2 (en) Methods and systems for authoring of mixed-initiative multi-modal interactions and related browsing mechanisms
US20130275875A1 (en) Automatically Adapting User Interfaces for Hands-Free Interaction
TWI376681B (en) Speech understanding system for semantic object synchronous understanding implemented with speech application language tags, and computer readable medium for recording related instructions thereon
CN104350541B (en) The robot that natural dialogue with user can be merged into its behavior, and programming and the method using the robot
US5748841A (en) Supervised contextual language acquisition system

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
AMND Amendment
E601 Decision to refuse application
J201 Request for trial against refusal decision
AMND Amendment
B701 Decision to grant
GRNT Written decision to grant
FPAY Annual fee payment

Payment date: 20130522

Year of fee payment: 7

FPAY Annual fee payment

Payment date: 20140521

Year of fee payment: 8

FPAY Annual fee payment

Payment date: 20150518

Year of fee payment: 9

FPAY Annual fee payment

Payment date: 20160517

Year of fee payment: 10

FPAY Annual fee payment

Payment date: 20170522

Year of fee payment: 11

FPAY Annual fee payment

Payment date: 20180516

Year of fee payment: 12