US20210311701A1 - Technique for generating a command for a voice-controlled electronic device - Google Patents

Technique for generating a command for a voice-controlled electronic device Download PDF

Info

Publication number
US20210311701A1
US20210311701A1 US17/311,279 US201817311279A US2021311701A1 US 20210311701 A1 US20210311701 A1 US 20210311701A1 US 201817311279 A US201817311279 A US 201817311279A US 2021311701 A1 US2021311701 A1 US 2021311701A1
Authority
US
United States
Prior art keywords
command
electronic device
voice input
content
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US17/311,279
Inventor
Baran Cubukcu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vestel Elektronik Sanayi ve Ticaret AS
Original Assignee
Vestel Elektronik Sanayi ve Ticaret AS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vestel Elektronik Sanayi ve Ticaret AS filed Critical Vestel Elektronik Sanayi ve Ticaret AS
Assigned to VESTEL ELEKTRONIK SANAYI VE TICARET A.S. reassignment VESTEL ELEKTRONIK SANAYI VE TICARET A.S. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CUBUKCU, Baran
Publication of US20210311701A1 publication Critical patent/US20210311701A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0487Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser
    • G06F3/0488Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures
    • G06F3/04883Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures for inputting data by handwriting, e.g. gesture or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/221Announcement of recognition results

Definitions

  • the present disclosure generally relates to the field of electronic devices.
  • a technique for generating a command to be processed by a voice-controlled electronic device is presented.
  • the technique may be embodied in a method, a computer program and an electronic device.
  • Speech recognition techniques also known as “speech-to-text” techniques—have been developed over the recent decades to provide computer-implemented assistance for transcribing spoken language into text and have meanwhile been adopted in various fields of application.
  • speech recognition techniques have increasingly been employed for voice control of electronic devices, such as for voice control of household appliances or for the implementation of virtual assistants, i.e., software agents capable of performing tasks or providing services upon verbal request of a user.
  • virtual assistants include Apple Siri, Google Assistant, Amazon Alexa and Microsoft Cortana, for example.
  • Voice control of electronic devices may generally reach its limits when keywords included in a voice command cannot be recognized unambiguously, so that the inputted command potentially contains unintended elements that may result in unwanted outcomes of the control being performed. Such situations may especially occur when the voice command contains terms of a different language than the default language of speech recognition, when the voice command contains terms that are not included in a vocabulary used for speech recognition, or when the voice command contains terms pronounced in an ambiguous manner by the user.
  • the recognition of the Japanese expression may fail, either due to a wrong pronunciation by the user or because recognition of the different language (which may even be based on a different character set) is not supported by the recognition engine.
  • the recognition of the name may fail, either due to—again—wrong pronunciation by the user or because the name is not part of the vocabulary used for speech recognition.
  • a method for generating a command to be processed by a voice-controlled electronic device comprises receiving a voice input representative of a first portion of a command to be processed by the electronic device, receiving a selection of content displayed on a screen of the electronic device, the selected content being representative of a second portion of the command to be processed by the electronic device, and generating the command based on a combination of the voice input and the selected content.
  • the electronic device may be any kind of electronic device that is capable of being voice controlled. This may include consumer electronics devices, such as smartphones, tablet computers, laptops and personal computers, as well as household appliances, such as refrigerators, cookers, dishwashers, washing machines and air conditioners, for example, but is not limited thereto.
  • the electronic device may comprise a microphone for receiving voice commands (or, more generally, voice input) and may execute an agent (e.g., a software agent) that may be configured to process the received voice commands and to take action in accordance therewith.
  • the agent may be provided in the form of a virtual assistant that is capable of providing services in response to voice commands received from a user, i.e., in other words, upon verbal request of the user.
  • the command to be processed may correspond to a command that is generated from a combination of voice input and content selected from a screen of the electronic device.
  • the command may therefore be created from two types of inputs, namely a voice input representative of the first portion of the command as well as visual input selected from a display (corresponding to the selection of displayed content on the screen of the electronic device) representative of the second portion of the command to be generated.
  • the full command may then the generated by combining the first portion and the second portion of the command. Once the full command is generated, the command may be processed by the electronic device.
  • first and second may merely differentiate the respective portions of the command to be generated but may not necessarily imply an order of (or a temporal relationship between) the respective portions of the command to be generated. It may thus be conceivable that the second portion is input before the first portion of the command and represents an initial portion of the command which is followed by the first portion of the command, or vice versa.
  • the selection of content on the display of the electronic device may generally provide a more accurate input method and may therefore be preferable as input method for portions of the command that are otherwise hardly recognizable from voice input.
  • the visual selection of the content may be used for input of portions of the command that comprise a term that is of a different language than a default language of the speech recognition engine, a term that is not included in a vocabulary of the speech recognition engine and/or a term that likely results in an ambiguous transcription (e.g., a term whose average transcription ambiguity is above a predetermined threshold, as pronounced by the user, for example).
  • the command may be created more precisely and the generation of improper command elements may generally be avoided. Unwanted outcomes of the voice control being performed may thus be prevented.
  • the command may correspond to any type of command that is interpretable by the electronic device.
  • the command may correspond to a control command for controlling a function of the electronic device, such as a command for controlling the behavior of a household appliance or controlling a virtual assistant executed on the electronics device, for example.
  • the command may correspond to a command that is input in response to activation of a voice control function of the electronic device and the command may thus reflect a command to be processed by the voice control function of the electronic device.
  • the command may be input upon input of a hotword that activates a voice control function of the electronic device, for example.
  • the command may correspond to a query to a virtual assistant executed on the electronic device, e.g., a query to request a service from the virtual assistant.
  • Known hotwords for virtual assistants are “Hey Siri” in case of Apple Siri or “Ok Google” in case of Google Assistant, for example.
  • the screen may be a touch screen and the selection of the content may be made by a touch input on the touch screen.
  • the touch input may correspond to a touch gesture specifying a display region on the screen where the content is to be selected.
  • the touch input may correspond to a sliding gesture covering the content to be selected. This may involve sliding over the content (e.g., a text portion) to be selected or encircling/framing the content to be selected, for example.
  • the content to be selected may correspond to a portion of text currently being displayed on the screen of the electronic device.
  • the text portion may comprise selectable text (e.g., text that is markable/selectable using common user interface functions known for common copy/paste operations) or, otherwise, the text portion may comprise non-selectable text.
  • the selected content may correspond to a selected display region on the screen that contains a non-selectable text portion, wherein the text portion may form part of a non-textual display element, such as an image displayed on the screen, for example.
  • the content to be selected may not correspond to input from a keyboard displayed on the screen of the electronic device.
  • both the voice input and the selected content may be converted into a same format, such as into (but not limited to) text, for example.
  • the voice input may be transcribed into text using speech recognition.
  • the selected content corresponds to selectable text
  • the selected text may not need to be further converted.
  • the selected display region may be subjected to text recognition in order to obtain a text representation of the selected content.
  • combining the voice input with the selected content may comprise combining a transcription of the voice input with the selected text (e.g., concatenating the transcription of the voice input and the selected text).
  • combining the voice input with the selected content may comprise performing text recognition on the selected display region to obtain text included therein as selected text, and combining a transcription of the voice input with the selected text (e.g., concatenating the transcription of the voice input and the selected text).
  • the electronic device when the selection of the content is made by a touch input specifying the display region, the electronic device may be configured to recognize what is written in the display region and may use the recognized text as the second portion of the command to be generated. In this way, any text portion displayed on the screen may generally be selected as second portion for the command to be generated. This may include text portions displayed in web browsers or messaging applications executed on a smartphone, for example, and a word or phrase to be used as second portion of the command may simply be selected by a touch on the word or phrase on the screen, for example.
  • the language of the transcription of the voice input and the language of the selected text may be different.
  • a character set of the transcription of the voice input and the character set of the selected text may be different. Therefore, as an example, although both the language and the character set of the transcription of the voice input may be based on English language, the user may select text that is displayed in Japanese language as second portion for the command to be generated. As a mere example, the user may say “what is” as voice input representing the first portion of the command and then select “ ” on the screen representing the second input of the command so that the full command “what is ” is generated.
  • a user may use a camera application of the electronic device to capture an image of content of interest and select a region in the captured image to be used as second portion of the command to be generated. For example, a user may capture a Japanese signboard, say “what is” and slide his finger over the Japanese text of the signboard on the captured image to generate a corresponding command to be processed by the electronic device.
  • the voice input may include an instruction to be processed by the electronic device, wherein the selected content may correspond to a parameter associated with the instruction.
  • the instruction may correspond to a copying operation and the parameter associated with the instruction may correspond to an item to be copied. For example, if a user reads a webpage and would like to share a text portion of the webpage with friends, the user may say “copy the words” and select the desired text portion on the screen to generate a corresponding command.
  • the electronic device may copy the selected text portion into the clipboard of the electronic device, ready to be pasted somewhere else in order to be shared with friends.
  • receiving the voice input representative of the first portion of the command and receiving the selection of the content representative of the second portion of the command may be performed in the form of a standalone two-step input procedure, it may also be conceivable that the two-step input procedure is performed as a fallback procedure to a failed attempt to transcribe the command as a full voice command.
  • the selection of the content may thus be received upon failure to correctly transcribe voice input representing the content. Failure to correctly transcribe the voice input may be determined by the user upon review of the transcription of the voice input on the screen, for example.
  • the electronic device may also recognize that the voice input received in the first step may not yet represent a full command (e.g., saying “what is” without any further specification) and the electronic device may therefore be configured to wait for additional input from the user. Upon recognizing that the voice input representative of the first portion of the comment does not represent a full command, the electronic device may thus wait for the selection of the content. In one such variant, the electronic device may actively prompt the user to perform the selection of the content on the screen when detecting that the full command is not yet available.
  • the voice input received in the first step may not yet represent a full command (e.g., saying “what is” without any further specification) and the electronic device may therefore be configured to wait for additional input from the user.
  • the electronic device may thus wait for the selection of the content.
  • the electronic device may actively prompt the user to perform the selection of the content on the screen when detecting that the full command is not yet available.
  • a computer program product comprises program code portions for performing the method of the first aspect when the computer program product is executed on one or more computing devices.
  • the computer program product may be stored on a computer readable recording medium, such as a semiconductor memory, DVD, CD-ROM, or the like.
  • a voice-controlled electronic device for generating a command to be processed by the electronic device.
  • the electronic device comprises at least one processor and at least one memory, wherein the at least one memory contains instructions executable by the at least one processor such that the electronic device is operable to perform the method steps presented herein with respect to the first aspect.
  • FIG. 1 schematically illustrates an exemplary hardware composition of a voice-controlled electronic device according to the present disclosure
  • FIG. 2 illustrates a flowchart of a method which may be performed by the electronic device of FIG. 1 ;
  • FIG. 3 illustrates an exemplary selection of content displayed on a screen of an electronic device according to the present disclosure.
  • FIG. 1 illustrates an exemplary hardware composition of the electronic device 100 .
  • the electronic device 100 comprises at least one processor 102 and at least one memory 104 , wherein the at least one memory 104 contains instructions executable by the at least one processor such that the electronic device is operable to carry out the functions, services or steps described herein below.
  • the electronic device 100 may be any kind of electronic device that is capable of being voice controlled. This may include consumer electronic devices, such as smartphones, tablets computers, laptops and personal computers, as well as household appliances, such as refrigerators, cookers, dishwashers, washing machines and air conditioners, for example, but is not limited thereto.
  • the electronic device 100 comprises a microphone 106 for receiving voice commands (or, more generally, a voice input) and may execute an agent (e.g., a software agent) that may be configured to process the received voice commands and to take action in accordance therewith.
  • the agent may be provided in the form of a virtual assistant that is capable of providing services in response to voice commands from a user, i.e., in other words, upon verbal request of the user.
  • the electronic device 100 further comprises a screen 108 for displaying content that may be selectable for the user.
  • FIG. 2 illustrates a method which may be performed by the electronic device 100 according to the present disclosure.
  • the method is dedicated to generating a command to be processed by the electronic device 100 and comprises receiving, in step S 202 , a voice input representative of a first portion of a command to be processed by the electronic device 100 , receiving, in step S 204 , a selection of content displayed on a screen of the electronic device 100 , the selected content being representative of a second portion of the command to be processed by the electronic device 100 , and generating, in step S 206 , the command based on a combination of the voice input and the selected content.
  • the generated command may be processed by the electronic device 100 .
  • the command to be processed by the electronic device 100 may correspond to a command that is generated from a combination of voice input and content selected from the screen 108 of the electronic device 100 .
  • the command may therefore be created from two types of inputs, namely a voice input representative of a first portion of the command as well as a visual input selected from a display (corresponding to the selection of displayed content on the screen 108 of the electronic device 100 ) representative of a second portion of the command to be generated.
  • the full command may then be generated by combining the first portion and the second portion of the command.
  • first and second may merely differentiate the respective portions of the command to be generated but may not necessarily imply an order of (or a temporal relationship between) the respective portions of the command to be generated. It may thus be conceivable that the second portion is input before the first portion of the command and represents an initial portion of the command which is followed by the first portion of the command, or vice versa.
  • the selection of content on the display of the electronic device 100 may generally provide a more accurate input method and may therefore be preferable as input method for portions of the command that are otherwise hardly recognizable from voice input.
  • the visual selection of the content may be used for input of portions of the command that comprise a term that is of a different language than a default language of the speech recognition engine, a term that is not included in a vocabulary of the speech recognition engine and/or a term that likely results in an ambiguous transcription (e.g., a term whose average transcription ambiguity is above a predetermined threshold, as pronounced by the user, for example).
  • the command may be created more precisely and the generation of improper command elements may generally be avoided. Unwanted outcomes of the voice control being performed may thus be prevented.
  • the command may correspond to any type of command that is interpretable by the electronic device 100 .
  • the command may correspond to a control command for controlling a function of the electronic device 100 , such as a command for controlling the behavior of a household appliance or controlling a virtual assistant executed on the electronics device 100 , for example.
  • the command may correspond to a command that is input in response to activation of a voice control function of the electronic device 100 and the command may thus reflect a command to be processed by the voice control function of the electronic device 100 .
  • the command may be input upon input of a hotword that activates a voice control function of the electronic device 100 , for example.
  • the command may correspond to a query to a virtual assistant executed on the electronic device 100 , e.g., a query to request a service from the virtual assistant.
  • a virtual assistant executed on the electronic device 100
  • Known hotwords for virtual assistants are “Hey Siri” in case of Apple Siri or “Ok Google” in case of Google Assistant, for example.
  • the selection of the content on the screen 108 of the electronic device 100 may be made using any kind of input means, such as using a mouse or keyboard in case of a personal computer, for example, in one implementation, the screen 108 may be a touch screen and the selection of the content may be made by a touch input on the touch screen.
  • the touch input may correspond to a touch gesture specifying a display region on the screen 108 where the content is to be selected.
  • the touch input may correspond to a sliding gesture covering the content to be selected. This may involve sliding over the content (e.g., a text portion) to be selected or encircling/framing the content to be selected, for example.
  • the content to be selected may correspond to a portion of text currently being displayed on the screen 108 of the electronic device 100 .
  • the text portion may comprise selectable text (e.g., text that is markable/selectable using common user interface functions known for common copy/paste operations) or, otherwise, the text portion may comprise non-selectable text.
  • the selected content may correspond to a selected display region on the screen 108 that contains a non-selectable text portion, wherein the text portion may form part of a non-textual display element, such as an image displayed on the screen, for example.
  • the content to be selected may not correspond to input from a keyboard displayed on the screen of the electronic device 100 .
  • both the voice input and the selected content may be converted into a same format, such as into (but not limited to) text, for example.
  • the voice input may be transcribed into text using speech recognition.
  • the selected content corresponds to selectable text
  • the selected text may not need to be further converted.
  • the selected display region may be subjected to text recognition in order to obtain a text representation of the selected content.
  • combining the voice input with the selected content may comprise combining a transcription of the voice input with the selected text (e.g., concatenating the transcription of the voice input and the selected text).
  • combining the voice input with the selected content may comprise performing text recognition on the selected display region to obtain text included therein as selected text, and combining a transcription of the voice input with the selected text (e.g., concatenating the transcription of the voice input and the selected text).
  • the electronic device 100 may be configured to recognize what is written in the display region and may use the recognized text as the second portion of the command to be generated.
  • any text portion displayed on the screen 108 may generally be selected as second portion for the command to be generated.
  • This may include text portions displayed in web browsers or messaging applications executed on a smartphone, for example, and a word or phrase to be used as second portion of the command may simply be selected by a touch on the word or phrase on the screen, for example.
  • a user may use a camera application of the electronic device 100 to capture an image of content of interest and select a region in the captured image to be used as second portion of the command to be generated. For example, a user may capture a Japanese signboard, say “what is” and slide his finger over the Japanese text of the signboard on the captured image to generate a corresponding command to be processed by the electronic device.
  • receiving the voice input representative of the first portion of the command and receiving the selection of the content representative of the second portion of the command may be performed in the form of a standalone two-step input procedure, it may also be conceivable that the two-step input procedure is performed as a fallback procedure to a failed attempt to transcribe the command as a full voice command.
  • the selection of the content may thus be received upon failure to correctly transcribe voice input representing the content. Failure to correctly transcribe the voice input may be determined by the user upon review of the transcription of the voice input on the screen 108 , for example.
  • the electronic device 100 may also recognize that the voice input received in the first step may not yet represent a full command (e.g., saying “what is” without any further specification) and the electronic device 100 may therefore be configured to wait for additional input from the user. Upon recognizing that the voice input representative of the first portion of the comment does not represent a full command, the electronic device 100 may thus wait for the selection of the content. In one such variant, the electronic device 100 may actively prompt the user to perform the selection of the content on the screen 108 when detecting that the full command is not yet available.
  • the voice input received in the first step may not yet represent a full command (e.g., saying “what is” without any further specification) and the electronic device 100 may therefore be configured to wait for additional input from the user.
  • the electronic device 100 may thus wait for the selection of the content.
  • the electronic device 100 may actively prompt the user to perform the selection of the content on the screen 108 when detecting that the full command is not yet available.
  • FIG. 3 illustrates an exemplary selection of content displayed on a screen 108 of an electronic device 100 which, in the figure, is given as a smartphone having a touch screen.
  • the user of the smartphone 100 communicates with person “A” via a messaging application.
  • the user may have received a message from person A saying “Hi, I'm now in VESTEL”.
  • the user may then ask the virtual assistant of the smartphone 100 “where is VESTEL”. Due to a not fully clear pronunciation by the user, the virtual assistant may incorrectly recognize “where is vessel” as voice command input by the user (not shown in the figure).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A technique for generating a command to be processed by a voice-controlled electronic device is disclosed. A method implementation of the technique comprises receiving a voice input representative of a first portion of a command to be processed by the electronic device; receiving a selection of content displayed on a screen of the electronic device, the selected content being representative of a second portion of the command to be processed by the electronic device; and generating the command based on a combination of the voice input and the selected content.

Description

    TECHNICAL FIELD
  • The present disclosure generally relates to the field of electronic devices. In particular, a technique for generating a command to be processed by a voice-controlled electronic device is presented. The technique may be embodied in a method, a computer program and an electronic device.
  • BACKGROUND
  • Speech recognition techniques—also known as “speech-to-text” techniques—have been developed over the recent decades to provide computer-implemented assistance for transcribing spoken language into text and have meanwhile been adopted in various fields of application. In particular, in recent years, speech recognition techniques have increasingly been employed for voice control of electronic devices, such as for voice control of household appliances or for the implementation of virtual assistants, i.e., software agents capable of performing tasks or providing services upon verbal request of a user. Known virtual assistants include Apple Siri, Google Assistant, Amazon Alexa and Microsoft Cortana, for example.
  • Voice control of electronic devices may generally reach its limits when keywords included in a voice command cannot be recognized unambiguously, so that the inputted command potentially contains unintended elements that may result in unwanted outcomes of the control being performed. Such situations may especially occur when the voice command contains terms of a different language than the default language of speech recognition, when the voice command contains terms that are not included in a vocabulary used for speech recognition, or when the voice command contains terms pronounced in an ambiguous manner by the user.
  • As an example, when the default language of speech recognition is English and the user attempts to input a Japanese expression as an element of a voice command (e.g., asking “what is
    Figure US20210311701A1-20211007-P00001
    ”), the recognition of the Japanese expression may fail, either due to a wrong pronunciation by the user or because recognition of the different language (which may even be based on a different character set) is not supported by the recognition engine. As another example, when a user attempts to input an uncommon name as an element of the voice command (e.g., asking “who is Vladimir Beschastnykh”), the recognition of the name may fail, either due to—again—wrong pronunciation by the user or because the name is not part of the vocabulary used for speech recognition. In a still further example, when a user attempts to input a term that—although included in the vocabulary—may result in an ambiguous transcription if pronounced unclearly, the recognition of the term may fail due to the unclear pronunciation by the user (e.g., asking “where is Vestel”, but recognizing “where is vessel”).
  • In view of these examples, it is evident that the merely verbal way of inputting a command into an electronic device may not always produce satisfying results for voice control. It is thus an object of the present disclosure to provide a technique for generating a command to be processed by a voice-controlled electronic device that avoids one or more of these, or other, problems.
  • SUMMARY
  • According to a first aspect, a method for generating a command to be processed by a voice-controlled electronic device is provided. The method comprises receiving a voice input representative of a first portion of a command to be processed by the electronic device, receiving a selection of content displayed on a screen of the electronic device, the selected content being representative of a second portion of the command to be processed by the electronic device, and generating the command based on a combination of the voice input and the selected content.
  • The electronic device may be any kind of electronic device that is capable of being voice controlled. This may include consumer electronics devices, such as smartphones, tablet computers, laptops and personal computers, as well as household appliances, such as refrigerators, cookers, dishwashers, washing machines and air conditioners, for example, but is not limited thereto. The electronic device may comprise a microphone for receiving voice commands (or, more generally, voice input) and may execute an agent (e.g., a software agent) that may be configured to process the received voice commands and to take action in accordance therewith. In one implementation, the agent may be provided in the form of a virtual assistant that is capable of providing services in response to voice commands received from a user, i.e., in other words, upon verbal request of the user.
  • Instead of using entirely voice-based commands, according to the technique presented herein, the command to be processed may correspond to a command that is generated from a combination of voice input and content selected from a screen of the electronic device. The command may therefore be created from two types of inputs, namely a voice input representative of the first portion of the command as well as visual input selected from a display (corresponding to the selection of displayed content on the screen of the electronic device) representative of the second portion of the command to be generated. The full command may then the generated by combining the first portion and the second portion of the command. Once the full command is generated, the command may be processed by the electronic device. It will be understood that, when it is referred to herein to the first portion and the second portion of the command to be generated, the terms “first” and “second” may merely differentiate the respective portions of the command to be generated but may not necessarily imply an order of (or a temporal relationship between) the respective portions of the command to be generated. It may thus be conceivable that the second portion is input before the first portion of the command and represents an initial portion of the command which is followed by the first portion of the command, or vice versa.
  • While performing speech recognition on voice input may suffer from ambiguous or incorrect recognition in case of unclear pronunciation or words unknown to the speech recognition engine, as described above, the selection of content on the display of the electronic device may generally provide a more accurate input method and may therefore be preferable as input method for portions of the command that are otherwise hardly recognizable from voice input. In particular, the visual selection of the content may be used for input of portions of the command that comprise a term that is of a different language than a default language of the speech recognition engine, a term that is not included in a vocabulary of the speech recognition engine and/or a term that likely results in an ambiguous transcription (e.g., a term whose average transcription ambiguity is above a predetermined threshold, as pronounced by the user, for example). By using the visual selection, the command may be created more precisely and the generation of improper command elements may generally be avoided. Unwanted outcomes of the voice control being performed may thus be prevented.
  • The command may correspond to any type of command that is interpretable by the electronic device. In particular, the command may correspond to a control command for controlling a function of the electronic device, such as a command for controlling the behavior of a household appliance or controlling a virtual assistant executed on the electronics device, for example. The command may correspond to a command that is input in response to activation of a voice control function of the electronic device and the command may thus reflect a command to be processed by the voice control function of the electronic device. The command may be input upon input of a hotword that activates a voice control function of the electronic device, for example. As an example, the command may correspond to a query to a virtual assistant executed on the electronic device, e.g., a query to request a service from the virtual assistant. Known hotwords for virtual assistants are “Hey Siri” in case of Apple Siri or “Ok Google” in case of Google Assistant, for example.
  • While it will be understood that the selection of the content on the screen of the electronic device may be made using any kind of input means, such as using a mouse or keyboard in case of a personal computer, for example, in one implementation, the screen may be a touch screen and the selection of the content may be made by a touch input on the touch screen. The touch input may correspond to a touch gesture specifying a display region on the screen where the content is to be selected. As an example, the touch input may correspond to a sliding gesture covering the content to be selected. This may involve sliding over the content (e.g., a text portion) to be selected or encircling/framing the content to be selected, for example.
  • The content to be selected may correspond to a portion of text currently being displayed on the screen of the electronic device. The text portion may comprise selectable text (e.g., text that is markable/selectable using common user interface functions known for common copy/paste operations) or, otherwise, the text portion may comprise non-selectable text. In the latter case, the selected content may correspond to a selected display region on the screen that contains a non-selectable text portion, wherein the text portion may form part of a non-textual display element, such as an image displayed on the screen, for example. The content to be selected may not correspond to input from a keyboard displayed on the screen of the electronic device.
  • Prior to combining the voice input and the selected content (again, representing the first portion and the second portion of the command to be processed, respectively), both the voice input and the selected content may be converted into a same format, such as into (but not limited to) text, for example. To this end, the voice input may be transcribed into text using speech recognition. When the selected content corresponds to selectable text, the selected text may not need to be further converted. When the selected content corresponds to a display region containing non-selectable text (e.g., text contained in an image displayed on the screen), on the other hand, the selected display region may be subjected to text recognition in order to obtain a text representation of the selected content.
  • Therefore, in one variant, when the selection of the content comprises a selection of text (i.e., selectable text), combining the voice input with the selected content may comprise combining a transcription of the voice input with the selected text (e.g., concatenating the transcription of the voice input and the selected text). In another variant, when the selection of the content comprises a selection of a display region on the screen (e.g., corresponding to an image displayed on the screen containing text to be used as the second portion of the command), combining the voice input with the selected content may comprise performing text recognition on the selected display region to obtain text included therein as selected text, and combining a transcription of the voice input with the selected text (e.g., concatenating the transcription of the voice input and the selected text). In other words, when the selection of the content is made by a touch input specifying the display region, the electronic device may be configured to recognize what is written in the display region and may use the recognized text as the second portion of the command to be generated. In this way, any text portion displayed on the screen may generally be selected as second portion for the command to be generated. This may include text portions displayed in web browsers or messaging applications executed on a smartphone, for example, and a word or phrase to be used as second portion of the command may simply be selected by a touch on the word or phrase on the screen, for example.
  • In one implementation, the language of the transcription of the voice input and the language of the selected text may be different. Also, a character set of the transcription of the voice input and the character set of the selected text may be different. Therefore, as an example, although both the language and the character set of the transcription of the voice input may be based on English language, the user may select text that is displayed in Japanese language as second portion for the command to be generated. As a mere example, the user may say “what is” as voice input representing the first portion of the command and then select “
    Figure US20210311701A1-20211007-P00002
    ” on the screen representing the second input of the command so that the full command “what is
    Figure US20210311701A1-20211007-P00003
    ” is generated. In a similar use case, a user may use a camera application of the electronic device to capture an image of content of interest and select a region in the captured image to be used as second portion of the command to be generated. For example, a user may capture a Japanese signboard, say “what is” and slide his finger over the Japanese text of the signboard on the captured image to generate a corresponding command to be processed by the electronic device.
  • In some implementations, the voice input may include an instruction to be processed by the electronic device, wherein the selected content may correspond to a parameter associated with the instruction. As an example, the instruction may correspond to a copying operation and the parameter associated with the instruction may correspond to an item to be copied. For example, if a user reads a webpage and would like to share a text portion of the webpage with friends, the user may say “copy the words” and select the desired text portion on the screen to generate a corresponding command. When processing the command, the electronic device may copy the selected text portion into the clipboard of the electronic device, ready to be pasted somewhere else in order to be shared with friends.
  • While it will be understood that receiving the voice input representative of the first portion of the command and receiving the selection of the content representative of the second portion of the command may be performed in the form of a standalone two-step input procedure, it may also be conceivable that the two-step input procedure is performed as a fallback procedure to a failed attempt to transcribe the command as a full voice command. In one variant, the selection of the content may thus be received upon failure to correctly transcribe voice input representing the content. Failure to correctly transcribe the voice input may be determined by the user upon review of the transcription of the voice input on the screen, for example.
  • If the first portion of the command represents an initial portion of the command that is to be input before the second portion of the command, the electronic device may also recognize that the voice input received in the first step may not yet represent a full command (e.g., saying “what is” without any further specification) and the electronic device may therefore be configured to wait for additional input from the user. Upon recognizing that the voice input representative of the first portion of the comment does not represent a full command, the electronic device may thus wait for the selection of the content. In one such variant, the electronic device may actively prompt the user to perform the selection of the content on the screen when detecting that the full command is not yet available.
  • According to a second aspect, a computer program product is provided. The computer program product comprises program code portions for performing the method of the first aspect when the computer program product is executed on one or more computing devices. The computer program product may be stored on a computer readable recording medium, such as a semiconductor memory, DVD, CD-ROM, or the like.
  • According to a third aspect, a voice-controlled electronic device for generating a command to be processed by the electronic device is provided. The electronic device comprises at least one processor and at least one memory, wherein the at least one memory contains instructions executable by the at least one processor such that the electronic device is operable to perform the method steps presented herein with respect to the first aspect.
  • All of the aspects described herein may be implemented by hardware circuitry and/or by software. Even if some of the aspects are described herein with respect to the electronic device, these aspects may also be implemented as a method or as a computer program for performing or executing the method. Likewise, aspects described as or with reference to a method may be realized by components or processing means of the electronic device, or by means of the computer program.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • In the following, the present disclosure will further be described with reference to exemplary implementations illustrated in the figures, in which:
  • FIG. 1 schematically illustrates an exemplary hardware composition of a voice-controlled electronic device according to the present disclosure;
  • FIG. 2 illustrates a flowchart of a method which may be performed by the electronic device of FIG. 1; and
  • FIG. 3 illustrates an exemplary selection of content displayed on a screen of an electronic device according to the present disclosure.
  • DETAILED DESCRIPTION
  • In the following description, for purposes of explanation and not limitation, specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent to one skilled in the art that the present disclosure may be practiced in other implementations that depart from these specific details.
  • FIG. 1 illustrates an exemplary hardware composition of the electronic device 100. The electronic device 100 comprises at least one processor 102 and at least one memory 104, wherein the at least one memory 104 contains instructions executable by the at least one processor such that the electronic device is operable to carry out the functions, services or steps described herein below. The electronic device 100 may be any kind of electronic device that is capable of being voice controlled. This may include consumer electronic devices, such as smartphones, tablets computers, laptops and personal computers, as well as household appliances, such as refrigerators, cookers, dishwashers, washing machines and air conditioners, for example, but is not limited thereto. The electronic device 100 comprises a microphone 106 for receiving voice commands (or, more generally, a voice input) and may execute an agent (e.g., a software agent) that may be configured to process the received voice commands and to take action in accordance therewith. In one implementation, the agent may be provided in the form of a virtual assistant that is capable of providing services in response to voice commands from a user, i.e., in other words, upon verbal request of the user. The electronic device 100 further comprises a screen 108 for displaying content that may be selectable for the user.
  • FIG. 2 illustrates a method which may be performed by the electronic device 100 according to the present disclosure. The method is dedicated to generating a command to be processed by the electronic device 100 and comprises receiving, in step S202, a voice input representative of a first portion of a command to be processed by the electronic device 100, receiving, in step S204, a selection of content displayed on a screen of the electronic device 100, the selected content being representative of a second portion of the command to be processed by the electronic device 100, and generating, in step S206, the command based on a combination of the voice input and the selected content. Finally, in step S208, the generated command may be processed by the electronic device 100.
  • Instead of using entirely voice-based commands, according to the technique presented herein, the command to be processed by the electronic device 100 may correspond to a command that is generated from a combination of voice input and content selected from the screen 108 of the electronic device 100. The command may therefore be created from two types of inputs, namely a voice input representative of a first portion of the command as well as a visual input selected from a display (corresponding to the selection of displayed content on the screen 108 of the electronic device 100) representative of a second portion of the command to be generated. The full command may then be generated by combining the first portion and the second portion of the command. It will be understood that, when it is referred to herein to the first portion and the second portion of the command to be generated, the terms “first” and “second” may merely differentiate the respective portions of the command to be generated but may not necessarily imply an order of (or a temporal relationship between) the respective portions of the command to be generated. It may thus be conceivable that the second portion is input before the first portion of the command and represents an initial portion of the command which is followed by the first portion of the command, or vice versa.
  • While performing speech recognition on voice input may suffer from ambiguous or incorrect recognition in case of unclear pronunciation or words unknown to the speech recognition engine, as described above, the selection of content on the display of the electronic device 100 may generally provide a more accurate input method and may therefore be preferable as input method for portions of the command that are otherwise hardly recognizable from voice input. In particular, the visual selection of the content may be used for input of portions of the command that comprise a term that is of a different language than a default language of the speech recognition engine, a term that is not included in a vocabulary of the speech recognition engine and/or a term that likely results in an ambiguous transcription (e.g., a term whose average transcription ambiguity is above a predetermined threshold, as pronounced by the user, for example). By using the visual selection, the command may be created more precisely and the generation of improper command elements may generally be avoided. Unwanted outcomes of the voice control being performed may thus be prevented.
  • The command may correspond to any type of command that is interpretable by the electronic device 100. In particular, the command may correspond to a control command for controlling a function of the electronic device 100, such as a command for controlling the behavior of a household appliance or controlling a virtual assistant executed on the electronics device 100, for example. The command may correspond to a command that is input in response to activation of a voice control function of the electronic device 100 and the command may thus reflect a command to be processed by the voice control function of the electronic device 100. The command may be input upon input of a hotword that activates a voice control function of the electronic device 100, for example. As an example, the command may correspond to a query to a virtual assistant executed on the electronic device 100, e.g., a query to request a service from the virtual assistant. Known hotwords for virtual assistants are “Hey Siri” in case of Apple Siri or “Ok Google” in case of Google Assistant, for example.
  • While it will be understood that the selection of the content on the screen 108 of the electronic device 100 may be made using any kind of input means, such as using a mouse or keyboard in case of a personal computer, for example, in one implementation, the screen 108 may be a touch screen and the selection of the content may be made by a touch input on the touch screen. The touch input may correspond to a touch gesture specifying a display region on the screen 108 where the content is to be selected. As an example, the touch input may correspond to a sliding gesture covering the content to be selected. This may involve sliding over the content (e.g., a text portion) to be selected or encircling/framing the content to be selected, for example.
  • The content to be selected may correspond to a portion of text currently being displayed on the screen 108 of the electronic device 100. The text portion may comprise selectable text (e.g., text that is markable/selectable using common user interface functions known for common copy/paste operations) or, otherwise, the text portion may comprise non-selectable text. In the latter case, the selected content may correspond to a selected display region on the screen 108 that contains a non-selectable text portion, wherein the text portion may form part of a non-textual display element, such as an image displayed on the screen, for example. The content to be selected may not correspond to input from a keyboard displayed on the screen of the electronic device 100.
  • Prior to combining the voice input and the selected content (again, representing the first portion and the second portion of the command to be processed, respectively), both the voice input and the selected content may be converted into a same format, such as into (but not limited to) text, for example. To this end, the voice input may be transcribed into text using speech recognition. When the selected content corresponds to selectable text, the selected text may not need to be further converted. When the selected content corresponds to a display region containing non-selectable text (e.g., text contained in an image displayed on the screen), on the other hand, the selected display region may be subjected to text recognition in order to obtain a text representation of the selected content.
  • Therefore, in one variant, when the selection of the content comprises a selection of text (i.e., selectable text), combining the voice input with the selected content may comprise combining a transcription of the voice input with the selected text (e.g., concatenating the transcription of the voice input and the selected text). In another variant, when the selection of the content comprises a selection of a display region on the screen 108 (e.g., corresponding to an image displayed on the screen 108 containing text to be used as the second portion of the command), combining the voice input with the selected content may comprise performing text recognition on the selected display region to obtain text included therein as selected text, and combining a transcription of the voice input with the selected text (e.g., concatenating the transcription of the voice input and the selected text). In other words, when the selection of the content is made by a touch input specifying the display region, the electronic device 100 may be configured to recognize what is written in the display region and may use the recognized text as the second portion of the command to be generated. In this way, any text portion displayed on the screen 108 may generally be selected as second portion for the command to be generated. This may include text portions displayed in web browsers or messaging applications executed on a smartphone, for example, and a word or phrase to be used as second portion of the command may simply be selected by a touch on the word or phrase on the screen, for example.
  • In one implementation, the language of the transcription of the voice input and the language of the selected text may be different. Also, a character set of the transcription of the voice input and the character set of the selected text may be different. Therefore, as an example, although both the language and the character set of the transcription of the voice input may be based on English language, the user may select text that is displayed in Japanese language as second portion for the command to be generated. As a mere example, the user may say “what is” as voice input representing the first portion of the command and then select “
    Figure US20210311701A1-20211007-P00004
    ” on the screen representing the second input of the command so that the full command “what is
    Figure US20210311701A1-20211007-P00005
    ” is generated. In a similar use case, a user may use a camera application of the electronic device 100 to capture an image of content of interest and select a region in the captured image to be used as second portion of the command to be generated. For example, a user may capture a Japanese signboard, say “what is” and slide his finger over the Japanese text of the signboard on the captured image to generate a corresponding command to be processed by the electronic device.
  • In some implementations, the voice input may include an instruction to be processed by the electronic device 100, wherein the selected content may correspond to a parameter associated with the instruction. As an example, the instruction may correspond to a copying operation and the parameter associated with the instruction may correspond to an item to be copied. For example, if a user reads a webpage and would like to share a text portion of the webpage with friends, the user may say “copy the words” and select the desired text portion on the screen to generate a corresponding command. When processing the command, the electronic device may copy the selected text portion into the clipboard of the electronic device 100, ready to be pasted somewhere else in order to be shared with friends.
  • While it will be understood that receiving the voice input representative of the first portion of the command and receiving the selection of the content representative of the second portion of the command may be performed in the form of a standalone two-step input procedure, it may also be conceivable that the two-step input procedure is performed as a fallback procedure to a failed attempt to transcribe the command as a full voice command. In one variant, the selection of the content may thus be received upon failure to correctly transcribe voice input representing the content. Failure to correctly transcribe the voice input may be determined by the user upon review of the transcription of the voice input on the screen 108, for example.
  • If the first portion of the command represents an initial portion of the command that is to be input before the second portion of the command, the electronic device 100 may also recognize that the voice input received in the first step may not yet represent a full command (e.g., saying “what is” without any further specification) and the electronic device 100 may therefore be configured to wait for additional input from the user. Upon recognizing that the voice input representative of the first portion of the comment does not represent a full command, the electronic device 100 may thus wait for the selection of the content. In one such variant, the electronic device 100 may actively prompt the user to perform the selection of the content on the screen 108 when detecting that the full command is not yet available.
  • FIG. 3 illustrates an exemplary selection of content displayed on a screen 108 of an electronic device 100 which, in the figure, is given as a smartphone having a touch screen. In the shown example, it is assumed that the user of the smartphone 100 communicates with person “A” via a messaging application. As shown, the user may have received a message from person A saying “Hi, I'm now in VESTEL”. Assuming that the user does not know where VESTEL is, the user may then ask the virtual assistant of the smartphone 100 “where is VESTEL”. Due to a not fully clear pronunciation by the user, the virtual assistant may incorrectly recognize “where is vessel” as voice command input by the user (not shown in the figure). In order to correct this improper recognition, the user may repeat his question, but this time using the technique presented herein. The user may therefore say “where is” and the virtual assistant may recognize that “where is” does not yet represent a full command. The virtual assistant may thus wait for additional input from the user. As shown in the figure, the additional input is then provided by sliding the user's finger on the screen 108 over the word “VESTEL” in order to select the word “VESTEL” as subsequent input for the command to be generated. The virtual assistant may then combine the voice input “where is” with the content selection “VESTEL” in order to obtain the full command “where is VESTEL”. After that, the virtual assistant may process the command and provide a corresponding answer to the user's question. In this way, it is made sure that the user obtains an answer to a correct question, rather than to the initially recognized and improper question “where is vessel”.
  • It is believed that the advantages of the technique presented herein will be fully understood from the foregoing description, and it will be apparent that various changes may be made in is the form, constructions and arrangement of the exemplary aspects thereof without departing from the scope of the disclosure or without sacrificing all of its advantageous effects. Because the technique presented herein can be varied in many ways, it will be recognized that the disclosure should be limited only by the scope of the claims that follow.

Claims (16)

1. A method for generating a command to be processed by a voice-controlled electronic device, the method comprising:
receiving a voice input representative of a first portion of a command to be processed by the electronic device;
receiving a selection of content displayed on a screen of the electronic device, the selected content being representative of a second portion of the command to be processed by the electronic device; and
generating the command based on a combination of the voice input and the selected content.
2. The method of claim 1, wherein the command corresponds to a query to a virtual assistant executed on the electronic device.
3. The method of claim 1, wherein the screen is a touch screen and wherein the selection of the content is made by a touch input on the touch screen.
4. The method of claim 3, wherein the touch input corresponds to a sliding gesture covering the content to be selected.
5. The method of claim 1, wherein, when the selection of the content comprises a selection of text, combining the voice input with the selected content comprises:
combining a transcription of the voice input with the selected text.
6. The method of claim 1, wherein, when the selection of the content comprises a selection of a display region on the screen, combining the voice input with the selected content comprises:
performing text recognition on the selected display region to obtain text included therein as selected text; and
combining a transcription of the voice input with the selected text.
7. The method of claim 5, wherein a language of the transcription of the voice input and a language of the selected text are different.
8. The method of claim 5, wherein a character set of the transcription of the voice input and a character set of the selected text are different.
9. The method of claim 1, wherein the voice input includes an instruction to be processed by the electronic device and wherein the selected content corresponds to a parameter associated with the instruction.
10. The method of claim 1, wherein the selection of the content is received upon failure to correctly transcribe voice input representing the content.
11. The method of claim 1, wherein, upon recognizing that the voice input representative of the first portion of the command does not represent a full command, the electronic device waits for the selection of the content.
12. A computer program product comprising program code portions that, when the computer program product is executed on one or more computing devices, enable an electronic device to
receive a voice input representative of a first portion of a command to be processed by the electronic device;
receive a selection of content displayed on a screen of the electronic device, the selected content being representative of a second portion of the command to be processed by the electronic device; and
generate the command based on a combination of the voice input and the selected content.
13. A non-transitory computer readable recording medium comprising the computer program product of claim 12.
14. A voice-controlled electronic device for generating a command to be processed by the electronic device, the electronic device comprising at least one processor and at least one memory, the at least one memory containing instructions executable by the at least one processor such that the electronic device is operable to
receive a voice input representative of a first portion of a command to be processed by the electronic device;
receive a selection of content displayed on a screen of the electronic device, the selected content being representative of a second portion of the command to be processed by the electronic device; and
generate the command based on a combination of the voice input and the selected content.
15. The method of claim 6, wherein a language of the transcription of the voice input and a language of the selected text are different.
16. The method of claim 6, wherein a character set of the transcription of the voice input and a character set of the selected text are different.
US17/311,279 2018-12-06 2018-12-06 Technique for generating a command for a voice-controlled electronic device Abandoned US20210311701A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2018/083802 WO2020114599A1 (en) 2018-12-06 2018-12-06 Technique for generating a command for a voice-controlled electronic device

Publications (1)

Publication Number Publication Date
US20210311701A1 true US20210311701A1 (en) 2021-10-07

Family

ID=64664278

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/311,279 Abandoned US20210311701A1 (en) 2018-12-06 2018-12-06 Technique for generating a command for a voice-controlled electronic device

Country Status (6)

Country Link
US (1) US20210311701A1 (en)
EP (1) EP3891730B1 (en)
JP (1) JP2022518339A (en)
KR (1) KR20210099629A (en)
CN (1) CN113196383A (en)
WO (1) WO2020114599A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200094839A (en) * 2019-01-23 2020-08-10 삼성전자주식회사 Electronic device and operating method for providing a feedback information for a user input

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6223150B1 (en) * 1999-01-29 2001-04-24 Sony Corporation Method and apparatus for parsing in a spoken language translation system
JP4550708B2 (en) * 2005-09-29 2010-09-22 株式会社東芝 Speech translation apparatus and speech translation method
US20080153465A1 (en) * 2006-12-26 2008-06-26 Voice Signal Technologies, Inc. Voice search-enabled mobile device
US20080235029A1 (en) * 2007-03-23 2008-09-25 Cross Charles W Speech-Enabled Predictive Text Selection For A Multimodal Application
US20090112572A1 (en) * 2007-10-30 2009-04-30 Karl Ola Thorn System and method for input of text to an application operating on a device
US8520983B2 (en) * 2009-10-07 2013-08-27 Google Inc. Gesture-based selective text recognition
US9257115B2 (en) * 2012-03-08 2016-02-09 Facebook, Inc. Device for extracting information from a dialog
KR101992191B1 (en) * 2012-11-01 2019-06-24 엘지전자 주식회사 Mobile terminal and method for controlling thereof
US9383910B2 (en) * 2013-10-04 2016-07-05 Microsoft Technology Licensing, Llc Autoscroll regions
CN107004410B (en) * 2014-10-01 2020-10-02 西布雷恩公司 Voice and connectivity platform
KR20170046958A (en) * 2015-10-22 2017-05-04 삼성전자주식회사 Electronic apparatus and Method for executing function using speech recognition thereof
JP2017211430A (en) * 2016-05-23 2017-11-30 ソニー株式会社 Information processing device and information processing method
US10372412B2 (en) * 2016-10-25 2019-08-06 Microsoft Technology Licensing, Llc Force-based interactions with digital agents

Also Published As

Publication number Publication date
EP3891730A1 (en) 2021-10-13
CN113196383A (en) 2021-07-30
KR20210099629A (en) 2021-08-12
WO2020114599A1 (en) 2020-06-11
EP3891730B1 (en) 2023-07-05
JP2022518339A (en) 2022-03-15

Similar Documents

Publication Publication Date Title
US11532306B2 (en) Detecting a trigger of a digital assistant
US11393476B2 (en) Automatically determining language for speech recognition of spoken utterance received via an automated assistant interface
US10741181B2 (en) User interface for correcting recognition errors
US11817080B2 (en) Using corrections, of predicted textual segments of spoken utterances, for training of on-device speech recognition model
US10445429B2 (en) Natural language understanding using vocabularies with compressed serialized tries
CN110050303B (en) Voice-to-text conversion based on third party proxy content
WO2018055983A1 (en) Translation device, translation system, and evaluation server
US20220415305A1 (en) Speech generation using crosslingual phoneme mapping
KR102399420B1 (en) Text Independent Speaker Recognition
US20240055002A1 (en) Detecting near matches to a hotword or phrase
EP3891730B1 (en) Technique for generating a command for a voice-controlled electronic device
WO2016014597A2 (en) Translating emotions into electronic representations
CN116711004A (en) Automatic assistant execution of non-assistant application operations in response to user input capable of limiting parameters
US20240029728A1 (en) System(s) and method(s) to enable modification of an automatically arranged transcription in smart dictation
US20230252995A1 (en) Altering a candidate text representation, of spoken input, based on further spoken input
EP4330850A1 (en) System(s) and method(s) to enable modification of an automatically arranged transcription in smart dictation
WO2022104297A1 (en) Multimodal input-based data selection and command execution

Legal Events

Date Code Title Description
AS Assignment

Owner name: VESTEL ELEKTRONIK SANAYI VE TICARET A.S., TURKEY

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:CUBUKCU, BARAN;REEL/FRAME:056785/0597

Effective date: 20210628

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION