WO2019175896A1 - System and method for interacting with digitalscreensusing voice input and image processing technique - Google Patents

System and method for interacting with digitalscreensusing voice input and image processing technique Download PDF

Info

Publication number
WO2019175896A1
WO2019175896A1 PCT/IN2019/050200 IN2019050200W WO2019175896A1 WO 2019175896 A1 WO2019175896 A1 WO 2019175896A1 IN 2019050200 W IN2019050200 W IN 2019050200W WO 2019175896 A1 WO2019175896 A1 WO 2019175896A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
user
digital screen
input
voice input
Prior art date
Application number
PCT/IN2019/050200
Other languages
French (fr)
Inventor
Renuka Bodla
Original Assignee
Renuka Bodla
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renuka Bodla filed Critical Renuka Bodla
Publication of WO2019175896A1 publication Critical patent/WO2019175896A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback

Definitions

  • the embodiments herein generally relate to a system for interacting with digital screens and natural language processing, and more particularly, to a system and method that assists a user in interacting with digital screens using voice input.
  • an embodiment herein provides a method for interacting with a digital screen of a voice based digital screen interaction device using a voice input of a user.
  • the method includes (a) generating a database that stores information comprising a natural language of a user; (b) identifying a type of a program that runs on a digital screen of the voice based digital screen interaction device; (c) parsing the digital screen as an image in real time using an image processing technique to identify an overall layout of the digital screen, wherein the image processing technique removes unnecessary images and graphics that are embedded in the digital screen; (d) determining attributes by analysing the overall layout of the digital screen; (e) transforming the attributes of the digital screen from a machine-readable language to the natural language of the user using a natural language processing technique, wherein the natural language of the user is a human spoken language; (f) providing the attributes in the natural language of the user on the voice based digital screen interaction device for enabling the user to input a voice input, wherein each attribute
  • the method includes accessing functions selected from a group comprising Home, Tools, View, Message, Submit, Save, Backspace, Pagedown, Pageup based on the voice input to operate of the voice digital screen based interaction device.
  • the method includes (a) analysing a behaviour of the user, using a machine learning algorithm, based on the user’s interactions with different digital screens of the voice based digital screen interaction device; (b) generating data analytics on which digital screens or attributes that the user is spending maximum amount of time in interacting with; and (c) automatically prepopulating input fields associated with the attributes the standard values based on the user behaviour analysis and data analytics.
  • the method includes implementing a voice
  • the method includes (a) verifying the voice input of the user when a format of received input data for an input field is different from an expected input data format; and (b) displaying or playing the attributes of the digital screen to the user in the natural language of the user.
  • the voice input of the user is captured using a microphone associated with the voice based interaction device.
  • the method includes eliminating grammatical errors from the voice input using the natural language processing technique to determine input data corresponding to different input fields.
  • the method includes (a) implementing a machine learning algorithm to interpret abbreviations, synonyms, and other forms of pronouncing the labels and to analyse standard values provided by the user for prepopulating in the respective input fields.
  • the attributes include a static text, a menu bar option, a tool bar option, an audio, a video, a label, an input field and/or a semantics on the digital screen and the input fields include a text box, a radio button, a drop-down list box, a check box, or a multiple choice to select input data.
  • a system for interacting with a digital screen of a voice based digital screen interaction device using a voice input of a user comprises a memory and a processor.
  • the memory stores a set of instructions.
  • the memory comprises a database that stores information associated with a natural language of a user.
  • the processor executes the set of instructions to perform: (a) identifying a type of a program that runs on a digital screen of the voice based digital screen interaction device ; (b) parsing the digital screen as an image in real time using an image processing technique to identify an overall layout of the digital screen, wherein the image processing technique removes unnecessary images and graphics that are embedded in the digital screen; (c) determining attributes by analysing the overall layout of the digital screen; (d) transforming the attributes of the digital screen from a machine-readable language to the natural language of the user using a natural language processing technique, wherein the natural language of the user is a human spoken language; (e) providing the attributes in the natural language of the user on the voice based digital screen interaction device for enabling the user to input a voice input, wherein each attribute comprises a label, an input field and their semantics; (f) receiving the voice input of the user in the natural language of the user from the voice based digital screen interaction device, wherein the voice input of the user is processed to eliminate unnecessary background noises from the voice input
  • the system enhances performance/ productivity of the user through voice input based interaction.
  • the system does not require any external hardware to allow the user to interact with the digital screen or to provide the input data for various input fields.
  • the system is quick and accurate enough to parse, populate and identify the input fields in a newly opened digital screen, so that the user can interact with the newly opened digital screen in a real time.
  • the system allows the users to interact with the digital screens in data entry kind of jobs using voice input.
  • the system helps the users to effectively interact with the digital screen who are not proficient in (i) the machine -readable language and (ii) operating the voice based digital screen interaction device.
  • FIG. 1 illustrates a system view of a user interacting with a digital screen using a voice based digital screen interaction device that communicates with a voice interaction enabling server through a network according to an embodiment herein;
  • FIG. 2 illustrates an exploded view of the voice based digital screen interaction device of FIG. 1 according to an embodiment herein;
  • FIG. 3 illustrates an exploded view of the voice interaction enabling server of FIG. 1 according to an embodiment herein;
  • FIG. 4 illustrates an exemplary view of a digital screen with a menu bar having sub-options for messaging that the user interacts with to select a sub-option in real time by providing voice input according to an embodiment herein;
  • FIG. 5 illustrates an exemplary view of a user interacting with digital screens in real time according to an embodiment herein;
  • FIGS. 6A-6B are flow diagrams illustrating a method for interacting with the digital screen in the voice based digital screen interaction device using voice input according to an embodiment herein;
  • FIGS. 7 A and 7B are flow diagrams illustrating a method for generating commands using the voice interaction enabling server based on voice input according to an embodiment herein; and [0025] FIG. 8 shows a diagrammatic representation of a computer system within which a set of instructions for causing the machine to perform any one or more of the methodologies discussed herein may be executed in accordance with an embodiment herein.
  • FIGS. 1 through 8 where similar reference characters denote corresponding features consistently throughout the figures, preferred embodiments are shown.
  • FIG. 1 illustrates a system view of a user 102 interacting with a digital screen using a voice based digital screen interaction device 104 that communicates with a voice interaction enabling server 108 through a network 106 according to an embodiment herein.
  • the voice based digital screen interaction device 104 identifies a type of a program that runs on its digital screen.
  • the voice based digital screen interaction device 104 may be any of a mobile phone, a tablet, a laptop, a desktop computer etc.
  • the program may be an ERP or CRM application, an operating system, a mobile application, a website etc.
  • the voice based digital screen interaction device 104 parses the digital screen as an image in real time using image processing techniques (e.g.
  • the digital screen may be (a) a web portal page or (b) an e-filing form, (c) a mobile application, (d) a user interface of an operating system etc.
  • the digital screen is internally processed using the image processing techniques, such as image preprocessing, image filtering and image compression, for the removal of noise, accurate image visibility and to identify an overall lay out of the digital screen.
  • the voice based digital screen interaction device 104 determines attributes of the digital screen by analyzing it.
  • the attributes may include static text, menu bar options, tool bar options, audio, video, labels, input fields and/or semantics on the digital screen.
  • the input fields may include text boxes, radio buttons, drop down list boxes, check boxes, etc.
  • the menu bar options may include FILE, VIEW, EDIT, TOOLS, MESSAGE, etc.
  • the voice based digital screen interaction device 104 uses the image processing technique to remove unnecessary images and graphics that are embedded in the digital screen.
  • the voice based digital screen interaction device 104 identifies a proximity type between the labels and the input fields.
  • the proximity type may include (a) adjacent proximity, (b) up and down proximity and/or (c) near-by location proximity.
  • the up and down proximity may typically occur when the digital screen is displayed on a computing device that has a smaller screen.
  • the voice based digital screen interaction device 104 analyzes the semantics of the labels and associates the labels with corresponding input fields based on the identified proximity type.
  • the input fields may include multiple choices to select input data.
  • the voice based digital screen interaction device 104 associates the multiple choices with corresponding input fields in real time.
  • the voice based digital screen interaction device 104 associates the above-mentioned multiple choices with the label “HOBBIES” in real time.
  • the voice based digital screen interaction device 104 receives voice input from the user 102 using a microphone or any other mechanism to receive the voice input/commands.
  • the voice based digital screen interaction device 104 may receive the voice input in any order based on a preferred communication style or format of the user 102.
  • the voice based digital screen interaction device 104 communicates the voice input and attributes of the digital screen to a voice interaction enabling server 108.
  • the voice interaction enabling server 108 may eliminate grammatical errors from the voice input using a natural language processing technique to determine data corresponding to different fields.
  • Natural-language processing technique is a technique used in artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to fruitfully process large amounts of natural language data.
  • the Natural-language processing technique employs machine learning algorithms such as, Python etc. to eliminate grammatical errors from the voice input in order to determine data corresponding to different fields.
  • There are multiple open source libraries e.g. Apache Open NLP, Natural Language Toolkit (NLTK), Stanford NLP, MALLET) that are available to provide algorithmic building blocks in real time.
  • the embodiment herein can implement Natural language processing technique using open source library called Natural Language Toolkit (NLTK), for example Python.
  • NLTK Natural Language Toolkit
  • the voice interaction enabling server 108 may receive the voice input as“MY FIRST NAME is JOHN” or“FIRST NAME” and“JOHN” to determine that the data corresponding to the field“FIRST NAME” is“JOHN”.
  • the voice interaction enabling server 108 may also receive the voice input as exactly the input data or spelled format of the input data. For example, the voice interaction enabling server 108 receives the voice input in the format of“JOHN” or“J-O-H-N”. The voice interaction enabling server 108 identifies the attributes irrespective of the order of the attributes in the digital screen and generates commands to populate the input data in the associated input fields.
  • the voice interaction enabling server 108 may receive the input data first for the label“LAST NAME”, second for the label“FIRST NAME” and last for the label“MIDDLE NAME” and generate commands to exactly populate the associated input fields with the respective input data irrespective of the order in which the inputs were received.
  • the voice interaction enabling server 108 may learn how to understand and interpret abbreviations, synonyms, and other forms of pronouncing the labels. For example, inside an Oracle ERP system, there could be a label “PO” and the user 102 may provide input corresponding to the label either as Purchase Order or PO.
  • the voice interaction enabling server 108 further analyzes the voice input and the attributes of the digital screen.
  • the voice input may be in a natural language (e.g. native tongue) of the user 102.
  • the voice interaction enabling server 108 processes the voice input of the user 102 to eliminate unnecessary background noises from the voice input.
  • the attributes of the digital screen may be in a machine readable language.
  • the machine-readable language may be any language, for example Extensible Markup Language (XML), on which the digital screen is intended to fill according to legal requirements, jurisdictional requirements etc.
  • the voice interaction enabling server 108 translates (i) the voice input from a natural language to a machine readable language and (ii) the attributes of the digital screen from the machine readable language to the natural language of the user 102 (e.g. native tongue of the user 102) using the natural language processing technique.
  • the voice interaction enabling server 108 communicates the translated attributes to the voice based digital screen interaction device 104.
  • the voice based digital screen interaction device 104 may display or play the attributes using a display or a speaker respectively, when the user 102 requests the voice interaction enabling server 108.
  • the voice interaction enabling server 108 may assist the user 102 using the natural language processing technique (a) to understand the semantics of the attributes and (b) to provide the voice input in the natural language.
  • the natural language may be any a human spoken language or native tongue of the user 102.
  • the natural language is not a programming language.
  • the user 102 may interact with the digital screens several times.
  • the voice interaction enabling server 108 may learn and analyze behavior (e.g. which may include an accent or an individual style or pattern of providing the voice inputs) of the user 102 from the digital screens that are interacted by the user 102, using a machine learning algorithm.
  • the voice interaction server 108 using the machine learning algorithm, may (i) learn and analyze standard values (e.g. JOHN, SMITH, INDIA etc.) provided by the user 102 for the respective input fields (e.g. first name, last name, country etc.) and (ii) generate commands to automatically pre populate those standard values in the respective input fields to reduce input time for the user l02during subsequent interactions with the digital screens.
  • standard values e.g. JOHN, SMITH, INDIA etc.
  • the voice based digital screen interaction device 104 includes a voice based interaction database that stores a repository of the digital screens, the standard values and the automatic pre -populated input fields etc. For example, if the user 102, who may be from a country“INDIA”, may repeatedly fill the label “COUNTRY” with the standard value of“INDIA”, then the voice interaction enabling server 108, using the machine learning algorithm, learns the standard value as“INDIA” for the label “COUNTRY” for that user 102 and generate commands to automatically pre-populate the standard value of “INDIA” for an input filed of the label“COUNTRY” in the digital screen during subsequent interactions of that user 102 with that digital screen.
  • the voice interaction enabling server 108 may analyze the semantics of the input data in the label“FIRST NAME” and generate commands to automatically pre -populate an input field of the label “GENDER” with appropriate data.
  • the voice interaction enabling server 108 analyses the semantics of the input data“JOHN” associated with the label“FIRST NAME” and generates commands to automatically pre -populate an input field of the label“GENDER” with the data of“MALE”.
  • the voice based digital screen interaction device 104 may delete, edit or change the automatically pre-populated standard values based on voice input of the user 102.
  • the voice interaction enabling server 108 automatically populates the input field of the label “COUNTRY”.
  • the voice interaction enabling server 108 automatically populates the input field of the label“COUNTRY” as“INIDA” in the digital screen.
  • the voice interaction enabling server 108 determines how a word has to be spelled in the display / speaker of the voice based digital screen interaction device 104. For example, for the word“COLOUR”, the spelling is“COLOR” in US whereas the spelling is COLOUR in India. If the location the voice based digital screen interaction device 104 is identified as US, then the voice interaction enabling server 108 populates the input field with the correct spelling used in US.
  • the voice interaction enabling server 108 may provide data analytics on which digital screen that the user 102 is spending maximum amount of time in interacting with or input fields that are not being populated by the user 102.
  • the voice interaction enabling server 108 may provide options to the user 102 to stop or start interacting with the digital screen using voice input.
  • the voice interaction enabling server 108 may allow the user 102 to start or stop interacting with the digital screen using the voice input“START” or“STOP” respectively.
  • the voice based digital screen interaction device 104 may allow the user 102 to enter the inputs through a keyboard or a mouse.
  • the voice interaction enabling server 108 may analyze an accent of the user 102 and an individual style or pattern of the user 102 in providing the voice input.
  • the voice interaction enabling server 108 may employ a voice authentication process which authenticates authorized users to fill / interact with secured digital screens.
  • the voice interaction enabling server 108 may use the user’s 102 voice as an authentication feature to allow the authorized users to fill the secured digital screens.
  • the authentication feature may be (i) a voice pitch, (ii) a voice tone or (iii) a voice accent.
  • the voice interaction enabling server 108 may generate commands based on the analysis of the voice input of the user 102 and the attributes of the digital screen.
  • the voice interaction enabling server 108 may communicate the generated commands to the voice based digital screen interaction device 104.
  • the voice based digital screen interaction device 104 receives and executes the commands using a command execution technique.
  • the command execution technique includes well known techniques such as command injection.
  • the command injection is the insertion of code e.g. UNIX commands, into dynamically generated output and it helps for managing the data.
  • the command execution may include (i) populating the input fields of the labels with the respective input data based on the voice input of the user 102, (ii) automatically pre-populating the input fields with the standard values based on the behavior analysis and data analytics associated with the user 102, (iii) verifying the voice input or the input data with the user 102, (iv) accessing the functions like HOME, TOOLS, VIEW, MESSAGE, SUBMIT, SAVE, BACKSPACE, PAGEDOWN, PAGEUP etc. to effectively operate the voice based digital screen interaction device 104, (v) providing an authentication to the user 102 to access the secured digital screen or (vi) displaying or playing the attributes of the digital screen or input data to the user 102 in the natural language of the user 102.
  • the voice based digital screen interaction device 104 may play or display the attributes of the digital screen or input data to the user 102 in the natural language, using the speaker/display, when the user 102 requests to play /display the input data entered in the input field.
  • the voice based digital screen interaction device 104 may play or display the attribute of the digital screen irrespective of a type of font, color of the attribute, a size of the attribute etc.
  • the voice based digital screen interaction device 104 may verify the voice input or the input data with the user 102 when a format of received input data for an input field is different from an expected input data format a. For example, for the label, the input data format expected by the voice interaction enabling server 108 is a number format.
  • the voice interaction enabling server 108 verifies and notifies the user 102 through the voice based digital screen interaction device l04to provide the input data in the number format.
  • the digital screen may include mandatory fields to be filled in by the user 102. If the user 102 clicks“SUBMIT” option without filling the mandatory fields, then the voice interaction enabling server 108 verifies and notifies the user 102 through the voice based digital screen interaction device 104 to fill the mandatory fields.
  • one or more voice based digital screen interaction devices is communicatively coupled with the voice interaction enabling server 108.
  • the one or more voice based digital screen interaction devices receives the voice input from the one or more users and communicates the voice inputs of the one or more users with the voice interaction enabling server 108 for further processing as described above.
  • the one or more voice based digital screen interaction devices communicates with a voice based digital screen interactive cloud model instead of the voice interaction enabling server 108.
  • the voice based digital screen interaction device 104 performs the function of the voice interaction enabling server 108 as described above including processing voice input obtained from the user 102 and the attributes of the digital screen, generating command based on analysis of the voice input and the attributes of the digital screen etc.
  • the voice interaction enabling server 108 may include a default setting if the user 102 uses the voice interaction enabling server 108 for the first time. The voice interaction enabling server 108 then starts learning the attributes of the digital screens that the user 102 is interacting with, using the machine learning algorithm, and saves those attributes with his/her profile. If the voice interaction enabling server 108 is being used by multiple users, then the voice interaction enabling server 108 creates a profile for each of the users using an identification of each user.
  • FIG. 2 illustrates an exploded view of the voice based digital screen interaction device l04of FIG. 1 according to an embodiment herein.
  • the voice based digital screen interaction device l04 includes a voice based interaction database 202, a program identification module 204, a screen parsing module 206, a label and input field association module 208, a voice input obtaining module 210, a voice input verification module 212, an attributes and voice input communication module 214, a command receiving module 216 and a command execution module 218.
  • the voice based digital screen interaction device 104 may include a voice input swapping module.
  • the program identification module 204 identifies a type of the program that runs on a digital screen of the voice based digital screen interaction device 104.
  • the screen parsing module 206 parses the digital screen as an image in real time using an image processing technique to identify the overall layout of the digital screen.
  • the screen parsing module 206 may determine the attributes of the digital screen by analyzing the layout of the digital screen.
  • the screen parsing module 206 may remove unnecessary images and graphics that are embedded in the digital screen for better interaction.
  • the label and input field association module 208 identifies a proximity type between the labels and the input fields.
  • the label and input field association module 208 associates the labels with the respective input fields based on (i) the identified proximity type and (ii) the analysis of the semantics of the labels and corresponding input fields.
  • the voice input obtaining module 210 obtains the voice input from the user 102 in natural language (e.g. native tongue) using the microphone or any other mechanism to receive the voice input.
  • the voice input may include any one of (i) utterance of name of the label followed by the input data to be entered in the corresponding input field and (ii) utterance of phrase“CLICK ON” followed by at least one of (a) menu bar option, (b) tool bar option or (c) functions like EDIT, BACKSPACE, DELETE, SUBMIT, SAVE, etc.
  • the voice input may include (ii) utterance of phrase“TRANSLATE” followed by at least one of (a) a screen name, (b) a label name and (c) an input field followed by a name of the natural language.
  • the user voice input obtaining module 210 eliminates the unnecessary background noises from the voice input of the user 102.
  • the voice input verification module 212 verifies the voice input of the user 102 when a format of received input data for an input field is different from an expected input data format.
  • the voice input verification module 212 further verifies (i) a voice pitch, (ii) a voice tone and (iii) a voice accent of the user 102 (e.g. authorized user) with the received voice input for authentication.
  • the attributes and voice input communication module 214 communicates the voice input and the attributes of the digital screen to the voice interaction enabling server 108 to generate commands based on the analysis of the voice input of the user 102 and the attributes of the digital screen.
  • the command receiving module 216 receives a command that is generated based on the voice input and the attributes of the digital screen from the voice interaction enabling server 108.
  • the command execution module 218 executes the command that is generated based on the voice input using a command execution technique.
  • the command execution module 218 may delete, edit or change the input data in an input field that is automatically pre-populated with standard values based on voice input of the user 102.
  • the voice based interaction database 202 stores a repository of the digital screens, the standard values filled in different digital screens filled by the user 102 and the automatic pre populated input fields etc.
  • the voice input swapping module provides options to the user 102 to swap between the voice input based interaction and a traditional way of interaction (e.g. through keyboard or mouse) with the digital screen.
  • FIG. 3 illustrates an exploded view of the voice interaction enabling server l08of FIG. 1 according to an embodiment herein.
  • the voice interaction enabling server 108 includes a voice based command execution database 302, an attribute and voice input receiving module 304, a natural language module 306, an attribute identification module 308, a behavior analysis module 310, a data analytics generation module 312, a command generation module 314 and a command communication module 316.
  • the attribute and voice input receiving module 304 receives the voice input and the attributes of the digital screen from the voice based digital screen interaction device 104.
  • the natural language module 306 analyzes the received voice input and translates the voice input to a machine readable language from a natural language (e.g. native tongue of the user 102).
  • a natural language e.g. native tongue of the user 102
  • the attribute identification module 308 identifies an attribute that the user 102 is intended to enter input data/execute a command from the received attributes of the digital screen based on the analyzed voice input.
  • the attribute identification module 308 may identify the attributes that the user 102 is intended to enter input data/execute the command irrespective of position of the attributes in the digital screen.
  • the behavior analysis module 310 analyzes the behavior of the user 102 (e.g. which may include an accent or an individual style or pattern of providing the voice input) from the digital screens that are filled in /interacted by the user 102, using a machine learning algorithm (e.g. MATLAB, C++ etc.).
  • a machine learning algorithm e.g. MATLAB, C++ etc.
  • the data analytics generation module 312 generates data analytics on which digital screens or attributes that the user 102 is spending maximum amount of time in interacting with /entering the input data.
  • the command generation module 314 generates commands based on the voice input, the data analytics of user’s interaction with the different digital screens, the voice input verification and the behavior analysis of the user 102.
  • the command communication module 316 communicates the generated commands to the voice based digital screen interaction device 104 through a wireless or any other network.
  • the voice based command execution database 302 stores (i) a voice pitch of the user 102, (ii) a voice tone of the user 102, (iii) a voice accent of the user 102, (iv) standard values, (iv) data analytics of user’s interaction with the different digital screens, etc.
  • the wireless network may be (a) a WIFI network, (b) a Bluetooth or (c) any other network.
  • FIG. 4 illustrates an exemplary view of a digital screen with a menu bar having sub-options for messaging that the user interacts with to select a sub-option in real time by providing voice input according to an embodiment herein.
  • the menu bar comprises an option “MESSAGE” 402, which in turn provides/displays sub-options such as (i) NEW MESSAGE, (ii) NEW MESSAGE USING, (iii) REPLY TO SENDER, (iv) REPLY TO ALL, (v) REPLY TO GROUP, and (iii) FORWARD when the user 102 accesses/clicks the option“MESSAGE” 402.
  • the voice interaction enabling server 108 analyzes, parses and associates the sub-options with the option“MESSAGE” 402 in real time and allows the user 102 to select a sub-option in real time using his voice input.
  • FIG. 5 illustrates an exemplary view of a user interacting with one or more digital screens in real time according to an embodiment herein.
  • the voice based digital screen interaction device 104 may provide options to the user 102 to interact simultaneously with more than one digital screens and switch between a first digital screen and a second digital screen.
  • the voice based digital screen interaction device 104 provides options to the user 102 to name each digital screen with a name.
  • the voice based digital screen interaction device l04 receives a voice input that includes a name of the digital screen as input data followed by a label from the user 102 and names that digital screen based on the voice input.
  • the user 102 when the user 102 is interacting simultaneously with two digital screens, for example“SCREEN 1” and SCREEN 2” and wants to name the“SCREEN 1” with a name, then the user 102 can provide his voice input to name the digital screens as“SCREEN 1” followed by a name of the attribute/label and followed by input data to be entered/executed in the input field as associated with the attribute / label.
  • FIGS. 6A-6B are flow diagrams illustrating a method for interacting with a digital screen in the voice based digital screen interaction device 104 using voice input according to an embodiment herein.
  • attributes of a digital screen is received from a voice based digital screen interaction device 104.
  • the attributes of the digital screen is transformed from a machine -readable language to the natural language of the user 102 using a natural language processing technique.
  • the natural language of the user 102 is a human spoken language.
  • the attributes in the natural language of the user 102 is provided on the voice based digital screen interaction device 104 for enabling the user 102 to input a voice input.
  • Each attribute comprises a label, an input field and their semantics.
  • the voice input of the user 102 is received from the voice based digital screen interaction device 104.
  • the voice input is transformed from the natural language of the user 102 to the machine - readable language using the natural language processing technique.
  • a first attribute to be populated is determined based on the voice input irrespective of an order of a position of the first attribute in the digital screen.
  • a proximity type between a label and an input field associated with the first attribute is identified to associate the label with the input field based on the identified proximity type.
  • semantics of the label and their corresponding input field is analyzed based on the identified proximity type.
  • a command is generated based on the voice input, the first attribute to populate the voice input in the input field associated with the first attribute.
  • the voice input is populated in the input field associated with the first attribute by executing the generated command on the voice based digital screen interaction device 104 using a command execution technique.
  • FIGS. 7 A and 7B are flow diagrams illustrating a method for generating commands using the voice interaction enabling server 108 based on voice input according to an embodiment herein.
  • a type of a program that runs on a digital screen of the voice based digital screen interaction device 104 is identified.
  • the digital screen is parsed as an image in real time using an image processing technique to identify an overall layout of the digital screen.
  • the image processing technique removes unnecessary images and graphics that are embedded in the digital screen.
  • attributes are determined by analysing the overall layout of the digital screen.
  • the attributes of the digital screen are transformed from a machine-readable language to the natural language of the user using a natural language processing technique.
  • the attributes comprise a static text, a menu bar option, a tool bar option, an audio, a video, a label, an input field and/or a semantics on the digital screen and the input fields comprise a text box, a radio button, a drop down list box, a check box, or a multiple choice to select input data.
  • the natural language of the user is a human spoken language.
  • the attributes are provided in the natural language of the user 102 on the voice based digital screen interaction device 104 for enabling the user 102 to input a voice input.
  • Each attribute comprises a label, an input field and their semantics.
  • the voice input of the user 102 is received in the natural language of the user 102 from the voice based digital screen interaction device 104.
  • the voice input of the user 102 is processed to eliminate unnecessary background noises from the voice input.
  • the voice input is transformed from the natural language of the user 102 to the machine -readable language using the natural language processing technique.
  • a first attribute to be populated is determined based on the voice input irrespective of an order of a position of the first attribute in the digital screen.
  • a proximity type between a label and an input field associated with the first attribute is identified to associate the label with the input field based on the identified proximity type.
  • semantics of the label and their corresponding input field is analysed based on the identified proximity type.
  • a command is generated based on the voice input, the first attribute to populate the voice input in the input field associated with the first attribute.
  • the voice input is populated in the input field associated with the first attribute by executing the generated command on the voice based digital screen interaction device 104 using a command execution technique.
  • FIG. 8 shows a diagrammatic representation of a computer system within which a set of instructions for causing the machine to perform any one or more of the methodologies discussed herein may be executed in accordance with an embodiment herein.
  • the machine operates as a standalone device or may be connected (e.g., networked) to other machines.
  • the machine may operate in the capacity of a server or a client machine in server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment.
  • the machine may be a server computer, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.
  • PC personal computer
  • PDA Personal Digital Assistant
  • STB set-top box
  • a cellular telephone a web appliance
  • network router switch or bridge
  • the example computer system includes a processor 802 (e.g., a central processing unit (CPU) a graphics processing unit (GPU) or both), a main memory 804 and a static memory 806, which communicate with each other via a bus 808.
  • the computer system may further include a video display unit 810 (e.g., a liquid crystal display (LCD), a light emitting diode (LED) or a cathode ray tube (CRT)).
  • the computer system also includes an alphanumeric input device 812 (e.g., a keyboard or touch screen), a disk drive unit 814 and a network interface device 816.
  • the disk drive unit 814 includes a machine -readable medium 818 on which is stored one or more sets of instructions 820 (e.g. software) embodying any one or more of the methodologies or functions described herein.
  • the instructions 820 may also reside, completely or at least partially, within the main memory 804 and/or within the processor 802 during execution thereof by the computer system, the main memory 804 and the processor 802 also constituting machine-readable media.
  • the instructions 820 may further be transmitted or received over a network 822 via the network interface device 816.
  • machine-readable medium 818 is shown in an example embodiment to be a single medium, the term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions.
  • the term “machine -readable medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure.
  • the term “machine -readable medium” shah accordingly be taken to include, but not be limited to, solid-state memories, optical and magnetic media, and carrier wave signals.

Abstract

A system for interacting with digital screen using voice input is provided. The voice based digital screen interaction device (104) identifies a type of program that runs on its digital screen and parses the digital screen as an image in real-time. The voice based digital screen interaction device (104) identifies a layout of the digital screen to determine attributes and to associate labels with corresponding input fields. The voice based digital screen interaction device (104) receives voice input from the user (102) in a human spoken language and communicates the attributes and the voice input to a voice interaction enabling server (108). The voice interaction enabling server (108) translates the attributes and the voice input and verifies the voice input to generate commands based on the voice input. The voice based digital screen interaction device (104) receives and executes the commands to perform functions as defined in the commands.

Description

SYSTEM AND METHOD FOR INTERACTING WITH DIGITALSCREENSUSING VOICE INPUT AND IMAGE PROCESSING TECHNIQUE
BACKGROUND
Technical Field
[0001] The embodiments herein generally relate to a system for interacting with digital screens and natural language processing, and more particularly, to a system and method that assists a user in interacting with digital screens using voice input.
Description of the Related Art
[0002] Users of software products, websites, mobile applications, etc. typically interact with the digital screens to provide inputs using input devices like a keyboard or a mouse. Excess usage of the above-mentioned input devices may result in ergonomic injuries such as neck pain, back pain, and carpel tunnel syndrome, and reduced productivity. Further, users who suffer from physical disabilities, who are unable to operate these input devices, cannot use their personal digital devices without seeking help from others. Since digital screens are typically displayed in popular languages like English, users who are not proficient in English also face hurdles in operating their personal digital devices.
[0003] With the advent of speech recognition, virtual assistants have become available for use with personal devices such as smartphones, tablets, etc. which use natural language processing (NLP) to match user text or voice input to executable commands. They are used for simple applications such providing information on the weather, playing music and videos etc. However, they are limited in their capabilities to interact with digital screens to implement processes and require substantial manual input using input devices, and are slow and inaccurate. Existing systems also do not support multilingual capability to assist the users to interact with the digital screens in their natural language or desired language.
[0004] Accordingly, there remains a need for a system and a method for enabling users to effectively interact with the digital screens using their voice inputs in real time. SUMMARY
[0005] In view of the foregoing, an embodiment herein provides a method for interacting with a digital screen of a voice based digital screen interaction device using a voice input of a user. The method includes (a) generating a database that stores information comprising a natural language of a user; (b) identifying a type of a program that runs on a digital screen of the voice based digital screen interaction device; (c) parsing the digital screen as an image in real time using an image processing technique to identify an overall layout of the digital screen, wherein the image processing technique removes unnecessary images and graphics that are embedded in the digital screen; (d) determining attributes by analysing the overall layout of the digital screen; (e) transforming the attributes of the digital screen from a machine-readable language to the natural language of the user using a natural language processing technique, wherein the natural language of the user is a human spoken language; (f) providing the attributes in the natural language of the user on the voice based digital screen interaction device for enabling the user to input a voice input, wherein each attribute comprises a label, an input field and their semantics; (g) receiving the voice input of the user in the natural language of the user from the voice based digital screen interaction device, wherein the voice input of the user is processed to eliminate unnecessary background noises from the voice input; (h) transforming the voice input from the natural language of the user to the machine-readable language using the natural language processing technique; (i) determining a first attribute to be populated based on the voice input irrespective of an order of a position of the first attribute in the digital screen; (J) identifying a proximity type between a label and an input field associated with the first attribute to associate the label with the input field based on the identified proximity type; and (k) analysing semantics of the label and their corresponding input field based on the identified proximity type; (1) generating a command based on the voice input, the first attribute to populate the voice input in the input field associated with the first attribute; and (m) populating the voice input in the input field associated with the first attribute by executing the generated command on the voice based digital screen interaction device using a command execution technique.
[0006] In one embodiment, the method includes accessing functions selected from a group comprising Home, Tools, View, Message, Submit, Save, Backspace, Pagedown, Pageup based on the voice input to operate of the voice digital screen based interaction device. [0007] In another embodiment, the method includes (a) analysing a behaviour of the user, using a machine learning algorithm, based on the user’s interactions with different digital screens of the voice based digital screen interaction device; (b) generating data analytics on which digital screens or attributes that the user is spending maximum amount of time in interacting with; and (c) automatically prepopulating input fields associated with the attributes the standard values based on the user behaviour analysis and data analytics.
[0008] In yet another embodiment, the method includes implementing a voice
authentication process to authenticate authorized users to access the digital screen that are secured.
[0009] In yet another embodiment, the method includes (a) verifying the voice input of the user when a format of received input data for an input field is different from an expected input data format; and (b) displaying or playing the attributes of the digital screen to the user in the natural language of the user.
[0010] In yet another embodiment, the voice input of the user is captured using a microphone associated with the voice based interaction device.
[0011] In yet another embodiment, the method includes eliminating grammatical errors from the voice input using the natural language processing technique to determine input data corresponding to different input fields.
[0012] In yet another embodiment, the method includes (a) implementing a machine learning algorithm to interpret abbreviations, synonyms, and other forms of pronouncing the labels and to analyse standard values provided by the user for prepopulating in the respective input fields.
[0013] In yet another embodiment, the attributes include a static text, a menu bar option, a tool bar option, an audio, a video, a label, an input field and/or a semantics on the digital screen and the input fields include a text box, a radio button, a drop-down list box, a check box, or a multiple choice to select input data.
[0014] In one aspect, a system for interacting with a digital screen of a voice based digital screen interaction device using a voice input of a user is provided. The system comprises a memory and a processor. The memory stores a set of instructions. The memory comprises a database that stores information associated with a natural language of a user. The processor executes the set of instructions to perform: (a) identifying a type of a program that runs on a digital screen of the voice based digital screen interaction device ; (b) parsing the digital screen as an image in real time using an image processing technique to identify an overall layout of the digital screen, wherein the image processing technique removes unnecessary images and graphics that are embedded in the digital screen; (c) determining attributes by analysing the overall layout of the digital screen; (d) transforming the attributes of the digital screen from a machine-readable language to the natural language of the user using a natural language processing technique, wherein the natural language of the user is a human spoken language; (e) providing the attributes in the natural language of the user on the voice based digital screen interaction device for enabling the user to input a voice input, wherein each attribute comprises a label, an input field and their semantics; (f) receiving the voice input of the user in the natural language of the user from the voice based digital screen interaction device, wherein the voice input of the user is processed to eliminate unnecessary background noises from the voice input; (g) transforming the voice input from the natural language of the user to the machine-readable language using the natural language processing technique; (h) determining a first attribute to be populated based on the voice input irrespective of an order of a position of the first attribute in the digital screen; (i) identifying a proximity type between a label and an input field associated with the first attribute to associate the label with the input field based on the identified proximity type; and (j) analysing semantics of the label and their corresponding input field based on the identified proximity type; (k) generating a command based on the voice input, the first attribute to populate the voice input in the input field associated with the first attribute; and (1) populating the voice input in the input field associated with the first attribute by executing the generated command on the voice based digital screen interaction device using a command execution technique.
[0015] The system enhances performance/ productivity of the user through voice input based interaction. The system does not require any external hardware to allow the user to interact with the digital screen or to provide the input data for various input fields. The system is quick and accurate enough to parse, populate and identify the input fields in a newly opened digital screen, so that the user can interact with the newly opened digital screen in a real time. The system allows the users to interact with the digital screens in data entry kind of jobs using voice input. The system helps the users to effectively interact with the digital screen who are not proficient in (i) the machine -readable language and (ii) operating the voice based digital screen interaction device.
[0016] These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating preferred embodiments and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments herein without departing from the spirit thereof, and the embodiments herein include all such modifications.
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] The embodiments herein will be better understood from the following detailed description with reference to the drawings, in which:
[0018] FIG. 1 illustrates a system view of a user interacting with a digital screen using a voice based digital screen interaction device that communicates with a voice interaction enabling server through a network according to an embodiment herein;
[0019] FIG. 2 illustrates an exploded view of the voice based digital screen interaction device of FIG. 1 according to an embodiment herein;
[0020] FIG. 3 illustrates an exploded view of the voice interaction enabling server of FIG. 1 according to an embodiment herein;
[0021] FIG. 4 illustrates an exemplary view of a digital screen with a menu bar having sub-options for messaging that the user interacts with to select a sub-option in real time by providing voice input according to an embodiment herein;
[0022] FIG. 5 illustrates an exemplary view of a user interacting with digital screens in real time according to an embodiment herein;
[0023] FIGS. 6A-6B are flow diagrams illustrating a method for interacting with the digital screen in the voice based digital screen interaction device using voice input according to an embodiment herein;
[0024] FIGS. 7 A and 7B are flow diagrams illustrating a method for generating commands using the voice interaction enabling server based on voice input according to an embodiment herein; and [0025] FIG. 8 shows a diagrammatic representation of a computer system within which a set of instructions for causing the machine to perform any one or more of the methodologies discussed herein may be executed in accordance with an embodiment herein.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
[0026] The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
[0027] As mentioned, there remains a need for a system for enhancing the performance/ productivity of the user. The embodiments herein achieve this by providing a voice based digital screen interaction device and a voice interaction enabling server that enables a user to effectively interact with a digital screen using voice input. Referring now to the drawings, and more particularly to FIGS. 1 through 8, where similar reference characters denote corresponding features consistently throughout the figures, preferred embodiments are shown.
[0028] FIG. 1 illustrates a system view of a user 102 interacting with a digital screen using a voice based digital screen interaction device 104 that communicates with a voice interaction enabling server 108 through a network 106 according to an embodiment herein. The voice based digital screen interaction device 104 identifies a type of a program that runs on its digital screen. The voice based digital screen interaction device 104 may be any of a mobile phone, a tablet, a laptop, a desktop computer etc. The program may be an ERP or CRM application, an operating system, a mobile application, a website etc. The voice based digital screen interaction device 104 parses the digital screen as an image in real time using image processing techniques (e.g. image preprocessing, image restoration, image compression, image filtering and/or image analysis) to identify an overall layout of the digital screen. The digital screen may be (a) a web portal page or (b) an e-filing form, (c) a mobile application, (d) a user interface of an operating system etc. The digital screen is internally processed using the image processing techniques, such as image preprocessing, image filtering and image compression, for the removal of noise, accurate image visibility and to identify an overall lay out of the digital screen. From the overall layout of the digital screen, the voice based digital screen interaction device 104 determines attributes of the digital screen by analyzing it. The attributes may include static text, menu bar options, tool bar options, audio, video, labels, input fields and/or semantics on the digital screen. The input fields may include text boxes, radio buttons, drop down list boxes, check boxes, etc. The menu bar options may include FILE, VIEW, EDIT, TOOLS, MESSAGE, etc. In one embodiment, the voice based digital screen interaction device 104 uses the image processing technique to remove unnecessary images and graphics that are embedded in the digital screen.
[0029] The voice based digital screen interaction device 104 identifies a proximity type between the labels and the input fields. The proximity type may include (a) adjacent proximity, (b) up and down proximity and/or (c) near-by location proximity. The up and down proximity may typically occur when the digital screen is displayed on a computing device that has a smaller screen. The voice based digital screen interaction device 104 analyzes the semantics of the labels and associates the labels with corresponding input fields based on the identified proximity type. The input fields may include multiple choices to select input data. The voice based digital screen interaction device 104 associates the multiple choices with corresponding input fields in real time. For example, if an input field associated with the label“HOBBIES” include multiple choices like“PLAYING”, “DRAWING”,“READING”, etc., the voice based digital screen interaction device 104 associates the above-mentioned multiple choices with the label “HOBBIES” in real time.
[0030] The voice based digital screen interaction device 104 receives voice input from the user 102 using a microphone or any other mechanism to receive the voice input/commands. The voice based digital screen interaction device 104 may receive the voice input in any order based on a preferred communication style or format of the user 102. The voice based digital screen interaction device 104 communicates the voice input and attributes of the digital screen to a voice interaction enabling server 108. The voice interaction enabling server 108 may eliminate grammatical errors from the voice input using a natural language processing technique to determine data corresponding to different fields. Natural-language processing technique is a technique used in artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to fruitfully process large amounts of natural language data. The Natural-language processing technique employs machine learning algorithms such as, Python etc. to eliminate grammatical errors from the voice input in order to determine data corresponding to different fields. There are multiple open source libraries (e.g. Apache Open NLP, Natural Language Toolkit (NLTK), Stanford NLP, MALLET) that are available to provide algorithmic building blocks in real time. The embodiment herein can implement Natural language processing technique using open source library called Natural Language Toolkit (NLTK), for example Python. For example, the voice interaction enabling server 108 may receive the voice input as“MY FIRST NAME is JOHN” or“FIRST NAME” and“JOHN” to determine that the data corresponding to the field“FIRST NAME” is“JOHN”. The voice interaction enabling server 108 may also receive the voice input as exactly the input data or spelled format of the input data. For example, the voice interaction enabling server 108 receives the voice input in the format of“JOHN” or“J-O-H-N”. The voice interaction enabling server 108 identifies the attributes irrespective of the order of the attributes in the digital screen and generates commands to populate the input data in the associated input fields. For example, for the labels“FIRST NAME”,“MIDDLE NAME” and“LAST NAME”, the voice interaction enabling server 108 may receive the input data first for the label“LAST NAME”, second for the label“FIRST NAME” and last for the label“MIDDLE NAME” and generate commands to exactly populate the associated input fields with the respective input data irrespective of the order in which the inputs were received. In an embodiment, using machine learning algorithms, the voice interaction enabling server 108 may learn how to understand and interpret abbreviations, synonyms, and other forms of pronouncing the labels. For example, inside an Oracle ERP system, there could be a label “PO” and the user 102 may provide input corresponding to the label either as Purchase Order or PO.
[0031] The voice interaction enabling server 108 further analyzes the voice input and the attributes of the digital screen. The voice input may be in a natural language (e.g. native tongue) of the user 102. The voice interaction enabling server 108 processes the voice input of the user 102 to eliminate unnecessary background noises from the voice input. The attributes of the digital screen may be in a machine readable language. The machine-readable language may be any language, for example Extensible Markup Language (XML), on which the digital screen is intended to fill according to legal requirements, jurisdictional requirements etc. The voice interaction enabling server 108 translates (i) the voice input from a natural language to a machine readable language and (ii) the attributes of the digital screen from the machine readable language to the natural language of the user 102 (e.g. native tongue of the user 102) using the natural language processing technique. The voice interaction enabling server 108 communicates the translated attributes to the voice based digital screen interaction device 104. The voice based digital screen interaction device 104 may display or play the attributes using a display or a speaker respectively, when the user 102 requests the voice interaction enabling server 108. The voice interaction enabling server 108 may assist the user 102 using the natural language processing technique (a) to understand the semantics of the attributes and (b) to provide the voice input in the natural language. The natural language may be any a human spoken language or native tongue of the user 102. The natural language is not a programming language.
[0032] The user 102 may interact with the digital screens several times. The voice interaction enabling server 108 may learn and analyze behavior (e.g. which may include an accent or an individual style or pattern of providing the voice inputs) of the user 102 from the digital screens that are interacted by the user 102, using a machine learning algorithm. The voice interaction server 108, using the machine learning algorithm, may (i) learn and analyze standard values (e.g. JOHN, SMITH, INDIA etc.) provided by the user 102 for the respective input fields (e.g. first name, last name, country etc.) and (ii) generate commands to automatically pre populate those standard values in the respective input fields to reduce input time for the user l02during subsequent interactions with the digital screens. The voice based digital screen interaction device 104 includes a voice based interaction database that stores a repository of the digital screens, the standard values and the automatic pre -populated input fields etc. For example, if the user 102, who may be from a country“INDIA”, may repeatedly fill the label “COUNTRY” with the standard value of“INDIA”, then the voice interaction enabling server 108, using the machine learning algorithm, learns the standard value as“INDIA” for the label “COUNTRY” for that user 102 and generate commands to automatically pre-populate the standard value of “INDIA” for an input filed of the label“COUNTRY” in the digital screen during subsequent interactions of that user 102 with that digital screen. In another example, if the user 102 of a“FEMALE” gender may repeatedly fill the label“GENDER” with the standard value of“FEMALE”, then the voice interaction enabling server l08learns the standard value as “FEMALE” for the label“GENDER” and generates commands to automatically pre -populate an input field of the label“GENDER” with the standard value of“FEMALE”. Similarly, the voice interaction enabling server 108 may analyze the semantics of the input data in the label“FIRST NAME” and generate commands to automatically pre -populate an input field of the label “GENDER” with appropriate data. For example, the voice interaction enabling server 108 analyses the semantics of the input data“JOHN” associated with the label“FIRST NAME” and generates commands to automatically pre -populate an input field of the label“GENDER” with the data of“MALE”. The voice based digital screen interaction device 104 may delete, edit or change the automatically pre-populated standard values based on voice input of the user 102. In an embodiment, based on a location of the voice based digital screen interaction device 104, the voice interaction enabling server 108 automatically populates the input field of the label “COUNTRY”. For example, if a location of the voice based digital screen interaction device 104 is identified as“INDIA” using Global positioning system, the voice interaction enabling server 108 automatically populates the input field of the label“COUNTRY” as“INIDA” in the digital screen. Similarly, based on the location of the voice based digital screen interaction device 104, the voice interaction enabling server 108 determines how a word has to be spelled in the display / speaker of the voice based digital screen interaction device 104. For example, for the word“COLOUR”, the spelling is“COLOR” in US whereas the spelling is COLOUR in India. If the location the voice based digital screen interaction device 104 is identified as US, then the voice interaction enabling server 108 populates the input field with the correct spelling used in US.
[0033] The voice interaction enabling server 108 may provide data analytics on which digital screen that the user 102 is spending maximum amount of time in interacting with or input fields that are not being populated by the user 102. The voice interaction enabling server 108 may provide options to the user 102 to stop or start interacting with the digital screen using voice input. In an example embodiment, the voice interaction enabling server 108 may allow the user 102 to start or stop interacting with the digital screen using the voice input“START” or“STOP” respectively. The voice based digital screen interaction device 104 may allow the user 102 to enter the inputs through a keyboard or a mouse.
[0034] The voice interaction enabling server 108 may analyze an accent of the user 102 and an individual style or pattern of the user 102 in providing the voice input. The voice interaction enabling server 108 may employ a voice authentication process which authenticates authorized users to fill / interact with secured digital screens. The voice interaction enabling server 108 may use the user’s 102 voice as an authentication feature to allow the authorized users to fill the secured digital screens. The authentication feature may be (i) a voice pitch, (ii) a voice tone or (iii) a voice accent.
[0035] The voice interaction enabling server 108 may generate commands based on the analysis of the voice input of the user 102 and the attributes of the digital screen. The voice interaction enabling server 108 may communicate the generated commands to the voice based digital screen interaction device 104. The voice based digital screen interaction device 104 receives and executes the commands using a command execution technique. The command execution technique includes well known techniques such as command injection. The command injection is the insertion of code e.g. UNIX commands, into dynamically generated output and it helps for managing the data. The command execution may include (i) populating the input fields of the labels with the respective input data based on the voice input of the user 102, (ii) automatically pre-populating the input fields with the standard values based on the behavior analysis and data analytics associated with the user 102, (iii) verifying the voice input or the input data with the user 102, (iv) accessing the functions like HOME, TOOLS, VIEW, MESSAGE, SUBMIT, SAVE, BACKSPACE, PAGEDOWN, PAGEUP etc. to effectively operate the voice based digital screen interaction device 104, (v) providing an authentication to the user 102 to access the secured digital screen or (vi) displaying or playing the attributes of the digital screen or input data to the user 102 in the natural language of the user 102.
[0036] The voice based digital screen interaction device 104 may play or display the attributes of the digital screen or input data to the user 102 in the natural language, using the speaker/display, when the user 102 requests to play /display the input data entered in the input field. The voice based digital screen interaction device 104 may play or display the attribute of the digital screen irrespective of a type of font, color of the attribute, a size of the attribute etc. The voice based digital screen interaction device 104 may verify the voice input or the input data with the user 102 when a format of received input data for an input field is different from an expected input data format a. For example, for the label, the input data format expected by the voice interaction enabling server 108 is a number format. If the user 102 provides the input data in a text format for the label“MOBILE”, the voice interaction enabling server 108 verifies and notifies the user 102 through the voice based digital screen interaction device l04to provide the input data in the number format. In another example, the digital screen may include mandatory fields to be filled in by the user 102. If the user 102 clicks“SUBMIT” option without filling the mandatory fields, then the voice interaction enabling server 108 verifies and notifies the user 102 through the voice based digital screen interaction device 104 to fill the mandatory fields.
[0037] In an embodiment, one or more voice based digital screen interaction devices is communicatively coupled with the voice interaction enabling server 108. The one or more voice based digital screen interaction devices receives the voice input from the one or more users and communicates the voice inputs of the one or more users with the voice interaction enabling server 108 for further processing as described above. In another embodiment, the one or more voice based digital screen interaction devices communicates with a voice based digital screen interactive cloud model instead of the voice interaction enabling server 108. In yet another embodiment, the voice based digital screen interaction device 104 performs the function of the voice interaction enabling server 108 as described above including processing voice input obtained from the user 102 and the attributes of the digital screen, generating command based on analysis of the voice input and the attributes of the digital screen etc. In an embodiment, the voice interaction enabling server 108 may include a default setting if the user 102 uses the voice interaction enabling server 108 for the first time. The voice interaction enabling server 108 then starts learning the attributes of the digital screens that the user 102 is interacting with, using the machine learning algorithm, and saves those attributes with his/her profile. If the voice interaction enabling server 108 is being used by multiple users, then the voice interaction enabling server 108 creates a profile for each of the users using an identification of each user.
[0038] FIG. 2 illustrates an exploded view of the voice based digital screen interaction device l04of FIG. 1 according to an embodiment herein. The voice based digital screen interaction device l04includes a voice based interaction database 202, a program identification module 204, a screen parsing module 206, a label and input field association module 208, a voice input obtaining module 210, a voice input verification module 212, an attributes and voice input communication module 214, a command receiving module 216 and a command execution module 218. The voice based digital screen interaction device 104 may include a voice input swapping module.
[0039] The program identification module 204 identifies a type of the program that runs on a digital screen of the voice based digital screen interaction device 104. The screen parsing module 206 parses the digital screen as an image in real time using an image processing technique to identify the overall layout of the digital screen. The screen parsing module 206 may determine the attributes of the digital screen by analyzing the layout of the digital screen. The screen parsing module 206 may remove unnecessary images and graphics that are embedded in the digital screen for better interaction. The label and input field association module 208 identifies a proximity type between the labels and the input fields. The label and input field association module 208 associates the labels with the respective input fields based on (i) the identified proximity type and (ii) the analysis of the semantics of the labels and corresponding input fields. The voice input obtaining module 210 obtains the voice input from the user 102 in natural language (e.g. native tongue) using the microphone or any other mechanism to receive the voice input. The voice input may include any one of (i) utterance of name of the label followed by the input data to be entered in the corresponding input field and (ii) utterance of phrase“CLICK ON” followed by at least one of (a) menu bar option, (b) tool bar option or (c) functions like EDIT, BACKSPACE, DELETE, SUBMIT, SAVE, etc. The voice input may include (ii) utterance of phrase“TRANSLATE” followed by at least one of (a) a screen name, (b) a label name and (c) an input field followed by a name of the natural language. In an embodiment, the user voice input obtaining module 210 eliminates the unnecessary background noises from the voice input of the user 102. The voice input verification module 212 verifies the voice input of the user 102 when a format of received input data for an input field is different from an expected input data format. The voice input verification module 212 further verifies (i) a voice pitch, (ii) a voice tone and (iii) a voice accent of the user 102 (e.g. authorized user) with the received voice input for authentication. The attributes and voice input communication module 214 communicates the voice input and the attributes of the digital screen to the voice interaction enabling server 108 to generate commands based on the analysis of the voice input of the user 102 and the attributes of the digital screen.
[0040] The command receiving module 216 receives a command that is generated based on the voice input and the attributes of the digital screen from the voice interaction enabling server 108. The command execution module 218 executes the command that is generated based on the voice input using a command execution technique.
[0041] The command execution module 218 may delete, edit or change the input data in an input field that is automatically pre-populated with standard values based on voice input of the user 102. The voice based interaction database 202 stores a repository of the digital screens, the standard values filled in different digital screens filled by the user 102 and the automatic pre populated input fields etc. The voice input swapping module provides options to the user 102 to swap between the voice input based interaction and a traditional way of interaction (e.g. through keyboard or mouse) with the digital screen.
[0042] FIG. 3 illustrates an exploded view of the voice interaction enabling server l08of FIG. 1 according to an embodiment herein. The voice interaction enabling server 108 includes a voice based command execution database 302, an attribute and voice input receiving module 304, a natural language module 306, an attribute identification module 308, a behavior analysis module 310, a data analytics generation module 312, a command generation module 314 and a command communication module 316. The attribute and voice input receiving module 304 receives the voice input and the attributes of the digital screen from the voice based digital screen interaction device 104. The natural language module 306 analyzes the received voice input and translates the voice input to a machine readable language from a natural language (e.g. native tongue of the user 102). The attribute identification module 308 identifies an attribute that the user 102 is intended to enter input data/execute a command from the received attributes of the digital screen based on the analyzed voice input. The attribute identification module 308 may identify the attributes that the user 102 is intended to enter input data/execute the command irrespective of position of the attributes in the digital screen. The behavior analysis module 310 analyzes the behavior of the user 102 (e.g. which may include an accent or an individual style or pattern of providing the voice input) from the digital screens that are filled in /interacted by the user 102, using a machine learning algorithm (e.g. MATLAB, C++ etc.). The data analytics generation module 312 generates data analytics on which digital screens or attributes that the user 102 is spending maximum amount of time in interacting with /entering the input data. The command generation module 314 generates commands based on the voice input, the data analytics of user’s interaction with the different digital screens, the voice input verification and the behavior analysis of the user 102. The command communication module 316 communicates the generated commands to the voice based digital screen interaction device 104 through a wireless or any other network. The voice based command execution database 302 stores (i) a voice pitch of the user 102, (ii) a voice tone of the user 102, (iii) a voice accent of the user 102, (iv) standard values, (iv) data analytics of user’s interaction with the different digital screens, etc. In an embodiment, the wireless network may be (a) a WIFI network, (b) a Bluetooth or (c) any other network.
[0043] FIG. 4 illustrates an exemplary view of a digital screen with a menu bar having sub-options for messaging that the user interacts with to select a sub-option in real time by providing voice input according to an embodiment herein. The menu bar comprises an option “MESSAGE” 402, which in turn provides/displays sub-options such as (i) NEW MESSAGE, (ii) NEW MESSAGE USING, (iii) REPLY TO SENDER, (iv) REPLY TO ALL, (v) REPLY TO GROUP, and (iii) FORWARD when the user 102 accesses/clicks the option“MESSAGE” 402. The voice interaction enabling server 108 analyzes, parses and associates the sub-options with the option“MESSAGE” 402 in real time and allows the user 102 to select a sub-option in real time using his voice input.
[0044] FIG. 5 illustrates an exemplary view of a user interacting with one or more digital screens in real time according to an embodiment herein. The voice based digital screen interaction device 104 may provide options to the user 102 to interact simultaneously with more than one digital screens and switch between a first digital screen and a second digital screen. The voice based digital screen interaction device 104 provides options to the user 102 to name each digital screen with a name. The voice based digital screen interaction device l04receives a voice input that includes a name of the digital screen as input data followed by a label from the user 102 and names that digital screen based on the voice input. For example, when the user 102 is interacting simultaneously with two digital screens, for example“SCREEN 1” and SCREEN 2” and wants to name the“SCREEN 1” with a name, then the user 102 can provide his voice input to name the digital screens as“SCREEN 1” followed by a name of the attribute/label and followed by input data to be entered/executed in the input field as associated with the attribute / label.
[0045] FIGS. 6A-6B are flow diagrams illustrating a method for interacting with a digital screen in the voice based digital screen interaction device 104 using voice input according to an embodiment herein. At step 602, attributes of a digital screen is received from a voice based digital screen interaction device 104. At step 604, the attributes of the digital screen is transformed from a machine -readable language to the natural language of the user 102 using a natural language processing technique. The natural language of the user 102 is a human spoken language. At step 606, the attributes in the natural language of the user 102 is provided on the voice based digital screen interaction device 104 for enabling the user 102 to input a voice input. Each attribute comprises a label, an input field and their semantics. At step 608, the voice input of the user 102 is received from the voice based digital screen interaction device 104. At step 610, the voice input is transformed from the natural language of the user 102 to the machine - readable language using the natural language processing technique. At step 612, a first attribute to be populated is determined based on the voice input irrespective of an order of a position of the first attribute in the digital screen. At step 614, a proximity type between a label and an input field associated with the first attribute is identified to associate the label with the input field based on the identified proximity type. At step 616, semantics of the label and their corresponding input field is analyzed based on the identified proximity type. At step 618, a command is generated based on the voice input, the first attribute to populate the voice input in the input field associated with the first attribute. At step 620, the voice input is populated in the input field associated with the first attribute by executing the generated command on the voice based digital screen interaction device 104 using a command execution technique.
[0046] FIGS. 7 A and 7B are flow diagrams illustrating a method for generating commands using the voice interaction enabling server 108 based on voice input according to an embodiment herein. At step 702, a type of a program that runs on a digital screen of the voice based digital screen interaction device 104 is identified. At step 704, the digital screen is parsed as an image in real time using an image processing technique to identify an overall layout of the digital screen. The image processing technique removes unnecessary images and graphics that are embedded in the digital screen. At step 706, attributes are determined by analysing the overall layout of the digital screen. At step 708, the attributes of the digital screen are transformed from a machine-readable language to the natural language of the user using a natural language processing technique. The attributes comprise a static text, a menu bar option, a tool bar option, an audio, a video, a label, an input field and/or a semantics on the digital screen and the input fields comprise a text box, a radio button, a drop down list box, a check box, or a multiple choice to select input data. The natural language of the user is a human spoken language. At step 710, the attributes are provided in the natural language of the user 102 on the voice based digital screen interaction device 104 for enabling the user 102 to input a voice input. Each attribute comprises a label, an input field and their semantics. At step 712, the voice input of the user 102 is received in the natural language of the user 102 from the voice based digital screen interaction device 104. The voice input of the user 102 is processed to eliminate unnecessary background noises from the voice input. At step 714, the voice input is transformed from the natural language of the user 102 to the machine -readable language using the natural language processing technique. At step 716, a first attribute to be populated is determined based on the voice input irrespective of an order of a position of the first attribute in the digital screen. At step 718, a proximity type between a label and an input field associated with the first attribute is identified to associate the label with the input field based on the identified proximity type. At step 720, semantics of the label and their corresponding input field is analysed based on the identified proximity type. At step 722, a command is generated based on the voice input, the first attribute to populate the voice input in the input field associated with the first attribute. At step 724, the voice input is populated in the input field associated with the first attribute by executing the generated command on the voice based digital screen interaction device 104 using a command execution technique.
[0047] FIG. 8 shows a diagrammatic representation of a computer system within which a set of instructions for causing the machine to perform any one or more of the methodologies discussed herein may be executed in accordance with an embodiment herein. In alternative embodiments, the machine operates as a standalone device or may be connected (e.g., networked) to other machines. In a networked deployment, the machine may operate in the capacity of a server or a client machine in server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The machine may be a server computer, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term "machine" shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
[0048] The example computer system includes a processor 802 (e.g., a central processing unit (CPU) a graphics processing unit (GPU) or both), a main memory 804 and a static memory 806, which communicate with each other via a bus 808. The computer system may further include a video display unit 810 (e.g., a liquid crystal display (LCD), a light emitting diode (LED) or a cathode ray tube (CRT)). The computer system also includes an alphanumeric input device 812 (e.g., a keyboard or touch screen), a disk drive unit 814 and a network interface device 816.
[0049] The disk drive unit 814 includes a machine -readable medium 818 on which is stored one or more sets of instructions 820 (e.g. software) embodying any one or more of the methodologies or functions described herein. The instructions 820 may also reside, completely or at least partially, within the main memory 804 and/or within the processor 802 during execution thereof by the computer system, the main memory 804 and the processor 802 also constituting machine-readable media. The instructions 820 may further be transmitted or received over a network 822 via the network interface device 816.
[0050] While the machine-readable medium 818 is shown in an example embodiment to be a single medium, the term "machine-readable medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term "machine -readable medium" shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure. The term "machine -readable medium" shah accordingly be taken to include, but not be limited to, solid-state memories, optical and magnetic media, and carrier wave signals.
[0051] The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope.

Claims

FWe claim:
1. A method for interacting with a digital screen of a voice based digital screen interaction device (104) using a voice input of a user (102), comprising:
generating a database that stores information comprising a natural language of a user
(102);
identifying a type of a program that runs on a digital screen of the voice based digital screen interaction device (104);
characterized in that,
parsing the digital screen as an image in real time using an image processing technique to identify an overall layout of the digital screen, wherein the image processing technique removes unnecessary images and graphics that are embedded in the digital screen; and
determining attributes by analysing the overall layout of the digital screen;
transforming the attributes of the digital screen from a machine-readable language to the natural language of the user (102) using a natural language processing technique, wherein the natural language of the user (102) is a human spoken language;
providing the attributes in the natural language of the user (102) on the voice based digital screen interaction device (104) for enabling the user (102) to input a voice input, wherein each attribute comprises a label, an input field and their semantics;
receiving the voice input of the user (102) in the natural language of the user (102) from the voice based digital screen interaction device (104), wherein the voice input of the user (102) is processed to eliminate unnecessary background noises from the voice input;
transforming the voice input from the natural language of the user (102) to the machine- readable language using the natural language processing technique;
determining a first attribute to be populated based on the voice input irrespective of an order of a position of the first attribute in the digital screen;
identifying a proximity type between a label and an input field associated with the first attribute to associate the label with the input field based on the identified proximity type; and analysing semantics of the label and their corresponding input field based on the identified proximity type; generating a command based on the voice input, the first attribute to populate the voice input in the input field associated with the first attribute; and
populating the voice input in the input field associated with the first attribute by executing the generated com maud on the voice based digital screen interaction device (104) using a command execution technique.
2. The method as claims in claim 1 , wherein the method comprises
accessing functions selected from a group comprising Home, Tools, View, Message, Submit, Save, Backspace, Pagedown, Pageup based on the voice input to operate of the voice digital screen based interaction device (104).
3. The method as claimed in claim 1, wherein the method comprises
analysing a behaviour of the user (102), using a machine learning algorithm, based on the user’s interactions with different digital screens of the voice based digital screen interaction device (104);
generating data analytics on which digital screens or attributes that the user (102) is spending maximum amount of time in interacting with; and
automatically prepopulating input fields associated with the attributes the standard values based on the user behaviour analysis and data analytics.
4. The method as claimed in claim 1, wherein the method comprises implementing a voice authentication process to authenticate authorized users to access the digital screen that are secured.
5. The method as claimed in claim 1, wherein the method comprises
verifying the voice input of the user (102) when a format of received input data for an input field is different from an expected input data format; and
displaying or playing the attributes of the digital screen to the user (102) in the natural language of the user (102).
6. The method as claimed in claim 1, wherein the voice input of the user (102) is captured using a microphone associated with the voice based interaction device (104).
7. The method as claimed in claim 1, wherein the method comprises eliminating grammatical errors from the voice input using the natural language processing technique to determine input data corresponding to different input fields.
8. The method as claimed in claim 1, wherein the method comprises implementing a machine learning algorithm to interpret abbreviations, synonyms, and other forms of pronouncing the labels and to analyse standard values provided by the user (102) for prepopulating in the respective input fields.
9. The method as claimed in claim 1, wherein the attributes comprise at least one of a static text, a menu bar option, a tool bar option, an audio, a video, a label, an input field and/or a semantics on the digital screen and the input fields comprise a text box, a radio button, a drop down list box, a check box, or a multiple choice to select input data.
10. A system for interacting with a digital screen of a voice based digital screen interaction device (104) using a voice input of a user (102), comprising:
a memory that stores a set of instructions, wherein said memory comprises a database that stores information associated with a natural language of a user (102); and
a processor that executes said set of instructions to perform:
identifying a type of a program that runs on a digital screen of the voice based digital screen interaction device (104);
characterized in that,
parsing the digital screen as an image in real time using an image processing technique to identify an overall layout of the digital screen, wherein the image processing technique removes unnecessary images and graphics that are embedded in the digital screen; and
determining attributes by analysing the overall layout of the digital screen; transforming the attributes of the digital screen from a machine-readable language to the natural language of the user (102) using a natural language processing technique, wherein the natural language of the user (102) is a human spoken language;
providing the attributes in the natural language of the user (102) on the voice based digital screen interaction device (104) for enabling the user (102) to input a voice input, wherein each attribute comprises a label, an input field and their semantics;
receiving the voice input of the user (102) in the natural language of the user (102) from the voice based digital screen interaction device (104), wherein the voice input of the user (102) is processed to eliminate unnecessary background noises from the voice input;
transforming the voice input from the natural language of the user (102) to the machine-readable language using the natural language processing technique;
determining a first attribute to be populated based on the voice input irrespective of an order of a position of the first attribute in the digital screen;
identifying a proximity type between a label and an input field associated with the first attribute to associate the label with the input field based on the identified proximity type; and
analysing semantics of the label and their corresponding input field based on the identified proximity type;
generating a command based on the voice input, the first attribute to populate the voice input in the input field associated with the first attribute; and
populating the voice input in the input field associated with the first attribute by executing the generated command on the voice based digital screen interaction device (104) using a command execution technique.
PCT/IN2019/050200 2018-03-13 2019-03-12 System and method for interacting with digitalscreensusing voice input and image processing technique WO2019175896A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN201841009230 2018-03-13
IN201841009230 2018-03-13

Publications (1)

Publication Number Publication Date
WO2019175896A1 true WO2019175896A1 (en) 2019-09-19

Family

ID=67906540

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IN2019/050200 WO2019175896A1 (en) 2018-03-13 2019-03-12 System and method for interacting with digitalscreensusing voice input and image processing technique

Country Status (1)

Country Link
WO (1) WO2019175896A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104516709A (en) * 2014-11-12 2015-04-15 科大讯飞股份有限公司 Software operation scene and voice assistant based voice aiding method and system
US20170178626A1 (en) * 2010-01-18 2017-06-22 Apple Inc. Intelligent automated assistant

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170178626A1 (en) * 2010-01-18 2017-06-22 Apple Inc. Intelligent automated assistant
CN104516709A (en) * 2014-11-12 2015-04-15 科大讯飞股份有限公司 Software operation scene and voice assistant based voice aiding method and system

Similar Documents

Publication Publication Date Title
US10417347B2 (en) Computer messaging bot creation
KR102401942B1 (en) Method and apparatus for evaluating translation quality
US11217239B2 (en) Computer proxy messaging bot
US10733197B2 (en) Method and apparatus for providing information based on artificial intelligence
US10255265B2 (en) Process flow diagramming based on natural language processing
JP7296419B2 (en) Method and device, electronic device, storage medium and computer program for building quality evaluation model
US20140258892A1 (en) Resource locator suggestions from input character sequence
US11775254B2 (en) Analyzing graphical user interfaces to facilitate automatic interaction
US20190087780A1 (en) System and method to extract and enrich slide presentations from multimodal content through cognitive computing
US11126794B2 (en) Targeted rewrites
US10891430B2 (en) Semi-automated methods for translating structured document content to chat-based interaction
US11763074B2 (en) Systems and methods for tool integration using cross channel digital forms
KR20180136116A (en) A digital signage system in IoT environment using personality analysis
CN111860000A (en) Text translation editing method and device, electronic equipment and storage medium
EP3149729A1 (en) Method and system for processing a voice-based user-input
WO2019175896A1 (en) System and method for interacting with digitalscreensusing voice input and image processing technique
Bisson et al. Azure AI Services at Scale for Cloud, Mobile, and Edge
US11741302B1 (en) Automated artificial intelligence driven readability scoring techniques
US11941345B2 (en) Voice instructed machine authoring of electronic documents
US20240126412A1 (en) Cross channel digital data structures integration and controls
KR20170057074A (en) Intelligent auto-completion method and apparatus sentence
CN113419711A (en) Page guiding method and device, electronic equipment and storage medium
Rodrigues Human-Powered Smartphone Assistance for Blind People
CN117873619A (en) Method and device for generating product description document, storage medium and terminal equipment
CN116910399A (en) Page construction method, page construction device, electronic equipment and computer storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19768527

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19768527

Country of ref document: EP

Kind code of ref document: A1