WO2023087934A1 - Voice control method, apparatus, device, and computer storage medium - Google Patents

Voice control method, apparatus, device, and computer storage medium Download PDF

Info

Publication number
WO2023087934A1
WO2023087934A1 PCT/CN2022/122020 CN2022122020W WO2023087934A1 WO 2023087934 A1 WO2023087934 A1 WO 2023087934A1 CN 2022122020 W CN2022122020 W CN 2022122020W WO 2023087934 A1 WO2023087934 A1 WO 2023087934A1
Authority
WO
WIPO (PCT)
Prior art keywords
control
image control
image
voice
graphical interface
Prior art date
Application number
PCT/CN2022/122020
Other languages
French (fr)
Chinese (zh)
Inventor
陈明
冉茂松
张晓帆
Original Assignee
杭州逗酷软件科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州逗酷软件科技有限公司 filed Critical 杭州逗酷软件科技有限公司
Publication of WO2023087934A1 publication Critical patent/WO2023087934A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44521Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
    • G06F9/44526Plug-ins; Add-ons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/451Execution arrangements for user interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the technical field of voice interaction, and in particular to a voice control method, device, equipment and computer storage medium.
  • GUI Graphical User Interface
  • the embodiment of the present application provides a voice control method, the method includes:
  • Image control identification is performed according to the image description text information corresponding to the voice data and at least one image control, and a target image control is determined in at least one image control;
  • the embodiment of the present application provides a voice control device, the voice control device includes a receiving unit, a determining unit, an analyzing unit, and a sending unit; wherein,
  • a receiving unit configured to receive voice data input by a user
  • a determining unit configured to determine at least one image control according to the current graphical interface
  • the analysis unit is configured to understand the image content of at least one image control, and obtain the image description text information corresponding to the at least one image control; and is also configured to perform image control identification according to the voice data and the image description text information corresponding to the at least one image control, determining a target image control among at least one image control;
  • the determination unit is further configured to determine the operation instruction according to the voice data
  • the sending unit is configured to send an operation instruction to the target image control, so as to implement voice control on the target image control.
  • an embodiment of the present application provides an electronic device, where the electronic device includes a memory and a processor; wherein,
  • said memory for storing a computer program capable of running on said processor
  • the processor is configured to execute the method as described in the first aspect when running the computer program.
  • an embodiment of the present application provides a computer storage medium, where a computer program is stored in the computer storage medium, and when the computer program is executed by at least one processor, the method described in the first aspect is implemented.
  • FIG. 1 is a schematic diagram of a graphical interface element grid index
  • FIG. 2 is a schematic flowchart of a voice control method provided in an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of a hierarchical tree provided by an embodiment of the present application.
  • FIG. 4 is a detailed flowchart of a voice control method provided in an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of another voice control method provided in the embodiment of the present application.
  • FIG. 6 is a schematic flowchart of another voice control method provided in the embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a voice control device provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of the composition and structure of an electronic device provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of the composition and structure of another electronic device provided by the embodiment of the present application.
  • the embodiment of the present application provides a voice control method, the method includes:
  • An operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control, so as to implement voice control on the target image control.
  • the determining at least one image control according to the current graphical interface includes:
  • the at least one image control is determined according to the graphical interface element information.
  • the determining the graphical interface element information corresponding to the current graphical interface includes:
  • the determining the at least one image control according to the graphical interface element information includes:
  • the preset type includes at least one of the following: ImageView, FrameLayout, LinearLayout, RelativeLayout and View.
  • the size screening of the controls in the set of candidate controls to obtain the at least one image control includes:
  • the determining at least one image control according to the current graphical interface includes:
  • the identifying the image control according to the image description text information corresponding to the voice data and the at least one image control, and determining the target image control in the at least one image control includes:
  • the target image control is determined according to the semantic similarity value.
  • the determining the target image control according to the semantic similarity value includes:
  • the image control corresponding to the maximum similarity value among the semantic similarity values is the target image control.
  • the method after receiving the voice data input by the user, the method further includes:
  • An operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control, so as to implement voice control on the target image control.
  • the embodiment of the present application provides a voice control device, the voice control device includes a receiving unit, a determining unit, an analyzing unit, and a sending unit; wherein,
  • the receiving unit is configured to receive voice data input by a user
  • the determining unit is configured to determine at least one image control according to the current graphical interface
  • the analysis unit is configured to understand the image content of the at least one image control to obtain image description text information corresponding to the at least one image control; and is also configured to correspond to the at least one image control according to the voice data
  • the image description text information of the image control is identified, and the target image control is determined in the at least one image control;
  • the determining unit is further configured to determine an operation instruction according to the voice data
  • the sending unit is configured to send the operation instruction to the target image control, so as to implement voice control on the target image control.
  • the determining unit is further configured to determine graphical interface element information corresponding to the current graphical interface; and determine the at least one image control according to the graphical interface element information.
  • the voice control device further includes a calling unit configured to call the underlying system code information to obtain the graphical interface element information; or call the system auxiliary service function interface to obtain the graphical interface element information.
  • the analysis unit is further configured to query controls whose class attribute suffix is a preset type from the graphical interface element information to form a set of candidate controls; and size the controls in the set of candidate controls Screening to obtain the at least one image control.
  • the determining unit is further configured to, in the set of candidate controls, determine whether the length and width of the controls meet a preset size condition, and select all controls whose length and width meet the preset size condition The control is determined as the image control.
  • the voice control device further includes a detection unit configured to take a screenshot of the current graphical interface to obtain an image to be recognized; and perform control detection on the image to be recognized, and combine the detected controls into a A set of candidate controls; and performing size screening on the controls in the set of candidate controls to obtain the at least one image control.
  • a detection unit configured to take a screenshot of the current graphical interface to obtain an image to be recognized; and perform control detection on the image to be recognized, and combine the detected controls into a A set of candidate controls; and performing size screening on the controls in the set of candidate controls to obtain the at least one image control.
  • the analysis unit is further configured to perform text conversion on the voice data to obtain voice-to-text information; and semantically convert the voice-to-text information to the image description text information corresponding to the at least one image control matching, determining a semantic similarity value corresponding to the at least one image control; and determining the target image control according to the semantic similarity value.
  • the detection unit is configured to take a screenshot of the current graphical interface to obtain an image to be recognized; and perform text conversion on the voice data to obtain voice-to-text information; performing target detection on the image to be recognized, and determining a target image control; and determining an operation instruction according to the voice data, and sending the operation instruction to the target image control, so as to implement voice control on the target image control.
  • an embodiment of the present application provides an electronic device, where the electronic device includes a memory and a processor; wherein,
  • said memory for storing a computer program capable of running on said processor
  • the processor is configured to execute the method according to any one of the first aspect when running the computer program.
  • an embodiment of the present application provides a computer storage medium, where a computer program is stored in the computer storage medium, and when the computer program is executed by at least one processor, the method according to any one of the first aspect is implemented.
  • references to “some embodiments” describe a subset of all possible embodiments, but it is understood that “some embodiments” may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict.
  • first ⁇ second ⁇ third involved in the embodiment of the present application is only used to distinguish similar objects, and does not represent a specific ordering of objects. Understandably, “first ⁇ second ⁇ The specific order or sequence of "third” may be interchanged where permitted so that the embodiments of the application described herein can be implemented in an order other than that illustrated or described herein.
  • GUI Graphical User Interface
  • VGUI Voice and Graphical User Interface
  • the vast majority of applications do not consider the use of voice interaction when designing and developing.
  • the current mobile phone applications are mainly designed for interaction through touch screen interaction. Therefore, basically the vast majority of applications have not been adapted to voice interaction. Therefore, when using voice to interact and control the application graphical interface on the mobile phone, many problems will be encountered, such as lack of text descriptions for interface elements, or interface elements that have text descriptions but are not convenient for users to directly use corresponding text descriptions (such as The text description is too long, the text description contains symbols, pictures, etc., the text description is unclear, and there are multiple interface elements whose text descriptions are the same or similar, etc.) to refer to, etc. In these cases, users cannot directly refer to the control they want to interact with through the text description.
  • the current solutions for this situation mainly include the following:
  • Icon recognition Use the model to detect and identify commonly used and unambiguous icon controls, so as to obtain description text (common terms/titles). In this way, users can use common sense or common sayings/appellations to describe icon controls, such as "play”, “pause”, “previous song” and “next song” in audio and video playback control buttons, so as to realize Description of the target control to achieve the purpose of interaction;
  • Text recognition use the model to recognize the text information that may be contained in image controls such as pictures and icons that do not contain text descriptions, and use the recognized text information as the text description information of the control to match the user's interaction instructions , so as to realize the positioning of the target control and achieve the purpose of interaction;
  • Digital number refers to: for example, "the first button", etc., by numbering all controls, and then referring to the control by number, so as to realize the description of the target control and achieve the purpose of interaction; the control number is visually Generally, it is not displayed, so the user needs to calculate the number of the control by himself;
  • Superimposed display text instructions superimpose and display the text description of each interactive control on the GUI, and the user can refer to the corresponding control through the corresponding text description superimposed and displayed on each interactive control, so as to achieve the purpose of interaction;
  • the digital reference method needs to number the control through the program, and then use the number to refer to the control.
  • the control number itself is not displayed in the interface.
  • the user's numbering method is not necessarily consistent with the program's numbering method; and usually there may be dozens of interactive objects on an interface, it is very difficult for users to number the controls one by one.
  • the method of superimposed display text instructions needs to generate text instructions first; and generating text instructions depends on the text description of the control, so the text instructions may have the same situation as the text description; and the superimposed display If the content is too large, the original content will be covered, and if the content is too small, the user will not be able to see clearly; and usually there may be dozens of interactive objects on an interface, and finally dense prompt content will be superimposed on the interface , has a great impact on the user experience and sensory experience.
  • the method of superimposed display of digital numbers is simple to implement, but it is not conducive to the user's memory of correct interactive instructions.
  • the superimposed content is too large, the original content will be covered, and if the content is too small, the user will not be able to see it clearly; and usually there may be dozens of interactive objects on an interface, and finally densely packed on the interface.
  • the prompt content has a great impact on the user experience and sensory experience.
  • the grid size may be too large or too small in the way of superimposed display of network grids and numbers; target interactive controls may fall in several grids; Several interactive objects may also appear in the grid. In these situations, the user needs to perform multiple operations before finally determining the interaction target.
  • the superimposed content will cover the original content, which will greatly affect the user experience and sensory experience.
  • the icon class can be seen in the bold dashed box in Figure 1.
  • the control size is generally small, and the appearance and meaning are usually relatively fixed, which can be realized through icon recognition; for non-standard, very useful 1.
  • Ambiguous icons can be realized through multiple labels, such as shape, color, appearance style, visual semantics and other labels, or through the combination of spatial orientation and numbered index.
  • the other type is pictures, which can be seen in the bold solid line box in Figure 1.
  • the control size is generally large, and it mainly appears in the list of images, videos, files, messages, etc.; the control itself may be arranged according to the rules (such as a grid arrangement), may also be irregular.
  • the visual content and meaning of the picture itself varies greatly, and it may or may not have a text description; and the text description may be repeated, and there may be symbols, pictures, etc. that are not convenient for users to describe directly.
  • the existing solutions all have certain limitations, which makes it difficult to implement voice control, and cannot provide a better interaction method and interaction experience.
  • the embodiment of the present application provides a voice control method, which receives the voice data input by the user; determines at least one image control according to the current graphical interface; understands the image content of at least one image control, and obtains the corresponding image description text information; image control identification is performed according to the image description text information corresponding to the voice data and at least one image control, and the target image control is determined in at least one image control; the operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control , to enable voice control of the target image control.
  • the voice interaction method based on image content understanding does not need to adapt the application and voice interaction, which not only saves development costs, but also facilitates user description, which can effectively improve the convenience of users when using voice control, so as to better realize voice interaction. Interaction and control purposes.
  • FIG. 2 shows a schematic flowchart of a voice control method provided in an embodiment of the present application.
  • the method may include:
  • S201 Receive voice data input by a user.
  • S202 Determine at least one image control according to the current graphical interface.
  • the embodiment of the present application is applied to a voice control device, or an electronic device integrated with the device.
  • the electronic equipment can be implemented in various forms, for example, the electronic equipment can include such as smart phones, tablet computers, notebook computers, palmtop computers, personal digital assistants (Personal Digital Assistant, PDA), portable media players (Portable Media Player, PMP), navigation device, wearable device, voice assistant, etc., the embodiment of this application does not make any limitation.
  • PDA Personal Digital Assistant
  • PMP portable media players
  • navigation device wearable device
  • voice assistant etc.
  • the operation instruction can be determined according to the voice data.
  • the electronic device can perform corresponding operations on the current graphical interface according to the operation instruction determined by the voice data.
  • the image control in Figure 1 since the user cannot or is inconvenient to describe it directly through text description, it is necessary to find out the image control in the current graphical interface at this time, so that the follow-up can be based on the understanding of the image content, by understanding the user Input voice data and graphical interface content, so as to realize the matching of user interaction intention and interaction goal, and achieve the purpose of voice interaction.
  • the determining at least one image control according to the current graphical interface may include:
  • the graphical interface element information may be represented by a hierarchical tree, which may also be called a view tree (View Tree).
  • the GUI element information in the electronic device may be a hierarchical structure tree as shown in FIG. 3 .
  • each node in the View Tree represents an element or control (Control/Widget/Element) in the GUI, and some related attributes of the element can include text descriptions, interactive attributes (clickable or not) , whether it can enter text, whether it can slide, etc.), the position of the control, and so on.
  • Table 1 shows some related attribute information of the node, which can include index number, text description, interactive attributes (whether it can be clicked, whether it can input text, whether it can slide, etc.), controls location, etc., as shown below.
  • the elements directly visible to the user are mainly the leaf nodes (Leaf Elements) in the View Tree.
  • Other non-leaf nodes are generally invisible to users, and they are mainly used as interface element containers (Containers), which are mainly used to constrain and control the position, size, arrangement, etc. of elements.
  • Containers interface element containers
  • some Containers also carry the function of interacting with users (the clickable attribute is true at this time).
  • the determining the graphical interface element information corresponding to the current graphical interface may include:
  • the interface for obtaining the structure and information of the interface elements can be provided directly through the underlying code of the system To achieve the acquisition of graphical interface element information; or, the acquisition of graphical interface element information may also be achieved through the system accessibility service (Accessibility Service) function interface.
  • the former method is complicated to implement, requires a large amount of development work, and needs to modify the underlying code of the system, which has certain security risks, but can obtain more comprehensive and accurate interface element information; the latter method is simple to implement and easy to develop.
  • the workload is small, but the obtained graphical interface element information may be missing, and there may be some information errors.
  • the embodiments of the present application may be specifically selected according to actual conditions, and no limitation is made here.
  • the determining at least one image control according to the graphical interface element information may include:
  • Size screening is performed on the controls in the candidate control set to obtain at least one image control.
  • the preset type may include at least one of the following: image view (ImageView), single frame layout (FrameLayout), linear layout (LinearLayout), relative layout (RelativeLayout) and view ( View).
  • the size screening of the controls in the candidate control set to obtain at least one image control may include: in the candidate control set, judging whether the length and width of the controls meet the preset size conditions, A control whose length and width meet preset size conditions is determined as an image control. That is to say, when the length and width of a control meet the preset size conditions, it indicates that the control is an image control, rather than icons, buttons, decorative strips and other controls, and then the image control is filtered from the candidate control set come out.
  • the preset size condition may be: the length and width of the control are respectively greater than a first preset value, and the ratio of the length and width of the control is smaller than a second preset value.
  • the second preset value may be 3, that is, the aspect ratio of the image control needs to be less than 3.
  • the type (class attribute suffix) of some target controls may be FrameLayout, LinearLayout , RelativeLayout, View, etc., not ImageView.
  • controls of the type ImageView or unconventional types (FrameLayout, LinearLayout, RelativeLayout, View, etc., which will not appear in the leaf nodes under normal circumstances) in the leaf nodes of the View Tree can be filtered out as a set of candidate controls .
  • the embodiment of the present application proposes the method of obtaining the current graphical interface element information View Tree from the system first, and then filtering out the image control from the View Tree, considering that it may be difficult for some platforms or systems to obtain the View Tree, here is also A method of detecting controls by taking a screenshot of the current graphical interface and then filtering out image controls is proposed. Therefore, in some embodiments, for S202, the determining at least one image control according to the current graphical interface may include:
  • Size screening is performed on the controls in the candidate control set to obtain at least one image control.
  • the size screening of the controls in the candidate control set to obtain at least one image control may include: in the candidate control set, judging whether the length and width of the control meet the preset size conditions, and combining the length and width Controls that meet the preset size conditions are determined as image controls. That is to say, among these several controls, when the length and width of a certain control meet the preset size conditions, it indicates that the control is an image control, rather than icons, buttons, decorative strips and other controls, in order to filter out Image controls.
  • the controls such as text, image, etc.
  • the positions of the controls contained in the image to be recognized can be detected to obtain several controls;
  • the length and width of the image control need to be larger than a certain size (for example, 100dp), and the aspect ratio of the image control needs to be less than 3; then, the image control is selected from these several controls.
  • S203 Perform image content understanding on at least one image control to obtain image description text information corresponding to at least one image control.
  • S204 Perform image control identification according to the voice data and image description text information corresponding to the at least one image control, and determine a target image control in the at least one image control.
  • the image content of these image controls can be understood, so as to obtain the image description text information about the image content of each image control.
  • the method for image content understanding may include: image classification, image detection, image description generation, image-based text reference target detection, etc., and these methods will be described in detail below:
  • Image classification By classifying the image content, it matches the text description tags of the image content, such as "car”, “food”, “person”, etc. Generally speaking, the text labels obtained by image classification are generally not detailed enough, and their understanding of images is limited. For example, in the second video in the first row of Figure 1, if the label is "car”, more details cannot be obtained, such as whether it is a motorcycle, a car, a truck, or a bus; Level tags or multiple tags can be used to improve to a certain extent, such as “car/sedan” (first level tag/secondary tag/).
  • image classification cannot obtain labels at the same level at the same time, such as the first video in the third row of Figure 1, which contains people and food, but the classification model can only get one of the labels "people” or "food”. It cannot be obtained at the same time; it can be improved to a certain extent by means of confidence, such as "food: 0.5", "people: 0.4".
  • Image detection Detect the objects contained in the image through the detection model.
  • the second video in the first row of Figure 1 contains "person” and "car”; the second video in the second row contains “person” and "food”; at the same time, the detection model can be cascaded or jointly classified model to achieve object segmentation recognition.
  • the type, manufacturer, model, color, etc. of the car can be further identified.
  • Image detection can provide more and more detailed information than image classification, but it cannot provide the relationship between multiple detected objects; and the information provided is relatively fragmented, which is quite different from the user's natural language description, and the later matching is more difficult. big. At the same time, if more detailed information needs to be provided, a more complex model or cascading of multiple models is required, so the complexity of the system is high and the cost of use is high.
  • Image description generation By understanding the image content, and then generating a description of the image content in a natural language. As shown in the second video in the first row of Figure 1, a generated description is "a person appears next to a car; this person is carrying a schoolbag; the car is white; ". The quality and level of detail of the generated description depends on the model accuracy and related settings. This method is closest to the user's natural language description, and the later matching is less difficult, the system complexity is low, and the cost of use is relatively small.
  • the model receives the user's instruction text and image at the same time, realizes the semantic extraction and matching of the text and image inside the model, and finally directly gives the object referred to by the user instruction text in the image s position.
  • the matching and positioning of the user interaction instruction and the target interaction object can be realized.
  • the two steps of S203 and S204 can be combined into one step to complete. This way minimizes the loss of information in the whole process and can obtain better results.
  • the specific effect also depends on the quality and complexity of the model. At the same time, the system complexity of this method is the smallest, and the use cost is also the smallest.
  • the image control identification is performed according to the image description text information corresponding to the voice data and at least one image control, and the target image control is determined in the at least one image control, which may include:
  • the target image control is determined.
  • the determining the target image control according to the semantic similarity value may include: determining the image control corresponding to the maximum similarity value among the semantic similarity values as the target image control.
  • the semantic similarity value corresponding to at least one image control After obtaining the semantic similarity value corresponding to at least one image control, select the maximum similarity value from the semantic similarity values, and determine the image control corresponding to the maximum similarity value as the target image control.
  • the voice text information can be semantically matched with the image description text information corresponding to each image control, for example, a traditional text matching method (such as TF-IDF algorithm, BM25 algorithm, simhash algorithm, Jaccard algorithm, etc.), you can also use a semantic matching model based on neural network training to determine the semantic similarity value corresponding to each image control; and then select the most similar semantics
  • the image control (that is, the image control corresponding to the maximum similarity value) is used as the target image control.
  • S205 Determine an operation instruction according to the voice data, and send the operation instruction to the target image control, so as to implement voice control on the target image control.
  • the electronic device can determine the operation instruction according to the user's voice data, and then send the operation instruction to the target image control to perform the corresponding operation (click, long press, etc.), thereby completing Voice interaction.
  • the operation instruction is determined from the voice data input by the user.
  • the technical solution of the embodiment of the present application provides a method for understanding the image content of graphical interface elements, and further provides a method for controlling voice interaction based on the understanding of the image content; thus, the technical solution of the embodiment of the present application does not need to be
  • the control application is adapted to voice control, which can save development and promotion costs and is convenient for users to use.
  • the voice interaction method based on image content understanding proposed by the technical solution of the embodiment of the application can be passed Understand the voice data and image content input by the user, and realize the matching between the user interaction intention and the user interaction goal, so as to achieve the purpose of voice interaction.
  • This embodiment provides a voice control method, by receiving the voice data input by the user; determining at least one image control according to the current graphical interface; understanding the image content of the at least one image control, and obtaining the image description corresponding to the at least one image control Text information; image control identification is performed according to the image description text information corresponding to the voice data and at least one image control, and the target image control is determined in at least one image control; the operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control to achieve Voice control over target image controls.
  • the voice interaction method based on image content understanding does not need to adapt the application and voice interaction, which not only saves development costs, but also facilitates user description, which can effectively improve the convenience of users when using voice control, so as to better realize voice interaction. Interaction and control purposes.
  • FIG. 4 shows a detailed flowchart of a voice control method provided by the embodiment of the present application. As shown in Figure 4, the detailed process may include:
  • S303 Query all controls meeting the requirements from the View Tree to form a set of candidate controls.
  • S304 Perform size screening on the controls in the candidate control set to obtain at least one image control.
  • S305 Perform image content understanding on at least one image control to obtain image description text information corresponding to at least one image control.
  • S306 Semantically matching the voice text information with the image description text information corresponding to at least one image control, and determining the target image control corresponding to the maximum semantic similarity value.
  • S307 Send an operation instruction to the target image control, so as to realize voice interaction between the user and the target image control.
  • the operation instruction is determined according to the voice data of the user.
  • the main process of the voice interaction control method based on image content understanding proposed in the embodiment of the present application includes: in the process of voice interaction by the user, firstly, acquire the instruction text of the user voice interaction; secondly, acquire the current graphics Interface element information View Tree; again, find all controls that meet the requirements of the View Tree query, that is, all image controls whose class attribute suffix is ImageView or unconventional types (such as FrameLayout, LinearLayout, RelativeLayout, View, etc.); Size screening of image controls, where the length and width of the screened image controls need to be larger than a certain size (such as 100dp), and the aspect ratio needs to be less than 3; again, understand the image content of the screened image controls, and Generate the image description text of these image controls for the image content; again, perform semantic matching through the instruction text and the image description text, and find out the image control with the most similar semantics as the target image control; finally, the target image control executes
  • the embodiment of the present application provides a voice interaction and control method based on image content understanding for the situation where the interface element description text is missing or the description text is not convenient for the user to directly describe when the user uses voice to operate the graphical interface content.
  • the user can directly say “I want to watch Ronaldo's video” or “open the video containing the car”, so as to match and locate the video containing "C Ronaldo” or “car”, and reach the interface The purpose of element manipulation and interaction.
  • this interaction method conforms to the user's interaction habits and methods, and can effectively improve the user's experience when using voice control.
  • the above-mentioned technical solution needs to first obtain the current graphical interface element information View Tree from the system, and then filter out image controls from the View Tree to understand the image content.
  • the screenshot of the current graphical interface can also be used to understand the content of the image, so as to achieve the purpose of voice interaction and control.
  • FIG. 5 shows a schematic flowchart of another voice control method provided by an embodiment of the present application. As shown in Figure 5, the detailed process may include:
  • S402 Take a screenshot of the current graphical interface to obtain an image to be recognized.
  • S403 Perform control detection on the image to be recognized to obtain several candidate controls.
  • S404 Perform size screening on several candidate controls to obtain at least one image control.
  • S405 Perform image content understanding on at least one image control to obtain image description text information corresponding to at least one image control.
  • S406 Semantically matching the speech text information with the image description text information corresponding to at least one image control, and determining the target image control corresponding to the maximum semantic similarity value.
  • S407 Send an operation instruction to the target image control, so as to realize voice interaction between the user and the target image control.
  • the operation instruction is determined according to the voice data of the user.
  • the main process of the voice interaction control method based on image content understanding proposed in the embodiment of the present application includes: in the process of voice interaction by the user, firstly, acquire the command text of the user voice interaction; secondly, acquire the current graphical interface screenshot; again, detect the control (text, image, etc.) and control position contained in the screenshot; thirdly, combine the control position to filter the size of the detected control, wherein, the filtered image control, its length and width need If it is larger than a certain size (such as 100dp), the aspect ratio needs to be less than 3; again, understand the image content of the filtered image controls, and generate the image description text of these image controls for the image content; again, through the instruction text and image description The text is semantically matched, and the image control with the most similar semantics is found as the target image control; finally, the target image control executes the user's operation instructions (click, long press, etc.) to complete the user's voice interaction.
  • the text description information can be extracted by commonly used methods such as icon recognition and text recognition.
  • the user's voice control and matching can be realized only by images, thereby achieving the purpose of user interaction and control; the solution is simple to implement and is conducive to popularization.
  • the embodiment of the present application can also combine images and text instructions, use image-based text references to perform target detection, and directly match user interaction targets, thereby achieving the purpose of user interaction and control.
  • FIG. 6 shows a detailed flow chart of another voice control method provided by an embodiment of the present application. As shown in Figure 6, the detailed process may include:
  • S501 Obtain voice text information corresponding to user voice data.
  • S502 Take a screenshot of the current graphical interface to obtain an image to be recognized.
  • S503 Perform target detection on the image to be recognized according to the voice and text information, and determine the target image control.
  • S504 Send an operation instruction to the target image control, so as to realize voice interaction between the user and the target image control.
  • the voice data input by the user after receiving the voice data input by the user, it may specifically include: taking a screenshot of the current graphical interface to obtain an image to be recognized; performing text conversion on the voice data to obtain voice text information; Perform target detection on the image to be recognized according to the voice text information, and determine the target image control; determine the operation instruction according to the voice data, and send the operation instruction to the target image control, so as to realize the voice control of the target image control.
  • the operation instruction is determined according to the voice data of the user.
  • the main process of the voice interaction control method based on image content understanding proposed in the embodiment of the present application includes: in the process of voice interaction by the user, firstly, acquire the command text of the user voice interaction; secondly, acquire the current graphical interface screenshot; again, using image-based text reference target detection to achieve matching and positioning of user interaction targets to determine the target image control; finally, the target image control executes the user's operation instructions (click, long press, etc.) , to complete the interaction process.
  • the user's voice control and matching can be realized only by images, so as to realize the purpose of user interaction and control.
  • the user's interaction instructions can not be restricted, and the user can describe the interactive object (not limited to pictures, but also icons, etc.) in a relatively free manner, such as "the blue button in the upper right corner” , "I want to watch a car video”, “select the plus button below”, “click the second button at the bottom”, etc., instead of describing the interactive object according to the interface text description or predefined instructions.
  • the technical solution can provide users with a more natural and intelligent voice interaction and control mode; and the technical solution is simple to implement, which is conducive to popularization; at the same time, the system complexity is small, which is beneficial to the realization and deployment of the device side.
  • This embodiment provides a voice control method.
  • the implementation of the foregoing embodiments is described in detail through the foregoing embodiments. It can be seen that the technical solutions of the foregoing embodiments can not only save development costs, but also improve The convenience of users using voice control, so as to better realize the purpose of voice interaction and control.
  • FIG. 7 shows a schematic structural diagram of a voice control device 60 provided in the embodiment of the present application.
  • the voice control device 60 may include: a receiving unit 601, a determining unit 602, an analyzing unit 603, and a sending unit 604; wherein,
  • the receiving unit 601 is configured to receive voice data input by the user;
  • the determining unit 602 is configured to determine at least one image control according to the current graphical interface
  • the analysis unit 603 is configured to understand the image content of at least one image control, and obtain the image description text information corresponding to the at least one image control; and is also configured to perform image control recognition according to the voice data and the image description text information corresponding to the at least one image control , determining a target image control in at least one image control;
  • the determining unit 602 is further configured to determine an operation instruction according to the voice data
  • the sending unit 604 is configured to send an operation instruction to the target image control, so as to implement voice control on the target image control.
  • the determining unit 602 is further configured to determine graphical interface element information corresponding to the current graphical interface; and determine at least one image control according to the graphical interface element information.
  • the voice control device 60 may further include a calling unit 605 configured to call the system underlying code information to obtain graphical interface element information; or call the system auxiliary service function interface to obtain graphical interface element information.
  • the analysis unit 602 is further configured to query the controls whose class attribute suffix is a preset type from the graphical interface element information to form a set of candidate controls; and perform size screening on the controls in the set of candidate controls to obtain at least one Image controls.
  • the preset type includes at least one of the following: ImageView, FrameLayout, LinearLayout, RelativeLayout and View.
  • the determining unit 602 is further configured to, in the set of candidate controls, determine whether the length and width of the controls meet the preset size conditions, and determine the controls whose length and width meet the preset size conditions as image controls.
  • the voice control device 60 may further include a detection unit 606 configured to take a screenshot of the current graphical interface to obtain an image to be recognized;
  • the controls form a candidate control set; and size screening is performed on the controls in the candidate control set to obtain at least one image control.
  • the analysis unit 603 is specifically configured to perform text conversion on the voice data to obtain voice-to-text information; and semantically match the voice-to-text information with the image description text information corresponding to at least one image control to determine at least one image control The corresponding semantic similarity value; and according to the semantic similarity value, determine the target image control.
  • the determining unit 602 is further configured to determine that the image control corresponding to the maximum similarity value among the semantic similarity values is the target image control.
  • the detection unit 606 is also configured to take a screenshot of the current graphical interface to obtain the image to be recognized; and perform text conversion on the voice data to obtain voice-to-text information; and perform target detection on the image to be recognized based on the voice-to-text information , determine the target image control; and determine the operation instruction according to the voice data, and send the operation instruction to the target image control, so as to realize the voice control of the target image control.
  • a "unit” may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a module, or it may be non-modular.
  • each component in this embodiment may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software function modules.
  • the integrated unit is implemented in the form of a software function module and is not sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of this embodiment is essentially or It is said that the part that contributes to the prior art or the whole or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium, and includes several instructions to make a computer device (which can It is a personal computer, a server, or a network device, etc.) or a processor (processor) that executes all or part of the steps of the method described in this embodiment.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other various media that can store program codes.
  • this embodiment provides a computer storage medium, the computer storage medium stores a computer program, and when the computer program is executed by at least one processor, the steps of the method described in any one of the preceding embodiments are implemented.
  • an electronic device 70 may include: a communication interface 701 , a memory 702 , and a processor 703 ; each component is coupled together through a bus system 704 .
  • the bus system 704 is used to realize connection and communication between these components.
  • the bus system 704 also includes a power bus, a control bus and a status signal bus.
  • the various buses are labeled as bus system 704 in FIG. 8 .
  • the communication interface 701 is used for receiving and sending signals during the process of sending and receiving information with other external network elements;
  • memory 702 used to store computer programs that can run on the processor 703;
  • the processor 703 is configured to, when running the computer program, execute:
  • the memory 702 in this embodiment of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories.
  • the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electronically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash.
  • the volatile memory can be Random Access Memory (RAM), which acts as external cache memory.
  • RAM Static Random Access Memory
  • SRAM Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • Synchronous Dynamic Random Access Memory Synchronous Dynamic Random Access Memory
  • SDRAM double data rate synchronous dynamic random access memory
  • Double Data Rate SDRAM DDRSDRAM
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous chain dynamic random access memory
  • Direct Rambus RAM Direct Rambus RAM
  • the processor 703 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 703 or instructions in the form of software.
  • the above-mentioned processor 703 may be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 702, and the processor 703 reads the information in the memory 702, and completes the steps of the above method in combination with its hardware.
  • the processing unit can be implemented in one or more application specific integrated circuits (Application Specific Integrated Circuits, ASIC), digital signal processor (Digital Signal Processing, DSP), digital signal processing device (DSP Device, DSPD), programmable Logic device (Programmable Logic Device, PLD), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), general-purpose processor, controller, microcontroller, microprocessor, other devices used to perform the functions described in this application electronic unit or its combination.
  • ASIC Application Specific Integrated Circuits
  • DSP Digital Signal Processing
  • DSP Device digital signal processing device
  • DSPD digital signal processing device
  • PLD programmable Logic Device
  • Field-Programmable Gate Array Field-Programmable Gate Array
  • FPGA Field-Programmable Gate Array
  • the techniques described herein can be implemented through modules (eg, procedures, functions, and so on) that perform the functions described herein.
  • Software codes can be stored in memory and executed by a processor.
  • Memory can be implemented within the processor or external to the processor.
  • the processor 703 is further configured to execute the steps of the method described in any one of the foregoing embodiments when running the computer program.
  • FIG. 9 shows a schematic diagram of the composition and structure of another electronic device provided by the embodiment of the present application.
  • an electronic device 70 may include the voice control device 60 described in any one of the foregoing embodiments.
  • the voice interaction method based on image content understanding does not need to adapt the application and voice interaction, which not only saves development costs, but also facilitates user description, and can effectively improve the user's ability to use voice. Convenience during manipulation, so as to better realize the purpose of voice interaction and control.
  • the voice data input by the user is received; at least one image control is determined according to the current graphical interface; the image content of the at least one image control is understood to obtain the image description text information corresponding to the at least one image control; according to the voice data
  • the image description text information corresponding to at least one image control is used for image control identification, and the target image control is determined in at least one image control; the operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control to realize the voice of the target image control control.
  • the voice interaction method based on image content understanding does not need to adapt the application and voice interaction, which not only saves development costs, but also facilitates user description, which can effectively improve the convenience of users when using voice control, so as to better realize voice interaction. Interaction and control purposes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A voice control method, an apparatus (60), an electronic device (70), and a computer storage medium. The method comprises: receiving voice data input by a user (S201); determining at least one image control according to a current graphical interface (S202); performing image content understanding on the at least one image control, to obtain image description text information corresponding to the at least one image control (S203); performing image control recognition according to the voice data and the image description text information corresponding to the at least one image control, and determining a target image control of the at least one image control (S204); determining an operation instruction according to the voice data, and sending the operation instruction to the target image control so as to realize voice control of the target image control (S205). In this way, a voice interaction mode based on image content understanding can not only reduce development costs, but also improve convenience for a user during voice control, so that the purpose of voice interaction and control is better achieved.

Description

一种语音控制方法、装置、设备以及计算机存储介质A voice control method, device, equipment and computer storage medium
相关申请的交叉引用Cross References to Related Applications
本申请要求在2021年11月19日提交中国专利局、申请号为202111398660.6、申请名称为“一种语音控制方法、装置、设备以及计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application submitted to the China Patent Office on November 19, 2021, with the application number 202111398660.6 and the application name "A Voice Control Method, Device, Equipment, and Computer Storage Medium", the entire content of which is passed References are incorporated in this application.
技术领域technical field
本申请涉及语音交互技术领域,尤其涉及一种语音控制方法、装置、设备以及计算机存储介质。The present application relates to the technical field of voice interaction, and in particular to a voice control method, device, equipment and computer storage medium.
背景技术Background technique
近年来,随着硬件设备及电子产品的快速发展,基于语音的人机交互方式越来越成熟,越来越普遍,也越来越被人们所接受和使用。如此,语音交互也逐渐渗透至人们的日常生活中,使得通过语音来操控图形用户界面(Graphical User Interface,GUI)的需求也变得越来越强烈。In recent years, with the rapid development of hardware equipment and electronic products, voice-based human-computer interaction methods have become more mature, more common, and more and more accepted and used by people. In this way, voice interaction has gradually penetrated into people's daily life, so that the demand for controlling a Graphical User Interface (GUI) by voice is becoming more and more intense.
在相关技术中,针对图形界面中的元素缺少文本描述或者文本描述不便于用户直接描述的情形,虽然已经存在一些解决方案,例如图标识别、文字识别、空间方位指代、数字编号指代等等,但是这些解决方案均具有一定的局限性,尤其是对于图片类控件,由于用户无法或者不便通过文本描述来对此类控件进行描述,导致语音操控实现困难,用户的使用性差。In related technologies, there are already some solutions for the lack of text descriptions or text descriptions that are not convenient for users to directly describe the elements in the graphical interface, such as icon recognition, text recognition, spatial orientation designation, number designation designation, etc. , but these solutions all have certain limitations, especially for picture-type controls, because users cannot or are inconvenient to describe such controls through text descriptions, which makes it difficult to implement voice control and poor usability for users.
发明内容Contents of the invention
本申请的技术方案是这样实现的:The technical scheme of the present application is realized like this:
第一方面,本申请实施例提供了一种语音控制方法,该方法包括:In the first aspect, the embodiment of the present application provides a voice control method, the method includes:
接收用户输入的语音数据;Receive voice data input by the user;
根据当前的图形界面,确定至少一个图像控件;Determine at least one image control according to the current graphical interface;
对至少一个图像控件进行图像内容理解,得到至少一个图像控件对应的图像描述文本信息;Perform image content understanding on at least one image control, and obtain image description text information corresponding to at least one image control;
根据语音数据与至少一个图像控件对应的图像描述文本信息进行图像控件识别,在至少一个图像控件中确定目标图像控件;Image control identification is performed according to the image description text information corresponding to the voice data and at least one image control, and a target image control is determined in at least one image control;
根据语音数据确定操作指令,向目标图像控件发送操作指令,以实现对目标图像控件的语音控制。Determine the operation instruction according to the voice data, and send the operation instruction to the target image control, so as to realize the voice control of the target image control.
第二方面,本申请实施例提供了一种语音控制装置,该语音控制装置包括接收单元、确定单元、分析单元和发送单元;其中,In the second aspect, the embodiment of the present application provides a voice control device, the voice control device includes a receiving unit, a determining unit, an analyzing unit, and a sending unit; wherein,
接收单元,配置为接收用户输入的语音数据;a receiving unit configured to receive voice data input by a user;
确定单元,配置为根据当前的图形界面,确定至少一个图像控件;A determining unit configured to determine at least one image control according to the current graphical interface;
分析单元,配置为对至少一个图像控件进行图像内容理解,得到至少一个图像控件对应的图像描述文本信息;以及还配置为根据语音数据与至少一个图像控件对应的图像描述文本信息进行图像控件识别,在至少一个图像控件中确定目标图像控件;The analysis unit is configured to understand the image content of at least one image control, and obtain the image description text information corresponding to the at least one image control; and is also configured to perform image control identification according to the voice data and the image description text information corresponding to the at least one image control, determining a target image control among at least one image control;
确定单元,还配置为根据语音数据确定操作指令;The determination unit is further configured to determine the operation instruction according to the voice data;
发送单元,配置为向目标图像控件发送操作指令,以实现对目标图像控件的语音控制。The sending unit is configured to send an operation instruction to the target image control, so as to implement voice control on the target image control.
第三方面,本申请实施例提供了一种电子设备,该电子设备包括存储器和处理器;其中,In a third aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a memory and a processor; wherein,
所述存储器,用于存储能够在所述处理器上运行的计算机程序;said memory for storing a computer program capable of running on said processor;
所述处理器,用于在运行所述计算机程序时,执行如第一方面所述的方法。The processor is configured to execute the method as described in the first aspect when running the computer program.
第四方面,本申请实施例提供了一种计算机存储介质,该计算机存储介质存储有计算机程序,所述计算机程序被至少一个处理器执行时实现如第一方面所述的方法。In a fourth aspect, an embodiment of the present application provides a computer storage medium, where a computer program is stored in the computer storage medium, and when the computer program is executed by at least one processor, the method described in the first aspect is implemented.
附图说明Description of drawings
图1为一种图形界面元素网格索引示意图;FIG. 1 is a schematic diagram of a graphical interface element grid index;
图2为本申请实施例提供的一种语音控制方法的流程示意图;FIG. 2 is a schematic flowchart of a voice control method provided in an embodiment of the present application;
图3为本申请实施例提供的一种层次结构树的结构示意图;FIG. 3 is a schematic structural diagram of a hierarchical tree provided by an embodiment of the present application;
图4为本申请实施例提供的一种语音控制方法的详细流程示意图;FIG. 4 is a detailed flowchart of a voice control method provided in an embodiment of the present application;
图5为本申请实施例提供的另一种语音控制方法的详细流程示意图;FIG. 5 is a schematic flowchart of another voice control method provided in the embodiment of the present application;
图6为本申请实施例提供的又一种语音控制方法的详细流程示意图;FIG. 6 is a schematic flowchart of another voice control method provided in the embodiment of the present application;
图7为本申请实施例提供的一种语音控制装置的组成结构示意图;FIG. 7 is a schematic structural diagram of a voice control device provided by an embodiment of the present application;
图8为本申请实施例提供的一种电子设备的组成结构示意图;FIG. 8 is a schematic diagram of the composition and structure of an electronic device provided by an embodiment of the present application;
图9为本申请实施例提供的另一种电子设备的组成结构示意图。FIG. 9 is a schematic diagram of the composition and structure of another electronic device provided by the embodiment of the present application.
具体实施方式Detailed ways
第一方面,本申请实施例提供了一种语音控制方法,该方法包括:In the first aspect, the embodiment of the present application provides a voice control method, the method includes:
接收用户输入的语音数据;Receive voice data input by the user;
根据当前的图形界面,确定至少一个图像控件;Determine at least one image control according to the current graphical interface;
对所述至少一个图像控件进行图像内容理解,得到所述至少一个图像控件对应的图像描述文本信息;Perform image content understanding on the at least one image control to obtain image description text information corresponding to the at least one image control;
根据所述语音数据与所述至少一个图像控件对应的图像描述文本信息进行图像控件识别,在所述至少一个图像控件中确定目标图像控件;Perform image control identification according to the image description text information corresponding to the voice data and the at least one image control, and determine a target image control in the at least one image control;
根据所述语音数据确定操作指令,向所述目标图像控件发送所述操作指令,以实现对所述目标图像控件的语音控制。An operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control, so as to implement voice control on the target image control.
在一些实施例中,所述根据当前的图形界面,确定至少一个图像控件,包括:In some embodiments, the determining at least one image control according to the current graphical interface includes:
确定当前的图形界面对应的图形界面元素信息;Determine the graphical interface element information corresponding to the current graphical interface;
根据所述图形界面元素信息,确定所述至少一个图像控件。The at least one image control is determined according to the graphical interface element information.
在一些实施例中,所述确定当前的图形界面对应的图形界面元素信息,包括:In some embodiments, the determining the graphical interface element information corresponding to the current graphical interface includes:
调用系统底层代码信息,获取所述图形界面元素信息;或者,Invoking the underlying code information of the system to obtain the graphical interface element information; or,
调用系统辅助服务功能接口,获取所述图形界面元素信息。Call the system auxiliary service function interface to obtain the graphic interface element information.
在一些实施例中,所述根据所述图形界面元素信息,确定所述至少一个图像控件,包括:In some embodiments, the determining the at least one image control according to the graphical interface element information includes:
从所述图形界面元素信息中查询类属性后缀为预设类型的控件,组成候选控件集合;Querying controls whose class attribute suffix is a preset type from the graphical interface element information to form a set of candidate controls;
对所述候选控件集合中的控件进行尺寸筛选,得到所述至少一个图像控件。Perform size screening on the controls in the set of candidate controls to obtain the at least one image control.
在一些实施例中,所述预设类型包括下述至少之一:ImageView、FrameLayout、LinearLayout、RelativeLayout和View。In some embodiments, the preset type includes at least one of the following: ImageView, FrameLayout, LinearLayout, RelativeLayout and View.
在一些实施例中,所述对所述候选控件集合中的控件进行尺寸筛选,得到所述至少一个图像控件,包括:In some embodiments, the size screening of the controls in the set of candidate controls to obtain the at least one image control includes:
在所述候选控件集合中,判断所述控件的长度和宽度是否满足预设尺寸条件,将长度 和宽度满足所述预设尺寸条件的所述控件确定为所述图像控件。In the set of candidate controls, it is judged whether the length and width of the controls meet the preset size condition, and the control whose length and width meet the preset size condition is determined as the image control.
在一些实施例中,所述根据当前的图形界面,确定至少一个图像控件,包括:In some embodiments, the determining at least one image control according to the current graphical interface includes:
对当前的图形界面进行截图,得到待识别图像;Take a screenshot of the current graphical interface to obtain the image to be recognized;
对所述待识别图像进行控件检测,将检测得到的若干个控件组成候选控件集合;Performing control detection on the image to be recognized, and forming a candidate control set from several detected controls;
对所述候选控件集合中的控件进行尺寸筛选,得到所述至少一个图像控件。Perform size screening on the controls in the set of candidate controls to obtain the at least one image control.
在一些实施例中,所述根据所述语音数据与所述至少一个图像控件对应的图像描述文本信息进行图像控件识别,在所述至少一个图像控件中确定目标图像控件,包括:In some embodiments, the identifying the image control according to the image description text information corresponding to the voice data and the at least one image control, and determining the target image control in the at least one image control includes:
对所述语音数据进行文本转换,得到语音文本信息;Carry out text conversion to described voice data, obtain voice text information;
将所述语音文本信息与所述至少一个图像控件对应的图像描述文本信息进行语义匹配,确定所述至少一个图像控件对应的语义相似度值;Semantically matching the voice text information with the image description text information corresponding to the at least one image control, and determining the semantic similarity value corresponding to the at least one image control;
根据所述语义相似度值,确定所述目标图像控件。The target image control is determined according to the semantic similarity value.
在一些实施例中,所述根据所述语义相似度值,确定所述目标图像控件,包括:In some embodiments, the determining the target image control according to the semantic similarity value includes:
确定所述语义相似度值中的最大相似度值对应的图像控件为所述目标图像控件。It is determined that the image control corresponding to the maximum similarity value among the semantic similarity values is the target image control.
在一些实施例中,在所述接收用户输入的语音数据之后,所述方法还包括:In some embodiments, after receiving the voice data input by the user, the method further includes:
对当前的图形界面进行截图,得到待识别图像;Take a screenshot of the current graphical interface to obtain the image to be recognized;
对所述语音数据进行文本转换,得到语音文本信息;Carry out text conversion to described voice data, obtain voice text information;
根据所述语音文本信息对所述待识别图像进行目标检测,确定目标图像控件;performing target detection on the image to be recognized according to the voice text information, and determining a target image control;
根据所述语音数据确定操作指令,向所述目标图像控件发送所述操作指令,以实现对所述目标图像控件的语音控制。An operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control, so as to implement voice control on the target image control.
第二方面,本申请实施例提供了一种语音控制装置,该语音控制装置包括接收单元、确定单元、分析单元和发送单元;其中,In the second aspect, the embodiment of the present application provides a voice control device, the voice control device includes a receiving unit, a determining unit, an analyzing unit, and a sending unit; wherein,
所述接收单元,配置为接收用户输入的语音数据;The receiving unit is configured to receive voice data input by a user;
所述确定单元,配置为根据当前的图形界面,确定至少一个图像控件;The determining unit is configured to determine at least one image control according to the current graphical interface;
所述分析单元,配置为对所述至少一个图像控件进行图像内容理解,得到所述至少一个图像控件对应的图像描述文本信息;以及还配置为根据所述语音数据与所述至少一个图像控件对应的图像描述文本信息进行图像控件识别,在所述至少一个图像控件中确定目标图像控件;The analysis unit is configured to understand the image content of the at least one image control to obtain image description text information corresponding to the at least one image control; and is also configured to correspond to the at least one image control according to the voice data The image description text information of the image control is identified, and the target image control is determined in the at least one image control;
所述确定单元,还配置为根据所述语音数据确定操作指令;The determining unit is further configured to determine an operation instruction according to the voice data;
所述发送单元,配置为向所述目标图像控件发送所述操作指令,以实现对所述目标图像控件的语音控制。The sending unit is configured to send the operation instruction to the target image control, so as to implement voice control on the target image control.
在一些实施例中,所述确定单元,还配置为确定当前的图形界面对应的图形界面元素信息;以及根据所述图形界面元素信息,确定所述至少一个图像控件。In some embodiments, the determining unit is further configured to determine graphical interface element information corresponding to the current graphical interface; and determine the at least one image control according to the graphical interface element information.
在一些实施例中,所述语音控制装置还包括调用单元,配置为调用系统底层代码信息,获取所述图形界面元素信息;或者,调用系统辅助服务功能接口,获取所述图形界面元素信息。In some embodiments, the voice control device further includes a calling unit configured to call the underlying system code information to obtain the graphical interface element information; or call the system auxiliary service function interface to obtain the graphical interface element information.
在一些实施例中,所述分析单元,还配置为从所述图形界面元素信息中查询类属性后缀为预设类型的控件,组成候选控件集合;以及对所述候选控件集合中的控件进行尺寸筛选,得到所述至少一个图像控件。In some embodiments, the analysis unit is further configured to query controls whose class attribute suffix is a preset type from the graphical interface element information to form a set of candidate controls; and size the controls in the set of candidate controls Screening to obtain the at least one image control.
在一些实施例中,所述确定单元,还配置为在所述候选控件集合中,判断所述控件的长度和宽度是否满足预设尺寸条件,将长度和宽度满足所述预设尺寸条件的所述控件确定为所述图像控件。In some embodiments, the determining unit is further configured to, in the set of candidate controls, determine whether the length and width of the controls meet a preset size condition, and select all controls whose length and width meet the preset size condition The control is determined as the image control.
在一些实施例中,所述语音控制装置还包括检测单元,配置为对当前的图形界面进行截图,得到待识别图像;以及对所述待识别图像进行控件检测,将检测得到的若干个控件组成候选控件集合;以及对所述候选控件集合中的控件进行尺寸筛选,得到所述至少一个图像控件。In some embodiments, the voice control device further includes a detection unit configured to take a screenshot of the current graphical interface to obtain an image to be recognized; and perform control detection on the image to be recognized, and combine the detected controls into a A set of candidate controls; and performing size screening on the controls in the set of candidate controls to obtain the at least one image control.
在一些实施例中,所述分析单元,还配置为对所述语音数据进行文本转换,得到语音文本信息;以及将所述语音文本信息与所述至少一个图像控件对应的图像描述文本信息进行语义匹配,确定所述至少一个图像控件对应的语义相似度值;以及根据所述语义相似度值,确定所述目标图像控件。In some embodiments, the analysis unit is further configured to perform text conversion on the voice data to obtain voice-to-text information; and semantically convert the voice-to-text information to the image description text information corresponding to the at least one image control matching, determining a semantic similarity value corresponding to the at least one image control; and determining the target image control according to the semantic similarity value.
在一些实施例中,所述检测单元,配置为对当前的图形界面进行截图,得到待识别图像;以及对所述语音数据进行文本转换,得到语音文本信息;以及根据所述语音文本信息对所述待识别图像进行目标检测,确定目标图像控件;以及根据所述语音数据确定操作指令,向所述目标图像控件发送所述操作指令,以实现对所述目标图像控件的语音控制。In some embodiments, the detection unit is configured to take a screenshot of the current graphical interface to obtain an image to be recognized; and perform text conversion on the voice data to obtain voice-to-text information; performing target detection on the image to be recognized, and determining a target image control; and determining an operation instruction according to the voice data, and sending the operation instruction to the target image control, so as to implement voice control on the target image control.
第三方面,本申请实施例提供了一种电子设备,该电子设备包括存储器和处理器;其中,In a third aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a memory and a processor; wherein,
所述存储器,用于存储能够在所述处理器上运行的计算机程序;said memory for storing a computer program capable of running on said processor;
所述处理器,用于在运行所述计算机程序时,执行如第一方面中任一项所述的方法。The processor is configured to execute the method according to any one of the first aspect when running the computer program.
第四方面,本申请实施例提供了一种计算机存储介质,该计算机存储介质存储有计算机程序,所述计算机程序被至少一个处理器执行时实现如第一方面中任一项所述的方法。In a fourth aspect, an embodiment of the present application provides a computer storage medium, where a computer program is stored in the computer storage medium, and when the computer program is executed by at least one processor, the method according to any one of the first aspect is implemented.
为了能够更加详尽地了解本申请实施例的特点与技术内容,下面结合附图对本申请实施例的实现进行详细阐述,所附附图仅供参考说明之用,并非用来限定本申请实施例。In order to understand the characteristics and technical contents of the embodiments of the present application in more detail, the implementation of the embodiments of the present application will be described in detail below in conjunction with the accompanying drawings. The attached drawings are only for reference and description, and are not intended to limit the embodiments of the present application.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。还需要指出,本申请实施例所涉及的术语“第一\第二\第三”仅是用于区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。In the following description, references to "some embodiments" describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict. It should also be pointed out that the term "first\second\third" involved in the embodiment of the present application is only used to distinguish similar objects, and does not represent a specific ordering of objects. Understandably, "first\second\ The specific order or sequence of "third" may be interchanged where permitted so that the embodiments of the application described herein can be implemented in an order other than that illustrated or described herein.
得益于近年来硬件设备及电子产品的快速发展,基于语音的人机交互方式越来越成熟,越来越普遍,也越来越被人们所接受和使用。随着语音交互逐渐渗透至人们生活中的方方面面,通过语音来操控图形用户界面(Graphical User Interface,GUI)的需求,也变得越来越强烈。Thanks to the rapid development of hardware equipment and electronic products in recent years, voice-based human-computer interaction methods are becoming more and more mature, more common, and more and more accepted and used by people. As voice interaction gradually penetrates into all aspects of people's lives, the demand for controlling Graphical User Interface (GUI) through voice is becoming more and more intense.
目前,基于语音和图形的用户界面(Voice and Graphical User Interface,VGUI)最主要的问题是应用适配问题。因为绝大多数的应用在设计和开发的时候都没有考虑使用语音交互的问题。以智能手机为例,目前手机端的应用,其设计的主要交互方式是通过触屏交互。因此,基本上绝大多数应用都没有适配过语音交互。因此,当使用语音来与手机端的应用图形界面进行交互和控制时,会遇到较多的问题,例如界面元素缺少文字描述、或者界面元素有文字描述但不便于用户直接通过对应文字描述(如文字描述太长,文字描述包含符号、图片等,文字描述看不清,存在多个界面元素的文本描述相同或相似等)来指代等。这些情况下,用户无法直接通过文本描述来指代想要交互的控件。At present, the main problem of Voice and Graphical User Interface (VGUI) is application adaptation. Because the vast majority of applications do not consider the use of voice interaction when designing and developing. Taking smart phones as an example, the current mobile phone applications are mainly designed for interaction through touch screen interaction. Therefore, basically the vast majority of applications have not been adapted to voice interaction. Therefore, when using voice to interact and control the application graphical interface on the mobile phone, many problems will be encountered, such as lack of text descriptions for interface elements, or interface elements that have text descriptions but are not convenient for users to directly use corresponding text descriptions (such as The text description is too long, the text description contains symbols, pictures, etc., the text description is unclear, and there are multiple interface elements whose text descriptions are the same or similar, etc.) to refer to, etc. In these cases, users cannot directly refer to the control they want to interact with through the text description.
在相关技术中,针对这种情况,目前的解决方案主要有以下几种:In related technologies, the current solutions for this situation mainly include the following:
(1)图标识别:通过模型对常用、无歧义的图标控件进行检测、识别,从而得到描述文本(常用说法/称谓)。这样,用户可以通过常识或常用说法/称谓等对图标控件进行描述,例如音视频播放控制按钮中的“播放”、“暂停”、“上一首”、“下一首”等,从而实现对目标控件的描述,达到交互的目的;(1) Icon recognition: Use the model to detect and identify commonly used and unambiguous icon controls, so as to obtain description text (common terms/titles). In this way, users can use common sense or common sayings/appellations to describe icon controls, such as "play", "pause", "previous song" and "next song" in audio and video playback control buttons, so as to realize Description of the target control to achieve the purpose of interaction;
(2)文字识别:通过模型实现对不包含文本描述的图片、图标等图像控件中识别可能包含的文本信息,并将识别得到的文本信息作为控件的文本描述信息,用以匹配用户的交互指令,从而实现对目标控件的定位,达到交互的目的;(2) Text recognition: use the model to recognize the text information that may be contained in image controls such as pictures and icons that do not contain text descriptions, and use the recognized text information as the text description information of the control to match the user's interaction instructions , so as to realize the positioning of the target control and achieve the purpose of interaction;
(3)空间方位指代:例如“下载按钮右边的按钮”、“点赞按钮下方的图标”等,通过其他可描述的控件与目标控件之间的空间方位关系来指代目标控件,从而实现对目标控件的描述,达到交互的目的;(3) Spatial orientation reference: For example, "the button on the right of the download button", "the icon below the like button", etc., refer to the target control through the spatial orientation relationship between other descriptive controls and the target control, so as to realize Description of the target control to achieve the purpose of interaction;
(4)数字编号指代:例如“第一个按钮”等,通过对所有控件进行编号,然后通过编号来指代控件,从而实现对目标控件的描述,达到交互的目的;控件编号在视觉上一般不显示,因此需要用户自己计算控件的编号;(4) Digital number refers to: for example, "the first button", etc., by numbering all controls, and then referring to the control by number, so as to realize the description of the target control and achieve the purpose of interaction; the control number is visually Generally, it is not displayed, so the user needs to calculate the number of the control by himself;
(5)叠加显示文本指令:在GUI上叠加显示每个可交互控件的文本描述,用户可以通过每个可交互控件叠加显示的对应文本描述来指代相应的控件,从而达到交互的目的;(5) Superimposed display text instructions: superimpose and display the text description of each interactive control on the GUI, and the user can refer to the corresponding control through the corresponding text description superimposed and displayed on each interactive control, so as to achieve the purpose of interaction;
(6)叠加显示数字编号:在GUI上叠加显示每个可交互控件的数字编号,用户可以通过每个可交互控件叠加实现的对应数字编号来指代相应的控件,从而达到交互的目的;(6) Superimposed display of digital numbers: The digital numbers of each interactive control are superimposed on the GUI, and the user can refer to the corresponding control through the corresponding digital number realized by the superimposition of each interactive control, so as to achieve the purpose of interaction;
(7)叠加显示网络栅格及编号:在GUI上全屏叠加显示网络栅格,并对每个网格区域进行编号。用户可通过对控件所在网格对应的编号来指代相应的控件,从而实现对目标控件的描述,达到交互的目的。(7) Superimposed display of network grid and numbering: the network grid is superimposed and displayed on the GUI in full screen, and each grid area is numbered. The user can refer to the corresponding control by the number corresponding to the grid where the control is located, so as to realize the description of the target control and achieve the purpose of interaction.
然而,针对图形界面中的元素缺少文本描述或者文本描述不便于用户直接描述的情形,上述的几种解决方案皆具有一定的局限性,无法适用全部情形。其中,(1)对于图标识别而言,图标识别仅适用于常用且无歧义的控件图标,对于其他类型的图标以及非图标内容无法处理,适用范围有限。(2)对于文本识别而言,文本识别仅适用于图像中包含文本的情况,适用情形有限;而且文本识别需要较多的计算资源,因此一般处理时延较大,使用成本较高,且准确度有限。(3)对于空间方位指代而言,空间方位指代的方式需要先找到一个可通过文本描述进行定位的控件作为基准,然而许多情况下并不能找到这样的控件,因此其适用范围相对有限。(4)对于数字编号指代而言,数字编号指代的方式需要通过程序对控件进行编号,然后通过编号来指代控件。控件编号本身不会在界面中显示。然而在实际使用中,用户的编号方式与程序的编号方式不一定是一致的;而且通常一个界面上可能存在几十个可交互对象,让用户对控件进行一一编号是十分困难的。(5)对于叠加显示文本指令而言,叠加显示文本指令的方式需要先生成文本指令;而生成文本指令则依赖于控件的文本描述,因此文本指令可能出现与文本描述相同的情形;并且叠加显示的内容过大则会遮盖住原有内容,内容过小则会导致用户看不清等情形;且通常一个界面上可能存在几十个可交互对象,最后会在界面上叠加上密密麻麻的提示内容,对用户的使用体验和感官体验影响极大。(6)对于叠加显示数字编号而言,叠加显示数字编号的方式实现简单,但是不利于用户记忆正确的交互指令。并且叠加显示的内容过大则会遮盖住原有内容,内容过小则会导致用户看不清等情形;且通常一个界面上可能存在几十个可交互对象,最后会在界面上叠加上密密麻麻的提示内容,对用户的使用体验和感官体验影响极大。(7)对于叠加显示网络栅格及编号而言,叠加显示网络栅格及编号的方式中网格大小可能过大也可能过小;目标交互控件可能落在好几个网格中;同一个网格中也可能出现好几个交互对象。这些情形下就需要用户进行多次操作,才能最终确定交互目标。并且叠加显示的内容会遮盖住原有内容,对用户的使用体验和感官体验影响较大。However, for the situation where the elements in the graphical interface lack text descriptions or the text descriptions are not convenient for users to directly describe, the above solutions have certain limitations and cannot be applied to all situations. Among them, (1) for icon recognition, icon recognition is only applicable to commonly used and unambiguous control icons, and cannot handle other types of icons and non-icon content, and its scope of application is limited. (2) For text recognition, text recognition is only applicable to the situation where the image contains text, and the applicable situation is limited; and text recognition requires more computing resources, so the general processing delay is relatively large, the cost of use is high, and it is accurate Degree is limited. (3) For spatial orientation reference, the method of spatial orientation reference needs to find a control that can be located through text description as a reference. However, in many cases, such a control cannot be found, so its scope of application is relatively limited. (4) For digital reference, the digital reference method needs to number the control through the program, and then use the number to refer to the control. The control number itself is not displayed in the interface. However, in actual use, the user's numbering method is not necessarily consistent with the program's numbering method; and usually there may be dozens of interactive objects on an interface, it is very difficult for users to number the controls one by one. (5) For superimposed display text instructions, the method of superimposed display text instructions needs to generate text instructions first; and generating text instructions depends on the text description of the control, so the text instructions may have the same situation as the text description; and the superimposed display If the content is too large, the original content will be covered, and if the content is too small, the user will not be able to see clearly; and usually there may be dozens of interactive objects on an interface, and finally dense prompt content will be superimposed on the interface , has a great impact on the user experience and sensory experience. (6) For the superimposed display of digital numbers, the method of superimposed display of digital numbers is simple to implement, but it is not conducive to the user's memory of correct interactive instructions. And if the superimposed content is too large, the original content will be covered, and if the content is too small, the user will not be able to see it clearly; and usually there may be dozens of interactive objects on an interface, and finally densely packed on the interface. The prompt content has a great impact on the user experience and sensory experience. (7) For the superimposed display of network grids and numbers, the grid size may be too large or too small in the way of superimposed display of network grids and numbers; target interactive controls may fall in several grids; Several interactive objects may also appear in the grid. In these situations, the user needs to perform multiple operations before finally determining the interaction target. In addition, the superimposed content will cover the original content, which will greatly affect the user experience and sensory experience.
简单来说,通过对用户无法或者不便通过文本描述来对目标控件进行描述的情形分析,我们可以将其大致分为以下两种:图标类和图片类。以图1为例,图标类可以见图1中的加粗虚线框所示,其控件尺寸一般较小,且通常外观和含义相对固定,其可以通过图标识别来实现;对于非标准、非常用、有歧义的图标,可以通过多重标签的方式来实现,例如形状、颜色、外观样式、视觉语义等标签,也可以通过空间方位结合编号索引的方式来实现。另一类是图片类,可以见图1中的加粗实线框所示,其控件尺寸一般较大,其主要出现在图像、视频、文件、消息等列表中;其控件本身排列可能规则(如网格排列),也可能不规则。但是图片本身的视觉内容和含义变化较大,其本身可能有文本描述,也可能没有文本描述;而且文本描述可能存在重复,也可能存在符号、图片等不便于用户直接描述 的情形。目前针对这种情形,已有的解决方案均具有一定的局限性,导致语音操控实现困难,无法提供较好的交互方式和交互体验。To put it simply, by analyzing the situation where the user cannot or is inconvenient to describe the target control through text description, we can roughly divide it into the following two types: icons and pictures. Taking Figure 1 as an example, the icon class can be seen in the bold dashed box in Figure 1. The control size is generally small, and the appearance and meaning are usually relatively fixed, which can be realized through icon recognition; for non-standard, very useful 1. Ambiguous icons can be realized through multiple labels, such as shape, color, appearance style, visual semantics and other labels, or through the combination of spatial orientation and numbered index. The other type is pictures, which can be seen in the bold solid line box in Figure 1. The control size is generally large, and it mainly appears in the list of images, videos, files, messages, etc.; the control itself may be arranged according to the rules ( such as a grid arrangement), may also be irregular. However, the visual content and meaning of the picture itself varies greatly, and it may or may not have a text description; and the text description may be repeated, and there may be symbols, pictures, etc. that are not convenient for users to describe directly. At present, for this situation, the existing solutions all have certain limitations, which makes it difficult to implement voice control, and cannot provide a better interaction method and interaction experience.
基于此,本申请实施例提供了一种语音控制方法,接收用户输入的语音数据;根据当前的图形界面,确定至少一个图像控件;对至少一个图像控件进行图像内容理解,得到至少一个图像控件对应的图像描述文本信息;根据语音数据与至少一个图像控件对应的图像描述文本信息进行图像控件识别,在至少一个图像控件中确定目标图像控件;根据语音数据确定操作指令,向目标图像控件发送操作指令,以实现对目标图像控件的语音控制。这样,基于图像内容理解的语音交互方式,无需进行应用与语音交互的适配,不仅可以节省开发成本,而且方便用户描述,能够有效提升用户使用语音操控时的便捷性,从而更好地实现语音交互和控制的目的。Based on this, the embodiment of the present application provides a voice control method, which receives the voice data input by the user; determines at least one image control according to the current graphical interface; understands the image content of at least one image control, and obtains the corresponding image description text information; image control identification is performed according to the image description text information corresponding to the voice data and at least one image control, and the target image control is determined in at least one image control; the operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control , to enable voice control of the target image control. In this way, the voice interaction method based on image content understanding does not need to adapt the application and voice interaction, which not only saves development costs, but also facilitates user description, which can effectively improve the convenience of users when using voice control, so as to better realize voice interaction. Interaction and control purposes.
下面将结合附图对本申请各实施例进行详细说明。Various embodiments of the present application will be described in detail below with reference to the accompanying drawings.
本申请的一实施例中,参见图2,其示出了本申请实施例提供的一种语音控制方法的流程示意图。如图2所示,该方法可以包括:In an embodiment of the present application, refer to FIG. 2 , which shows a schematic flowchart of a voice control method provided in an embodiment of the present application. As shown in Figure 2, the method may include:
S201:接收用户输入的语音数据。S201: Receive voice data input by a user.
S202:根据当前的图形界面,确定至少一个图像控件。S202: Determine at least one image control according to the current graphical interface.
需要说明的是,本申请实施例应用于语音控制装置,或者集成有该装置的电子设备。其中,电子设备可以以各种形式来实施,例如,电子设备可以包括诸如智能手机、平板电脑、笔记本电脑、掌上电脑、个人数字助理(Personal Digital Assistant,PDA)、便捷式媒体播放器(Portable Media Player,PMP)、导航装置、可穿戴设备、语音助手等等,本申请实施例不作任何限定。It should be noted that the embodiment of the present application is applied to a voice control device, or an electronic device integrated with the device. Among them, the electronic equipment can be implemented in various forms, for example, the electronic equipment can include such as smart phones, tablet computers, notebook computers, palmtop computers, personal digital assistants (Personal Digital Assistant, PDA), portable media players (Portable Media Player, PMP), navigation device, wearable device, voice assistant, etc., the embodiment of this application does not make any limitation.
还需要说明的是,根据语音数据可以确定操作指令。这样,电子设备在接收到用户输入的语音数据之后,可以根据该语音数据所确定的操作指令在当前的图形界面执行相应操作。但是以图1中的图片类控件为例,由于用户无法或者不便直接通过文本描述进行描述,这时候需要查找出当前图形界面中的图像控件,使得后续可以基于图像内容理解的方式,通过理解用户输入的语音数据以及图形界面内容,从而实现用户交互意图与交互目标的匹配,达到语音交互的目的。It should also be noted that the operation instruction can be determined according to the voice data. In this way, after receiving the voice data input by the user, the electronic device can perform corresponding operations on the current graphical interface according to the operation instruction determined by the voice data. However, taking the image control in Figure 1 as an example, since the user cannot or is inconvenient to describe it directly through text description, it is necessary to find out the image control in the current graphical interface at this time, so that the follow-up can be based on the understanding of the image content, by understanding the user Input voice data and graphical interface content, so as to realize the matching of user interaction intention and interaction goal, and achieve the purpose of voice interaction.
在一些实施例中,对于S202来说,所述根据当前的图形界面,确定至少一个图像控件,可以包括:In some embodiments, for S202, the determining at least one image control according to the current graphical interface may include:
确定当前的图形界面对应的图形界面元素信息;Determine the graphical interface element information corresponding to the current graphical interface;
根据图形界面元素信息,确定至少一个图像控件。Determine at least one image control according to the graphical interface element information.
需要说明的是,在本申请实施例中,图形界面元素信息可以用层次结构树表示,也可以将其称为视图树(View Tree)。示例性地,电子设备中的GUI元素信息可以是如图3所示的层次结构树。其中,对于View Tree而言,View Tree中的每一个节点表示GUI中的一个元素或控件(Control/Widget/Element),对于该元素的一些相关属性,可以包括文本描述、交互属性(是否可点击、是否可输入文本、是否可滑动等)、控件位置等等。It should be noted that, in the embodiment of the present application, the graphical interface element information may be represented by a hierarchical tree, which may also be called a view tree (View Tree). Exemplarily, the GUI element information in the electronic device may be a hierarchical structure tree as shown in FIG. 3 . Among them, for the View Tree, each node in the View Tree represents an element or control (Control/Widget/Element) in the GUI, and some related attributes of the element can include text descriptions, interactive attributes (clickable or not) , whether it can enter text, whether it can slide, etc.), the position of the control, and so on.
另外,对于View Tree中的一个节点,表1示意出了该节点的一些相关属性信息,可以包括索引序号、文本描述、交互属性(是否可点击、是否可输入文本、是否可滑动等)、控件位置等等,具体如下所示。In addition, for a node in the View Tree, Table 1 shows some related attribute information of the node, which can include index number, text description, interactive attributes (whether it can be clicked, whether it can input text, whether it can slide, etc.), controls location, etc., as shown below.
表1Table 1
Figure PCTCN2022122020-appb-000001
Figure PCTCN2022122020-appb-000001
在View Tree中,用户直接可见的元素主要是View Tree中的叶子节点(Leaf Elements)。其他的非叶子节点,一般来说用户是不可见的,其主要作为界面元素容器(Container)来使用,其主要用于对元素的位置、大小、排列等进行约束和控制。同时,部分Container也承载了与用户交互的作用(这时候clickable属性为true)。In the View Tree, the elements directly visible to the user are mainly the leaf nodes (Leaf Elements) in the View Tree. Other non-leaf nodes are generally invisible to users, and they are mainly used as interface element containers (Containers), which are mainly used to constrain and control the position, size, arrangement, etc. of elements. At the same time, some Containers also carry the function of interacting with users (the clickable attribute is true at this time).
在一种具体的实施例中,所述确定当前的图形界面对应的图形界面元素信息,可以包括:In a specific embodiment, the determining the graphical interface element information corresponding to the current graphical interface may include:
调用系统底层代码信息,获取图形界面元素信息;或者,Call the underlying code information of the system to obtain the graphical interface element information; or,
调用系统辅助服务功能接口,获取图形界面元素信息。Call the system auxiliary service function interface to obtain the graphical interface element information.
也就是说,以基于安卓(Android)系统的电子设备为例,从安卓系统中获取图形界面元素信息的方式主要有以下两种:可以直接通过系统底层代码,提供获取界面元素结构及信息的接口来实现图形界面元素信息的获取;或者,也可以通过系统辅助服务(Accessibility Service)功能接口来实现图形界面元素信息的获取。然而,前一种方式实现复杂、开发工作量较大,同时需要修改系统底层代码,具有一定的安全风险,但是能够获取到更全面、更准确的界面元素信息;后一种方式实现简单、开发工作量较小,但是其获取到的图形界面元素信息可能会有遗漏,且有可能存在部分信息错误。需要注意的是,本申请实施例可以根据实际情况进行具体选择,这里不作任何限定。That is to say, taking an electronic device based on the Android system as an example, there are mainly two ways to obtain the information of the graphical interface elements from the Android system: the interface for obtaining the structure and information of the interface elements can be provided directly through the underlying code of the system To achieve the acquisition of graphical interface element information; or, the acquisition of graphical interface element information may also be achieved through the system accessibility service (Accessibility Service) function interface. However, the former method is complicated to implement, requires a large amount of development work, and needs to modify the underlying code of the system, which has certain security risks, but can obtain more comprehensive and accurate interface element information; the latter method is simple to implement and easy to develop. The workload is small, but the obtained graphical interface element information may be missing, and there may be some information errors. It should be noted that the embodiments of the present application may be specifically selected according to actual conditions, and no limitation is made here.
进一步地,在得到图形界面元素信息之后,可以查找出所有可能的控件,然后筛选出所需求的至少一个图像控件。具体地,在一些实施例中,所述根据图形界面元素信息,确定至少一个图像控件,可以包括:Further, after obtaining the graphical interface element information, all possible controls can be searched out, and then at least one required image control can be screened out. Specifically, in some embodiments, the determining at least one image control according to the graphical interface element information may include:
从图形界面元素信息中查询类(class)属性后缀为预设类型的控件,组成候选控件集合;Query the controls whose class attribute suffix is the preset type from the graphical interface element information to form a set of candidate controls;
对候选控件集合中的控件进行尺寸筛选,得到至少一个图像控件。Size screening is performed on the controls in the candidate control set to obtain at least one image control.
需要说明的是,在本申请实施例中,预设类型可以包括下述至少之一:图像视图(ImageView)、单帧布局(FrameLayout)、线性布局(LinearLayout)、相对布局(RelativeLayout)和视图(View)。It should be noted that, in this embodiment of the application, the preset type may include at least one of the following: image view (ImageView), single frame layout (FrameLayout), linear layout (LinearLayout), relative layout (RelativeLayout) and view ( View).
进一步地,在一些实施例中,所述对候选控件集合中的控件进行尺寸筛选,得到至少 一个图像控件,可以包括:在候选控件集合中,判断控件的长度和宽度是否满足预设尺寸条件,将长度和宽度满足预设尺寸条件的控件确定为图像控件。也就是说,当某控件的长度和宽度满足预设尺寸条件时,表明该控件为图像控件,而非是图标、按钮、装饰性长条等其他控件,进而从候选控件集合中将图像控件筛选出来。Further, in some embodiments, the size screening of the controls in the candidate control set to obtain at least one image control may include: in the candidate control set, judging whether the length and width of the controls meet the preset size conditions, A control whose length and width meet preset size conditions is determined as an image control. That is to say, when the length and width of a control meet the preset size conditions, it indicates that the control is an image control, rather than icons, buttons, decorative strips and other controls, and then the image control is filtered from the candidate control set come out.
还需要说明的是,在本申请实施例中,预设尺寸条件可以为:控件的长度和宽度分别大于第一预设值,且控件的长度和宽度的比例小于第二预设值。It should also be noted that, in the embodiment of the present application, the preset size condition may be: the length and width of the control are respectively greater than a first preset value, and the ratio of the length and width of the control is smaller than a second preset value.
示例性地,第一预设值可以为100dp;其中,dp=pixel/density,pixel为绝对像素点表示,density为单位尺寸像素点密度,dp为标准尺寸表示。第二预设值可以为3,即图像控件的长宽比需要小于3。Exemplarily, the first preset value may be 100dp; wherein, dp=pixel/density, pixel is an absolute pixel representation, density is a pixel density per unit size, and dp is a standard size representation. The second preset value may be 3, that is, the aspect ratio of the image control needs to be less than 3.
也就是说,对于安卓系统中利用AccessibilityService功能接口来获取图形界面元素信息的情况,由于可能存在部分控件无法进行正确的内容获取,因此存在部分目标控件的类型(class属性后缀)可能为FrameLayout、LinearLayout、RelativeLayout、View等,而非是ImageView。针对这种情形,可以将View Tree的叶子节点中类型为ImageView或者非常规类型(FrameLayout、LinearLayout、RelativeLayout、View等,正常情形下其不会出现在叶子节点中)的控件筛选出来作为候选控件集合。That is to say, for the situation where the AccessibilityService functional interface is used to obtain graphical interface element information in the Android system, since some controls may not be able to obtain correct content, the type (class attribute suffix) of some target controls may be FrameLayout, LinearLayout , RelativeLayout, View, etc., not ImageView. In view of this situation, controls of the type ImageView or unconventional types (FrameLayout, LinearLayout, RelativeLayout, View, etc., which will not appear in the leaf nodes under normal circumstances) in the leaf nodes of the View Tree can be filtered out as a set of candidate controls .
可以理解的是,对于Android系统中利用AccessibilityService功能接口来获取图形界面元素信息,其需要应用自身适配AccessibilityService功能接口。对于部分应用或者部分自定义控件,其对AccessibilityService功能接口的适配欠佳,从而导致通过AccessibilityService功能接口来获取图形界面元素信息时,会存在部分控件无法进行正确的内容获取,从而造成用户界面(User Interface,UI)信息的遗漏、错误等情形。除此之外,通过Android系统的AccessibilityService功能接口来获取图形界面元素信息时,其无法获得控件的视觉外观图像信息,因此本申请实施例还需要界面截图来辅助获取控件的相关图像信息。It can be understood that, for using the AccessibilityService functional interface in the Android system to obtain graphical interface element information, it is necessary for the application itself to adapt to the AccessibilityService functional interface. For some applications or some custom controls, their adaptation to the AccessibilityService functional interface is not good. As a result, when obtaining graphical interface element information through the AccessibilityService functional interface, some controls may not be able to obtain correct content, resulting in user interface ( User Interface, UI) information omissions, errors, etc. In addition, when the graphical interface element information is obtained through the AccessibilityService functional interface of the Android system, it cannot obtain the visual appearance image information of the control, so the embodiment of the present application also needs interface screenshots to assist in obtaining related image information of the control.
虽然本申请实施例提出了先从系统中获取当前的图形界面元素信息View Tree,然后从View Tree中筛选出图像控件的方式,但是考虑到部分平台或系统可能获取View Tree会比较困难,这里还提出了一种通过针对当前图形界面的截图进行控件检测,然后筛选出图像控件的方式。因此,在一些实施例中,对于S202来说,所述根据当前的图形界面,确定至少一个图像控件,可以包括:Although the embodiment of the present application proposes the method of obtaining the current graphical interface element information View Tree from the system first, and then filtering out the image control from the View Tree, considering that it may be difficult for some platforms or systems to obtain the View Tree, here is also A method of detecting controls by taking a screenshot of the current graphical interface and then filtering out image controls is proposed. Therefore, in some embodiments, for S202, the determining at least one image control according to the current graphical interface may include:
对当前的图形界面进行截图,得到待识别图像;Take a screenshot of the current graphical interface to obtain the image to be recognized;
对待识别图像进行控件检测,将检测得到的若干个控件组成候选控件集合;Perform control detection on the image to be recognized, and form several detected controls into a candidate control set;
对候选控件集合中的控件进行尺寸筛选,得到至少一个图像控件。Size screening is performed on the controls in the candidate control set to obtain at least one image control.
需要说明的是,所述对候选控件集合中的控件进行尺寸筛选,得到至少一个图像控件,可以包括:在候选控件集合中,判断控件的长度和宽度是否满足预设尺寸条件,将长度和宽度满足预设尺寸条件的控件确定为图像控件。也就是说,在这若干个控件中,当某控件的长度和宽度满足预设尺寸条件时,表明该控件为图像控件,而非是图标、按钮、装饰性长条等其他控件,以便筛选出图像控件。It should be noted that the size screening of the controls in the candidate control set to obtain at least one image control may include: in the candidate control set, judging whether the length and width of the control meet the preset size conditions, and combining the length and width Controls that meet the preset size conditions are determined as image controls. That is to say, among these several controls, when the length and width of a certain control meet the preset size conditions, it indicates that the control is an image control, rather than icons, buttons, decorative strips and other controls, in order to filter out Image controls.
也就是说,在得到待识别图像之后,可以对待识别图像中所包含的控件(例如文本、图像等)及控件位置进行检测,得到若干个控件;然后针对检测得到的若干个控件进行尺寸筛选,示例性地,图像控件的长度和宽度均需要大于一定尺寸(例如100dp),图像控件的长宽比需要小于3;进而从这若干个控件中筛选出图像控件。That is to say, after the image to be recognized is obtained, the controls (such as text, image, etc.) and the positions of the controls contained in the image to be recognized can be detected to obtain several controls; Exemplarily, the length and width of the image control need to be larger than a certain size (for example, 100dp), and the aspect ratio of the image control needs to be less than 3; then, the image control is selected from these several controls.
S203:对至少一个图像控件进行图像内容理解,得到至少一个图像控件对应的图像描述文本信息。S203: Perform image content understanding on at least one image control to obtain image description text information corresponding to at least one image control.
S204:根据语音数据与至少一个图像控件对应的图像描述文本信息进行图像控件识别,在至少一个图像控件中确定目标图像控件。S204: Perform image control identification according to the voice data and image description text information corresponding to the at least one image control, and determine a target image control in the at least one image control.
需要说明的是,在筛选得到至少一个图像控件之后,可以对这些图像控件进行图像内 容理解,从而得到每一个图像控件关于图像内容的图像描述文本信息。在本申请实施例中,图像内容理解的方法可以包括;图像分类、图像检测、图像描述生成、基于图像的文本指代目标检测等等,下面将针对这几种方法进行详细描述:It should be noted that after screening at least one image control, the image content of these image controls can be understood, so as to obtain the image description text information about the image content of each image control. In the embodiment of the present application, the method for image content understanding may include: image classification, image detection, image description generation, image-based text reference target detection, etc., and these methods will be described in detail below:
(1)图像分类:通过对图像内容进行分类,从而匹配到图像内容的文本描述标签,如“汽车”、“食物”、“人”等。一般来说,通过图像分类方式得到的文本标签,一般都不够细致,其对图像的理解程度有限。例如,如图1的第一行第二个视频,如果标签为“汽车”,则无法获取到更多的细节,比如是摩托车,还是轿车,或者是货车,还是公交车;其可以通过多级标签或者多重标签的方式来进行一定程度的改善,例如“汽车/轿车”(一级标签/二级标签/...)。除此之外,图像分类无法同时获取同级标签,如图1的第三行第一个视频,其中包含人和美食,但是分类模型只能得到“人”或者“美食”其中的一个标签,无法同时得到;其可以通过置信度的方式来进行一定程度的改善,例如“美食:0.5”,“人:0.4”。(1) Image classification: By classifying the image content, it matches the text description tags of the image content, such as "car", "food", "person", etc. Generally speaking, the text labels obtained by image classification are generally not detailed enough, and their understanding of images is limited. For example, in the second video in the first row of Figure 1, if the label is "car", more details cannot be obtained, such as whether it is a motorcycle, a car, a truck, or a bus; Level tags or multiple tags can be used to improve to a certain extent, such as "car/sedan" (first level tag/secondary tag/...). In addition, image classification cannot obtain labels at the same level at the same time, such as the first video in the third row of Figure 1, which contains people and food, but the classification model can only get one of the labels "people" or "food". It cannot be obtained at the same time; it can be improved to a certain extent by means of confidence, such as "food: 0.5", "people: 0.4".
2)图像检测:通过检测模型检测出图像中所包含的对象。例如:如图1的第一行第二个视频,包含“人”、“汽车”;第二行第二个视频,包含“人”、“美食”;同时检测模型可以进行级联或者联合分类模型,实现对象细分识别。例如,检测到“汽车”后,可以进一步识别汽车的类型、厂商、型号、颜色等。检测到“人”后,可进一步识别人的性别、年龄、身份(人脸识别,是谁)、情绪等。图像检测能够提供比图像分类更多、更细致的信息,但是无法提供检测到的多个对象之间的关系;而且提供的信息较为割裂,与用户的自然语言描述区别较大,后期匹配难度较大。同时,如果需要提供较为细致的信息,则需要更复杂的模型或者多个模型级联,因此系统复杂度较高,使用成本较大。2) Image detection: Detect the objects contained in the image through the detection model. For example: the second video in the first row of Figure 1 contains "person" and "car"; the second video in the second row contains "person" and "food"; at the same time, the detection model can be cascaded or jointly classified model to achieve object segmentation recognition. For example, after "car" is detected, the type, manufacturer, model, color, etc. of the car can be further identified. After detecting a "person", it can further identify the person's gender, age, identity (face recognition, who it is), emotion, etc. Image detection can provide more and more detailed information than image classification, but it cannot provide the relationship between multiple detected objects; and the information provided is relatively fragmented, which is quite different from the user's natural language description, and the later matching is more difficult. big. At the same time, if more detailed information needs to be provided, a more complex model or cascading of multiple models is required, so the complexity of the system is high and the cost of use is high.
(3)图像描述生成:通过对图像内容进行理解,然后以自然语言的方式生成对图像内容的描述。如图1的第一行第二个视频,生成的一种描述为“一个人出现在一辆轿车的旁边;这个人背着书包;轿车是白色的;...”。生成描述内容的质量以及细致程度,取决于模型精度以及相关设置。这种方式是最接近用户的自然语言描述的,后期匹配难度较小,且系统复杂度较低,使用成本相对较小。(3) Image description generation: By understanding the image content, and then generating a description of the image content in a natural language. As shown in the second video in the first row of Figure 1, a generated description is "a person appears next to a car; this person is carrying a schoolbag; the car is white;...". The quality and level of detail of the generated description depends on the model accuracy and related settings. This method is closest to the user's natural language description, and the later matching is less difficult, the system complexity is low, and the cost of use is relatively small.
(4)基于图像的文本指代目标检测:模型同时接收用户的指令文本以及图像,在模型内部实现对文本及图像的语义提取和匹配,最后直接给出用户指令文本所指代对象在图像中的位置。由此,即可实现对用户交互指令与目标交互对象的匹配和定位。利用此方式,可以将S203和S204这两个步骤合并成一步完成。这种方式使得整个过程中信息损失最小,能够获得较好的效果。当然具体的效果也依赖于模型的质量和复杂程度。同时这种方式的系统复杂度最小,使用成本也是最小的。(4) Image-based text reference target detection: the model receives the user's instruction text and image at the same time, realizes the semantic extraction and matching of the text and image inside the model, and finally directly gives the object referred to by the user instruction text in the image s position. In this way, the matching and positioning of the user interaction instruction and the target interaction object can be realized. In this way, the two steps of S203 and S204 can be combined into one step to complete. This way minimizes the loss of information in the whole process and can obtain better results. Of course, the specific effect also depends on the quality and complexity of the model. At the same time, the system complexity of this method is the smallest, and the use cost is also the smallest.
在实际应用中,对于图像内容理解的方法,通常并不是只选择上述几种方法中的其中一种来实现,而是可以根据实际情况选择多种进行组合来实现,本申请实施例并不作任何限定。In practical applications, for the method of image content understanding, it is usually not only one of the above-mentioned methods to be selected, but multiple combinations can be selected according to the actual situation. The embodiment of this application does not make any limited.
在一些实施例中,对于S204来说,所述根据语音数据与至少一个图像控件对应的图像描述文本信息进行图像控件识别,在所述至少一个图像控件中确定目标图像控件,可以包括:In some embodiments, for S204, the image control identification is performed according to the image description text information corresponding to the voice data and at least one image control, and the target image control is determined in the at least one image control, which may include:
对语音数据进行文本转换,得到语音文本信息;Perform text conversion on the voice data to obtain voice and text information;
将语音文本信息与至少一个图像控件对应的图像描述文本信息进行语义匹配,确定至少一个图像控件对应的语义相似度值;Semantically matching the voice text information with the image description text information corresponding to at least one image control, and determining the semantic similarity value corresponding to at least one image control;
根据语义相似度值,确定目标图像控件。According to the semantic similarity value, the target image control is determined.
在一种具体的实施例中,所述根据语义相似度值,确定目标图像控件,可以包括:确定语义相似度值中的最大相似度值对应的图像控件为目标图像控件。In a specific embodiment, the determining the target image control according to the semantic similarity value may include: determining the image control corresponding to the maximum similarity value among the semantic similarity values as the target image control.
具体来说,在得到至少一个图像控件对应的语义相似度值之后,从语义相似度值中选择最大相似度值,并将最大相似度值对应的图像控件确定为目标图像控件。Specifically, after obtaining the semantic similarity value corresponding to at least one image control, select the maximum similarity value from the semantic similarity values, and determine the image control corresponding to the maximum similarity value as the target image control.
也就是说,在得到语音文本信息和至少一个图像控件对应的图像描述文本信息之后,可以将语音文本信息与每一个图像控件对应的图像描述文本信息进行语义匹配,例如可以采用传统文本匹配方法(如TF-IDF算法、BM25算法、simhash算法、Jaccard算法等),也可以采用基于神经网络训练的语义匹配模型等,确定出每一个图像控件对应的语义相似度值;然后从中选取出语义最相似的图像控件(即最大相似度值对应的图像控件)作为目标图像控件。That is to say, after obtaining the voice text information and the image description text information corresponding to at least one image control, the voice text information can be semantically matched with the image description text information corresponding to each image control, for example, a traditional text matching method ( Such as TF-IDF algorithm, BM25 algorithm, simhash algorithm, Jaccard algorithm, etc.), you can also use a semantic matching model based on neural network training to determine the semantic similarity value corresponding to each image control; and then select the most similar semantics The image control (that is, the image control corresponding to the maximum similarity value) is used as the target image control.
S205:根据语音数据确定操作指令,向目标图像控件发送操作指令,以实现对目标图像控件的语音控制。S205: Determine an operation instruction according to the voice data, and send the operation instruction to the target image control, so as to implement voice control on the target image control.
需要说明的是,在确定出目标图像控件之后,电子设备可以根据用户的语音数据来确定出操作指令,然后向目标图像控件发送操作指令,以便执行相应操作(点击、长按等),从而完成语音交互。在这里,操作指令是从用户输入的语音数据中确定的。It should be noted that after determining the target image control, the electronic device can determine the operation instruction according to the user's voice data, and then send the operation instruction to the target image control to perform the corresponding operation (click, long press, etc.), thereby completing Voice interaction. Here, the operation instruction is determined from the voice data input by the user.
简言之,本申请实施例的技术方案提供了一种图形界面元素的图像内容理解方法,进一步提供了一种基于图像内容理解的语音交互控制方法;这样,本申请实施例的技术方案无需被操控应用适配语音操控,可以节省开发和推广成本,便于用户使用。如此,针对图1中所示的图片类控件,当用户无法或者不便通过文本描述来对目标控件进行描述时,本申请实施例的技术方案所提出的基于图像内容理解的语音交互方式,可以通过理解用户输入的语音数据以及图像内容,实现用户交互意图与用户交互目标的匹配,从而达到语音交互的目的。仍以图1为例,当用户输入的语音数据为“打开包含汽车的视频”时,这时候可以打开第一行第二个对应的包含汽车的视频;或者当用户输入的语音数据为“打开包含宝箱的视频”时,这时候可以打开第三行第二个包含宝箱的视频。该技术方案能够给用户提供一种更加自然、更加智能、更加符合用户直觉的交互方式,从而能够给用户带来更好的体验感。In short, the technical solution of the embodiment of the present application provides a method for understanding the image content of graphical interface elements, and further provides a method for controlling voice interaction based on the understanding of the image content; thus, the technical solution of the embodiment of the present application does not need to be The control application is adapted to voice control, which can save development and promotion costs and is convenient for users to use. In this way, for the picture control shown in Figure 1, when the user is unable or inconvenient to describe the target control through text description, the voice interaction method based on image content understanding proposed by the technical solution of the embodiment of the application can be passed Understand the voice data and image content input by the user, and realize the matching between the user interaction intention and the user interaction goal, so as to achieve the purpose of voice interaction. Still taking Figure 1 as an example, when the voice data input by the user is "Open the video containing the car", at this time the second corresponding video containing the car in the first line can be opened; or when the voice data input by the user is "Open "Video Containing Treasure Chests", you can open the second video containing treasure chests in the third row. The technical solution can provide users with a more natural, smarter, and more intuitive interaction mode, thereby bringing users a better sense of experience.
本实施例提供了一种语音控制方法,通过接收用户输入的语音数据;根据当前的图形界面,确定至少一个图像控件;对至少一个图像控件进行图像内容理解,得到至少一个图像控件对应的图像描述文本信息;根据语音数据与至少一个图像控件对应的图像描述文本信息进行图像控件识别,在至少一个图像控件中确定目标图像控件;根据语音数据确定操作指令,向目标图像控件发送操作指令,以实现对目标图像控件的语音控制。这样,基于图像内容理解的语音交互方式,无需进行应用与语音交互的适配,不仅可以节省开发成本,而且方便用户描述,能够有效提升用户使用语音操控时的便捷性,从而更好地实现语音交互和控制的目的。This embodiment provides a voice control method, by receiving the voice data input by the user; determining at least one image control according to the current graphical interface; understanding the image content of the at least one image control, and obtaining the image description corresponding to the at least one image control Text information; image control identification is performed according to the image description text information corresponding to the voice data and at least one image control, and the target image control is determined in at least one image control; the operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control to achieve Voice control over target image controls. In this way, the voice interaction method based on image content understanding does not need to adapt the application and voice interaction, which not only saves development costs, but also facilitates user description, which can effectively improve the convenience of users when using voice control, so as to better realize voice interaction. Interaction and control purposes.
本申请的另一实施例中,基于前述实施例相同的发明构思,参见图4,其示出了本申请实施例提供的一种语音控制方法的详细流程示意图。如图4所示,该详细流程可以包括:In another embodiment of the present application, based on the same inventive concept as the foregoing embodiments, refer to FIG. 4 , which shows a detailed flowchart of a voice control method provided by the embodiment of the present application. As shown in Figure 4, the detailed process may include:
S301:获取用户语音数据对应的语音文本信息。S301: Obtain voice text information corresponding to user voice data.
S302:获取图形界面元素信息View Tree。S302: Obtain a View Tree of graphical interface element information.
S303:从View Tree中查询符合要求的所有控件,组成候选控件集合。S303: Query all controls meeting the requirements from the View Tree to form a set of candidate controls.
S304:对候选控件集合中的控件进行尺寸筛选,得到至少一个图像控件。S304: Perform size screening on the controls in the candidate control set to obtain at least one image control.
S305:对至少一个图像控件进行图像内容理解,得到至少一个图像控件对应的图像描述文本信息。S305: Perform image content understanding on at least one image control to obtain image description text information corresponding to at least one image control.
S306:将语音文本信息与至少一个图像控件对应的图像描述文本信息进行语义匹配,确定最大语义相似度值对应的目标图像控件。S306: Semantically matching the voice text information with the image description text information corresponding to at least one image control, and determining the target image control corresponding to the maximum semantic similarity value.
S307:向目标图像控件发送操作指令,以实现用户与目标图像控件的语音交互。S307: Send an operation instruction to the target image control, so as to realize voice interaction between the user and the target image control.
需要说明的是,操作指令是根据用户语音数据来确定的。在这里,本申请实施例所提出的基于图像内容理解的语音交互控制方法,其主要流程包括:在用户进行语音交互的过程中,首先,获取用户语音交互的指令文本;其次,获取当前的图形界面元素信息View Tree;再次,查找出View Tree查询符合要求的所有控件,即class属性后缀为ImageView或者非 常规类型(如FrameLayout、LinearLayout、RelativeLayout、View等)的所有图像控件;再次,对查找到的图像控件进行尺寸筛选,其中,所筛选出的图像控件,其长宽需要大于一定尺寸(如100dp),长宽比则需要小于3;再次,对筛选到的图像控件进行图像内容理解,并生成这些图像控件对于图像内容的图像描述文本;再次,通过指令文本以及图像描述文本进行语义匹配,找出语义最相似的图像控件作为目标图像控件;最后,由目标图像控件执行用户的操作指令(点击、长按等),完成用户的语音交互。It should be noted that the operation instruction is determined according to the voice data of the user. Here, the main process of the voice interaction control method based on image content understanding proposed in the embodiment of the present application includes: in the process of voice interaction by the user, firstly, acquire the instruction text of the user voice interaction; secondly, acquire the current graphics Interface element information View Tree; again, find all controls that meet the requirements of the View Tree query, that is, all image controls whose class attribute suffix is ImageView or unconventional types (such as FrameLayout, LinearLayout, RelativeLayout, View, etc.); Size screening of image controls, where the length and width of the screened image controls need to be larger than a certain size (such as 100dp), and the aspect ratio needs to be less than 3; again, understand the image content of the screened image controls, and Generate the image description text of these image controls for the image content; again, perform semantic matching through the instruction text and the image description text, and find out the image control with the most similar semantics as the target image control; finally, the target image control executes the user's operation instruction ( Click, long press, etc.) to complete the user's voice interaction.
这样,本申请实施例针对用户在使用语音操作图形界面内容时,界面元素描述文本缺失,或者描述文本不便于用户直接描述的情形,这里提供了一种基于图像内容理解的语音交互和控制方法。通过本申请实施例的技术方案,用户可以直接说“我要看C罗的视频”或者“打开包含汽车的视频”,从而匹配和定位到包含“C罗”或者“汽车”的视频,达到界面元素操控和交互的目的。如此,由于本技术方案无需被操控应用对语音操控进行适配,可以节省开发成本,便于推广和使用;同时这种交互方式符合用户的交互习惯和方式,还能够有效提升用户使用语音操控时的便捷性,方便用户描述,节省用户时间,从而提升用户的体验感。In this way, the embodiment of the present application provides a voice interaction and control method based on image content understanding for the situation where the interface element description text is missing or the description text is not convenient for the user to directly describe when the user uses voice to operate the graphical interface content. Through the technical solution of the embodiment of the application, the user can directly say "I want to watch Ronaldo's video" or "open the video containing the car", so as to match and locate the video containing "C Ronaldo" or "car", and reach the interface The purpose of element manipulation and interaction. In this way, since the technical solution does not need to be adapted to the voice control by the controlled application, it can save development costs and facilitate promotion and use; at the same time, this interaction method conforms to the user's interaction habits and methods, and can effectively improve the user's experience when using voice control. Convenience, convenient for users to describe, save users' time, thereby improving the user's sense of experience.
进一步地,由于上述技术方案需要先从系统中获取当前图形界面元素信息View Tree,然后再从View Tree中筛选出图像控件进行图像内容理解。但是部分平台或系统可能获取View Tree会比较困难。因此,本申请实施例还可以通过当前图形界面的截图来实现对图像内容的理解,从而实现语音交互和控制的目的。Further, because the above-mentioned technical solution needs to first obtain the current graphical interface element information View Tree from the system, and then filter out image controls from the View Tree to understand the image content. However, it may be difficult for some platforms or systems to obtain View Tree. Therefore, in the embodiment of the present application, the screenshot of the current graphical interface can also be used to understand the content of the image, so as to achieve the purpose of voice interaction and control.
参见图5,其示出了本申请实施例提供的另一种语音控制方法的详细流程示意图。如图5所示,该详细流程可以包括:Referring to FIG. 5 , it shows a schematic flowchart of another voice control method provided by an embodiment of the present application. As shown in Figure 5, the detailed process may include:
S401:获取用户语音数据对应的语音文本信息。S401: Obtain voice text information corresponding to user voice data.
S402:对当前的图形界面进行截图,得到待识别图像。S402: Take a screenshot of the current graphical interface to obtain an image to be recognized.
S403:对待识别图像进行控件检测,得到若干个候选控件。S403: Perform control detection on the image to be recognized to obtain several candidate controls.
S404:对若干个候选控件进行尺寸筛选,得到至少一个图像控件。S404: Perform size screening on several candidate controls to obtain at least one image control.
S405:对至少一个图像控件进行图像内容理解,得到至少一个图像控件对应的图像描述文本信息。S405: Perform image content understanding on at least one image control to obtain image description text information corresponding to at least one image control.
S406:将语音文本信息与至少一个图像控件对应的图像描述文本信息进行语义匹配,确定最大语义相似度值对应的目标图像控件。S406: Semantically matching the speech text information with the image description text information corresponding to at least one image control, and determining the target image control corresponding to the maximum semantic similarity value.
S407:向目标图像控件发送操作指令,以实现用户与目标图像控件的语音交互。S407: Send an operation instruction to the target image control, so as to realize voice interaction between the user and the target image control.
需要说明的是,操作指令是根据用户语音数据来确定的。在这里,本申请实施例所提出的基于图像内容理解的语音交互控制方法,其主要流程包括:在用户进行语音交互的过程中,首先,获取用户语音交互的指令文本;其次,获取当前图形界面的截图;再次,检测该截图中所包含的控件(文本、图像等)及控件位置;再次,结合控件位置对检测到的控件进行尺寸筛选,其中,所筛选出的图像控件,其长宽需要大于一定尺寸(如100dp),长宽比则需要小于3;再次,对筛选到的图像控件进行图像内容理解,并生成这些图像控件对于图像内容的图像描述文本;再次,通过指令文本以及图像描述文本进行语义匹配,找出语义最相似的图像控件作为目标图像控件;最后,由目标图像控件执行用户的操作指令(点击、长按等),完成用户的语音交互。It should be noted that the operation instruction is determined according to the voice data of the user. Here, the main process of the voice interaction control method based on image content understanding proposed in the embodiment of the present application includes: in the process of voice interaction by the user, firstly, acquire the command text of the user voice interaction; secondly, acquire the current graphical interface screenshot; again, detect the control (text, image, etc.) and control position contained in the screenshot; thirdly, combine the control position to filter the size of the detected control, wherein, the filtered image control, its length and width need If it is larger than a certain size (such as 100dp), the aspect ratio needs to be less than 3; again, understand the image content of the filtered image controls, and generate the image description text of these image controls for the image content; again, through the instruction text and image description The text is semantically matched, and the image control with the most similar semantics is found as the target image control; finally, the target image control executes the user's operation instructions (click, long press, etc.) to complete the user's voice interaction.
还需要说明的是,对于本技术方案中的非目标控件可以通过常用的图标识别、文本识别等方式来提取文本描述信息。通过该技术方案,可以仅依靠图像实现用户的语音操控和匹配,从而实现用户交互和控制的目的;该方案实现简单,利于推广。It should also be noted that, for the non-target controls in this technical solution, the text description information can be extracted by commonly used methods such as icon recognition and text recognition. Through the technical solution, the user's voice control and matching can be realized only by images, thereby achieving the purpose of user interaction and control; the solution is simple to implement and is conducive to popularization.
进一步地,通过上述技术方案可以明白,通过图像文本描述+文本语义匹配的方式,该过程中会伴随一定的信息损失,可能无法获得较好的效果和性能。因此,本申请实施例还可以结合图像以及文本指令的方式,利用基于图像的文本指代进行目标检测,直接实现对用户交互目标的匹配,从而实现用户交互和控制的目的。Furthermore, it can be understood from the above technical solutions that, through image text description + text semantic matching, the process will be accompanied by a certain amount of information loss, and better results and performance may not be obtained. Therefore, the embodiment of the present application can also combine images and text instructions, use image-based text references to perform target detection, and directly match user interaction targets, thereby achieving the purpose of user interaction and control.
参见图6,其示出了本申请实施例提供的又一种语音控制方法的详细流程示意图。如图6所示,该详细流程可以包括:Referring to FIG. 6 , it shows a detailed flow chart of another voice control method provided by an embodiment of the present application. As shown in Figure 6, the detailed process may include:
S501:获取用户语音数据对应的语音文本信息。S501: Obtain voice text information corresponding to user voice data.
S502:对当前的图形界面进行截图,得到待识别图像。S502: Take a screenshot of the current graphical interface to obtain an image to be recognized.
S503:根据语音文本信息对待识别图像进行目标检测,确定目标图像控件。S503: Perform target detection on the image to be recognized according to the voice and text information, and determine the target image control.
S504:向目标图像控件发送操作指令,以实现用户与目标图像控件的语音交互。S504: Send an operation instruction to the target image control, so as to realize voice interaction between the user and the target image control.
需要说明的是,在本申请实施例中,在接收用户输入的语音数据之后,具体可以包括:对当前的图形界面进行截图,得到待识别图像;对语音数据进行文本转换,得到语音文本信息;根据语音文本信息对待识别图像进行目标检测,确定目标图像控件;根据语音数据确定操作指令,向目标图像控件发送操作指令,以实现对目标图像控件的语音控制。It should be noted that, in the embodiment of the present application, after receiving the voice data input by the user, it may specifically include: taking a screenshot of the current graphical interface to obtain an image to be recognized; performing text conversion on the voice data to obtain voice text information; Perform target detection on the image to be recognized according to the voice text information, and determine the target image control; determine the operation instruction according to the voice data, and send the operation instruction to the target image control, so as to realize the voice control of the target image control.
也就是说,操作指令是根据用户语音数据来确定的。在这里,本申请实施例所提出的基于图像内容理解的语音交互控制方法,其主要流程包括:在用户进行语音交互的过程中,首先,获取用户语音交互的指令文本;其次,获取当前图形界面的截图;再次,利用基于图像的文本指代目标检测,实现对用户交互目标的匹配和定位,以确定出目标图像控件;最后,由目标图像控件执行用户的操作指令(点击、长按等),完成交互过程。That is to say, the operation instruction is determined according to the voice data of the user. Here, the main process of the voice interaction control method based on image content understanding proposed in the embodiment of the present application includes: in the process of voice interaction by the user, firstly, acquire the command text of the user voice interaction; secondly, acquire the current graphical interface screenshot; again, using image-based text reference target detection to achieve matching and positioning of user interaction targets to determine the target image control; finally, the target image control executes the user's operation instructions (click, long press, etc.) , to complete the interaction process.
这样,通过本技术方案可以仅依靠图像实现用户的语音操控和匹配,从而实现用户交互和控制的目的。通过该技术方案,可以不对用户的交互指令进行约束,用户可以按照相对自由的方式来进行交互对象(可以不局限于图片,也可以是图标等)的描述,例如“右上角的蓝色按钮”、“我要看汽车视频”、“选择下方加号按钮”、“点击底部第二个按钮”等,而无需按照界面文本描述或者预定义指令来进行交互对象的描述。该技术方案能够给用户提供更加自然、更加智能的语音交互和控制方式;而且该技术方案实现简单,利于推广;同时系统复杂度小,有利于设备侧实现和部署。In this way, through the technical solution, the user's voice control and matching can be realized only by images, so as to realize the purpose of user interaction and control. Through this technical solution, the user's interaction instructions can not be restricted, and the user can describe the interactive object (not limited to pictures, but also icons, etc.) in a relatively free manner, such as "the blue button in the upper right corner" , "I want to watch a car video", "select the plus button below", "click the second button at the bottom", etc., instead of describing the interactive object according to the interface text description or predefined instructions. The technical solution can provide users with a more natural and intelligent voice interaction and control mode; and the technical solution is simple to implement, which is conducive to popularization; at the same time, the system complexity is small, which is beneficial to the realization and deployment of the device side.
本实施例提供了一种语音控制方法,通过上述实施例对前述实施例的具体实现进行了详细阐述,从中可以看出,通过前述实施例的技术方案,不仅能够节省开发成本,而且还能够提升用户使用语音操控时的便捷性,从而更好地实现语音交互和控制的目的。This embodiment provides a voice control method. The implementation of the foregoing embodiments is described in detail through the foregoing embodiments. It can be seen that the technical solutions of the foregoing embodiments can not only save development costs, but also improve The convenience of users using voice control, so as to better realize the purpose of voice interaction and control.
本申请的又一实施例中,基于前述实施例相同的发明构思,参见图7,其示出了本申请实施例提供的一种语音控制装置60的组成结构示意图。如图7所示,语音控制装置60可以包括:接收单元601、确定单元602、分析单元603和发送单元604;其中,In yet another embodiment of the present application, based on the same inventive concept as the foregoing embodiments, refer to FIG. 7 , which shows a schematic structural diagram of a voice control device 60 provided in the embodiment of the present application. As shown in FIG. 7, the voice control device 60 may include: a receiving unit 601, a determining unit 602, an analyzing unit 603, and a sending unit 604; wherein,
接收单元601,配置为接收用户输入的语音数据;The receiving unit 601 is configured to receive voice data input by the user;
确定单元602,配置为根据当前的图形界面,确定至少一个图像控件;The determining unit 602 is configured to determine at least one image control according to the current graphical interface;
分析单元603,配置为对至少一个图像控件进行图像内容理解,得到至少一个图像控件对应的图像描述文本信息;以及还配置为根据语音数据与至少一个图像控件对应的图像描述文本信息进行图像控件识别,在至少一个图像控件中确定目标图像控件;The analysis unit 603 is configured to understand the image content of at least one image control, and obtain the image description text information corresponding to the at least one image control; and is also configured to perform image control recognition according to the voice data and the image description text information corresponding to the at least one image control , determining a target image control in at least one image control;
确定单元602,还配置为根据语音数据确定操作指令;The determining unit 602 is further configured to determine an operation instruction according to the voice data;
发送单元604,配置为向目标图像控件发送操作指令,以实现对目标图像控件的语音控制。The sending unit 604 is configured to send an operation instruction to the target image control, so as to implement voice control on the target image control.
在一些实施例中,确定单元602,还配置为确定当前的图形界面对应的图形界面元素信息;以及根据图形界面元素信息,确定至少一个图像控件。In some embodiments, the determining unit 602 is further configured to determine graphical interface element information corresponding to the current graphical interface; and determine at least one image control according to the graphical interface element information.
在一些实施例中,参见图7,语音控制装置60还可以包括调用单元605,配置为调用系统底层代码信息,获取图形界面元素信息;或者,调用系统辅助服务功能接口,获取图形界面元素信息。In some embodiments, referring to FIG. 7 , the voice control device 60 may further include a calling unit 605 configured to call the system underlying code information to obtain graphical interface element information; or call the system auxiliary service function interface to obtain graphical interface element information.
在一些实施例中,分析单元602,还配置为从图形界面元素信息中查询类属性后缀为预设类型的控件,组成候选控件集合;以及对候选控件集合中的控件进行尺寸筛选,得到至少一个图像控件。In some embodiments, the analysis unit 602 is further configured to query the controls whose class attribute suffix is a preset type from the graphical interface element information to form a set of candidate controls; and perform size screening on the controls in the set of candidate controls to obtain at least one Image controls.
在一些实施例中,预设类型包括下述至少之一:ImageView、FrameLayout、LinearLayout、 RelativeLayout和View。In some embodiments, the preset type includes at least one of the following: ImageView, FrameLayout, LinearLayout, RelativeLayout and View.
在一些实施例中,确定单元602,还配置为在候选控件集合中,判断控件的长度和宽度是否满足预设尺寸条件,将长度和宽度满足预设尺寸条件的控件确定为图像控件。In some embodiments, the determining unit 602 is further configured to, in the set of candidate controls, determine whether the length and width of the controls meet the preset size conditions, and determine the controls whose length and width meet the preset size conditions as image controls.
在一些实施例中,参见图7,语音控制装置60还可以包括检测单元606,配置为对当前的图形界面进行截图,得到待识别图像;以及对待识别图像进行控件检测,将检测得到的若干个控件组成候选控件集合;以及对候选控件集合中的控件进行尺寸筛选,得到至少一个图像控件。In some embodiments, referring to FIG. 7 , the voice control device 60 may further include a detection unit 606 configured to take a screenshot of the current graphical interface to obtain an image to be recognized; The controls form a candidate control set; and size screening is performed on the controls in the candidate control set to obtain at least one image control.
在一些实施例中,分析单元603,具体配置为对语音数据进行文本转换,得到语音文本信息;以及将语音文本信息与至少一个图像控件对应的图像描述文本信息进行语义匹配,确定至少一个图像控件对应的语义相似度值;以及根据语义相似度值,确定目标图像控件。In some embodiments, the analysis unit 603 is specifically configured to perform text conversion on the voice data to obtain voice-to-text information; and semantically match the voice-to-text information with the image description text information corresponding to at least one image control to determine at least one image control The corresponding semantic similarity value; and according to the semantic similarity value, determine the target image control.
在一些实施例中,确定单元602,还配置为确定语义相似度值中的最大相似度值对应的图像控件为目标图像控件。In some embodiments, the determining unit 602 is further configured to determine that the image control corresponding to the maximum similarity value among the semantic similarity values is the target image control.
在一些实施例中,检测单元606,还配置为对当前的图形界面进行截图,得到待识别图像;以及对语音数据进行文本转换,得到语音文本信息;以及根据语音文本信息对待识别图像进行目标检测,确定目标图像控件;以及根据语音数据确定操作指令,向目标图像控件发送操作指令,以实现对目标图像控件的语音控制。In some embodiments, the detection unit 606 is also configured to take a screenshot of the current graphical interface to obtain the image to be recognized; and perform text conversion on the voice data to obtain voice-to-text information; and perform target detection on the image to be recognized based on the voice-to-text information , determine the target image control; and determine the operation instruction according to the voice data, and send the operation instruction to the target image control, so as to realize the voice control of the target image control.
可以理解地,在本实施例中,“单元”可以是部分电路、部分处理器、部分程序或软件等等,当然也可以是模块,还可以是非模块化的。而且在本实施例中的各组成部分可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。It can be understood that, in this embodiment, a "unit" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a module, or it may be non-modular. Moreover, each component in this embodiment may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software function modules.
所述集成的单元如果以软件功能模块的形式实现并非作为独立的产品进行销售或使用时,可以存储在一个计算机可读取存储介质中,基于这样的理解,本实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或processor(处理器)执行本实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software function module and is not sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this embodiment is essentially or It is said that the part that contributes to the prior art or the whole or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium, and includes several instructions to make a computer device (which can It is a personal computer, a server, or a network device, etc.) or a processor (processor) that executes all or part of the steps of the method described in this embodiment. The aforementioned storage medium includes: U disk, mobile hard disk, read only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other various media that can store program codes.
因此,本实施例提供了一种计算机存储介质,该计算机存储介质存储有计算机程序,所述计算机程序被至少一个处理器执行时实现前述实施例中任一项所述的方法的步骤。Therefore, this embodiment provides a computer storage medium, the computer storage medium stores a computer program, and when the computer program is executed by at least one processor, the steps of the method described in any one of the preceding embodiments are implemented.
基于上述语音控制装置60的组成以及计算机存储介质,参见图8,其示出了本申请实施例提供的一种电子设备的组成结构示意图。如图8所示,电子设备70可以包括:通信接口701、存储器702和处理器703;各个组件通过总线系统704耦合在一起。可理解,总线系统704用于实现这些组件之间的连接通信。总线系统704除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图8中将各种总线都标为总线系统704。其中,通信接口701,用于在与其他外部网元之间进行收发信息过程中,信号的接收和发送;Based on the above composition of the voice control device 60 and the computer storage medium, refer to FIG. 8 , which shows a schematic composition structure diagram of an electronic device provided by an embodiment of the present application. As shown in FIG. 8 , an electronic device 70 may include: a communication interface 701 , a memory 702 , and a processor 703 ; each component is coupled together through a bus system 704 . It can be understood that the bus system 704 is used to realize connection and communication between these components. In addition to the data bus, the bus system 704 also includes a power bus, a control bus and a status signal bus. However, for clarity of illustration, the various buses are labeled as bus system 704 in FIG. 8 . Among them, the communication interface 701 is used for receiving and sending signals during the process of sending and receiving information with other external network elements;
存储器702,用于存储能够在处理器703上运行的计算机程序; memory 702, used to store computer programs that can run on the processor 703;
处理器703,用于在运行所述计算机程序时,执行:The processor 703 is configured to, when running the computer program, execute:
接收用户输入的语音数据;Receive voice data input by the user;
根据当前的图形界面,确定至少一个图像控件;Determine at least one image control according to the current graphical interface;
对至少一个图像控件进行图像内容理解,得到至少一个图像控件对应的图像描述文本信息;Perform image content understanding on at least one image control, and obtain image description text information corresponding to at least one image control;
根据语音数据与至少一个图像控件对应的图像描述文本信息进行图像控件识别,在至 少一个图像控件中确定目标图像控件;Carry out image control identification according to the image description text information corresponding to the voice data and at least one image control, and determine the target image control in at least one image control;
根据语音数据确定操作指令,向目标图像控件发送操作指令,以实现对目标图像控件的语音控制。Determine the operation instruction according to the voice data, and send the operation instruction to the target image control, so as to realize the voice control of the target image control.
可以理解,本申请实施例中的存储器702可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDRSDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步链动态随机存取存储器(Synchronous link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DRRAM)。本文描述的系统和方法的存储器702旨在包括但不限于这些和任意其它适合类型的存储器。It can be understood that the memory 702 in this embodiment of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. Among them, the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electronically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash. The volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (Static RAM, SRAM), Dynamic Random Access Memory (Dynamic RAM, DRAM), Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDRSDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous chain dynamic random access memory (Synchronous link DRAM, SLDRAM ) and Direct Memory Bus Random Access Memory (Direct Rambus RAM, DRRAM). Memory 702 of the systems and methods described herein is intended to include, but is not limited to, these and any other suitable types of memory.
而处理器703可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器703中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器703可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器702,处理器703读取存储器702中的信息,结合其硬件完成上述方法的步骤。The processor 703 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 703 or instructions in the form of software. The above-mentioned processor 703 may be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register. The storage medium is located in the memory 702, and the processor 703 reads the information in the memory 702, and completes the steps of the above method in combination with its hardware.
可以理解的是,本文描述的这些实施例可以用硬件、软件、固件、中间件、微码或其组合来实现。对于硬件实现,处理单元可以实现在一个或多个专用集成电路(Application Specific Integrated Circuits,ASIC)、数字信号处理器(Digital Signal Processing,DSP)、数字信号处理设备(DSP Device,DSPD)、可编程逻辑设备(Programmable Logic Device,PLD)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、通用处理器、控制器、微控制器、微处理器、用于执行本申请所述功能的其它电子单元或其组合中。It should be understood that the embodiments described herein may be implemented by hardware, software, firmware, middleware, microcode or a combination thereof. For hardware implementation, the processing unit can be implemented in one or more application specific integrated circuits (Application Specific Integrated Circuits, ASIC), digital signal processor (Digital Signal Processing, DSP), digital signal processing device (DSP Device, DSPD), programmable Logic device (Programmable Logic Device, PLD), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), general-purpose processor, controller, microcontroller, microprocessor, other devices used to perform the functions described in this application electronic unit or its combination.
对于软件实现,可通过执行本文所述功能的模块(例如过程、函数等)来实现本文所述的技术。软件代码可存储在存储器中并通过处理器执行。存储器可以在处理器中或在处理器外部实现。For a software implementation, the techniques described herein can be implemented through modules (eg, procedures, functions, and so on) that perform the functions described herein. Software codes can be stored in memory and executed by a processor. Memory can be implemented within the processor or external to the processor.
可选地,作为另一个实施例,处理器703还配置为在运行所述计算机程序时,执行前述实施例中任一项所述的方法的步骤。Optionally, as another embodiment, the processor 703 is further configured to execute the steps of the method described in any one of the foregoing embodiments when running the computer program.
基于上述语音控制装置60的组成以及计算机存储介质,参见图9,其示出了本申请实施例提供的另一种电子设备的组成结构示意图。如图9所示,电子设备70可以包括前述实施例中任一项所述的语音控制装置60。Based on the above composition of the voice control apparatus 60 and the computer storage medium, refer to FIG. 9 , which shows a schematic diagram of the composition and structure of another electronic device provided by the embodiment of the present application. As shown in FIG. 9 , an electronic device 70 may include the voice control device 60 described in any one of the foregoing embodiments.
在本申请实施例中,对于电子设备70而言,基于图像内容理解的语音交互方式,无需进行应用与语音交互的适配,不仅可以节省开发成本,而且方便用户描述,能够有效提升用户使用语音操控时的便捷性,从而更好地实现语音交互和控制的目的。In the embodiment of the present application, for the electronic device 70, the voice interaction method based on image content understanding does not need to adapt the application and voice interaction, which not only saves development costs, but also facilitates user description, and can effectively improve the user's ability to use voice. Convenience during manipulation, so as to better realize the purpose of voice interaction and control.
需要说明的是,在本申请中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素, 而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that in this application, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements , but also include other elements not expressly listed, or also include elements inherent in such process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments.
本申请所提供的几个方法实施例中所揭露的方法,在不冲突的情况下可以任意组合,得到新的方法实施例。The methods disclosed in several method embodiments provided in this application can be combined arbitrarily to obtain new method embodiments under the condition of no conflict.
本申请所提供的几个产品实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的产品实施例。The features disclosed in several product embodiments provided in this application can be combined arbitrarily without conflict to obtain new product embodiments.
本申请所提供的几个方法或设备实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的方法实施例或设备实施例。The features disclosed in several method or device embodiments provided in this application can be combined arbitrarily without conflict to obtain new method embodiments or device embodiments.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above is only a specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application. Should be covered within the protection scope of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.
工业实用性Industrial Applicability
本申请实施例中,接收用户输入的语音数据;根据当前的图形界面,确定至少一个图像控件;对至少一个图像控件进行图像内容理解,得到至少一个图像控件对应的图像描述文本信息;根据语音数据与至少一个图像控件对应的图像描述文本信息进行图像控件识别,在至少一个图像控件中确定目标图像控件;根据语音数据确定操作指令,向目标图像控件发送操作指令,以实现对目标图像控件的语音控制。这样,基于图像内容理解的语音交互方式,无需进行应用与语音交互的适配,不仅可以节省开发成本,而且方便用户描述,能够有效提升用户使用语音操控时的便捷性,从而更好地实现语音交互和控制的目的。In the embodiment of the present application, the voice data input by the user is received; at least one image control is determined according to the current graphical interface; the image content of the at least one image control is understood to obtain the image description text information corresponding to the at least one image control; according to the voice data The image description text information corresponding to at least one image control is used for image control identification, and the target image control is determined in at least one image control; the operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control to realize the voice of the target image control control. In this way, the voice interaction method based on image content understanding does not need to adapt the application and voice interaction, which not only saves development costs, but also facilitates user description, which can effectively improve the convenience of users when using voice control, so as to better realize voice interaction. Interaction and control purposes.

Claims (20)

  1. 一种语音控制方法,所述方法包括:A voice control method, the method comprising:
    接收用户输入的语音数据;Receive voice data input by the user;
    根据当前的图形界面,确定至少一个图像控件;Determine at least one image control according to the current graphical interface;
    对所述至少一个图像控件进行图像内容理解,得到所述至少一个图像控件对应的图像描述文本信息;Perform image content understanding on the at least one image control to obtain image description text information corresponding to the at least one image control;
    根据所述语音数据与所述至少一个图像控件对应的图像描述文本信息进行图像控件识别,在所述至少一个图像控件中确定目标图像控件;Perform image control identification according to the image description text information corresponding to the voice data and the at least one image control, and determine a target image control in the at least one image control;
    根据所述语音数据确定操作指令,向所述目标图像控件发送所述操作指令,以实现对所述目标图像控件的语音控制。An operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control, so as to implement voice control on the target image control.
  2. 根据权利要求1所述的方法,其中,所述根据当前的图形界面,确定至少一个图像控件,包括:The method according to claim 1, wherein said determining at least one image control according to the current graphical interface comprises:
    确定当前的图形界面对应的图形界面元素信息;Determine the graphical interface element information corresponding to the current graphical interface;
    根据所述图形界面元素信息,确定所述至少一个图像控件。The at least one image control is determined according to the graphical interface element information.
  3. 根据权利要求2所述的方法,其中,所述确定当前的图形界面对应的图形界面元素信息,包括:The method according to claim 2, wherein said determining the graphical interface element information corresponding to the current graphical interface comprises:
    调用系统底层代码信息,获取所述图形界面元素信息;或者,Invoking the underlying code information of the system to obtain the graphical interface element information; or,
    调用系统辅助服务功能接口,获取所述图形界面元素信息。Call the system auxiliary service function interface to obtain the graphic interface element information.
  4. 根据权利要求2所述的方法,其中,所述根据所述图形界面元素信息,确定所述至少一个图像控件,包括:The method according to claim 2, wherein said determining said at least one image control according to said graphical interface element information comprises:
    从所述图形界面元素信息中查询类属性后缀为预设类型的控件,组成候选控件集合;Querying controls whose class attribute suffix is a preset type from the graphical interface element information to form a set of candidate controls;
    对所述候选控件集合中的控件进行尺寸筛选,得到所述至少一个图像控件。Perform size screening on the controls in the set of candidate controls to obtain the at least one image control.
  5. 根据权利要求4所述的方法,其中,所述预设类型包括下述至少之一:ImageView、FrameLayout、LinearLayout、RelativeLayout和View。The method according to claim 4, wherein the preset type includes at least one of the following: ImageView, FrameLayout, LinearLayout, RelativeLayout and View.
  6. 根据权利要求4所述的方法,其中,所述对所述候选控件集合中的控件进行尺寸筛选,得到所述至少一个图像控件,包括:The method according to claim 4, wherein the size screening of the controls in the set of candidate controls to obtain the at least one image control comprises:
    在所述候选控件集合中,判断所述控件的长度和宽度是否满足预设尺寸条件,将长度和宽度满足所述预设尺寸条件的所述控件确定为所述图像控件。In the set of candidate controls, it is judged whether the length and width of the controls meet a preset size condition, and the control whose length and width meet the preset size condition is determined as the image control.
  7. 根据权利要求1所述的方法,其中,所述根据当前的图形界面,确定至少一个图像控件,包括:The method according to claim 1, wherein said determining at least one image control according to the current graphical interface comprises:
    对当前的图形界面进行截图,得到待识别图像;Take a screenshot of the current graphical interface to obtain the image to be recognized;
    对所述待识别图像进行控件检测,将检测得到的若干个控件组成候选控件集合;Performing control detection on the image to be recognized, and forming a candidate control set from several detected controls;
    对所述候选控件集合中的控件进行尺寸筛选,得到所述至少一个图像控件。Perform size screening on the controls in the set of candidate controls to obtain the at least one image control.
  8. 根据权利要求1至7任一项所述的方法,其中,所述根据所述语音数据与所述至少一个图像控件对应的图像描述文本信息进行图像控件识别,在所述至少一个图像控件中确定目标图像控件,包括:The method according to any one of claims 1 to 7, wherein the image control is identified according to the image description text information corresponding to the voice data and the at least one image control, and is determined in the at least one image control Target image controls, including:
    对所述语音数据进行文本转换,得到语音文本信息;Carry out text conversion to described voice data, obtain voice text information;
    将所述语音文本信息与所述至少一个图像控件对应的图像描述文本信息进行语义匹配,确定所述至少一个图像控件对应的语义相似度值;Semantically matching the voice text information with the image description text information corresponding to the at least one image control, and determining the semantic similarity value corresponding to the at least one image control;
    根据所述语义相似度值,确定所述目标图像控件。The target image control is determined according to the semantic similarity value.
  9. 根据权利要求8所述的方法,其中,所述根据所述语义相似度值,确定所述目标图像控件,包括:The method according to claim 8, wherein said determining said target image control according to said semantic similarity value comprises:
    确定所述语义相似度值中的最大相似度值对应的图像控件为所述目标图像控件。It is determined that the image control corresponding to the maximum similarity value among the semantic similarity values is the target image control.
  10. 根据权利要求1所述的方法,其中,在所述接收用户输入的语音数据之后,所述方法还包括:The method according to claim 1, wherein, after receiving the voice data input by the user, the method further comprises:
    对当前的图形界面进行截图,得到待识别图像;Take a screenshot of the current graphical interface to obtain the image to be recognized;
    对所述语音数据进行文本转换,得到语音文本信息;Carry out text conversion to described voice data, obtain voice text information;
    根据所述语音文本信息对所述待识别图像进行目标检测,确定目标图像控件;performing target detection on the image to be recognized according to the voice text information, and determining a target image control;
    根据所述语音数据确定操作指令,向所述目标图像控件发送所述操作指令,以实现对所述目标图像控件的语音控制。An operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control, so as to implement voice control on the target image control.
  11. 一种语音控制装置,所述语音控制装置包括接收单元、确定单元、分析单元和发送单元;其中,A voice control device, the voice control device includes a receiving unit, a determining unit, an analyzing unit and a sending unit; wherein,
    所述接收单元,配置为接收用户输入的语音数据;The receiving unit is configured to receive voice data input by a user;
    所述确定单元,配置为根据当前的图形界面,确定至少一个图像控件;The determining unit is configured to determine at least one image control according to the current graphical interface;
    所述分析单元,配置为对所述至少一个图像控件进行图像内容理解,得到所述至少一个图像控件对应的图像描述文本信息;以及还配置为根据所述语音数据与所述至少一个图像控件对应的图像描述文本信息进行图像控件识别,在所述至少一个图像控件中确定目标图像控件;The analysis unit is configured to understand the image content of the at least one image control to obtain image description text information corresponding to the at least one image control; and is also configured to correspond to the at least one image control according to the voice data The image description text information of the image control is identified, and the target image control is determined in the at least one image control;
    所述确定单元,还配置为根据所述语音数据确定操作指令;The determining unit is further configured to determine an operation instruction according to the voice data;
    所述发送单元,配置为向所述目标图像控件发送所述操作指令,以实现对所述目标图像控件的语音控制。The sending unit is configured to send the operation instruction to the target image control, so as to implement voice control on the target image control.
  12. 根据权利要求11所述的语音控制装置,其中,所述确定单元,还配置为确定当前的图形界面对应的图形界面元素信息;以及根据所述图形界面元素信息,确定所述至少一个图像控件。The voice control device according to claim 11, wherein the determining unit is further configured to determine graphical interface element information corresponding to the current graphical interface; and determine the at least one image control according to the graphical interface element information.
  13. 根据权利要求12所述的语音控制装置,其中,所述语音控制装置还包括调用单元,配置为调用系统底层代码信息,获取所述图形界面元素信息;或者,调用系统辅助服务功能接口,获取所述图形界面元素信息。The voice control device according to claim 12, wherein the voice control device further comprises a calling unit configured to call the underlying code information of the system to obtain the graphical interface element information; or call the system auxiliary service function interface to obtain the Describes the graphical interface element information.
  14. 根据权利要求12所述的语音控制装置,其中,所述分析单元,还配置为从所述图形界面元素信息中查询类属性后缀为预设类型的控件,组成候选控件集合;以及对所述候选控件集合中的控件进行尺寸筛选,得到所述至少一个图像控件。The voice control device according to claim 12, wherein the analysis unit is further configured to query controls whose class attribute suffix is a preset type from the graphical interface element information to form a set of candidate controls; and The controls in the control set are screened by size to obtain the at least one image control.
  15. 根据权利要求14所述的语音控制装置,其中,所述确定单元,还配置为在所述候选控件集合中,判断所述控件的长度和宽度是否满足预设尺寸条件,将长度和宽度满足所述预设尺寸条件的所述控件确定为所述图像控件。The voice control device according to claim 14, wherein the determining unit is further configured to, in the set of candidate controls, determine whether the length and width of the control meet a preset size condition, and determine whether the length and width meet the predetermined size condition. The control with the preset size condition is determined as the image control.
  16. 根据权利要求11所述的语音控制装置,其中,所述语音控制装置还包括检测单元,配置为对当前的图形界面进行截图,得到待识别图像;以及对所述待识别图像进行控件检测,将检测得到的若干个控件组成候选控件集合;以及对所述候选控件集合中的控件进行尺寸筛选,得到所述至少一个图像控件。The voice control device according to claim 11, wherein the voice control device further includes a detection unit configured to take a screenshot of the current graphical interface to obtain an image to be recognized; and perform control detection on the image to be recognized, and The detected controls form a set of candidate controls; and performing size screening on the controls in the set of candidate controls to obtain the at least one image control.
  17. 根据权利要求11至16任一项所述的语音控制装置,其中,所述分析单元,还配置为对所述语音数据进行文本转换,得到语音文本信息;以及将所述语音文本信息与所述至少一个图像控件对应的图像描述文本信息进行语义匹配,确定所述至少一个图像控件对应的语义相似度值;以及根据所述语义相似度值,确定所述目标图像控件。The voice control device according to any one of claims 11 to 16, wherein the analysis unit is further configured to perform text conversion on the voice data to obtain voice-to-text information; and combine the voice-to-text information with the performing semantic matching on image description text information corresponding to at least one image control, determining a semantic similarity value corresponding to the at least one image control; and determining the target image control according to the semantic similarity value.
  18. 根据权利要求16所述的语音控制装置,其中,所述检测单元,配置为对当前的图形界面进行截图,得到待识别图像;以及对所述语音数据进行文本转换,得到语音文本信息;以及根据所述语音文本信息对所述待识别图像进行目标检测,确定目标图像控件;以及根据所述语音数据确定操作指令,向所述目标图像控件发送所述操作指令,以实现对所述目标图像控件的语音控制。The voice control device according to claim 16, wherein the detection unit is configured to take a screenshot of the current graphical interface to obtain an image to be recognized; and perform text conversion on the voice data to obtain voice text information; and The voice text information performs target detection on the image to be recognized, and determines a target image control; and determines an operation instruction according to the voice data, and sends the operation instruction to the target image control, so as to control the target image voice control.
  19. 一种电子设备,所述电子设备包括存储器和处理器;其中,An electronic device comprising a memory and a processor; wherein,
    所述存储器,用于存储能够在所述处理器上运行的计算机程序;said memory for storing a computer program capable of running on said processor;
    所述处理器,用于在运行所述计算机程序时,执行如权利要求1至10任一项所述的方法。The processor is configured to execute the method according to any one of claims 1 to 10 when running the computer program.
  20. 一种计算机存储介质,其中,所述计算机存储介质存储有计算机程序,所述计算机程序被至少一个处理器执行时实现如权利要求1至10任一项所述的方法。A computer storage medium, wherein the computer storage medium stores a computer program, and when the computer program is executed by at least one processor, the method according to any one of claims 1 to 10 is realized.
PCT/CN2022/122020 2021-11-19 2022-09-28 Voice control method, apparatus, device, and computer storage medium WO2023087934A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111398660.6A CN114067797A (en) 2021-11-19 2021-11-19 Voice control method, device, equipment and computer storage medium
CN202111398660.6 2021-11-19

Publications (1)

Publication Number Publication Date
WO2023087934A1 true WO2023087934A1 (en) 2023-05-25

Family

ID=80275779

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/122020 WO2023087934A1 (en) 2021-11-19 2022-09-28 Voice control method, apparatus, device, and computer storage medium

Country Status (2)

Country Link
CN (1) CN114067797A (en)
WO (1) WO2023087934A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117078358A (en) * 2023-10-13 2023-11-17 北京未来链技术有限公司 Intelligent construction method and system for meta-space electronic commerce platform system based on voice recognition

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067797A (en) * 2021-11-19 2022-02-18 杭州逗酷软件科技有限公司 Voice control method, device, equipment and computer storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109471678A (en) * 2018-11-07 2019-03-15 苏州思必驰信息科技有限公司 Voice midpoint controlling method and device based on image recognition
US20190179608A1 (en) * 2017-12-08 2019-06-13 Google Llc Graphical user interace rendering management by voice-driven computing infrastructure
US20200042286A1 (en) * 2018-08-01 2020-02-06 Adobe Inc. Collecting Multimodal Image Editing Requests
CN111309283A (en) * 2020-03-25 2020-06-19 北京百度网讯科技有限公司 Voice control method and device for user interface, electronic equipment and storage medium
CN114005445A (en) * 2020-06-28 2022-02-01 广州小鹏汽车科技有限公司 Information processing method, server, and computer-readable storage medium
CN114049892A (en) * 2021-11-12 2022-02-15 杭州逗酷软件科技有限公司 Voice control method and device and electronic equipment
CN114067797A (en) * 2021-11-19 2022-02-18 杭州逗酷软件科技有限公司 Voice control method, device, equipment and computer storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190179608A1 (en) * 2017-12-08 2019-06-13 Google Llc Graphical user interace rendering management by voice-driven computing infrastructure
US20200042286A1 (en) * 2018-08-01 2020-02-06 Adobe Inc. Collecting Multimodal Image Editing Requests
CN109471678A (en) * 2018-11-07 2019-03-15 苏州思必驰信息科技有限公司 Voice midpoint controlling method and device based on image recognition
CN111309283A (en) * 2020-03-25 2020-06-19 北京百度网讯科技有限公司 Voice control method and device for user interface, electronic equipment and storage medium
CN114005445A (en) * 2020-06-28 2022-02-01 广州小鹏汽车科技有限公司 Information processing method, server, and computer-readable storage medium
CN114049892A (en) * 2021-11-12 2022-02-15 杭州逗酷软件科技有限公司 Voice control method and device and electronic equipment
CN114067797A (en) * 2021-11-19 2022-02-18 杭州逗酷软件科技有限公司 Voice control method, device, equipment and computer storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117078358A (en) * 2023-10-13 2023-11-17 北京未来链技术有限公司 Intelligent construction method and system for meta-space electronic commerce platform system based on voice recognition

Also Published As

Publication number Publication date
CN114067797A (en) 2022-02-18

Similar Documents

Publication Publication Date Title
US20230385033A1 (en) Storing logical units of program code generated using a dynamic programming notebook user interface
US11361526B2 (en) Content-aware selection
US20240031688A1 (en) Enhancing tangible content on physical activity surface
WO2023087934A1 (en) Voice control method, apparatus, device, and computer storage medium
US10503821B2 (en) Dynamic workflow assistant with shared application context
US10122839B1 (en) Techniques for enhancing content on a mobile device
US8949729B2 (en) Enhanced copy and paste between applications
US9342233B1 (en) Dynamic dictionary based on context
US20130018894A1 (en) System and method of sentiment data generation
US20160026858A1 (en) Image based search to identify objects in documents
US10650814B2 (en) Interactive question-answering apparatus and method thereof
CN110119515B (en) Translation method, translation device, terminal and readable storage medium
US20130018874A1 (en) System and method of sentiment data use
US11734370B2 (en) Method for searching and device thereof
CN107015979B (en) Data processing method and device and intelligent terminal
US20130268556A1 (en) System and method for recording and querying original handwriting and electronic device
CN108829686A (en) Translation information display methods, device, equipment and storage medium
CN112631437A (en) Information recommendation method and device and electronic equipment
CN112839261A (en) Method for improving voice instruction matching degree and display equipment
WO2023138475A1 (en) Icon management method and apparatus, and device and storage medium
CN109145241B (en) Browser and content display management method of browser
CN115033153B (en) Application program recommendation method and electronic device
CN114416664A (en) Information display method, information display device, electronic apparatus, and readable storage medium
CN111344664B (en) Electronic apparatus and control method thereof
US20240126807A1 (en) Visual Search Determination for Text-To-Image Replacement

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22894472

Country of ref document: EP

Kind code of ref document: A1