WO2023087934A1

WO2023087934A1 - Voice control method, apparatus, device, and computer storage medium

Info

Publication number: WO2023087934A1
Application number: PCT/CN2022/122020
Authority: WO
Inventors: 陈明; 冉茂松; 张晓帆
Original assignee: 杭州逗酷软件科技有限公司
Priority date: 2021-11-19
Filing date: 2022-09-28
Publication date: 2023-05-25
Also published as: CN114067797A

Abstract

A voice control method, an apparatus (60), an electronic device (70), and a computer storage medium. The method comprises: receiving voice data input by a user (S201); determining at least one image control according to a current graphical interface (S202); performing image content understanding on the at least one image control, to obtain image description text information corresponding to the at least one image control (S203); performing image control recognition according to the voice data and the image description text information corresponding to the at least one image control, and determining a target image control of the at least one image control (S204); determining an operation instruction according to the voice data, and sending the operation instruction to the target image control so as to realize voice control of the target image control (S205). In this way, a voice interaction mode based on image content understanding can not only reduce development costs, but also improve convenience for a user during voice control, so that the purpose of voice interaction and control is better achieved.

Description

A voice control method, device, equipment and computer storage medium

Cross References to Related Applications

This application claims the priority of the Chinese patent application submitted to the China Patent Office on November 19, 2021, with the application number 202111398660.6 and the application name "A Voice Control Method, Device, Equipment, and Computer Storage Medium", the entire content of which is passed References are incorporated in this application.

technical field

The present application relates to the technical field of voice interaction, and in particular to a voice control method, device, equipment and computer storage medium.

Background technique

In recent years, with the rapid development of hardware equipment and electronic products, voice-based human-computer interaction methods have become more mature, more common, and more and more accepted and used by people. In this way, voice interaction has gradually penetrated into people's daily life, so that the demand for controlling a Graphical User Interface (GUI) by voice is becoming more and more intense.

In related technologies, there are already some solutions for the lack of text descriptions or text descriptions that are not convenient for users to directly describe the elements in the graphical interface, such as icon recognition, text recognition, spatial orientation designation, number designation designation, etc. , but these solutions all have certain limitations, especially for picture-type controls, because users cannot or are inconvenient to describe such controls through text descriptions, which makes it difficult to implement voice control and poor usability for users.

Contents of the invention

The technical scheme of the present application is realized like this:

In the first aspect, the embodiment of the present application provides a voice control method, the method includes:

Receive voice data input by the user;

Determine at least one image control according to the current graphical interface;

Perform image content understanding on at least one image control, and obtain image description text information corresponding to at least one image control;

Image control identification is performed according to the image description text information corresponding to the voice data and at least one image control, and a target image control is determined in at least one image control;

Determine the operation instruction according to the voice data, and send the operation instruction to the target image control, so as to realize the voice control of the target image control.

In the second aspect, the embodiment of the present application provides a voice control device, the voice control device includes a receiving unit, a determining unit, an analyzing unit, and a sending unit; wherein,

a receiving unit configured to receive voice data input by a user;

A determining unit configured to determine at least one image control according to the current graphical interface;

The analysis unit is configured to understand the image content of at least one image control, and obtain the image description text information corresponding to the at least one image control; and is also configured to perform image control identification according to the voice data and the image description text information corresponding to the at least one image control, determining a target image control among at least one image control;

The determination unit is further configured to determine the operation instruction according to the voice data;

The sending unit is configured to send an operation instruction to the target image control, so as to implement voice control on the target image control.

In a third aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a memory and a processor; wherein,

said memory for storing a computer program capable of running on said processor;

The processor is configured to execute the method as described in the first aspect when running the computer program.

In a fourth aspect, an embodiment of the present application provides a computer storage medium, where a computer program is stored in the computer storage medium, and when the computer program is executed by at least one processor, the method described in the first aspect is implemented.

Description of drawings

FIG. 1 is a schematic diagram of a graphical interface element grid index;

FIG. 2 is a schematic flowchart of a voice control method provided in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a hierarchical tree provided by an embodiment of the present application;

FIG. 4 is a detailed flowchart of a voice control method provided in an embodiment of the present application;

FIG. 5 is a schematic flowchart of another voice control method provided in the embodiment of the present application;

FIG. 6 is a schematic flowchart of another voice control method provided in the embodiment of the present application;

FIG. 7 is a schematic structural diagram of a voice control device provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of the composition and structure of an electronic device provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of the composition and structure of another electronic device provided by the embodiment of the present application.

Detailed ways

Receive voice data input by the user;

Perform image content understanding on the at least one image control to obtain image description text information corresponding to the at least one image control;

Perform image control identification according to the image description text information corresponding to the voice data and the at least one image control, and determine a target image control in the at least one image control;

An operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control, so as to implement voice control on the target image control.

In some embodiments, the determining at least one image control according to the current graphical interface includes:

Determine the graphical interface element information corresponding to the current graphical interface;

The at least one image control is determined according to the graphical interface element information.

In some embodiments, the determining the graphical interface element information corresponding to the current graphical interface includes:

Invoking the underlying code information of the system to obtain the graphical interface element information; or,

Call the system auxiliary service function interface to obtain the graphic interface element information.

In some embodiments, the determining the at least one image control according to the graphical interface element information includes:

Querying controls whose class attribute suffix is a preset type from the graphical interface element information to form a set of candidate controls;

Perform size screening on the controls in the set of candidate controls to obtain the at least one image control.

In some embodiments, the preset type includes at least one of the following: ImageView, FrameLayout, LinearLayout, RelativeLayout and View.

In some embodiments, the size screening of the controls in the set of candidate controls to obtain the at least one image control includes:

In the set of candidate controls, it is judged whether the length and width of the controls meet the preset size condition, and the control whose length and width meet the preset size condition is determined as the image control.

Take a screenshot of the current graphical interface to obtain the image to be recognized;

Performing control detection on the image to be recognized, and forming a candidate control set from several detected controls;

In some embodiments, the identifying the image control according to the image description text information corresponding to the voice data and the at least one image control, and determining the target image control in the at least one image control includes:

Carry out text conversion to described voice data, obtain voice text information;

Semantically matching the voice text information with the image description text information corresponding to the at least one image control, and determining the semantic similarity value corresponding to the at least one image control;

The target image control is determined according to the semantic similarity value.

In some embodiments, the determining the target image control according to the semantic similarity value includes:

It is determined that the image control corresponding to the maximum similarity value among the semantic similarity values is the target image control.

In some embodiments, after receiving the voice data input by the user, the method further includes:

performing target detection on the image to be recognized according to the voice text information, and determining a target image control;

The receiving unit is configured to receive voice data input by a user;

The determining unit is configured to determine at least one image control according to the current graphical interface;

The analysis unit is configured to understand the image content of the at least one image control to obtain image description text information corresponding to the at least one image control; and is also configured to correspond to the at least one image control according to the voice data The image description text information of the image control is identified, and the target image control is determined in the at least one image control;

The determining unit is further configured to determine an operation instruction according to the voice data;

The sending unit is configured to send the operation instruction to the target image control, so as to implement voice control on the target image control.

In some embodiments, the determining unit is further configured to determine graphical interface element information corresponding to the current graphical interface; and determine the at least one image control according to the graphical interface element information.

In some embodiments, the voice control device further includes a calling unit configured to call the underlying system code information to obtain the graphical interface element information; or call the system auxiliary service function interface to obtain the graphical interface element information.

In some embodiments, the analysis unit is further configured to query controls whose class attribute suffix is a preset type from the graphical interface element information to form a set of candidate controls; and size the controls in the set of candidate controls Screening to obtain the at least one image control.

In some embodiments, the determining unit is further configured to, in the set of candidate controls, determine whether the length and width of the controls meet a preset size condition, and select all controls whose length and width meet the preset size condition The control is determined as the image control.

In some embodiments, the voice control device further includes a detection unit configured to take a screenshot of the current graphical interface to obtain an image to be recognized; and perform control detection on the image to be recognized, and combine the detected controls into a A set of candidate controls; and performing size screening on the controls in the set of candidate controls to obtain the at least one image control.

In some embodiments, the analysis unit is further configured to perform text conversion on the voice data to obtain voice-to-text information; and semantically convert the voice-to-text information to the image description text information corresponding to the at least one image control matching, determining a semantic similarity value corresponding to the at least one image control; and determining the target image control according to the semantic similarity value.

In some embodiments, the detection unit is configured to take a screenshot of the current graphical interface to obtain an image to be recognized; and perform text conversion on the voice data to obtain voice-to-text information; performing target detection on the image to be recognized, and determining a target image control; and determining an operation instruction according to the voice data, and sending the operation instruction to the target image control, so as to implement voice control on the target image control.

The processor is configured to execute the method according to any one of the first aspect when running the computer program.

In a fourth aspect, an embodiment of the present application provides a computer storage medium, where a computer program is stored in the computer storage medium, and when the computer program is executed by at least one processor, the method according to any one of the first aspect is implemented.

In order to understand the characteristics and technical contents of the embodiments of the present application in more detail, the implementation of the embodiments of the present application will be described in detail below in conjunction with the accompanying drawings. The attached drawings are only for reference and description, and are not intended to limit the embodiments of the present application.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.

In the following description, references to "some embodiments" describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict. It should also be pointed out that the term "first\second\third" involved in the embodiment of the present application is only used to distinguish similar objects, and does not represent a specific ordering of objects. Understandably, "first\second\ The specific order or sequence of "third" may be interchanged where permitted so that the embodiments of the application described herein can be implemented in an order other than that illustrated or described herein.

Thanks to the rapid development of hardware equipment and electronic products in recent years, voice-based human-computer interaction methods are becoming more and more mature, more common, and more and more accepted and used by people. As voice interaction gradually penetrates into all aspects of people's lives, the demand for controlling Graphical User Interface (GUI) through voice is becoming more and more intense.

At present, the main problem of Voice and Graphical User Interface (VGUI) is application adaptation. Because the vast majority of applications do not consider the use of voice interaction when designing and developing. Taking smart phones as an example, the current mobile phone applications are mainly designed for interaction through touch screen interaction. Therefore, basically the vast majority of applications have not been adapted to voice interaction. Therefore, when using voice to interact and control the application graphical interface on the mobile phone, many problems will be encountered, such as lack of text descriptions for interface elements, or interface elements that have text descriptions but are not convenient for users to directly use corresponding text descriptions (such as The text description is too long, the text description contains symbols, pictures, etc., the text description is unclear, and there are multiple interface elements whose text descriptions are the same or similar, etc.) to refer to, etc. In these cases, users cannot directly refer to the control they want to interact with through the text description.

In related technologies, the current solutions for this situation mainly include the following:

(1) Icon recognition: Use the model to detect and identify commonly used and unambiguous icon controls, so as to obtain description text (common terms/titles). In this way, users can use common sense or common sayings/appellations to describe icon controls, such as "play", "pause", "previous song" and "next song" in audio and video playback control buttons, so as to realize Description of the target control to achieve the purpose of interaction;

(2) Text recognition: use the model to recognize the text information that may be contained in image controls such as pictures and icons that do not contain text descriptions, and use the recognized text information as the text description information of the control to match the user's interaction instructions , so as to realize the positioning of the target control and achieve the purpose of interaction;

(3) Spatial orientation reference: For example, "the button on the right of the download button", "the icon below the like button", etc., refer to the target control through the spatial orientation relationship between other descriptive controls and the target control, so as to realize Description of the target control to achieve the purpose of interaction;

(4) Digital number refers to: for example, "the first button", etc., by numbering all controls, and then referring to the control by number, so as to realize the description of the target control and achieve the purpose of interaction; the control number is visually Generally, it is not displayed, so the user needs to calculate the number of the control by himself;

(5) Superimposed display text instructions: superimpose and display the text description of each interactive control on the GUI, and the user can refer to the corresponding control through the corresponding text description superimposed and displayed on each interactive control, so as to achieve the purpose of interaction;

(6) Superimposed display of digital numbers: The digital numbers of each interactive control are superimposed on the GUI, and the user can refer to the corresponding control through the corresponding digital number realized by the superimposition of each interactive control, so as to achieve the purpose of interaction;

(7) Superimposed display of network grid and numbering: the network grid is superimposed and displayed on the GUI in full screen, and each grid area is numbered. The user can refer to the corresponding control by the number corresponding to the grid where the control is located, so as to realize the description of the target control and achieve the purpose of interaction.

However, for the situation where the elements in the graphical interface lack text descriptions or the text descriptions are not convenient for users to directly describe, the above solutions have certain limitations and cannot be applied to all situations. Among them, (1) for icon recognition, icon recognition is only applicable to commonly used and unambiguous control icons, and cannot handle other types of icons and non-icon content, and its scope of application is limited. (2) For text recognition, text recognition is only applicable to the situation where the image contains text, and the applicable situation is limited; and text recognition requires more computing resources, so the general processing delay is relatively large, the cost of use is high, and it is accurate Degree is limited. (3) For spatial orientation reference, the method of spatial orientation reference needs to find a control that can be located through text description as a reference. However, in many cases, such a control cannot be found, so its scope of application is relatively limited. (4) For digital reference, the digital reference method needs to number the control through the program, and then use the number to refer to the control. The control number itself is not displayed in the interface. However, in actual use, the user's numbering method is not necessarily consistent with the program's numbering method; and usually there may be dozens of interactive objects on an interface, it is very difficult for users to number the controls one by one. (5) For superimposed display text instructions, the method of superimposed display text instructions needs to generate text instructions first; and generating text instructions depends on the text description of the control, so the text instructions may have the same situation as the text description; and the superimposed display If the content is too large, the original content will be covered, and if the content is too small, the user will not be able to see clearly; and usually there may be dozens of interactive objects on an interface, and finally dense prompt content will be superimposed on the interface , has a great impact on the user experience and sensory experience. (6) For the superimposed display of digital numbers, the method of superimposed display of digital numbers is simple to implement, but it is not conducive to the user's memory of correct interactive instructions. And if the superimposed content is too large, the original content will be covered, and if the content is too small, the user will not be able to see it clearly; and usually there may be dozens of interactive objects on an interface, and finally densely packed on the interface. The prompt content has a great impact on the user experience and sensory experience. (7) For the superimposed display of network grids and numbers, the grid size may be too large or too small in the way of superimposed display of network grids and numbers; target interactive controls may fall in several grids; Several interactive objects may also appear in the grid. In these situations, the user needs to perform multiple operations before finally determining the interaction target. In addition, the superimposed content will cover the original content, which will greatly affect the user experience and sensory experience.

To put it simply, by analyzing the situation where the user cannot or is inconvenient to describe the target control through text description, we can roughly divide it into the following two types: icons and pictures. Taking Figure 1 as an example, the icon class can be seen in the bold dashed box in Figure 1. The control size is generally small, and the appearance and meaning are usually relatively fixed, which can be realized through icon recognition; for non-standard, very useful 1. Ambiguous icons can be realized through multiple labels, such as shape, color, appearance style, visual semantics and other labels, or through the combination of spatial orientation and numbered index. The other type is pictures, which can be seen in the bold solid line box in Figure 1. The control size is generally large, and it mainly appears in the list of images, videos, files, messages, etc.; the control itself may be arranged according to the rules ( such as a grid arrangement), may also be irregular. However, the visual content and meaning of the picture itself varies greatly, and it may or may not have a text description; and the text description may be repeated, and there may be symbols, pictures, etc. that are not convenient for users to describe directly. At present, for this situation, the existing solutions all have certain limitations, which makes it difficult to implement voice control, and cannot provide a better interaction method and interaction experience.

Based on this, the embodiment of the present application provides a voice control method, which receives the voice data input by the user; determines at least one image control according to the current graphical interface; understands the image content of at least one image control, and obtains the corresponding image description text information; image control identification is performed according to the image description text information corresponding to the voice data and at least one image control, and the target image control is determined in at least one image control; the operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control , to enable voice control of the target image control. In this way, the voice interaction method based on image content understanding does not need to adapt the application and voice interaction, which not only saves development costs, but also facilitates user description, which can effectively improve the convenience of users when using voice control, so as to better realize voice interaction. Interaction and control purposes.

Various embodiments of the present application will be described in detail below with reference to the accompanying drawings.

In an embodiment of the present application, refer to FIG. 2 , which shows a schematic flowchart of a voice control method provided in an embodiment of the present application. As shown in Figure 2, the method may include:

S201: Receive voice data input by a user.

S202: Determine at least one image control according to the current graphical interface.

It should be noted that the embodiment of the present application is applied to a voice control device, or an electronic device integrated with the device. Among them, the electronic equipment can be implemented in various forms, for example, the electronic equipment can include such as smart phones, tablet computers, notebook computers, palmtop computers, personal digital assistants (Personal Digital Assistant, PDA), portable media players (Portable Media Player, PMP), navigation device, wearable device, voice assistant, etc., the embodiment of this application does not make any limitation.

It should also be noted that the operation instruction can be determined according to the voice data. In this way, after receiving the voice data input by the user, the electronic device can perform corresponding operations on the current graphical interface according to the operation instruction determined by the voice data. However, taking the image control in Figure 1 as an example, since the user cannot or is inconvenient to describe it directly through text description, it is necessary to find out the image control in the current graphical interface at this time, so that the follow-up can be based on the understanding of the image content, by understanding the user Input voice data and graphical interface content, so as to realize the matching of user interaction intention and interaction goal, and achieve the purpose of voice interaction.

In some embodiments, for S202, the determining at least one image control according to the current graphical interface may include:

Determine at least one image control according to the graphical interface element information.

It should be noted that, in the embodiment of the present application, the graphical interface element information may be represented by a hierarchical tree, which may also be called a view tree (View Tree). Exemplarily, the GUI element information in the electronic device may be a hierarchical structure tree as shown in FIG. 3 . Among them, for the View Tree, each node in the View Tree represents an element or control (Control/Widget/Element) in the GUI, and some related attributes of the element can include text descriptions, interactive attributes (clickable or not) , whether it can enter text, whether it can slide, etc.), the position of the control, and so on.

In addition, for a node in the View Tree, Table 1 shows some related attribute information of the node, which can include index number, text description, interactive attributes (whether it can be clicked, whether it can input text, whether it can slide, etc.), controls location, etc., as shown below.

Table 1

In the View Tree, the elements directly visible to the user are mainly the leaf nodes (Leaf Elements) in the View Tree. Other non-leaf nodes are generally invisible to users, and they are mainly used as interface element containers (Containers), which are mainly used to constrain and control the position, size, arrangement, etc. of elements. At the same time, some Containers also carry the function of interacting with users (the clickable attribute is true at this time).

In a specific embodiment, the determining the graphical interface element information corresponding to the current graphical interface may include:

Call the underlying code information of the system to obtain the graphical interface element information; or,

Call the system auxiliary service function interface to obtain the graphical interface element information.

That is to say, taking an electronic device based on the Android system as an example, there are mainly two ways to obtain the information of the graphical interface elements from the Android system: the interface for obtaining the structure and information of the interface elements can be provided directly through the underlying code of the system To achieve the acquisition of graphical interface element information; or, the acquisition of graphical interface element information may also be achieved through the system accessibility service (Accessibility Service) function interface. However, the former method is complicated to implement, requires a large amount of development work, and needs to modify the underlying code of the system, which has certain security risks, but can obtain more comprehensive and accurate interface element information; the latter method is simple to implement and easy to develop. The workload is small, but the obtained graphical interface element information may be missing, and there may be some information errors. It should be noted that the embodiments of the present application may be specifically selected according to actual conditions, and no limitation is made here.

Further, after obtaining the graphical interface element information, all possible controls can be searched out, and then at least one required image control can be screened out. Specifically, in some embodiments, the determining at least one image control according to the graphical interface element information may include:

Query the controls whose class attribute suffix is the preset type from the graphical interface element information to form a set of candidate controls;

Size screening is performed on the controls in the candidate control set to obtain at least one image control.

It should be noted that, in this embodiment of the application, the preset type may include at least one of the following: image view (ImageView), single frame layout (FrameLayout), linear layout (LinearLayout), relative layout (RelativeLayout) and view ( View).

Further, in some embodiments, the size screening of the controls in the candidate control set to obtain at least one image control may include: in the candidate control set, judging whether the length and width of the controls meet the preset size conditions, A control whose length and width meet preset size conditions is determined as an image control. That is to say, when the length and width of a control meet the preset size conditions, it indicates that the control is an image control, rather than icons, buttons, decorative strips and other controls, and then the image control is filtered from the candidate control set come out.

It should also be noted that, in the embodiment of the present application, the preset size condition may be: the length and width of the control are respectively greater than a first preset value, and the ratio of the length and width of the control is smaller than a second preset value.

Exemplarily, the first preset value may be 100dp; wherein, dp=pixel/density, pixel is an absolute pixel representation, density is a pixel density per unit size, and dp is a standard size representation. The second preset value may be 3, that is, the aspect ratio of the image control needs to be less than 3.

That is to say, for the situation where the AccessibilityService functional interface is used to obtain graphical interface element information in the Android system, since some controls may not be able to obtain correct content, the type (class attribute suffix) of some target controls may be FrameLayout, LinearLayout , RelativeLayout, View, etc., not ImageView. In view of this situation, controls of the type ImageView or unconventional types (FrameLayout, LinearLayout, RelativeLayout, View, etc., which will not appear in the leaf nodes under normal circumstances) in the leaf nodes of the View Tree can be filtered out as a set of candidate controls .

It can be understood that, for using the AccessibilityService functional interface in the Android system to obtain graphical interface element information, it is necessary for the application itself to adapt to the AccessibilityService functional interface. For some applications or some custom controls, their adaptation to the AccessibilityService functional interface is not good. As a result, when obtaining graphical interface element information through the AccessibilityService functional interface, some controls may not be able to obtain correct content, resulting in user interface ( User Interface, UI) information omissions, errors, etc. In addition, when the graphical interface element information is obtained through the AccessibilityService functional interface of the Android system, it cannot obtain the visual appearance image information of the control, so the embodiment of the present application also needs interface screenshots to assist in obtaining related image information of the control.

Although the embodiment of the present application proposes the method of obtaining the current graphical interface element information View Tree from the system first, and then filtering out the image control from the View Tree, considering that it may be difficult for some platforms or systems to obtain the View Tree, here is also A method of detecting controls by taking a screenshot of the current graphical interface and then filtering out image controls is proposed. Therefore, in some embodiments, for S202, the determining at least one image control according to the current graphical interface may include:

Perform control detection on the image to be recognized, and form several detected controls into a candidate control set;

It should be noted that the size screening of the controls in the candidate control set to obtain at least one image control may include: in the candidate control set, judging whether the length and width of the control meet the preset size conditions, and combining the length and width Controls that meet the preset size conditions are determined as image controls. That is to say, among these several controls, when the length and width of a certain control meet the preset size conditions, it indicates that the control is an image control, rather than icons, buttons, decorative strips and other controls, in order to filter out Image controls.

That is to say, after the image to be recognized is obtained, the controls (such as text, image, etc.) and the positions of the controls contained in the image to be recognized can be detected to obtain several controls; Exemplarily, the length and width of the image control need to be larger than a certain size (for example, 100dp), and the aspect ratio of the image control needs to be less than 3; then, the image control is selected from these several controls.

S203: Perform image content understanding on at least one image control to obtain image description text information corresponding to at least one image control.

S204: Perform image control identification according to the voice data and image description text information corresponding to the at least one image control, and determine a target image control in the at least one image control.

It should be noted that after screening at least one image control, the image content of these image controls can be understood, so as to obtain the image description text information about the image content of each image control. In the embodiment of the present application, the method for image content understanding may include: image classification, image detection, image description generation, image-based text reference target detection, etc., and these methods will be described in detail below:

(1) Image classification: By classifying the image content, it matches the text description tags of the image content, such as "car", "food", "person", etc. Generally speaking, the text labels obtained by image classification are generally not detailed enough, and their understanding of images is limited. For example, in the second video in the first row of Figure 1, if the label is "car", more details cannot be obtained, such as whether it is a motorcycle, a car, a truck, or a bus; Level tags or multiple tags can be used to improve to a certain extent, such as "car/sedan" (first level tag/secondary tag/...). In addition, image classification cannot obtain labels at the same level at the same time, such as the first video in the third row of Figure 1, which contains people and food, but the classification model can only get one of the labels "people" or "food". It cannot be obtained at the same time; it can be improved to a certain extent by means of confidence, such as "food: 0.5", "people: 0.4".

2) Image detection: Detect the objects contained in the image through the detection model. For example: the second video in the first row of Figure 1 contains "person" and "car"; the second video in the second row contains "person" and "food"; at the same time, the detection model can be cascaded or jointly classified model to achieve object segmentation recognition. For example, after "car" is detected, the type, manufacturer, model, color, etc. of the car can be further identified. After detecting a "person", it can further identify the person's gender, age, identity (face recognition, who it is), emotion, etc. Image detection can provide more and more detailed information than image classification, but it cannot provide the relationship between multiple detected objects; and the information provided is relatively fragmented, which is quite different from the user's natural language description, and the later matching is more difficult. big. At the same time, if more detailed information needs to be provided, a more complex model or cascading of multiple models is required, so the complexity of the system is high and the cost of use is high.

(3) Image description generation: By understanding the image content, and then generating a description of the image content in a natural language. As shown in the second video in the first row of Figure 1, a generated description is "a person appears next to a car; this person is carrying a schoolbag; the car is white;...". The quality and level of detail of the generated description depends on the model accuracy and related settings. This method is closest to the user's natural language description, and the later matching is less difficult, the system complexity is low, and the cost of use is relatively small.

(4) Image-based text reference target detection: the model receives the user's instruction text and image at the same time, realizes the semantic extraction and matching of the text and image inside the model, and finally directly gives the object referred to by the user instruction text in the image s position. In this way, the matching and positioning of the user interaction instruction and the target interaction object can be realized. In this way, the two steps of S203 and S204 can be combined into one step to complete. This way minimizes the loss of information in the whole process and can obtain better results. Of course, the specific effect also depends on the quality and complexity of the model. At the same time, the system complexity of this method is the smallest, and the use cost is also the smallest.

In practical applications, for the method of image content understanding, it is usually not only one of the above-mentioned methods to be selected, but multiple combinations can be selected according to the actual situation. The embodiment of this application does not make any limited.

In some embodiments, for S204, the image control identification is performed according to the image description text information corresponding to the voice data and at least one image control, and the target image control is determined in the at least one image control, which may include:

Perform text conversion on the voice data to obtain voice and text information;

Semantically matching the voice text information with the image description text information corresponding to at least one image control, and determining the semantic similarity value corresponding to at least one image control;

According to the semantic similarity value, the target image control is determined.

In a specific embodiment, the determining the target image control according to the semantic similarity value may include: determining the image control corresponding to the maximum similarity value among the semantic similarity values as the target image control.

Specifically, after obtaining the semantic similarity value corresponding to at least one image control, select the maximum similarity value from the semantic similarity values, and determine the image control corresponding to the maximum similarity value as the target image control.

That is to say, after obtaining the voice text information and the image description text information corresponding to at least one image control, the voice text information can be semantically matched with the image description text information corresponding to each image control, for example, a traditional text matching method ( Such as TF-IDF algorithm, BM25 algorithm, simhash algorithm, Jaccard algorithm, etc.), you can also use a semantic matching model based on neural network training to determine the semantic similarity value corresponding to each image control; and then select the most similar semantics The image control (that is, the image control corresponding to the maximum similarity value) is used as the target image control.

S205: Determine an operation instruction according to the voice data, and send the operation instruction to the target image control, so as to implement voice control on the target image control.

It should be noted that after determining the target image control, the electronic device can determine the operation instruction according to the user's voice data, and then send the operation instruction to the target image control to perform the corresponding operation (click, long press, etc.), thereby completing Voice interaction. Here, the operation instruction is determined from the voice data input by the user.

In short, the technical solution of the embodiment of the present application provides a method for understanding the image content of graphical interface elements, and further provides a method for controlling voice interaction based on the understanding of the image content; thus, the technical solution of the embodiment of the present application does not need to be The control application is adapted to voice control, which can save development and promotion costs and is convenient for users to use. In this way, for the picture control shown in Figure 1, when the user is unable or inconvenient to describe the target control through text description, the voice interaction method based on image content understanding proposed by the technical solution of the embodiment of the application can be passed Understand the voice data and image content input by the user, and realize the matching between the user interaction intention and the user interaction goal, so as to achieve the purpose of voice interaction. Still taking Figure 1 as an example, when the voice data input by the user is "Open the video containing the car", at this time the second corresponding video containing the car in the first line can be opened; or when the voice data input by the user is "Open "Video Containing Treasure Chests", you can open the second video containing treasure chests in the third row. The technical solution can provide users with a more natural, smarter, and more intuitive interaction mode, thereby bringing users a better sense of experience.

This embodiment provides a voice control method, by receiving the voice data input by the user; determining at least one image control according to the current graphical interface; understanding the image content of the at least one image control, and obtaining the image description corresponding to the at least one image control Text information; image control identification is performed according to the image description text information corresponding to the voice data and at least one image control, and the target image control is determined in at least one image control; the operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control to achieve Voice control over target image controls. In this way, the voice interaction method based on image content understanding does not need to adapt the application and voice interaction, which not only saves development costs, but also facilitates user description, which can effectively improve the convenience of users when using voice control, so as to better realize voice interaction. Interaction and control purposes.

In another embodiment of the present application, based on the same inventive concept as the foregoing embodiments, refer to FIG. 4 , which shows a detailed flowchart of a voice control method provided by the embodiment of the present application. As shown in Figure 4, the detailed process may include:

S301: Obtain voice text information corresponding to user voice data.

S302: Obtain a View Tree of graphical interface element information.

S303: Query all controls meeting the requirements from the View Tree to form a set of candidate controls.

S304: Perform size screening on the controls in the candidate control set to obtain at least one image control.

S305: Perform image content understanding on at least one image control to obtain image description text information corresponding to at least one image control.

S306: Semantically matching the voice text information with the image description text information corresponding to at least one image control, and determining the target image control corresponding to the maximum semantic similarity value.

S307: Send an operation instruction to the target image control, so as to realize voice interaction between the user and the target image control.

It should be noted that the operation instruction is determined according to the voice data of the user. Here, the main process of the voice interaction control method based on image content understanding proposed in the embodiment of the present application includes: in the process of voice interaction by the user, firstly, acquire the instruction text of the user voice interaction; secondly, acquire the current graphics Interface element information View Tree; again, find all controls that meet the requirements of the View Tree query, that is, all image controls whose class attribute suffix is ImageView or unconventional types (such as FrameLayout, LinearLayout, RelativeLayout, View, etc.); Size screening of image controls, where the length and width of the screened image controls need to be larger than a certain size (such as 100dp), and the aspect ratio needs to be less than 3; again, understand the image content of the screened image controls, and Generate the image description text of these image controls for the image content; again, perform semantic matching through the instruction text and the image description text, and find out the image control with the most similar semantics as the target image control; finally, the target image control executes the user's operation instruction ( Click, long press, etc.) to complete the user's voice interaction.

In this way, the embodiment of the present application provides a voice interaction and control method based on image content understanding for the situation where the interface element description text is missing or the description text is not convenient for the user to directly describe when the user uses voice to operate the graphical interface content. Through the technical solution of the embodiment of the application, the user can directly say "I want to watch Ronaldo's video" or "open the video containing the car", so as to match and locate the video containing "C Ronaldo" or "car", and reach the interface The purpose of element manipulation and interaction. In this way, since the technical solution does not need to be adapted to the voice control by the controlled application, it can save development costs and facilitate promotion and use; at the same time, this interaction method conforms to the user's interaction habits and methods, and can effectively improve the user's experience when using voice control. Convenience, convenient for users to describe, save users' time, thereby improving the user's sense of experience.

Further, because the above-mentioned technical solution needs to first obtain the current graphical interface element information View Tree from the system, and then filter out image controls from the View Tree to understand the image content. However, it may be difficult for some platforms or systems to obtain View Tree. Therefore, in the embodiment of the present application, the screenshot of the current graphical interface can also be used to understand the content of the image, so as to achieve the purpose of voice interaction and control.

Referring to FIG. 5 , it shows a schematic flowchart of another voice control method provided by an embodiment of the present application. As shown in Figure 5, the detailed process may include:

S401: Obtain voice text information corresponding to user voice data.

S402: Take a screenshot of the current graphical interface to obtain an image to be recognized.

S403: Perform control detection on the image to be recognized to obtain several candidate controls.

S404: Perform size screening on several candidate controls to obtain at least one image control.

S405: Perform image content understanding on at least one image control to obtain image description text information corresponding to at least one image control.

S406: Semantically matching the speech text information with the image description text information corresponding to at least one image control, and determining the target image control corresponding to the maximum semantic similarity value.

S407: Send an operation instruction to the target image control, so as to realize voice interaction between the user and the target image control.

It should be noted that the operation instruction is determined according to the voice data of the user. Here, the main process of the voice interaction control method based on image content understanding proposed in the embodiment of the present application includes: in the process of voice interaction by the user, firstly, acquire the command text of the user voice interaction; secondly, acquire the current graphical interface screenshot; again, detect the control (text, image, etc.) and control position contained in the screenshot; thirdly, combine the control position to filter the size of the detected control, wherein, the filtered image control, its length and width need If it is larger than a certain size (such as 100dp), the aspect ratio needs to be less than 3; again, understand the image content of the filtered image controls, and generate the image description text of these image controls for the image content; again, through the instruction text and image description The text is semantically matched, and the image control with the most similar semantics is found as the target image control; finally, the target image control executes the user's operation instructions (click, long press, etc.) to complete the user's voice interaction.

It should also be noted that, for the non-target controls in this technical solution, the text description information can be extracted by commonly used methods such as icon recognition and text recognition. Through the technical solution, the user's voice control and matching can be realized only by images, thereby achieving the purpose of user interaction and control; the solution is simple to implement and is conducive to popularization.

Furthermore, it can be understood from the above technical solutions that, through image text description + text semantic matching, the process will be accompanied by a certain amount of information loss, and better results and performance may not be obtained. Therefore, the embodiment of the present application can also combine images and text instructions, use image-based text references to perform target detection, and directly match user interaction targets, thereby achieving the purpose of user interaction and control.

Referring to FIG. 6 , it shows a detailed flow chart of another voice control method provided by an embodiment of the present application. As shown in Figure 6, the detailed process may include:

S501: Obtain voice text information corresponding to user voice data.

S502: Take a screenshot of the current graphical interface to obtain an image to be recognized.

S503: Perform target detection on the image to be recognized according to the voice and text information, and determine the target image control.

S504: Send an operation instruction to the target image control, so as to realize voice interaction between the user and the target image control.

It should be noted that, in the embodiment of the present application, after receiving the voice data input by the user, it may specifically include: taking a screenshot of the current graphical interface to obtain an image to be recognized; performing text conversion on the voice data to obtain voice text information; Perform target detection on the image to be recognized according to the voice text information, and determine the target image control; determine the operation instruction according to the voice data, and send the operation instruction to the target image control, so as to realize the voice control of the target image control.

That is to say, the operation instruction is determined according to the voice data of the user. Here, the main process of the voice interaction control method based on image content understanding proposed in the embodiment of the present application includes: in the process of voice interaction by the user, firstly, acquire the command text of the user voice interaction; secondly, acquire the current graphical interface screenshot; again, using image-based text reference target detection to achieve matching and positioning of user interaction targets to determine the target image control; finally, the target image control executes the user's operation instructions (click, long press, etc.) , to complete the interaction process.

In this way, through the technical solution, the user's voice control and matching can be realized only by images, so as to realize the purpose of user interaction and control. Through this technical solution, the user's interaction instructions can not be restricted, and the user can describe the interactive object (not limited to pictures, but also icons, etc.) in a relatively free manner, such as "the blue button in the upper right corner" , "I want to watch a car video", "select the plus button below", "click the second button at the bottom", etc., instead of describing the interactive object according to the interface text description or predefined instructions. The technical solution can provide users with a more natural and intelligent voice interaction and control mode; and the technical solution is simple to implement, which is conducive to popularization; at the same time, the system complexity is small, which is beneficial to the realization and deployment of the device side.

This embodiment provides a voice control method. The implementation of the foregoing embodiments is described in detail through the foregoing embodiments. It can be seen that the technical solutions of the foregoing embodiments can not only save development costs, but also improve The convenience of users using voice control, so as to better realize the purpose of voice interaction and control.

In yet another embodiment of the present application, based on the same inventive concept as the foregoing embodiments, refer to FIG. 7 , which shows a schematic structural diagram of a voice control device 60 provided in the embodiment of the present application. As shown in FIG. 7, the voice control device 60 may include: a receiving unit 601, a determining unit 602, an analyzing unit 603, and a sending unit 604; wherein,

The receiving unit 601 is configured to receive voice data input by the user;

The determining unit 602 is configured to determine at least one image control according to the current graphical interface;

The analysis unit 603 is configured to understand the image content of at least one image control, and obtain the image description text information corresponding to the at least one image control; and is also configured to perform image control recognition according to the voice data and the image description text information corresponding to the at least one image control , determining a target image control in at least one image control;

The determining unit 602 is further configured to determine an operation instruction according to the voice data;

The sending unit 604 is configured to send an operation instruction to the target image control, so as to implement voice control on the target image control.

In some embodiments, the determining unit 602 is further configured to determine graphical interface element information corresponding to the current graphical interface; and determine at least one image control according to the graphical interface element information.

In some embodiments, referring to FIG. 7 , the voice control device 60 may further include a calling unit 605 configured to call the system underlying code information to obtain graphical interface element information; or call the system auxiliary service function interface to obtain graphical interface element information.

In some embodiments, the analysis unit 602 is further configured to query the controls whose class attribute suffix is a preset type from the graphical interface element information to form a set of candidate controls; and perform size screening on the controls in the set of candidate controls to obtain at least one Image controls.

In some embodiments, the determining unit 602 is further configured to, in the set of candidate controls, determine whether the length and width of the controls meet the preset size conditions, and determine the controls whose length and width meet the preset size conditions as image controls.

In some embodiments, referring to FIG. 7 , the voice control device 60 may further include a detection unit 606 configured to take a screenshot of the current graphical interface to obtain an image to be recognized; The controls form a candidate control set; and size screening is performed on the controls in the candidate control set to obtain at least one image control.

In some embodiments, the analysis unit 603 is specifically configured to perform text conversion on the voice data to obtain voice-to-text information; and semantically match the voice-to-text information with the image description text information corresponding to at least one image control to determine at least one image control The corresponding semantic similarity value; and according to the semantic similarity value, determine the target image control.

In some embodiments, the determining unit 602 is further configured to determine that the image control corresponding to the maximum similarity value among the semantic similarity values is the target image control.

In some embodiments, the detection unit 606 is also configured to take a screenshot of the current graphical interface to obtain the image to be recognized; and perform text conversion on the voice data to obtain voice-to-text information; and perform target detection on the image to be recognized based on the voice-to-text information , determine the target image control; and determine the operation instruction according to the voice data, and send the operation instruction to the target image control, so as to realize the voice control of the target image control.

It can be understood that, in this embodiment, a "unit" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a module, or it may be non-modular. Moreover, each component in this embodiment may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software function modules.

If the integrated unit is implemented in the form of a software function module and is not sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this embodiment is essentially or It is said that the part that contributes to the prior art or the whole or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium, and includes several instructions to make a computer device (which can It is a personal computer, a server, or a network device, etc.) or a processor (processor) that executes all or part of the steps of the method described in this embodiment. The aforementioned storage medium includes: U disk, mobile hard disk, read only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other various media that can store program codes.

Therefore, this embodiment provides a computer storage medium, the computer storage medium stores a computer program, and when the computer program is executed by at least one processor, the steps of the method described in any one of the preceding embodiments are implemented.

Based on the above composition of the voice control device 60 and the computer storage medium, refer to FIG. 8 , which shows a schematic composition structure diagram of an electronic device provided by an embodiment of the present application. As shown in FIG. 8 , an electronic device 70 may include: a communication interface 701 , a memory 702 , and a processor 703 ; each component is coupled together through a bus system 704 . It can be understood that the bus system 704 is used to realize connection and communication between these components. In addition to the data bus, the bus system 704 also includes a power bus, a control bus and a status signal bus. However, for clarity of illustration, the various buses are labeled as bus system 704 in FIG. 8 . Among them, the communication interface 701 is used for receiving and sending signals during the process of sending and receiving information with other external network elements;

memory 702, used to store computer programs that can run on the processor 703;

The processor 703 is configured to, when running the computer program, execute:

Receive voice data input by the user;

Carry out image control identification according to the image description text information corresponding to the voice data and at least one image control, and determine the target image control in at least one image control;

It can be understood that the memory 702 in this embodiment of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. Among them, the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electronically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash. The volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (Static RAM, SRAM), Dynamic Random Access Memory (Dynamic RAM, DRAM), Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDRSDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous chain dynamic random access memory (Synchronous link DRAM, SLDRAM ) and Direct Memory Bus Random Access Memory (Direct Rambus RAM, DRRAM). Memory 702 of the systems and methods described herein is intended to include, but is not limited to, these and any other suitable types of memory.

The processor 703 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 703 or instructions in the form of software. The above-mentioned processor 703 may be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register. The storage medium is located in the memory 702, and the processor 703 reads the information in the memory 702, and completes the steps of the above method in combination with its hardware.

It should be understood that the embodiments described herein may be implemented by hardware, software, firmware, middleware, microcode or a combination thereof. For hardware implementation, the processing unit can be implemented in one or more application specific integrated circuits (Application Specific Integrated Circuits, ASIC), digital signal processor (Digital Signal Processing, DSP), digital signal processing device (DSP Device, DSPD), programmable Logic device (Programmable Logic Device, PLD), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), general-purpose processor, controller, microcontroller, microprocessor, other devices used to perform the functions described in this application electronic unit or its combination.

For a software implementation, the techniques described herein can be implemented through modules (eg, procedures, functions, and so on) that perform the functions described herein. Software codes can be stored in memory and executed by a processor. Memory can be implemented within the processor or external to the processor.

Optionally, as another embodiment, the processor 703 is further configured to execute the steps of the method described in any one of the foregoing embodiments when running the computer program.

Based on the above composition of the voice control apparatus 60 and the computer storage medium, refer to FIG. 9 , which shows a schematic diagram of the composition and structure of another electronic device provided by the embodiment of the present application. As shown in FIG. 9 , an electronic device 70 may include the voice control device 60 described in any one of the foregoing embodiments.

In the embodiment of the present application, for the electronic device 70, the voice interaction method based on image content understanding does not need to adapt the application and voice interaction, which not only saves development costs, but also facilitates user description, and can effectively improve the user's ability to use voice. Convenience during manipulation, so as to better realize the purpose of voice interaction and control.

It should be noted that in this application, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements , but also include other elements not expressly listed, or also include elements inherent in such process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.

The serial numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments.

The methods disclosed in several method embodiments provided in this application can be combined arbitrarily to obtain new method embodiments under the condition of no conflict.

The features disclosed in several product embodiments provided in this application can be combined arbitrarily without conflict to obtain new product embodiments.

The features disclosed in several method or device embodiments provided in this application can be combined arbitrarily without conflict to obtain new method embodiments or device embodiments.

The above is only a specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application. Should be covered within the protection scope of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Industrial Applicability

In the embodiment of the present application, the voice data input by the user is received; at least one image control is determined according to the current graphical interface; the image content of the at least one image control is understood to obtain the image description text information corresponding to the at least one image control; according to the voice data The image description text information corresponding to at least one image control is used for image control identification, and the target image control is determined in at least one image control; the operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control to realize the voice of the target image control control. In this way, the voice interaction method based on image content understanding does not need to adapt the application and voice interaction, which not only saves development costs, but also facilitates user description, which can effectively improve the convenience of users when using voice control, so as to better realize voice interaction. Interaction and control purposes.

Claims

A voice control method, the method comprising:

Receive voice data input by the user;

Determine at least one image control according to the current graphical interface;

Perform image content understanding on the at least one image control to obtain image description text information corresponding to the at least one image control;

Perform image control identification according to the image description text information corresponding to the voice data and the at least one image control, and determine a target image control in the at least one image control;

An operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control, so as to implement voice control on the target image control.
The method according to claim 1, wherein said determining at least one image control according to the current graphical interface comprises:

Determine the graphical interface element information corresponding to the current graphical interface;

The at least one image control is determined according to the graphical interface element information.
The method according to claim 2, wherein said determining the graphical interface element information corresponding to the current graphical interface comprises:

Invoking the underlying code information of the system to obtain the graphical interface element information; or,

Call the system auxiliary service function interface to obtain the graphic interface element information.
The method according to claim 2, wherein said determining said at least one image control according to said graphical interface element information comprises:

Querying controls whose class attribute suffix is a preset type from the graphical interface element information to form a set of candidate controls;

Perform size screening on the controls in the set of candidate controls to obtain the at least one image control.
The method according to claim 4, wherein the preset type includes at least one of the following: ImageView, FrameLayout, LinearLayout, RelativeLayout and View.
The method according to claim 4, wherein the size screening of the controls in the set of candidate controls to obtain the at least one image control comprises:

In the set of candidate controls, it is judged whether the length and width of the controls meet a preset size condition, and the control whose length and width meet the preset size condition is determined as the image control.
The method according to claim 1, wherein said determining at least one image control according to the current graphical interface comprises:

Take a screenshot of the current graphical interface to obtain the image to be recognized;

Performing control detection on the image to be recognized, and forming a candidate control set from several detected controls;

Perform size screening on the controls in the set of candidate controls to obtain the at least one image control.
The method according to any one of claims 1 to 7, wherein the image control is identified according to the image description text information corresponding to the voice data and the at least one image control, and is determined in the at least one image control Target image controls, including:

Carry out text conversion to described voice data, obtain voice text information;

Semantically matching the voice text information with the image description text information corresponding to the at least one image control, and determining the semantic similarity value corresponding to the at least one image control;

The target image control is determined according to the semantic similarity value.
The method according to claim 8, wherein said determining said target image control according to said semantic similarity value comprises:

It is determined that the image control corresponding to the maximum similarity value among the semantic similarity values is the target image control.
The method according to claim 1, wherein, after receiving the voice data input by the user, the method further comprises:

Take a screenshot of the current graphical interface to obtain the image to be recognized;

Carry out text conversion to described voice data, obtain voice text information;

performing target detection on the image to be recognized according to the voice text information, and determining a target image control;

An operation instruction is determined according to the voice data, and the operation instruction is sent to the target image control, so as to implement voice control on the target image control.
A voice control device, the voice control device includes a receiving unit, a determining unit, an analyzing unit and a sending unit; wherein,

The receiving unit is configured to receive voice data input by a user;

The determining unit is configured to determine at least one image control according to the current graphical interface;

The analysis unit is configured to understand the image content of the at least one image control to obtain image description text information corresponding to the at least one image control; and is also configured to correspond to the at least one image control according to the voice data The image description text information of the image control is identified, and the target image control is determined in the at least one image control;

The determining unit is further configured to determine an operation instruction according to the voice data;

The sending unit is configured to send the operation instruction to the target image control, so as to implement voice control on the target image control.
The voice control device according to claim 11, wherein the determining unit is further configured to determine graphical interface element information corresponding to the current graphical interface; and determine the at least one image control according to the graphical interface element information.
The voice control device according to claim 12, wherein the voice control device further comprises a calling unit configured to call the underlying code information of the system to obtain the graphical interface element information; or call the system auxiliary service function interface to obtain the Describes the graphical interface element information.
The voice control device according to claim 12, wherein the analysis unit is further configured to query controls whose class attribute suffix is a preset type from the graphical interface element information to form a set of candidate controls; and The controls in the control set are screened by size to obtain the at least one image control.
The voice control device according to claim 14, wherein the determining unit is further configured to, in the set of candidate controls, determine whether the length and width of the control meet a preset size condition, and determine whether the length and width meet the predetermined size condition. The control with the preset size condition is determined as the image control.
The voice control device according to claim 11, wherein the voice control device further includes a detection unit configured to take a screenshot of the current graphical interface to obtain an image to be recognized; and perform control detection on the image to be recognized, and The detected controls form a set of candidate controls; and performing size screening on the controls in the set of candidate controls to obtain the at least one image control.
The voice control device according to any one of claims 11 to 16, wherein the analysis unit is further configured to perform text conversion on the voice data to obtain voice-to-text information; and combine the voice-to-text information with the performing semantic matching on image description text information corresponding to at least one image control, determining a semantic similarity value corresponding to the at least one image control; and determining the target image control according to the semantic similarity value.
The voice control device according to claim 16, wherein the detection unit is configured to take a screenshot of the current graphical interface to obtain an image to be recognized; and perform text conversion on the voice data to obtain voice text information; and The voice text information performs target detection on the image to be recognized, and determines a target image control; and determines an operation instruction according to the voice data, and sends the operation instruction to the target image control, so as to control the target image voice control.
An electronic device comprising a memory and a processor; wherein,

said memory for storing a computer program capable of running on said processor;

The processor is configured to execute the method according to any one of claims 1 to 10 when running the computer program.
A computer storage medium, wherein the computer storage medium stores a computer program, and when the computer program is executed by at least one processor, the method according to any one of claims 1 to 10 is realized.