CN114067797A

CN114067797A - Voice control method, device, equipment and computer storage medium

Info

Publication number: CN114067797A
Application number: CN202111398660.6A
Authority: CN
Inventors: 陈明; 冉茂松; 张晓帆
Original assignee: Hangzhou Douku Software Technology Co Ltd
Current assignee: Hangzhou Douku Software Technology Co Ltd
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-02-18
Also published as: WO2023087934A1

Abstract

The embodiment of the application discloses a voice control method, a device, equipment and a computer storage medium, wherein the method comprises the following steps: receiving voice data input by a user; determining at least one image control according to the current graphical interface; understanding the image content of at least one image control to obtain image description text information corresponding to the at least one image control; performing image control identification according to the voice data and image description text information corresponding to at least one image control, and determining a target image control in the at least one image control; and determining an operation instruction according to the voice data, and sending the operation instruction to the target image control so as to realize voice control of the target image control. Therefore, the voice interaction mode based on the understanding of the image content can not only save the development cost, but also improve the convenience of the user in using voice control, thereby better realizing the purposes of voice interaction and control.

Description

Voice control method, device, equipment and computer storage medium

Technical Field

The present application relates to the field of voice interaction technologies, and in particular, to a voice control method, apparatus, device, and computer storage medium.

Background

In recent years, with the rapid development of hardware devices and electronic products, man-machine interaction modes based on voice are more mature, more common and more accepted and used by people. As such, voice interaction is gradually penetrating into people's daily life, so that the demand for manipulating a Graphical User Interface (GUI) by voice is becoming stronger and stronger.

In the related art, for the situation that the elements in the graphical interface lack text descriptions or the text descriptions are not convenient for the user to directly describe, although some solutions exist, such as icon recognition, character recognition, spatial orientation reference, number reference, and the like, the solutions all have certain limitations, especially for the picture type control, since the user cannot or is not convenient to describe the control through the text descriptions, the voice control is difficult to implement, and the usability of the user is poor.

Disclosure of Invention

The application provides a voice control method, a voice control device, voice control equipment and a computer storage medium, which not only can save development cost, but also can improve convenience when a user uses voice control, so that the purposes of voice interaction and control are better achieved.

In order to achieve the purpose, the technical scheme of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a voice control method, where the method includes:

receiving voice data input by a user;

determining at least one image control according to the current graphical interface;

understanding the image content of at least one image control to obtain image description text information corresponding to the at least one image control;

performing image control identification according to the voice data and image description text information corresponding to at least one image control, and determining a target image control in the at least one image control;

and determining an operation instruction according to the voice data, and sending the operation instruction to the target image control so as to realize voice control of the target image control.

In a second aspect, an embodiment of the present application provides a voice control apparatus, which includes a receiving unit, a determining unit, an analyzing unit, and a transmitting unit; wherein the content of the first and second substances,

a receiving unit configured to receive voice data input by a user;

the determining unit is configured to determine at least one image control according to the current graphical interface;

the analysis unit is configured to understand image content of the at least one image control to obtain image description text information corresponding to the at least one image control; the image control recognition system is also configured to perform image control recognition according to the image description text information corresponding to the voice data and the at least one image control, and determine a target image control in the at least one image control;

the determining unit is further configured to determine an operation instruction according to the voice data;

and the sending unit is configured to send an operation instruction to the target image control so as to realize voice control on the target image control.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor; wherein the content of the first and second substances,

the memory for storing a computer program operable on the processor;

the processor, when executing the computer program, is adapted to perform the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a computer storage medium storing a computer program, which when executed by at least one processor implements the method according to the first aspect.

The voice control method, the voice control device, the voice control equipment and the computer storage medium provided by the embodiment of the application receive voice data input by a user; determining at least one image control according to the current graphical interface; understanding the image content of at least one image control to obtain image description text information corresponding to the at least one image control; performing image control identification according to the voice data and image description text information corresponding to at least one image control, and determining a target image control in the at least one image control; and determining an operation instruction according to the voice data, and sending the operation instruction to the target image control so as to realize voice control of the target image control. Therefore, the voice interaction mode based on the understanding of the image content does not need to be adapted to the application and the voice interaction, so that the development cost can be saved, the description of a user is facilitated, the convenience of the user in using the voice for control can be effectively improved, and the purposes of voice interaction and control are better achieved.

Drawings

FIG. 1 is a schematic diagram of a graphical interface element grid index;

fig. 2 is a schematic flowchart of a voice control method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a hierarchical tree according to an embodiment of the present application;

fig. 4 is a detailed flowchart of a voice control method according to an embodiment of the present application;

fig. 5 is a detailed flowchart of another speech control method according to an embodiment of the present application;

fig. 6 is a detailed flowchart of another voice control method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a voice control apparatus according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of another electronic device according to an embodiment of the present application.

Detailed Description

So that the manner in which the features and elements of the present embodiments can be understood in detail, a more particular description of the embodiments, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict. It should also be noted that reference to the terms "first \ second \ third" in the embodiments of the present application is only used for distinguishing similar objects and does not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may be interchanged with a specific order or sequence where possible so that the embodiments of the present application described herein can be implemented in an order other than that shown or described herein.

Due to the rapid development of hardware devices and electronic products in recent years, man-machine interaction modes based on voice are more mature, more common and more accepted and used by people. As voice interaction gradually permeates the aspects of human life, the need for manipulating a Graphical User Interface (GUI) by voice is becoming more and more strong.

Currently, the most important issue of Voice and Graphic User Interface (VGUI) is the application adaptation problem. As most applications do not consider the problem of using voice interaction at the time of design and development. Taking a smart phone as an example, in the application of the current mobile phone terminal, the main interaction mode designed for the application is interaction through a touch screen. Thus, basically most applications are not adapted to speech interaction. Therefore, when voice is used to interact and control with the application graphical interface at the mobile phone end, more problems may be encountered, such as the interface element lacks a textual description, or the interface element has a textual description but is not convenient for the user to directly refer to the interface element by a corresponding textual description (e.g., the textual description is too long, the textual description includes symbols, pictures, etc., the textual description is not clear, the textual descriptions of a plurality of interface elements are the same or similar, etc.). In these cases, the user cannot refer directly to the control that wants to interact through the textual description.

In the related art, the current solutions mainly include the following for the situation:

(1) and (3) icon identification: and detecting and identifying the common and unambiguous icon controls through the model to obtain a description text (common description/title). Therefore, the user can describe the icon control through common sense or common reference/title, for example, the "play", "pause", "previous", "next" and the like in the audio/video playing control button, so as to describe the target control and achieve the purpose of interaction;

(2) character recognition: the method comprises the steps that text information possibly contained in image controls such as pictures and icons which do not contain text description is recognized through a model, and the recognized text information is used as text description information of the controls to be matched with interaction instructions of users, so that the target controls are positioned, and the purpose of interaction is achieved;

(3) spatial orientation refers to: for example, a button on the right of a download button, an icon below a thumbturn button, and the like, the target control is referred to by the spatial orientation relationship between other describable controls and the target control, so that the description of the target control is realized, and the interaction purpose is achieved;

(4) the number numbers refer to: for example, the first button and the like, the controls are numbered and then referred to by the numbers, so that the description of the target control is realized, and the interaction purpose is achieved; the control number is not displayed in vision generally, so that the user is required to calculate the number of the control by himself;

(5) and (3) displaying a text instruction in an overlapping way: the text description of each interactive control is superposed and displayed on the GUI, and a user can refer to the corresponding control through the corresponding text description superposed and displayed by each interactive control, so that the interactive purpose is achieved;

(6) and (3) overlapping and displaying the number: the number of each interactive control is displayed on the GUI in an overlapping mode, and a user can refer to the corresponding control through the corresponding number which is realized by overlapping each interactive control, so that the interactive purpose is achieved;

(7) and (3) displaying the network grids and the numbers in an overlapping manner: and displaying the network grids on the GUI in a full-screen superposition mode, and numbering each grid area. The user can refer to the corresponding control by the number corresponding to the grid where the control is located, so that the description of the target control is realized, and the interaction purpose is achieved.

However, for the situation that the elements in the graphical interface lack text descriptions or the text descriptions are not convenient for the user to directly describe, the above solutions have certain limitations and cannot be applied to all situations. For icon identification, (1) the icon identification is only applicable to common and unambiguous control icons, and cannot be processed for other types of icons and non-icon contents, so that the application range is limited. (2) For text recognition, the text recognition is only suitable for the case that the image contains text, and the application situations are limited; and text recognition requires more computing resources, so that generally, the processing time delay is longer, the use cost is higher, and the accuracy is limited. (3) For the spatial orientation reference, the spatial orientation reference method needs to find a control that can be positioned through text description as a reference, but in many cases, the control cannot be found, and thus the application range is relatively limited. (4) For the reference of the number, the reference of the number needs to number the control by a program, and then the control is referred to by the number. The control number itself is not displayed in the interface. However, in actual use, the numbering of the user and the numbering of the program are not necessarily consistent; and often tens of interactable objects may exist on an interface, making it difficult for a user to number controls one by one. (5) For the text instructions displayed in an overlapping manner, the text instructions need to be generated in the overlapping manner; the generation of the text instruction depends on the text description of the control, so that the text instruction may have the same situation as the text description; if the content displayed in the superposition mode is too large, the original content can be covered, and if the content is too small, the situation that a user cannot see clearly can be caused; and usually, dozens of interactive objects may exist on one interface, and finally dense and numb prompt contents are superimposed on the interface, so that the use experience and the sensory experience of a user are greatly influenced. (6) For the overlapped display of the number numbers, the overlapped display of the number numbers is simple to realize, but is not beneficial to the user to memorize the correct interactive instruction. If the content displayed in the superposition mode is too large, the original content can be covered, and if the content is too small, the situation that a user cannot see clearly can be caused; and usually, dozens of interactive objects may exist on one interface, and finally dense and numb prompt contents are superimposed on the interface, so that the use experience and the sensory experience of a user are greatly influenced. (7) For the superposition display of the network grids and the numbers, the grid size in the mode of superposition display of the network grids and the numbers can be too large or too small; the target interaction control may fall within several grids; several interactive objects may also appear in the same grid. In these situations, the user needs to perform multiple operations to finally determine the interaction target. And the contents displayed in an overlapping way can cover the original contents, so that the use experience and the sense experience of a user are greatly influenced.

Briefly, by analyzing the situation where a user cannot or cannot conveniently describe a target control through a textual description, we can roughly divide it into the following two categories: icon class and picture class. Taking fig. 1 as an example, the icon class can be shown in a bold dashed box in fig. 1, and the size of the control is generally small, and the appearance and meaning are generally relatively fixed, which can be realized by icon recognition; for non-standard, non-useful, ambiguous icons, multiple labels may be used, such as shape, color, appearance style, visual semantics, etc., or spatial orientation in combination with numbering. The other is a photo class, which can be shown in a bold solid line box in fig. 1, the size of the control is generally large, and the control mainly appears in a list of images, videos, files, messages and the like; the arrangement of the controls may be regular (such as grid arrangement) or irregular. However, the visual content and meaning of the picture itself are greatly changed, and the picture itself may or may not have text description; moreover, text description may be repeated, and there may be situations where symbols, pictures, etc. are not convenient for the user to directly describe. At present, existing solutions all have certain limitations, so that voice control is difficult to implement, and a better interaction mode and interaction experience cannot be provided.

Based on this, the embodiment of the application provides a voice control method, which receives voice data input by a user; determining at least one image control according to the current graphical interface; understanding the image content of at least one image control to obtain image description text information corresponding to the at least one image control; performing image control identification according to the voice data and image description text information corresponding to at least one image control, and determining a target image control in the at least one image control; and determining an operation instruction according to the voice data, and sending the operation instruction to the target image control so as to realize voice control of the target image control. Therefore, the voice interaction mode based on the understanding of the image content does not need to be adapted to the application and the voice interaction, so that the development cost can be saved, the description of a user is facilitated, the convenience of the user in using the voice for control can be effectively improved, and the purposes of voice interaction and control are better achieved.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

In an embodiment of the present application, referring to fig. 2, a flowchart of a voice control method provided in an embodiment of the present application is shown. As shown in fig. 2, the method may include:

s201: voice data input by a user is received.

S202: and determining at least one image control according to the current graphical interface.

It should be noted that the embodiments of the present application are applied to a voice control device or an electronic apparatus integrated with the device. The electronic device may be implemented in various forms, for example, the electronic device may include a smart phone, a tablet computer, a notebook computer, a palm top computer, a Personal Digital Assistant (PDA), a Portable Media Player (PMP), a navigation device, a wearable device, a voice Assistant, and the like, which are not limited in this embodiment.

It should also be noted that the operation instruction may be determined based on the voice data. In this way, after receiving the voice data input by the user, the electronic device may perform a corresponding operation on the current graphical interface according to the operation instruction determined by the voice data. However, taking the picture control in fig. 1 as an example, since the user cannot or is inconvenient to directly describe the picture control through the text description, the image control in the current graphical interface needs to be found out at this time, so that the matching of the user interaction intention and the interaction target is realized by understanding the voice data and the graphical interface content input by the user in a subsequent mode based on the image content, and the purpose of voice interaction is achieved.

In some embodiments, for S202, the determining at least one image control according to the current graphical interface may include:

determining graphical interface element information corresponding to the current graphical interface;

and determining at least one image control according to the graphical interface element information.

In the embodiment of the present application, the information about the gui elements may be represented by a hierarchical Tree, which may also be referred to as a View Tree (View Tree). Illustratively, the GUI element information in the electronic device may be a hierarchical tree as shown in fig. 3. For the View Tree, each node in the View Tree represents an Element or Control (Control/Widget/Element) in the GUI, and some relevant attributes of the Element may include a text description, an interaction attribute (whether clickable, text inputtable, slidable, etc.), a Control position, and the like.

In addition, for a node in the View Tree, table 1 illustrates some related attribute information of the node, which may include an index number, a text description, an interaction attribute (whether clickable, whether text can be input, whether the node can slide, etc.), a control position, and so on, as shown below.

TABLE 1

In the View Tree, the Elements that are directly visible to the user are mainly Leaf nodes (Leaf Elements) in the View Tree. Other non-leaf nodes, which are generally not visible to the user, are primarily used as interface element containers (containers), which are primarily used to constrain and control the location, size, arrangement, etc. of the elements. Meanwhile, part of the Container also carries the interaction with the user (at this time, the clickable attribute is true).

In a specific embodiment, the determining information of the graphical interface element corresponding to the current graphical interface may include:

calling system bottom code information to obtain graphical interface element information; alternatively, the first and second electrodes may be,

and calling a system auxiliary service function interface to acquire graphical interface element information.

That is to say, taking an electronic device based on an Android (Android) system as an example, there are two main ways of obtaining graphical interface element information from the Android system: the acquisition of the graphical interface element information can be realized by directly providing an interface for acquiring the interface element structure and the information through the system bottom layer code; alternatively, the acquisition of the graphical interface element information may also be realized through a system auxiliary Service (access Service) function interface. However, the former method is complex to implement and has a large development workload, and meanwhile, the bottom layer codes of the system need to be modified, so that certain security risk is achieved, and more comprehensive and accurate interface element information can be acquired; the latter method is simple to implement and has a small development workload, but the acquired graphical interface element information may be omitted, and there may be partial information errors. It should be noted that the embodiments of the present application can be specifically selected according to actual situations, and are not limited herein.

Further, after obtaining the information of the graphical interface elements, all possible controls can be found out, and then at least one image control required is screened out. Specifically, in some embodiments, the determining at least one image control according to the graphical interface element information may include:

inquiring controls with class (class) attribute suffixes as preset types from the graphical interface element information to form a candidate control set;

and screening the sizes of the controls in the candidate control set to obtain at least one image control.

It should be noted that, in the embodiment of the present application, the preset type may include at least one of the following: image View (ImageView), single frame layout (FrameLayout), linear layout (linear layout), relative layout (relative layout), and View (View).

Further, in some embodiments, the performing size screening on the controls in the candidate control set to obtain at least one image control may include: and in the candidate control set, judging whether the length and the width of the control meet preset size conditions or not, and determining the control with the length and the width meeting the preset size conditions as an image control. That is, when the length and the width of a certain control meet the preset size conditions, the control is an image control, but not other controls such as an icon, a button, a decorative strip and the like, and the image control is screened from the candidate control set.

It should be further noted that, in the embodiment of the present application, the preset size condition may be: the length and the width of the control are respectively larger than a first preset value, and the ratio of the length to the width of the control is smaller than a second preset value.

Illustratively, the first preset value may be 100 dp; and the pixel is represented by an absolute pixel, the density is the density of the pixels with unit size, and the dp is represented by a standard size. The second preset value may be 3, that is, the aspect ratio of the image control needs to be less than 3.

That is, in the case of using the accessitivyservice function interface to acquire the gui element information in the android system, since there may be some widgets that cannot perform correct content acquisition, there may be some target widgets whose types (class attribute suffixes) are FrameLayout, linear layout, relativlayout, View, and the like, instead of ImageView. For this situation, a control of a type ImageView or an unconventional type (FrameLayout, LinearLayout, RelativeLayout, View, etc., which does not appear in the leaf node under normal conditions) in the leaf node of the View Tree may be screened out as a candidate control set.

It can be understood that, for obtaining the graphical interface element information by using the accessitivyservice function interface in the Android system, the access acitivyservice function interface is adapted to itself. For part of applications or part of custom controls, the adaptation of the control to the accessibility service function Interface is not good enough, so that when graphical Interface element information is acquired through the accessibility service function Interface, there may be situations that part of controls cannot acquire correct content, so that User Interface (UI) information is omitted, wrong, and the like. In addition, when the information of the graphical interface element is acquired through the accessibility service function interface of the Android system, the visual appearance image information of the control cannot be acquired, so that the embodiment of the application also needs an interface screenshot to assist in acquiring the related image information of the control.

Although the embodiment of the application provides a mode of acquiring the current graphical interface element information View Tree from the system and then screening the image control from the View Tree, in consideration of the fact that the View Tree is difficult to acquire by a part of platforms or systems, a mode of detecting the control by aiming at the screenshot of the current graphical interface and then screening the image control is also provided. Thus, in some embodiments, for S202, the determining at least one image control according to the current graphical interface may include:

screenshot is carried out on the current graphical interface to obtain an image to be recognized;

carrying out control detection on an image to be identified, and forming a candidate control set by a plurality of controls obtained by detection;

It should be noted that, the performing size screening on the controls in the candidate control set to obtain at least one image control may include: and in the candidate control set, judging whether the length and the width of the control meet preset size conditions or not, and determining the control with the length and the width meeting the preset size conditions as an image control. That is, in the several controls, when the length and the width of a certain control meet the preset size conditions, the control is indicated to be an image control instead of other controls such as icons, buttons, decorative strips and the like, so as to screen out the image control.

That is to say, after the image to be recognized is obtained, the controls (such as texts, images, and the like) and the positions of the controls contained in the image to be recognized can be detected, so as to obtain a plurality of controls; then, size screening is carried out on a plurality of detected controls, illustratively, the length and the width of each image control need to be larger than a certain size (for example, 100dp), and the aspect ratio of each image control needs to be smaller than 3; and then the image controls are screened out from the plurality of controls.

S203: and understanding the image content of the at least one image control to obtain image description text information corresponding to the at least one image control.

S204: and identifying the image control according to the image description text information corresponding to the voice data and the at least one image control, and determining a target image control in the at least one image control.

It should be noted that after the at least one image control is obtained through the filtering, the image content of the image controls may be understood, so as to obtain image description text information of each image control about the image content. In the embodiment of the application, the method for understanding the image content can comprise the following steps of; image classification, image detection, image description generation, image-based text-referred object detection, and so on, which will be described in detail below:

(1) image classification: by classifying the image content, textual descriptive labels of the image content, such as "car", "food", "person", etc., are matched. Generally, text labels obtained by image classification are generally not fine enough, and the comprehension degree of the text labels to the images is limited. For example, as in the second video in the first line of fig. 1, if the label is "car", no more details can be obtained, such as whether it is a motorcycle, a car, a truck, or a bus; it can be improved to some extent by means of multi-level tags or multiple tags, such as "car/car" (first level tag/second level tag/.). In addition, image classification cannot simultaneously acquire the same-level labels, for example, the third row of the first video in fig. 1 includes people and food, but the classification model can only acquire one label of "people" or "food" and cannot acquire the same label at the same time; it can be improved to some extent by means of confidence, for example "food: 0.5 "," human: 0.4".

2) Image detection: an object included in the image is detected by the detection model. For example: the second video in the first line of fig. 1, containing "person", "car"; second line second video, containing "people", "gourmet"; meanwhile, the detection model can be cascaded or combined with a classification model to realize object subdivision identification. For example, after detecting a "car," the type, manufacturer, model, color, etc. of the car may be further identified. After the 'person' is detected, the gender, age, identity (face recognition, who), emotion and the like of the person can be further identified. Image detection can provide more and more detailed information than image classification, but cannot provide the relationship between a plurality of detected objects; and the provided information is comparatively split, and is comparatively different from the natural language description of the user, and the later stage matching difficulty is comparatively large. Meanwhile, if more detailed information needs to be provided, a more complex model or a cascade of a plurality of models is needed, so that the system complexity is higher and the use cost is higher.

(3) Image description generation: by understanding the image content, a description of the image content is then generated in a natural language manner. As in the first line second video of fig. 1, a description is generated of "a person appears next to a car; the person carries a bag; cars are white; ...". The quality and level of detail of the generated description depends on the model accuracy and the associated settings. The method is most similar to the natural language description of the user, the later matching difficulty is low, the system complexity is low, and the use cost is relatively low.

(4) Image-based text refers to target detection: the model receives the instruction text and the image of the user at the same time, semantic extraction and matching of the text and the image are realized in the model, and finally the position of the object indicated by the instruction text of the user in the image is directly given. Therefore, matching and positioning of the user interaction instruction and the target interaction object can be realized. In this way, the two steps S203 and S204 can be combined into one step for completion. This way, the information loss in the whole process is minimized, and better effect can be obtained. The specific effect will of course also depend on the quality and complexity of the model. Meanwhile, the system complexity is minimum in the mode, and the use cost is also minimum.

In practical applications, the method for understanding the image content is not usually implemented by selecting only one of the above methods, but may be implemented by selecting a plurality of methods to be combined according to practical situations, and the embodiment of the present application is not limited in any way.

In some embodiments, for S204, the performing image control recognition according to image description text information corresponding to at least one image control according to the voice data, and determining a target image control in the at least one image control may include:

performing text conversion on the voice data to obtain voice text information;

semantic matching is carried out on the voice text information and the image description text information corresponding to at least one image control, and a semantic similarity value corresponding to at least one image control is determined;

and determining a target image control according to the semantic similarity value.

In a specific embodiment, the determining the target image control according to the semantic similarity value may include: and determining the image control corresponding to the maximum similarity value in the semantic similarity values as a target image control.

Specifically, after the semantic similarity value corresponding to at least one image control is obtained, the maximum similarity value is selected from the semantic similarity values, and the image control corresponding to the maximum similarity value is determined as the target image control.

That is, after obtaining the speech text information and the image description text information corresponding to at least one image control, semantic matching may be performed on the speech text information and the image description text information corresponding to each image control, for example, a semantic similarity value corresponding to each image control may be determined by using a conventional text matching method (such as a TF-IDF algorithm, a BM25 algorithm, a simhash algorithm, a Jaccard algorithm, and the like), or by using a semantic matching model based on neural network training, and the like; and then selecting the image control with the most similar semanteme (namely the image control corresponding to the maximum similarity value) from the image controls as the target image control.

S205: and determining an operation instruction according to the voice data, and sending the operation instruction to the target image control so as to realize voice control of the target image control.

It should be noted that after the target image control is determined, the electronic device may determine an operation instruction according to the voice data of the user, and then send the operation instruction to the target image control, so as to perform a corresponding operation (click, long press, and the like), thereby completing the voice interaction. Here, the operation instruction is determined from voice data input by the user.

In brief, the technical scheme of the embodiment of the application provides an image content understanding method of a graphical interface element, and further provides a voice interaction control method based on image content understanding; therefore, the technical scheme of the embodiment of the application does not need to be controlled to adapt to voice control, development and popularization cost can be saved, and the application is convenient for users to use. Thus, for the photo type control shown in fig. 1, when the user cannot or is inconvenient to describe the target control through text description, the voice interaction mode based on image content understanding provided by the technical solution of the embodiment of the present application can realize matching of the user interaction intention and the user interaction target by understanding the voice data and the image content input by the user, thereby achieving the purpose of voice interaction. Still taking fig. 1 as an example, when the voice data input by the user is "open video containing car", the second corresponding video containing car in the first row may be opened at this time; or when the voice data input by the user is "open video containing a treasure box", the third row of the second video containing a treasure box can be opened at this time. The technical scheme can provide a more natural, more intelligent and more user-intuitionistic interaction mode for the user, thereby bringing better experience for the user.

The embodiment provides a voice control method, which comprises the steps of receiving voice data input by a user; determining at least one image control according to the current graphical interface; understanding the image content of at least one image control to obtain image description text information corresponding to the at least one image control; performing image control identification according to the voice data and image description text information corresponding to at least one image control, and determining a target image control in the at least one image control; and determining an operation instruction according to the voice data, and sending the operation instruction to the target image control so as to realize voice control of the target image control. Therefore, the voice interaction mode based on the understanding of the image content does not need to be adapted to the application and the voice interaction, so that the development cost can be saved, the description of a user is facilitated, the convenience of the user in using the voice for control can be effectively improved, and the purposes of voice interaction and control are better achieved.

In another embodiment of the present application, based on the same inventive concept as the foregoing embodiment, referring to fig. 4, a detailed flowchart of a voice control method provided in an embodiment of the present application is shown. As shown in fig. 4, the detailed flow may include:

s301: and acquiring voice text information corresponding to the user voice data.

S302: and acquiring the information View Tree of the graphical interface elements.

S303: and querying all the controls meeting the requirements from the View Tree to form a candidate control set.

S304: and screening the sizes of the controls in the candidate control set to obtain at least one image control.

S305: and understanding the image content of the at least one image control to obtain image description text information corresponding to the at least one image control.

S306: and performing semantic matching on the voice text information and the image description text information corresponding to at least one image control, and determining a target image control corresponding to the maximum semantic similarity value.

S307: and sending an operation instruction to the target image control to realize the voice interaction between the user and the target image control.

It should be noted that the operation instruction is determined based on the user voice data. Here, the voice interaction control method based on image content understanding provided by the embodiment of the present application mainly includes: in the process of voice interaction of a user, firstly, acquiring an instruction text of the voice interaction of the user; secondly, acquiring current graphical interface element information View Tree; thirdly, finding out all controls which meet the requirements of the View Tree query, namely all image controls with class attribute suffixes of ImageView or unconventional types (such as FrameLayout, Linear Layout, relative Layout, View and the like); thirdly, screening the searched image control in size, wherein the length and the width of the screened image control need to be larger than a certain size (such as 100dp), and the length-width ratio needs to be smaller than 3; thirdly, understanding the image contents of the screened image controls, and generating image description texts of the image controls on the image contents; thirdly, performing semantic matching through the instruction text and the image description text, and finding out an image control with the most similar semantic as a target image control; and finally, executing the operation instruction (clicking, long pressing and the like) of the user by the target image control to finish the voice interaction of the user.

In this way, the embodiment of the present application provides a voice interaction and control method based on understanding of image content, which is directed to a situation that when a user operates graphical interface content by using voice, an interface element description text is missing, or the description text is not convenient for the user to directly describe. Through the technical scheme of the embodiment of the application, a user can directly say that 'I want to see the video of Rou C' or 'open the video containing the automobile', so that the video containing the 'Rou C' or the 'automobile' is matched and positioned, and the purposes of interface element control and interaction are achieved. Therefore, the technical scheme does not need to be controlled and applied to adapt to voice control, so that the development cost can be saved, and the voice control method and the voice control device are convenient to popularize and use; meanwhile, the interaction mode accords with the interaction habit and the interaction mode of the user, the convenience of the user in using voice for control can be effectively improved, the user can describe the voice conveniently, the time of the user is saved, and the experience of the user is improved.

Further, according to the technical scheme, the current graphical interface element information View Tree needs to be acquired from the system, and then the image control is screened from the View Tree for understanding the image content. But some platforms or systems may have difficulty acquiring the View Tree. Therefore, the embodiment of the application can also realize the understanding of the image content through the screenshot of the current graphical interface, thereby realizing the purposes of voice interaction and control.

Referring to fig. 5, a detailed flowchart of another speech control method provided in the embodiment of the present application is shown. As shown in fig. 5, the detailed flow may include:

s401: and acquiring voice text information corresponding to the user voice data.

S402: and (4) screenshot is carried out on the current graphical interface to obtain an image to be identified.

S403: and carrying out control detection on the image to be identified to obtain a plurality of candidate controls.

S404: and screening the sizes of the plurality of candidate controls to obtain at least one image control.

S405: and understanding the image content of the at least one image control to obtain image description text information corresponding to the at least one image control.

S406: and performing semantic matching on the voice text information and the image description text information corresponding to at least one image control, and determining a target image control corresponding to the maximum semantic similarity value.

S407: and sending an operation instruction to the target image control to realize the voice interaction between the user and the target image control.

It should be noted that the operation instruction is determined based on the user voice data. Here, the voice interaction control method based on image content understanding provided by the embodiment of the present application mainly includes: in the process of voice interaction of a user, firstly, acquiring an instruction text of the voice interaction of the user; secondly, acquiring a screenshot of the current graphical interface; thirdly, detecting controls (texts, images and the like) and control positions contained in the screenshot; thirdly, screening the detected controls according to the positions of the controls, wherein the length and the width of the screened image controls need to be larger than a certain size (such as 100dp), and the length-width ratio needs to be smaller than 3; thirdly, understanding the image contents of the screened image controls, and generating image description texts of the image controls on the image contents; thirdly, performing semantic matching through the instruction text and the image description text, and finding out an image control with the most similar semantic as a target image control; and finally, executing the operation instruction (clicking, long pressing and the like) of the user by the target image control to finish the voice interaction of the user.

It should be further noted that, for the non-target control in the technical solution, the text description information may be extracted through common icon recognition, text recognition, and the like. By the technical scheme, the voice control and matching of the user can be realized only by means of the image, so that the purposes of user interaction and control are realized; the scheme is simple to realize and beneficial to popularization.

Further, according to the technical scheme, a certain information loss is caused in the process in a mode of image text description + text semantic matching, and good effect and performance may not be obtained. Therefore, the embodiment of the application can also carry out target detection by using the text reference based on the image in combination with the mode of the image and the text instruction, and directly realize the matching of the user interaction target, thereby realizing the purposes of user interaction and control.

Referring to fig. 6, a detailed flowchart of another voice control method provided in the embodiment of the present application is shown. As shown in fig. 6, the detailed flow may include:

s501: and acquiring voice text information corresponding to the user voice data.

S502: and (4) screenshot is carried out on the current graphical interface to obtain an image to be identified.

S503: and performing target detection on the image to be recognized according to the voice text information, and determining a target image control.

S504: and sending an operation instruction to the target image control to realize the voice interaction between the user and the target image control.

It should be noted that, in this embodiment of the present application, after receiving voice data input by a user, specifically, the method may include: screenshot is carried out on the current graphical interface to obtain an image to be recognized; performing text conversion on the voice data to obtain voice text information; performing target detection on the image to be recognized according to the voice text information, and determining a target image control; and determining an operation instruction according to the voice data, and sending the operation instruction to the target image control so as to realize voice control of the target image control.

That is, the operation instruction is determined based on the user voice data. Here, the voice interaction control method based on image content understanding provided by the embodiment of the present application mainly includes: in the process of voice interaction of a user, firstly, acquiring an instruction text of the voice interaction of the user; secondly, acquiring a screenshot of the current graphical interface; thirdly, the matching and the positioning of the user interaction target are realized by using the image-based text to refer to the target detection so as to determine a target image control; and finally, executing the operation instruction (clicking, long pressing and the like) of the user by the target image control to finish the interaction process.

Therefore, the technical scheme can realize the voice control and matching of the user only by depending on the image, thereby realizing the purposes of user interaction and control. Through the technical scheme, the interactive instruction of the user can not be restricted, and the user can describe the interactive object (which can not be limited to a picture, but also can be an icon and the like) in a relatively free mode, such as 'blue button on the upper right corner', 'i want to see a car video', 'select a plus button below', 'click a second button at the bottom', and the like, without describing the interactive object according to interface text description or a predefined instruction. The technical scheme can provide a more natural and more intelligent voice interaction and control mode for the user; the technical scheme is simple to realize and beneficial to popularization; meanwhile, the system has low complexity, and is beneficial to realization and deployment of the equipment side.

The embodiment provides a voice control method, and the specific implementation of the foregoing embodiment is elaborated through the foregoing embodiment, and it can be seen that through the technical scheme of the foregoing embodiment, not only can the development cost be saved, but also the convenience when the user uses voice control can be improved, thereby better achieving the purpose of voice interaction and control.

In another embodiment of the present application, based on the same inventive concept as the previous embodiment, referring to fig. 7, a schematic structural diagram of a voice control apparatus 60 provided in an embodiment of the present application is shown. As shown in fig. 7, the voice control device 60 may include: a receiving unit 601, a determining unit 602, an analyzing unit 603, and a transmitting unit 604; wherein the content of the first and second substances,

a receiving unit 601 configured to receive voice data input by a user;

a determining unit 602, configured to determine at least one image control according to a current graphical interface;

the analysis unit 603 is configured to perform image content understanding on the at least one image control to obtain image description text information corresponding to the at least one image control; the image control recognition system is also configured to perform image control recognition according to the image description text information corresponding to the voice data and the at least one image control, and determine a target image control in the at least one image control;

a determining unit 602, further configured to determine an operation instruction according to the voice data;

the sending unit 604 is configured to send an operation instruction to the target image control to implement voice control on the target image control.

In some embodiments, the determining unit 602 is further configured to determine graphical interface element information corresponding to the current graphical interface; and determining at least one image control according to the graphical interface element information.

In some embodiments, referring to fig. 7, the speech control apparatus 60 may further include a calling unit 605 configured to call the system underlying code information to obtain the gui element information; or calling a system auxiliary service function interface to acquire the graphical interface element information.

In some embodiments, the analyzing unit 602 is further configured to query the widget with the class attribute suffix as a preset type from the gui element information, and form a candidate widget set; and screening the sizes of the controls in the candidate control set to obtain at least one image control.

In some embodiments, the preset type includes at least one of: ImageView, FrameLayout, LinearLayout, RelativeLayout, and View.

In some embodiments, the determining unit 602 is further configured to determine, in the candidate control set, whether the length and the width of the control satisfy a preset size condition, and determine, as the image control, the control whose length and width satisfy the preset size condition.

In some embodiments, referring to fig. 7, the voice control apparatus 60 may further include a detection unit 606 configured to capture a current graphical interface to obtain an image to be recognized; carrying out control detection on the image to be recognized, and forming a candidate control set by a plurality of controls obtained through detection; and screening the sizes of the controls in the candidate control set to obtain at least one image control.

In some embodiments, the analysis unit 603 is specifically configured to perform text conversion on the voice data to obtain voice text information; semantic matching is carried out on the voice text information and the image description text information corresponding to the at least one image control, and a semantic similarity value corresponding to the at least one image control is determined; and determining the target image control according to the semantic similarity value.

In some embodiments, the determining unit 602 is further configured to determine the image control corresponding to the largest similarity value in the semantic similarity values as the target image control.

In some embodiments, the detecting unit 606 is further configured to capture a current graphical interface to obtain an image to be identified; performing text conversion on the voice data to obtain voice text information; performing target detection on the image to be recognized according to the voice text information, and determining a target image control; and determining an operation instruction according to the voice data, and sending the operation instruction to the target image control so as to realize voice control of the target image control.

It is understood that in this embodiment, a "unit" may be a part of a circuit, a part of a processor, a part of a program or software, etc., and may also be a module, or may also be non-modular. Moreover, each component in the embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.

Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Accordingly, the present embodiments provide a computer storage medium storing a computer program which, when executed by at least one processor, performs the steps of the method of any of the preceding embodiments.

Based on the above-mentioned components of the voice control device 60 and the computer storage medium, refer to fig. 8, which shows a schematic structural diagram of a component of an electronic device according to an embodiment of the present application. As shown in fig. 8, the electronic device 70 may include: a communication interface 701, a memory 702, and a processor 703; the various components are coupled together by a bus system 704. It is understood that the bus system 704 is used to enable communications among the components. The bus system 704 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled in fig. 8 as bus system 704. The communication interface 701 is used for receiving and sending signals in the process of receiving and sending information with other external network elements;

a memory 702 for storing a computer program capable of running on the processor 703;

a processor 703 for executing, when running the computer program, the following:

receiving voice data input by a user;

It will be appreciated that the memory 702 in the subject embodiment can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (ddr Data Rate SDRAM, ddr SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous chained SDRAM (Synchronous link DRAM, SLDRAM), and Direct memory bus RAM (DRRAM). The memory 702 of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

The processor 703 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the method may be implemented by hardware integrated logic circuits in the processor 703 or by instructions in the form of software. The Processor 703 may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 702, and the processor 703 reads the information in the memory 702 and performs the steps of the above method in combination with the hardware thereof.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Optionally, as another embodiment, the processor 703 is further configured to, when running the computer program, perform the steps of the method of any of the preceding embodiments.

Based on the above-mentioned components of the voice control device 60 and the computer storage medium, refer to fig. 9, which shows a schematic structural diagram of another electronic device provided in the embodiment of the present application. As shown in fig. 9, the electronic device 70 may include the voice control apparatus 60 described in any of the previous embodiments.

In the embodiment of the application, for the electronic device 70, the voice interaction mode understood based on the image content does not need to be adapted to the application and the voice interaction, so that not only can the development cost be saved, but also the user can conveniently describe the voice interaction mode, and the convenience of the user in using the voice operation can be effectively improved, thereby better achieving the purposes of voice interaction and control.

It should be noted that, in the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for voice control, the method comprising:

receiving voice data input by a user;

understanding the image content of the at least one image control to obtain image description text information corresponding to the at least one image control;

performing image control identification according to the voice data and the image description text information corresponding to the at least one image control, and determining a target image control in the at least one image control;

and determining an operation instruction according to the voice data, and sending the operation instruction to the target image control so as to realize voice control on the target image control.

2. The method of claim 1, wherein determining at least one image control based on the current graphical interface comprises:

and determining the at least one image control according to the graphical interface element information.

3. The method of claim 2, wherein the determining of the information of the graphical interface element corresponding to the current graphical interface comprises:

calling system bottom code information to acquire the graphical interface element information; alternatively, the first and second electrodes may be,

and calling a system auxiliary service function interface to acquire the graphical interface element information.

4. The method of claim 2, wherein said determining the at least one image control from the graphical interface element information comprises:

querying controls with class attribute suffixes as preset types from the graphical interface element information to form a candidate control set;

and screening the sizes of the controls in the candidate control set to obtain the at least one image control.

5. The method of claim 4, wherein the preset type comprises at least one of: ImageView, FrameLayout, LinearLayout, RelativeLayout, and View.

6. The method of claim 4, wherein the size-screening the controls in the candidate set of controls to obtain the at least one image control comprises:

and in the candidate control set, judging whether the length and the width of the control meet preset size conditions or not, and determining the control with the length and the width meeting the preset size conditions as the image control.

7. The method of claim 1, wherein determining at least one image control based on the current graphical interface comprises:

carrying out control detection on the image to be identified, and forming a candidate control set by a plurality of detected controls;

8. The method according to any one of claims 1 to 7, wherein the performing image control recognition according to the image description text information corresponding to the voice data and the at least one image control, and determining a target image control in the at least one image control comprises:

performing text conversion on the voice data to obtain voice text information;

semantic matching is carried out on the voice text information and the image description text information corresponding to the at least one image control, and a semantic similarity value corresponding to the at least one image control is determined;

and determining the target image control according to the semantic similarity value.

9. The method of claim 8, wherein determining the target image control according to the semantic similarity value comprises:

and determining the image control corresponding to the maximum similarity value in the semantic similarity values as the target image control.

10. The method of claim 1, wherein after said receiving user-input speech data, the method further comprises:

performing text conversion on the voice data to obtain voice text information;

performing target detection on the image to be recognized according to the voice text information, and determining a target image control;

11. A voice control device is characterized by comprising a receiving unit, a determining unit, an analyzing unit and a sending unit; wherein the content of the first and second substances,

the receiving unit is configured to receive voice data input by a user;

the analysis unit is configured to understand image content of the at least one image control to obtain image description text information corresponding to the at least one image control; the voice data is configured to be image description text information corresponding to the at least one image control, and the image control is identified according to the voice data and the image description text information corresponding to the at least one image control, and a target image control is determined in the at least one image control;

the sending unit is configured to send the operation instruction to the target image control so as to implement voice control on the target image control.

12. An electronic device, comprising a memory and a processor; wherein the content of the first and second substances,

the memory for storing a computer program operable on the processor;

the processor, when running the computer program, is configured to perform the method of any of claims 1 to 10.

13. A computer storage medium, characterized in that the computer storage medium stores a computer program which, when executed by at least one processor, implements the method of any one of claims 1 to 10.