WO2024066067A1 - 一种定位界面上目标元素的方法、介质及电子设备 - Google Patents

一种定位界面上目标元素的方法、介质及电子设备 Download PDF

Info

Publication number
WO2024066067A1
WO2024066067A1 PCT/CN2022/138765 CN2022138765W WO2024066067A1 WO 2024066067 A1 WO2024066067 A1 WO 2024066067A1 CN 2022138765 W CN2022138765 W CN 2022138765W WO 2024066067 A1 WO2024066067 A1 WO 2024066067A1
Authority
WO
WIPO (PCT)
Prior art keywords
interface
target
node
elements
structure tree
Prior art date
Application number
PCT/CN2022/138765
Other languages
English (en)
French (fr)
Inventor
杭天欣
康佳慧
高煜光
张泉
Original Assignee
北京弘玑信息技术有限公司
上海弘玑信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京弘玑信息技术有限公司, 上海弘玑信息技术有限公司 filed Critical 北京弘玑信息技术有限公司
Publication of WO2024066067A1 publication Critical patent/WO2024066067A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0481Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/04842Selection of displayed objects or displayed text elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • G06V10/422Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation for representing the structure of the pattern or shape of an object therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the present application relates to the field of robotic process automation. Specifically, embodiments of the present application relate to a method, medium, and electronic device for locating a target element on an interface.
  • RPA Robotic Process Automation
  • the software robot needs to accurately identify the position and semantics of a button (as an example of a target element) before clicking it.
  • the accuracy of related technologies depends on the combined accuracy of multiple models such as target detection, template matching, and OCR (Optical Character Recognition).
  • each model depends on the accuracy of the upstream model, and the error rate will be multiplied, which leads to an inefficient success rate of the software robot.
  • the related technology requires too many modules in series (corresponding to the corresponding neural network models), this also leads to a decrease in the implementation speed of the software robot.
  • the software robot's search for certain elements in the interface depends on the semantic information given by OCR, so it has poor robustness to language versions or color and shape changes.
  • the purpose of the embodiments of the present application is to provide a method, medium and electronic device for locating target elements on an interface.
  • Some embodiments of the present application from the perspective of interface structuring, perform structured analysis on the target detected elements to obtain an element structure tree (i.e., a multi-branch tree of element structure), so that the software robot does not have to decide the button selection based on cumbersome OCR results or image semantic information, but instead uses the structural relationship between elements to map to the actual image (i.e., the image corresponding to the interface to be operated) and find the corresponding target element (e.g., button) position and complete clicks or other types of operations.
  • an element structure tree i.e., a multi-branch tree of element structure
  • an embodiment of the present application provides a method for locating a target element on an interface, the method comprising: acquiring a structural relationship between at least some elements on the interface to be operated, and obtaining a structure tree of elements to be matched; determining the position of the target element from the interface to be operated at least based on a reference element structure tree and the structure tree of elements to be matched, so as to complete the operation of the target element; wherein the reference element structure tree is used to characterize the structural relationship between at least some elements on a reference interface, and the structural relationship is obtained by performing a structured analysis on the elements of a corresponding interface, and the corresponding interface includes the reference interface and the interface to be operated.
  • Some embodiments of the present application enable the software robot to decide the selection of target elements without having to rely on cumbersome OCR results or image semantic information. Instead, it uses the structural relationship between elements to map to the actual image (i.e., the image corresponding to the interface to be operated) and find the position of the corresponding target element and complete clicks or other types of operations, thereby improving the accuracy of the results.
  • the method before determining the position of the target element from the interface to be operated at least based on the reference element structure tree and the element structure tree to be matched, the method further includes: acquiring the structural relationship between at least some of the elements on the reference interface to obtain the reference element structure tree.
  • the robot before the robot operates the interface to be operated, it is necessary to first obtain the element structure tree of the standard interface, so as to find the position of the target element on the interface to be operated based on the element structure tree and the element structure tree to be matched.
  • the structured analysis is a classification result of at least part of the elements based on element logical relationships and element spatial distance relationships; the reference element structure tree and the to-be-matched element structure tree are used to represent the target common ancestor node of any two nodes.
  • Some embodiments of the present application obtain an element structure tree by constructing a common ancestor between nodes that are spatially close and have the same logical relationship and marking the position of the ancestor (for example, using a rectangular box to mark the position of the ancestor).
  • the element structure tree can fully characterize the structural relationship between elements on the interface, thereby improving the accuracy of locating the target element based on the structural relationship.
  • the target common ancestor node is the nearest common ancestor node encountered in the process of searching for ancestor nodes upward between the two nodes.
  • obtaining the structural relationship between at least some elements on the interface to be operated includes: inputting the image of the interface to be operated into a target element detection model, obtaining element attribute information and target semantic features of all elements detected from the image of the interface to be operated, wherein the element attribute information includes: at least one of element position and element category, and the target semantic features are semantic features of the areas where all elements are located; constructing an initial structure graph according to a distance composition algorithm and the attribute information of all elements, wherein the initial structure graph includes multiple nodes, each node is used to represent an element, and the feature of each node is represented by the element attribute information; inputting the initial structure graph into a target graph neural network model, and obtaining the element structure tree to be matched at least according to the target graph neural network model, wherein the element structure tree to be matched includes the multiple nodes and ancestor nodes corresponding to at least some of the nodes.
  • Some embodiments of the present application detect the element attribute information and local semantic features of all elements existing on the image of the interface to be operated through a target element detection model, and then construct an element structure tree through a target graph neural network model to obtain the structural relationship of each element, so that the technical solution for finding the position of the target element on the interface to be operated relies on the structural relationship to find, thereby reducing the complexity of the technical solution while improving the accuracy of the search results.
  • the image of the interface to be operated is input into a target element detection model to obtain element attribute information and target semantic features of all elements detected from the image of the interface to be operated, including: obtaining overall image semantic features through a backbone network included in the target element detection model, wherein the backbone network is a feature extraction network; extracting local semantic features corresponding to each element included in all the elements from the overall image semantic features, and using all the obtained local semantic features as the target semantic features.
  • Some embodiments of the present application obtain the semantic features of the entire image through the target element detection network to obtain the local semantic features of each element.
  • the features are used to characterize the features of each node, which not only improves the accuracy of node features but also reduces the amount of data processing and improves the data processing speed.
  • the method before inputting the image of the interface to be operated into the target element detection model, the method also includes: acquiring N original interface images; marking the area where each element is located and the category of each element on each original interface image included in the N original interface images, to obtain N element annotated images, wherein the area where each element is located is marked with a rectangular frame, and the categories include: at least one of a scroll bar, an editable input box, text, a hyperlink, a bordered image, a button, a mark, a window, and a pop-up window; training the element detection model according to the N original interface images and the N element annotated images to obtain the target element detection model.
  • Some embodiments of the present application mark the location and category of elements on each training image so that the target element detection network obtained after training has the function of predicting this information on the input image.
  • the method before inputting the initial structure diagram into the target graph neural network model, the method further includes: marking at least one aggregation area on each element annotation image included in the N element annotation images and marking the level of the aggregation area in the element structure tree to obtain N ancestor node position and layer number annotation images, wherein an aggregation area includes an area where one or more elements are located, and corresponding to the one aggregation area is a common ancestor node, and the one aggregation area is used to characterize the position of the common ancestor node; training the graph neural network at least based on the N ancestor node position and layer number annotation images to obtain the target graph neural network model.
  • Some embodiments of the present application further annotate the location information of the common ancestor nodes of adjacent elements on N element annotated images, so that the trained target graph neural network model has the function of predicting the location of the common ancestor nodes between nodes on the input image.
  • marking at least one aggregation region on each element annotation image included in the N element annotation images and marking the level of the aggregation region in the element structure tree includes: aggregating one or more elements on each element annotation image according to a preset element logical relationship and a preset element spatial distance relationship, marking an initial aggregation region in the region where all the aggregated elements are located and marking the initial aggregation region with a first identifier, then aggregating at least one of the initial aggregation regions according to the preset element logical relationship and the preset element spatial distance relationship to obtain a second aggregation region, marking the second aggregation region and marking the second aggregation region with a second identifier, and so on, until an Nth aggregation region including all the elements on each element annotation image is obtained, marking the Nth aggregation region and marking the Nth aggregation region with an Nth identifier, wherein the Nth aggregation region corresponds to the root node of the tree, the N
  • Some embodiments of the present application further annotate multiple levels of aggregation areas on each element annotation map as annotation data for training the graph neural network model.
  • the annotated aggregation areas can reflect the subordinate relationships of the elements on the original interface image.
  • Such annotation data enables the trained target graph neural network model to have the ability to mine the subordinate relationships, i.e., structural relationships, of the elements on the interface image.
  • the method before inputting the initial structure graph into the target graph neural network model, the method further includes: obtaining a prediction result corresponding to each original interface image included in the N original interface images through the target element detection model, wherein the prediction result includes predicted element attribute information and a second semantic feature of all elements detected on any original interface image, the predicted element attribute information includes at least one of the element position and the element category, and the second semantic feature is a local semantic feature of each element among all elements detected on any original interface image; obtaining a predicted initial structure graph corresponding to any original interface image according to the predicted element attribute information and a distance composition algorithm, wherein the predicted initial structure graph includes a plurality of second nodes; obtaining features of each second node on the predicted initial structure graph according to the prediction result, and obtaining an input feature vector according to the features; and obtaining the target graph neural network model by training the graph neural network at least according to the N images annotated with ancestor node positions and number of layers, including: obtaining the target graph neural network by training the graph neural network according to the input feature
  • Some embodiments of the present application also need to obtain input vectors to obtain training data for training the graph neural network. These data and the N ancestor node position annotated images are simultaneously input into the graph neural network model to complete the training of the network and obtain a target graph neural network model capable of constructing an element structure tree.
  • the characteristics of each second node on the predicted initial structure diagram are obtained based on the prediction results, including: taking the element position, element category and local semantic features corresponding to any second node as the characteristics of any second node, wherein the local semantic features corresponding to any second node are the semantic features of the area where any second node is located.
  • Some embodiments of the present application use element position (i.e., the coordinates of the element on the corresponding interface image), element category (for example, at least one of a scroll bar, an editable input box, text, a hyperlink, a bordered image, a button, a mark, a window, and a pop-up window) and local semantic features as the features of each node on the initial structure diagram.
  • element position i.e., the coordinates of the element on the corresponding interface image
  • element category for example, at least one of a scroll bar, an editable input box, text, a hyperlink, a bordered image, a button, a mark, a window, and a pop-up window
  • local semantic features as the features of each node on the initial structure diagram.
  • the characteristics of each second node on the predicted initial structure diagram are obtained based on the prediction results, including: performing dimensionality reduction processing on the local semantic features corresponding to any second node to obtain reduced-dimensional local semantic features, wherein the local semantic features corresponding to any second node are the semantic features of the area where any second node is located; and using the element position, element category and the reduced-dimensional local area semantic features corresponding to any second node as the characteristics of any second node.
  • the present application uses some local semantic features after dimensionality reduction as the features of each node on the initial structure diagram, which can reduce the amount of data processing during training and improve the training speed.
  • the dimensionality reduction process is performed by a principal component analysis dimensionality reduction algorithm (PCA dimensionality reduction algorithm).
  • PCA dimensionality reduction algorithm principal component analysis dimensionality reduction algorithm
  • Some embodiments of the present application perform dimensionality reduction processing on local semantic features through a PCA dimensionality reduction algorithm.
  • the method further includes: marking the semantics of each element on the reference element structure tree to obtain a reference element semantic tree; determining the position of the target element from the interface image to be processed at least based on the reference element structure tree and the element structure tree to be matched, including: confirming that the structures of the reference element structure tree and the element structure tree to be matched are consistent; searching for a target node corresponding to the target element from the reference element semantic tree; searching for an element position feature value of a node corresponding to the target node from the element structure tree to be matched, and obtaining the position of the target element from the interface to be operated based on the element position feature value.
  • Some embodiments of the present application use the structural relationship of the interface and the semantic information of the elements on the interface to locate the target element (for example, a target button or a target edit box, etc.) on the interface to be operated, thereby further improving the accuracy of the positioning result.
  • the target element for example, a target button or a target edit box, etc.
  • some embodiments of the present application provide a device for locating a target element on an interface, the device comprising: an element structure tree acquisition module, configured to acquire the structural relationship between at least some elements on the interface to be operated, and obtain a structure tree of elements to be matched; a positioning module, configured to determine the position of the target element on the interface to be operated based on at least a reference element structure tree and the structure tree of the element to be matched, so as to complete the operation of the target element; wherein the reference element structure tree is used to characterize the structural relationship between at least some elements on the reference interface, and the structural relationship is obtained by performing structured analysis on the elements of the corresponding interface, and the corresponding interface includes the reference interface and the interface to be operated.
  • some embodiments of the present application provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, can implement the method described in any embodiment of the first aspect.
  • some embodiments of the present application provide an electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the method described in any embodiment of the first aspect above can be implemented.
  • some embodiments of the present application provide a computer program product, wherein the computer program product comprises a computer program, wherein the computer program, when executed by a processor, can implement the method described in any embodiment of the first aspect.
  • some embodiments of the present application provide a robot configured to execute the method described in any embodiment of the first aspect.
  • FIG1 is an image of an interface to be operated provided in an embodiment of the present application.
  • FIG2 is a flow chart of a method for locating a target element on an interface provided by an embodiment of the present application
  • FIG3 is a result of classifying some elements according to the logical relationship of the elements provided by an embodiment of the present application.
  • FIG4 is a schematic diagram of a process of obtaining a structure tree of elements to be matched according to a target element detection model and a target graph neural network model provided in an embodiment of the present application;
  • FIG5 is a diagram of an implementation model architecture of a robotic process automation process provided by an embodiment of the present application.
  • FIG6 is a schematic diagram of training an element detection model to obtain a target element detection model provided by an embodiment of the present application
  • FIG7 is a schematic diagram of processing an operation interface image by a target element detection model provided in an embodiment of the present application.
  • FIG8 is an architecture diagram for training a graph neural network model provided in an embodiment of the present application.
  • FIG9 is a block diagram of a device for locating a target element on an interface provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the composition of an electronic device provided in an embodiment of the present application.
  • Machine process automation technology can simulate employees' daily computer operations through keyboards and mice, and can replace humans to log in to the system, operate software, read and write data, download files, read emails, etc.
  • automated robots as the company's virtual labor force can free employees from repetitive and low-value work and devote their energy to high-value-added work, so that companies can reduce costs and increase benefits while transforming to digital intelligence.
  • RPA is a software robot that replaces manual tasks in business processes and interacts with the computer's front-end system like a human. Therefore, RPA can be regarded as a software program robot running on a personal PC or server. It imitates the operations performed by users on the computer to automatically repeat these operations instead of humans, such as retrieving emails, downloading attachments, logging into the system, data processing and analysis, etc., quickly, accurately and reliably.
  • RPA is a way of using "digital employees" to replace people in business operations and its related technologies.
  • RPA uses software automation technology to simulate people to achieve unmanned operation of objects such as computer systems, software, web pages and documents, obtain business information, perform business actions, and ultimately achieve process automation, labor cost savings and improved processing efficiency.
  • one of the core technologies of RPA is to locate and pick up the elements to be operated on the interface (i.e., target elements). For example, when it is necessary to simulate people to click buttons, the premise is to locate the position of the button element.
  • FIG. 1 is an image of a web page interface.
  • the process of robotic process automation is exemplarily described below in conjunction with FIG. 1 .
  • a web page interface i.e., a Baidu search interface
  • the interface includes a plurality of elements, i.e., a first element 101, a second element 102, a third element 103, a fourth element 104, a fifth element 105, a sixth element 106, a seventh element 107, an eighth element 108, a ninth element 109, and a tenth element, wherein the first to seventh elements are all hyperlink type elements, the eighth element is an editable input box type element, the ninth element is a button type element, and the tenth element 190 is a bordered image.
  • Robotic process automation means that robots simulate manual operations on the elements shown in Figure 1.
  • the related technology needs to rely on multiple modules such as element detection module, template matching module based on image features and OCR to work in series when realizing robot process automation.
  • Some embodiments of the present application need to obtain the element structure tree of the web page interface of Figure 1 in the design stage, and then the robot obtains the element structure tree of the interface to be operated (that is, the same interface as Figure 1) in the execution stage. Then, the two element structure trees are used to help the robot locate and search for the position of the button, and enable the robot to smoothly perform the click operation on the button.
  • FIG. 1 is only used to exemplify the working scenario and working process of the present application and should not be understood as limiting the application scenario of the technical solution of the present application.
  • a certain interface included in a certain application app will be selected, namely the baseline interface (or standard interface).
  • the robot will repeatedly access the interface countless times.
  • the interface accessed by the robot to operate it is called the interface to be operated (or interface N)
  • the interface operated by humans that is, the interface for image recording in the designer stage of the RPA process
  • an embodiment of the present application provides a method for locating a target element on an interface, and the method exemplarily includes: S101, obtaining the structural relationship between at least some elements on the interface to be operated, and obtaining a structure tree of elements to be matched. And S102, determining the position of the target element from the interface to be operated at least based on the reference element structure tree and the structure tree of elements to be matched, so as to complete the operation of the target element; wherein the reference element structure tree is used to characterize the structural relationship between at least some elements on the reference interface, and the structural relationship is obtained by structurally parsing the elements of the corresponding interface, and the corresponding interface includes the reference interface and the interface to be operated.
  • some embodiments of the present application enable the software robot to decide the selection of target elements without having to rely on cumbersome OCR results or image semantic information. Instead, it uses the structural relationship between elements to map to the actual image (i.e., the image corresponding to the interface to be operated) and find the position of the corresponding target element and complete clicks or other types of operations, thereby improving the accuracy of the results.
  • the method before executing S101, further includes: obtaining the structural relationship between at least some of the elements on the reference interface to obtain the reference element structure tree.
  • the robot before the robot operates the interface to be operated, it is also necessary to first obtain the element structure tree of the standard interface, so as to find the position of the target element on the interface to be operated based on the element structure tree and the element structure tree to be matched.
  • the structured analysis is the classification result of at least some of the elements according to the element logical relationship and the element spatial distance relationship; the reference element structure tree and the element structure tree to be matched are used to characterize the position of the target common ancestor node of any two nodes.
  • Some embodiments of the present application obtain an element structure tree by constructing a common ancestor between nodes that are close in space and have the same logical relationship and marking the position of the ancestor (for example, using a rectangular box to mark the position of the ancestor).
  • the element structure tree can fully characterize the structural relationship between elements on the interface, thereby improving the accuracy of locating the target element according to the structural relationship.
  • the element logical relationship refers to distinguishing different types of elements from the functional perspective of the elements, and multiple elements with similar or identical functions belong to elements that satisfy a logical relationship.
  • an image of a login interface is provided, on which a verification code login or password login is provided, as well as an input box for obtaining a text message verification code and obtaining a voice verification code corresponding to the verification code login method, a login/registration selection box, and other login methods.
  • the three elements in the annotation box 301 of Figure 3 have the same functions and all belong to the third-party login method, then it is considered that the three belong to the same category according to the element logical relationship.
  • the meaning of ancestor node is: the parent node of the node belongs to the ancestor node, the parent node of the parent node also belongs to the ancestor node, the parent node of the parent node also belongs to the ancestor node, and so on.
  • the common ancestor node is the ancestor node of two different nodes, and the overlapping nodes are their common ancestor nodes.
  • the process of obtaining the structural relationship between at least some elements on the interface to be operated in S101 exemplarily includes the following three steps:
  • the image of the interface to be operated is input into the target element detection model to obtain element attribute information and target semantic features of all elements detected from the image of the interface to be operated, wherein the element attribute information includes: element position (for example, using coordinates to represent the position) and at least one of element categories, and the target semantic features are semantic features of the area where each element included in all the elements is located.
  • element attribute information includes: element position (for example, using coordinates to represent the position) and at least one of element categories
  • the target semantic features are semantic features of the area where each element included in all the elements is located.
  • an initial structure graph is constructed according to the distance composition algorithm and the attribute information of all the elements, wherein the initial structure graph includes a plurality of nodes, each node is used to represent an element, and the feature of each node is represented by the element attribute information.
  • the second step process can also be used to construct an initial structure diagram corresponding to the reference interface.
  • the initial structure graph is input into the target graph neural network model, and the to-be-matched element structure tree is obtained at least according to the target graph neural network model, wherein the to-be-matched element structure tree includes the plurality of nodes and the position information of the ancestor nodes corresponding to at least some of the nodes.
  • the to-be-matched element structure tree includes the plurality of nodes and the position information of the nearest ancestor nodes corresponding to at least some of the nodes.
  • the image of the interface to be operated is input into the target element detection model 110 (for executing the first step above) to obtain element attribute information and target semantic features, and the element attribute information is provided to the distance composition module 112 (for executing the second step above) to obtain an initial structure diagram, and finally the initial structure diagram is input into the target graph neural network model 120 to obtain a structure tree of the element to be matched.
  • the reference element structure tree can be obtained by inputting the initial structure diagram corresponding to the reference interface into the target graph neural network model through the third step.
  • some embodiments of the present application detect the element attribute information and local semantic features (or target semantic features) of all elements existing on the image of the interface to be operated through a target element detection model, and then construct an element structure tree through a target graph neural network model to obtain the structural relationship of each element, so that the technical solution for finding the position of the target element on the interface to be operated relies on the structural relationship to search, thereby reducing the complexity of the technical solution while improving the accuracy of the search results.
  • some embodiments of the present application also need to identify the position of the target element based on the semantics of each element marked on the reference element structure tree.
  • the target element is a target button; wherein, after obtaining the reference element structure tree, the method further comprises: marking the semantics of each element on the reference element structure tree to obtain a reference element semantic tree.
  • S102 exemplarily comprises: confirming that the structures of the reference element structure tree and the element structure tree to be matched are consistent; searching for a target node corresponding to the target button from the reference element semantic tree; searching for an element position feature value of a node corresponding to the target node from the element structure tree to be matched, and obtaining the position of the target button from the interface to be operated according to the element position feature value.
  • Some embodiments of the present application use the structural relationship of the interface and the semantic information of the elements on the interface to locate the target button on the interface to be operated, thereby further improving the accuracy of the positioning result.
  • the target element detection model and the target graph neural network model are both neural network models obtained after training, wherein the target element detection model has the ability to detect element positions, element categories, and local semantic features corresponding to elements on an input image.
  • the above-mentioned target graph neural network model has the function of obtaining the position of the ancestor control of each element (belonging to a control on the interface) according to the output data of the target element detection model.
  • the reference interface image is input into the two-stage cascade neural network model system 100 to obtain the reference element structure tree
  • the interface image to be operated is input into the two-stage cascade neural network model system 100 to obtain the element structure tree to be matched
  • the neural network model system 100 at least includes a target element detection model 110 and a target graph neural network model 120.
  • the semantics of each node is configured on the reference element structure tree by the configuration module to obtain the reference element semantic structure tree. Then, it is determined whether the structures of the reference element structure tree and the element structure tree to be matched match (i.e., whether the two are consistent).
  • the target element semantic search module searches the reference element semantic structure tree to obtain the position of the target element in the reference element semantic structure tree. Finally, the position of the target element is located from the element structure tree to be matched according to the position information (specifically implemented by the code of the target element search module of FIG5 ), and the position is mapped to the interface to be operated so that the robot completes the click or other operations on the target element.
  • the two-stage cascaded neural network model system 100 in FIG. 5 may also include other functional units in addition to the two models, as shown in FIG. 4 .
  • the above-mentioned input of the image of the interface to be operated into the target element detection model to obtain the element attribute information and target semantic features of all elements detected from the image of the interface to be operated exemplarily includes: obtaining the overall picture semantic features through the backbone network included in the target element detection model, wherein the backbone network is a feature extraction network; extracting the local semantic features corresponding to each element included in all the elements from the overall picture semantic features, and using all the obtained local semantic features as the target semantic features.
  • Some embodiments of the present application obtain the local semantic features of each element through the semantic features of the overall picture obtained by the target element detection network, and using the features to characterize the features of each node not only improves the accuracy of the node features, but also reduces the amount of data processing and improves the data processing speed.
  • the two-stage cascaded neural network model system 100 in Figure 4 of some embodiments of the present application also adopts a series connection method, first performing element detection on the image, and then using the graph neural network to build an element structure tree based on the element detection results.
  • some embodiments of the present application only use two models, so compared with the technical solution of realizing element positioning in a multi-module manner in the prior art, it can: reduce the cumulative effect of the error rate caused by the series connection of multiple models; and improve the overall working speed.
  • some embodiments of the present application do not rely on the semantic information given by OCR for searching certain elements in the interface, but rely on the structural relationship of the elements, they can have a higher robustness to appearance information such as language version or color and shape changes, while reducing the training cost of the model.
  • the software robot in order to perform relevant operations, needs to include a designer stage in which the software robot obtains the corresponding button (element) process and an executor stage in which the software robot obtains the corresponding button (element) process.
  • the designer will infer the reference interface through the target element detection model 110 and the target graph neural network model 120 as shown in FIG5 , thereby generating a reference element structure tree.
  • Each node in the reference element structure tree is configured through manual configuration and other configuration methods, and the configuration information includes semantics, functions, coordinates and other information, and the reference element semantic structure tree is obtained after configuration.
  • the execution process of the executor includes:
  • the first step is to receive a search request for "click button X".
  • the target element detection model and the target graph neural network model shown in Figure 5 are used to obtain the element structure tree to be matched, and at the same time, the reference element structure tree obtained in advance based on the benchmark interface, the target element detection model and the target graph neural network model is read.
  • the third step is to compare whether the structures of the to-be-matched element structure tree and the reference element structure tree are consistent. If they are not consistent, the process of searching for button X on interface N fails. Otherwise, a semantic search is performed based on the reference element semantic structure tree obtained by the configuration, the target node corresponding to button X is located from the reference element structure tree, and the node corresponding to the target node is found from the to-be-matched element structure tree. The coordinate information of the node corresponding to the target node is returned to the software robot, and subsequent RPA work is performed, that is, the node position information is used as the position of button X to locate button X on interface N.
  • the fourth step is to complete the click operation on button X.
  • the architecture of the element detection model and the target element detection model in some embodiments of the present application is the same, except that the weight value of the element detection model is a randomly initialized value, while the weight value of the target element detection model is obtained after the training is completed.
  • the element detection model of some embodiments of the present application can adopt any neural network model with the function of extracting interface image elements.
  • the element detection model can be a yolov5 neural network model, which adopts a convolutional neural network CNN.
  • the method before executing S101, the method further includes a process of training the element detection model to obtain a weight file, and obtaining a target element detection model according to the weight file.
  • the process includes:
  • the first step is to obtain N original interface images.
  • the area where each element is located and the category of each element are marked on each of the N original interface images to obtain N element annotated images, wherein the area where each element is located is marked with a rectangular frame, and the categories include: at least one of a scroll bar, an editable input box, text, a hyperlink, an image with a border, a button, a mark, a window and a pop-up window; an element detection model is trained according to the N original interface images and the N element annotated images to obtain the target element detection model.
  • the web page interface images or software interface images collected in the first step are annotated (for example, manually annotated) to form a corresponding annotation set.
  • the categories include: scrollbar: scroll bar; textbox: editable input box; text: text; link: hyperlink (underlined); image: image with borders; button: button; icon: mark, symbol; window: window, pop-up window; icon_button: both icon and button; icon_button_text: both icon, button and text.
  • the element detection model is trained according to the N original interface images and the N element annotated images to obtain the target element detection model.
  • the N original interface images obtained in the first step and the N element annotated images obtained in the second step are sent as input to the element detection model for supervised training, and their corresponding annotation sets are used as supervision labels to obtain the trained first model weight file, and the coefficients in the weight file are used as the coefficients of the element detection model to obtain the target element detection model.
  • some embodiments of the present application mark the location of the element and the element category on each training image so that the target element detection network obtained after training has the function of predicting this information on the input image.
  • the functions of the target element detection model that is, the output of the model, will be described below by taking the image of the interface to be operated as an example in conjunction with FIG. 7 .
  • the image of the interface to be operated is input into the target element detection model, through which the element coordinates (i.e., representing the position of the element on the interface), element categories and the overall picture semantic features of all elements detected on the interface can be obtained (for example, the overall picture semantic features are obtained by the backbone network of the model), and then the target area semantic feature acquisition module extracts the semantic features of the area where each element is located from the overall picture semantic features to obtain the target semantic features.
  • the target element detection model through which the element coordinates (i.e., representing the position of the element on the interface), element categories and the overall picture semantic features of all elements detected on the interface can be obtained (for example, the overall picture semantic features are obtained by the backbone network of the model), and then the target area semantic feature acquisition module extracts the semantic features of the area where each element is located from the overall picture semantic features to obtain the target semantic features.
  • the following is an illustrative description of the process of training the graph neural network model to obtain the target graph neural network model.
  • the training process of the graph neural network requires training data and the training of the graph neural network belongs to supervised training based on labeled data.
  • the process of obtaining training data in some embodiments of the present application exemplarily includes: obtaining input x and labeled data y (i.e., N images labeled with ancestral node positions).
  • the following is an illustrative description of the implementation process of obtaining input x and labeled data y.
  • the method further includes: respectively marking at least one aggregation area on each element annotation image included in the N element annotation images and marking the level of the aggregation area in the element structure tree (that is, marking the aggregation box clustere based on the annotation of the target detection and marking the level of the aggregation box in the element structure tree, one aggregation box corresponds to one aggregation area), and obtaining N ancestor node position and layer number labeled images, wherein an aggregation area includes an area where one or more elements are located, and the one aggregation area corresponds to a common ancestor node, and the one aggregation area is used to characterize the location of the common ancestor node; at least according to the N ancestor node position and layer number labeled images, the graph neural network is trained to obtain the target graph neural network model.
  • Some embodiments of the present application further mark the location information of the common ancestor node of adjacent elements and the level of the common ancestor node in the element structure tree on the N element annotation images, so that the trained target graph neural network model has the function of predicting the location of the common ancestor node between nodes on the input image.
  • an example of the process of marking at least one aggregation area on each element annotation image included in the N element annotation images and marking the level of the aggregation area in the element structure tree includes: aggregating one or more elements on each element annotation image according to a preset element logical relationship (for example, the preset element logical relationship is the same function) and a preset element spatial distance relationship, marking an initial aggregation area in the area where all the aggregated elements are located and marking the initial aggregation area with a first identifier, then aggregating at least one of the initial aggregation areas according to the preset logic and the preset element spatial distance relationship to obtain a second aggregation area, marking the second aggregation area and marking the second aggregation area with a second identifier, and so on, until an Nth aggregation area including all the elements on each element annotation image is obtained, marking the Nth aggregation area and marking the Nth aggregation area with an Nth
  • Some embodiments of the present application use multi-level aggregation areas on each element annotation diagram and the layer number of each aggregation area as annotation data for training the graph neural network model.
  • the labeled aggregation areas can reflect the subordinate relationships of the elements on the original interface image.
  • Such annotation data enables the trained target graph neural network model to have the ability to mine the subordinate relationships, that is, structural relationships, of the elements on the interface image.
  • the above process is to prepare the corresponding label set for the input x as the input y, and perform the supervised labeling of the model.
  • the expression form of the label is: the position of the nearest common ancestor node between the two elements (the number of layers in the element structure tree to be constructed), which also represents the edge between the nodes.
  • the first node node1 and the second node node2 are labeled as 3, which means that the two elements represented by node1 and node2, in the final element structure tree, the position of the nearest common ancestor node of the two is the third layer in the element structure tree.
  • the model finally predicts that the result of the first edge edge12 (that is, the edge connecting the first node and the second node) between node1 and node2 should be 3.
  • the predicted initial structure diagram includes nodes corresponding to each element, but the edges connecting the nodes are not set with any numerical values.
  • the following is an example of the implementation process of obtaining the input x.
  • the method before inputting the initial structure graph into the target graph neural network model, the method further includes:
  • a prediction result corresponding to each original interface image included in the N original interface images is obtained through the target element detection model.
  • N original interface images are input into the target element detection model shown in Figure 7 to obtain a prediction result, which includes the predicted element attribute information and the second semantic feature of all elements detected on any original interface image, the predicted element attribute information includes at least one of the element position and the element category, and the second semantic feature is the local semantic feature of each element among all the elements detected on any original interface image (for example, the local semantic feature is obtained by extracting the semantic feature corresponding to each element from the overall picture semantic feature).
  • a predicted initial structure graph corresponding to any one of the original interface images is obtained according to the predicted element attribute information and a distance composition algorithm, wherein the predicted initial structure graph includes a plurality of second nodes.
  • the predicted element attribute information ie, the element category and the element coordinates
  • a distance composition module for executing a distance composition algorithm
  • some embodiments of the present application use the distance composition algorithm to compose according to the element coordinates and element categories included in the prediction results obtained in the first step, and obtain the predicted initial structure graph graph1.
  • the distance composition algorithm is defined as: all elements are defined as nodes nodes in the predicted initial structure graph or the initial structure graph (one node corresponds to one detected element), and for any node node N, with it as the center of the circle (for example, the element coordinates in the target detection result is a rectangle, and the center of the circle here refers to the center point of the rectangle) and a certain distance d as the radius, the nodes S collection composed of all other nodes within the circle drawn by it are all regarded as related to it, so that the S collection of all other nodes is edge-connected with the node N to obtain the predicted initial structure graph or the initial structure graph.
  • the numerical values corresponding to each edge are not set on the predicted initial structure graph and the initial structure graph. These numerical values can be obtained through the trained target graph neural network model. These numerical values are used to characterize the number of layers of the nearest common ancestor node of the two nodes corresponding to the edge on the constructed element structure tree.
  • the purpose of using the information of element categories as the feature of the predicted initial structure diagram or the nodes on the initial structure diagram is to increase the feature quantity of each node, so that the construction result of the constructed element structure tree is affected by each element category.
  • the above-mentioned initial structure diagram and the predicted initial structure diagram can also only consider the element position (i.e., element coordinates) information when composing, the output result of the corresponding target element detection module may not include the element category, and the corresponding annotation data may not need to be labeled with the element category.
  • the third step is to obtain the features of each second node on the predicted initial structure diagram according to the prediction results, and obtain an input feature vector according to the features.
  • the process of obtaining the features of each second node on the predicted initial structure diagram according to the prediction results in the third step exemplarily includes: taking the element position, element category and local semantic features corresponding to any second node as the features of any second node, wherein the local semantic features corresponding to any second node are the semantic features of the area where any second node is located.
  • some embodiments of the present application use element position (i.e., the coordinates of the element on the corresponding interface image), element category (e.g., at least one of a scroll bar, an editable input box, text, a hyperlink, a bounded image, a button, a mark, a window and a pop-up window) and local semantic features as the features of each node on the initial structure diagram.
  • element position i.e., the coordinates of the element on the corresponding interface image
  • element category e.g., at least one of a scroll bar, an editable input box, text, a hyperlink, a bounded image, a button, a mark, a window and a pop-up window
  • local semantic features e.g., local semantic features
  • the process of obtaining the features of each second node on the predicted initial structure diagram according to the prediction results in the third step exemplarily includes: performing dimensionality reduction processing on the local semantic features corresponding to any second node to obtain reduced-dimensional local semantic features, wherein the local semantic features corresponding to any second node are the semantic features of the area where any second node is located; and using the element position, element category and the reduced-dimensional local area semantic features corresponding to any second node as the features of any second node.
  • some of the present applications use the reduced-dimensional local semantic features as the features of each node on the initial structure diagram, which can reduce the amount of data processing during training and increase the training speed.
  • the dimensionality reduction processing is performed by the PCA dimensionality reduction algorithm.
  • Some embodiments of the present application perform dimensionality reduction processing on local semantic features by the PCA dimensionality reduction algorithm.
  • the overall picture semantic features corresponding to each original interface image are input into the target area semantic feature acquisition module, and the module is used to extract the local semantic features corresponding to each element from the overall picture semantic features. Then, each local semantic feature is input into the dimension reduction processing module (used to execute the dimension reduction processing algorithm) to obtain the dimension reduction local semantic features corresponding to each element. Then, the dimension reduction local semantic features are input into the node feature construction module to obtain the features of each second node.
  • the dimension reduction processing module used to execute the dimension reduction processing algorithm
  • the PCA dimensionality reduction method is adopted to obtain a smaller space feature expression K (that is, the reduced dimensionality local semantic feature is represented by K).
  • feature construction is performed to obtain the element features of each second node, that is, the features of each second node in each graph1.
  • the feature vector x contains the position coordinates, category, and img features.
  • the features of each second node are represented by the following expression:
  • node.class, node.location, and node.img_feature represent the location coordinates, category, and img features (feature expression K) of the corresponding elements, respectively, and the combination method is concatenation.
  • the feature of each second node node of the predicted initial structure graph graph1 is the second node node feature vector x at the corresponding position generated in the previous step, and a set of feature vectors - feature matrix X is constructed according to the structure of graph1; the adjacency matrix A and degree matrix D are generated according to the structure of graph1, and A, D, and X are used as input x0 and sent to the graph neural network for training.
  • the core formula of the graph neural network is:
  • X is the feature vector of nodes
  • A is the adjacency matrix of the graph
  • W is the trainable weight
  • the number in the upper right corner of W represents the layer, such as W0 represents the trainable weight of the 0th layer
  • ReLU is the internal activation function
  • softmax is the output activation function.
  • the corresponding matrix is obtained by constructing a module through the adjacency matrix and the degree matrix according to the characteristics of each second node.
  • the input x and input y are sent to the graph neural network model for training to obtain the second model weight file.
  • the weight coefficient value of the file is used as the parameter value of the graph neural network model to obtain the target graph neural network model.
  • N images of ancestor node position annotations (corresponding to input y) and the feature composition matrix, adjacency matrix and degree matrix (corresponding to input x) of the second node are input into the graph neural network 121 for training to obtain the target neural network model 120.
  • the process of training the graph neural network at least according to the N ancestor node position annotated images to obtain the target graph neural network model described in the above embodiments exemplarily includes: training the graph neural network according to the input feature vector and the N ancestor node position annotated images to obtain the target graph neural network.
  • Some embodiments of the present application also need to obtain input vectors to obtain training data for training the graph neural network.
  • These data and the N ancestor node position annotated images are simultaneously input into the graph neural network model to complete the training of the network and obtain a target graph neural network model capable of constructing an element structure tree.
  • the values corresponding to the edges between the nodes on the graph obtained by the distance composition algorithm can be predicted. This value can represent the position information of the nearest common ancestor node between two elements. It is not difficult to understand that an element structure tree can be constructed based on the output of the target neural network model.
  • FIG 9 shows a device for locating a target element on an interface provided by an embodiment of the present application.
  • the device corresponds to the method embodiment of Figure 2 above, and can execute each step involved in the above method embodiment.
  • the specific functions of the device can be found in the description above. To avoid repetition, the detailed description is appropriately omitted here.
  • the device includes at least one software function module that can be stored in a memory in the form of software or firmware or solidified in the operating system of the device.
  • the device for locating a target element on an interface includes: an element structure tree acquisition module 801 and a positioning module 802.
  • the element structure tree acquisition module 801 is configured to acquire the structural relationship between at least some elements on the interface to be operated, and obtain the element structure tree to be matched.
  • the positioning module 802 is configured to determine the position of the target element from the interface to be operated at least based on the reference element structure tree and the element structure tree to be matched, so as to complete the operation of the target element; wherein the reference element structure tree is used to characterize the structural relationship between at least some elements on the reference interface, and the structural relationship is obtained by structured analysis of the elements of the corresponding interface, and the corresponding interface includes the reference interface and the interface to be operated.
  • Some embodiments of the present application provide a computer-readable storage medium having a computer program stored thereon.
  • the program is executed by a processor, any embodiment of the method for locating a target element on an interface as described in the above embodiments can be implemented.
  • some embodiments of the present application provide an electronic device 900, including a memory 910, a processor 920, and a computer program stored in the memory 910 and executable on the processor 920, wherein the processor 920 can implement any embodiment of the method for locating a target element on an interface as described above when reading the program from the memory 910 through a bus 930 and executing the program.
  • Processor 920 can process digital signals and can include various computing structures, such as complex instruction set computer structure, reduced instruction set computer structure, or a structure that implements a combination of multiple instruction sets.
  • processor 920 can be a microprocessor.
  • the memory 910 may be used to store instructions executed by the processor 920 or data related to the execution of instructions. These instructions and/or data may include codes for implementing some or all functions of one or more modules described in the embodiments of the present application.
  • the processor 920 of the disclosed embodiment may be used to execute instructions in the memory 910 to implement the method shown in FIG. 2.
  • the memory 910 includes a dynamic random access memory, a static random access memory, a flash memory, an optical memory, or other memory known to those skilled in the art.
  • Some embodiments of the present application provide a computer program product, which includes a computer program.
  • the computer program When the computer program is executed by a processor, it can implement any embodiment of the method for locating a target element on an interface as described in the above embodiments.
  • Some embodiments of the present application provide a robot configured to execute any embodiment included in the method for locating a target element on an interface as described in the above embodiments.
  • each box in the flowchart or block diagram can represent a module, a program segment or a part of a code, and the module, a program segment or a part of a code contains one or more executable instructions for implementing the specified logical function.
  • the functions marked in the box can also occur in a different order from the order marked in the accompanying drawings.
  • each box in the block diagram and/or flowchart, and the combination of boxes in the block diagram and/or flowchart can be implemented with a dedicated hardware-based system that performs a specified function or action, or can be implemented with a combination of dedicated hardware and computer instructions.
  • the functional modules in the various embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
  • the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the computer software product is stored in a storage medium, including several instructions for a computer device (which can be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk, and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例提供一种定位界面上目标元素的方法、介质及电子设备,方法包括:获取待操作界面上至少部分元素之间的结构关系,得到待匹配元素结构树(S101);至少根据参考元素结构树和待匹配元素结构树从待操作界面上确定目标元素的位置,以完成对目标元素的操作(S102);其中,参考元素结构树用于表征基准界面上至少部分元素之间的结构关系,结构关系是通过对相应界面的元素进行结构化解析得到的,相应界面包括基准界面和待操作界面。本申请的一些实施例从界面结构化角度出发使得软件机器人不必根据繁琐的OCR结果或图像语义信息去决定目标元素的选择提升定位结果的准确性。

Description

一种定位界面上目标元素的方法、介质及电子设备
相关申请的交叉引用
本申请要求在2022年09月30日提交中国专利局、申请号为202211205671.2、申请名称为“一种定位界面上目标元素的方法、介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及机器人流程自动化领域,具体而言本申请实施例涉及一种定位界面上目标元素的方法、介质及电子设备。
背景技术
在机器人流程自动化(Robotic Process Automation,RPA)的实施过程中,对于一个普通的界面(该界面可以是网页或者app),软件机器人在点击某一个按钮(作为目标元素的一个示例)前,需要先精准的识别出这个按钮位置和语义,相关技术的准确率依赖于目标检测、模板匹配和OCR(Optical Character Recognition,光学字符识别)等多个模型的共同准确率。
由于相关技术中目标元素检测模块、基于图像特征的模板匹配模块和OCR模块等多个模型是串联工作,因此每一个模型都依赖与上游模型的准确率,错误率将累乘,这导致了软件机器人低效的成功率。同时由于相关技术需要串联的模块(分别对应相应的神经网络模型)个数过多,这也导致了软件机器人实施速度的下降,软件机器人对界面中某些元素的寻找,依赖于OCR给出的语义信息,因此对语言版本或者颜色形状变换具有较差的鲁棒性。
因此如何提升机器人对界面上元素的查找的准确性成了亟待解决的技术问题。
发明内容
本申请实施例的目的在于提供一种定位界面上目标元素的方法、介质及电子设备,本申请的一些实施例从界面结构化角度出发,将目标检测出的元素进行结构化解析得到元素结构树(即元素结构的多叉树),使得软件机器人不必根据繁琐的OCR结果或图像语义信息去决定按钮的选择,而是借助元素之间结构的关系,从而映射到实际图像(即待操作界面对应的图像)中并找到对应的目标元素(例如,按钮)位置并完成点击或者其他类型的操作。
第一方面,本申请实施例提供一种定位界面上目标元素的方法,所述方法包括:获取待操作界面上至少部分元素之间的结构关系,得到待匹配元素结构树;至少根据参考元素结构树和所述待匹配元素结构树从所述待操作界面上确定目标元素的位置,以完成对所述目标元素的操作;其中,所述参考元素结构树用于表征基准界面上至少部分元素之间的结构关系,所述结构关系是通过对相应界面的元素进行结构化解析得到的,所述相应界面包括所述基准界面和所述待操作界面。
本申请的一些实施例使得软件机器人不必根据繁琐的OCR结果或图像语义信息去决定目标元素的选择,而是借助元素之间的结构关系,从而映射到实际图像(即待操作界面对应的图像)中并找到对应的目标元素的位置并完成点击或者其他类型的操作,提升结果的准确性。
在一些实施例中,在所述至少根据参考元素结构树和所述待匹配元素结构树从所述待操作界面上确定目标元素的位置之前,所述方法还包括:获取所述基准界面上所述至少部分元素之间的所述结构关系,得到所述参考元素结构树。
本申请的一些实施例在机器人针对待操作界面进行操作之前还需要首先获取标准界面的元素结构树,从而根据该元素结构树和待匹配元素结构树找到目标元素在待操作界面上的位置。
在一些实施例中,所述结构化解析是依据元素逻辑关系以及元素空间距离关系对所述至少部分元素的分类结果;所述参考元素结构树和所述待匹配元素结构树用于表征任意两个节点的目标共同祖先节点。
本申请的一些实施例通过构建空间上较接近且逻辑关系相同的节点之间的共同祖先并标注祖先的位置(例如,采用矩形框标注出祖先的位置)得到元素结构树,该元素结构树能够充分表征界面上元素之间的结构关系,进而提升了根据结构关系定位目标元素的准确性。
在本申请的一些实施例中,若所述任意两个节点的共同祖先节点的个数为多个时,则所述目标共同祖先节点为在所述两个节点向上寻找祖先节点过程遇见的最近的共同祖先节点。
在一些实施例中,所述获取待操作界面上至少部分元素之间的结构关系,包括:将所述待操作界面的图像输入目标元素检测模型,得到从所述待操作界面的图像中检测到的所有元素的元素属性信息以及目标语义特征,其中,所述元素属性信息包括:元素位置和元素类别中的至少一个,所述目标语义特征为所述所有元素包括的各元素所在区域的语义特征;根据距离构图算法和所述所有元素的属性信息构建初始结构图,其中,所述初始结构图包括多个节点,每个节点用于表征一个元素,所述每个节点的特征采用所述元素属性信息进行表征;将所述初始结构图输入目标图神经网络模型,并至少根据所述目标图神经网络模型得到所述待匹配元素结构树,其中,所述待匹配元素结构树包括所述多个节点以及与至少部分节点对应的祖先节点。
本申请的一些实施例通过目标元素检测模型检测出待操作界面的图像上存在的所有元素的元素属性信息以及局部语义特征,之后再通过目标图神经网络模型构造出元素结构树进而得到各元素的结构关系,使得在待操作界面上查找目标元素的位置的技术方案依赖结构关系来查找,在降低技术方案复杂性的同时提升查找结果的准确性。
在一些实施例中,所述将所述待操作界面的图像输入目标元素检测模型,得到从所述待操作界面的图像中检测到的所有元素的元素属性信息以及目标语义特征,包括:通过所述目标元素检测模型包括的主干网络获取整体图片语义特征,其中,所述主干网络为特征提取网络;从所述整体图片语义特征中抠出与所述所有元素包括的各元素分别对应的局部语义特征,将得到的所有局部语义特征作为所述目标语义特征。
本申请的一些实施例通过目标元素检测网络获取的整体图片的语义特征来得到各元素的局部语义特征,采用该特征表征各节点的特征在提升节点特征准确性的基础上还减少 了数据处理量,提升数据处理速度。
在一些实施例中,在所述将所述待操作界面的图像输入目标元素检测模型之前,所述方法还包括:获取N张原始界面图像;在所述N张原始界面图像包括的各张原始界面图像上均标注每个元素所在的区域以及所述每个元素的类别,得到N张元素标注图像,其中,所述每个元素所在区域采用矩形框标出,所述类别包括:滚动条、可编辑输入框、文本、超链接、有边界的图像、按钮、标记、窗口和弹窗中的至少一种;根据所述N张原始界面图像和所述N张元素标注图像对元素检测模型进行训练,得到所述目标元素检测模型。
本申请的一些实施例通过在每张训练图像上标注元素所在的位置以及元素类别使得训练结束后得到的目标元素检测网络具备预测输入图像上这些信息的功能。
在一些实施例中,在所述将所述初始结构图输入目标图神经网络模型之前,所述方法还包括:在所述N张元素标注图像包括的每张元素标注图像上分别标注至少一个聚合区域并标注所述聚合区域在元素结构树中的层级,得到N张祖先节点位置及层数标注图像,其中,一个聚合区域包括一个或多个元素所在的区域,与所述一个聚合区域对应的是一个共同祖先节点,所述一个聚合区域用于表征所述共同祖先节点所在位置;至少根据所述N张祖先节点位置及层数标注图像对图神经网络进行训练得到所述目标图神经网络模型。
本申请的一些实施例通过在N张元素标注图像上进一步标注邻近元素的共同祖先节点的位置信息,使得训练得到的目标图神经网络模型具备预测输入图像上节点间共同祖先节点所在位置的功能。
在一些实施例中,所述在所述N张元素标注图像包括的每张元素标注图像上分别标注至少一个聚合区域并标注所述聚合区域在元素结构树中的层级,包括:根据预设元素逻辑关系以及预设元素空间距离关系对所述每张元素标注图像上的一个或多个元素进行聚合,在被聚合的所有元素所在的区域标注初始聚合区域并对所述初始聚合区域标注第一标识,再根据所述预设元素逻辑关系和所述预设元素空间距离关系将至少一个所述初始聚合区域聚合得到第二聚合区域并标注所述第二聚合区域并对所述第二聚合区域标注第二标识,依次类推,直到获得一个包含所述每张元素标注图像上所有元素的第N聚合区域并标注所述第N聚合区域并对所述第N聚合区域标注第N标识,其中,与所述第N聚合区域对应的是树的根节点,所述第N聚合区域包括一个或多个第N-1聚合区域,所述N的取值为大于1的整数,不同标识用于记录对应聚合区域在元素结构树上所处的层级。
本申请的一些实施例通过在每张元素标注图上进一步标注多层级的聚合区域作为对图神经网络模型进行训练的标注数据,标注的聚合区域可以反应原始界面图像上各元素的从属关系,这样的标注数据使得训练得到的目标图神经网络模型具备挖掘界面图像上各元素从属关系即结构关系的能力。
在一些实施例中,在所述将所述初始结构图输入目标图神经网络模型之前,所述方法还包括:通过所述目标元素检测模型得到与所述N张原始界面图像包括的每张原始界面图像对应的预测结果,其中,所述预测结果包括在任一原始界面图像上检测到的所有元素的预测元素属性信息以及第二语义特征,所述预测元素属性信息包括所述元素位置和所述元素类别中的至少一个,所述第二语义特征为在所述任一原始界面图像上检测到的所有元素中各元素的局部语义特征;根据所述预测元素属性信息和距离构图算法得到与所述任一原始界面图像对应的预测初始结构图,其中,所述预测初始结构图上包括多个第二节点;根据所述预测结果得到所述预测初始结构图上每个第二节点的特征,并根据所述特征得到输 入特征向量;所述至少根据所述N张祖先节点位置及层数标注图像对图神经网络进行训练得到所述目标图神经网络模型,包括:根据所述输入特征向量和所述N张祖先节点位置及层数标注图像对图神经网络进行训练得到所述目标图神经网络。
本申请的一些实施例还需要获取输入向量来得到对图神经网络进行训练的训练数据,这些数据与所述N张祖先节点位置标注图像同时输入图神经网络模型才能完成对网络的训练,得到具备构建元素结构树的目标图神经网络模型。
在一些实施例中,所述根据所述预测结果得到所述预测初始结构图上每个第二节点的特征,包括:将与所述任一第二节点对应的元素位置、元素类别以及局部语义特征作为所述任一第二节点的特征,其中,所述与所述任一第二节点对应的局部语义特征为所述任一第二节点所在区域的语义特征。
本申请的一些实施例通过元素位置(即元素在对应界面图像上的坐标)、元素类别(例如,滚动条、可编辑输入框、文本、超链接、有边界的图像、按钮、标记、窗口和弹窗中的至少一种)和局部语义特征作为初始结构图上各节点的特征。
在一些实施例中,所述根据所述预测结果得到所述预测初始结构图上每个第二节点的特征,包括:对与任一第二节点对应的局部语义特征进行降维处理,得到降维局部语义特征,其中,所述与任一第二节点对应的局部语义特征为所述任一第二节点所在区域的语义特征;将与所述任一第二节点对应的所述元素位置、元素类别以及所述降维局部区域语义特征作为所述任一第二节点的特征。
本申请一些采用降维后的局部语义特征作为初始结构图上各节点的特征,可以降低训练时的数据处理量,提升训练速度。
在一些实施例中,通过主分量分析降维算法PCA降维算法进行所述降维处理。
本申请的一些实施例通过PCA降维算法对局部语义特征进行降维处理。
在一些实施例中,在所述获取基准界面上至少部分元素之间的结构关系得到参考元素结构树之后,所述方法还包括:在所述参考元素结构树上标注每个元素的语义,得到参考元素语义树;所述至少根据参考元素结构树和所述待匹配元素结构树从所述待处理界面图像上确定目标元素的位置,包括:确认所述参考元素结构树和所述待匹配元素结构树的结构一致;从所述参考元素语义树中查找与所述目标元素对应的目标节点;从所述待匹配元素结构树中查找与所述目标节点对应的节点的元素位置特征值,并根据所述元素位置特征值从所述待操作界面上得到所述目标元素的位置。
本申请的一些实施例借助界面的结构关系和界面上元素的语义信息来定位待操作界面上的目标元素(例如,目标按钮或者目标编辑框等),进一步提升定位结果的准确性。
第二方面,本申请的一些实施例提供一种定位界面上目标元素的装置,所述装置包括:元素结构树获取模块,被配置为获取待操作界面上至少部分元素之间的结构关系,得到待匹配元素结构树;定位模块,被配置为至少根据参考元素结构树和所述待匹配元素结构树从所述待操作界面上确定目标元素的位置,以完成对所述目标元素的操作;其中,所述参考元素结构树用于表征基准界面上至少部分元素之间的结构关系,所述结构关系是通过对相应界面的元素进行结构化解析得到的,所述相应界面包括所述基准界面和所述待操作界面。
第三方面,本申请的一些实施例提供一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时可实现如第一方面任意实施例所述的方法。
第四方面,本申请的一些实施例提供一种电子设备,包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述程序时可实现如上述第一方面任意实施例所述的方法。
第五方面,本申请的一些实施例提供一种计算机程序产品,所述的计算机程序产品包括计算机程序,其中,所述的计算机程序被处理器执行时可实现如第一方面任一实施例所述的方法。
第六方面,本申请的一些实施例提供一种机器人,所述机器人被配置为执行如第一方面任意实施例所述的方法。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本申请的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1为本申请实施例提供的待操作界面的图像;
图2为本申请实施例提供的定位界面上目标元素的方法的流程图;
图3为本申请实施例提供的按照元素逻辑关系对部分元素分类的结果;
图4为本申请实施例提供的根据目标元素检测模型和目标图神经网络模型得到待匹配元素结构树的过程示意图;
图5为本申请实施例提供的机器人流程自动化过程的实现模型架构图;
图6为本申请实施例提供的对元素检测模型进行训练得到目标元素检测模型的示意图;
图7为本申请实施例提供的目标元素检测模型对待操作界面图像的处理示意图;
图8为本申请实施例提供的对图神经网络模型进行训练的架构图;
图9为本申请实施例提供的定位界面上目标元素的装置的组成框图;
图10为本申请实施例提供的电子设备组成示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。同时,在本申请的描述中,术语“第一”、“第二”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
机器流程自动化技术可以模拟员工在日常工作中通过键盘、鼠标对计算机的操作,可以代替人类执行登录系统、操作软件、读写数据、下载文件、读取邮件等操作。以自动化机器人作为企业的虚拟劳动力,可以将员工从重复、低价值的工作中解放出来,将精力投入到高附加值的工作上,从而可以使企业在数字化智能化转型的同时又做到降低成本、增加效益。
RPA是一种使用软件机器人取代业务流程中的人工任务,并且像人一样与计算机的前端系统进行交互,因此RPA可以看作是一种运行在个人PC机或服务器中的软件型程序机器人,通过模仿用户在电脑上进行的操作来替代人类自动重复这些操作,例如检索邮件、下载附件、登录系统、数据加工分析等活动,快速、准确、可靠。虽然和传统的物理机器 人一样都是通过设定的具体规则来解决人类工作中速度和准确度的问题,但是传统的物理机器人是软硬件结合的机器人,需要在特定的硬件支持下配合软件才能执行工作;而RPA机器人是纯软件层面的,只要安装了相应的软件,就可以部署到任意一台PC机和服务器中来完成规定的工作。
也就是说,RPA是一种利用“数字员工”代替人进行业务操作的一种方式及其相关的技术。本质上RPA是通过软件自动化技术,模拟人实现计算机上系统、软件、网页和文档等对象的无人化操作,获取业务信息、执行业务动作,最终实现流程自动化处理、人力成本节约和处理效率提升。从描述可知,RPA的核心技术之一就是进行界面上待操作的元素(即目标元素)的定位拾取,举例来说,当需要模拟人进行按钮点击动作,前提就是定位到按钮元素的位置。
请参看图1,图1为一个网页界面的图像,下面结合图1示例性阐述机器人流程自动化的过程。
在图1中提供一个网页界面,即百度搜索界面。在该界面上包括多个元素,即第一元素101、第二元素102、第三元素103、第四元素104、第五元素105、第六元素106、第七元素107、第八元素108、第九元素109以及第十元素,其中,第一元素至第七元素均属于超链接类型的元素,第八元素属于可编辑输入框类型的元素,第九元素属于按钮类型的元素,第十元素190属于有边界的图像。
机器人流程自动化即由机器人模拟人工对图1展示的各元素进行相应操作。
相关技术在实现机器人流程自动化时需要借助:元素检测模块,基于图像特征的模板匹配模块以及OCR等多个模块串联工作,而本申请的一些实施例需要在设计阶段获取图1的网页界面的元素结构树,之后在执行阶段由机器人获取待操作界面(即与图1相同的界面)的元素结构树,之后再通过两个元素结构树来帮助机器人定位搜索一下这个按钮的位置,并使得机器人在该按钮上顺利执行点击操作。
需要说明的是,图1仅用于示例性阐述本申请的工作场景和工作过程,不应理解为对本申请技术方案的应用场景的限制。
下面结合图2示例性阐述由机器人执行的定位目标元素的方法。
在RPA流程的设计器阶段会选定的某个应用程序app包括的某个界面,即基准界面(或称为标准界面),在后续RPA的执行过程中(即利用机器人模仿人类操作界面时),机器人会无数次重复访问该界面,在本申请的实施例中将机器人访问该界面以对其操作的界面称为待操作界面(或称为界面N),将人操作的界面(即RPA流程的设计器阶段进行图像录制的界面)称为基准界面或者标准界面。
如图2所示,本申请实施例提供一种定位界面上目标元素的方法,该方法示例性包括:S101,获取待操作界面上至少部分元素之间的结构关系,得到待匹配元素结构树。以及S102,至少根据参考元素结构树和所述待匹配元素结构树从所述待操作界面上确定目标元素的位置,以完成对所述目标元素的操作;其中,所述参考元素结构树用于表征基准界面上至少部分元素之间的结构关系,所述结构关系是通过对相应界面的元素进行结构化解析得到的,所述相应界面包括所述基准界面和所述待操作界面。
也就是说,本申请的一些实施例使得软件机器人不必根据繁琐的OCR结果或图像语义信息去决定目标元素的选择,而是借助元素之间的结构关系,从而映射到实际图像(即待操作界面对应的图像)中并找到对应的目标元素的位置并完成点击或者其他类型的操作, 提升结果的准确性。
可以理解的是,在本申请的一些实施例中,在执行S101之前,所述方法还包括:获取所述基准界面上所述至少部分元素之间的所述结构关系,得到所述参考元素结构树。本申请的一些实施例在机器人针对待操作界面进行操作之前还需要首先获取标准界面的元素结构树,从而根据该元素结构树和待匹配元素结构树找到目标元素在待操作界面上的位置。
需要说明的是,所述结构化解析是依据元素逻辑关系以及元素空间距离关系对所述至少部分元素的分类结果;所述参考元素结构树和所述待匹配元素结构树用于表征任意两个节点的目标共同祖先节点的位置。本申请的一些实施例通过构建空间上较接近且逻辑关系相同的节点之间的共同祖先并标注祖先的位置(例如,采用矩形框标注出祖先的位置)得到元素结构树,该元素结构树能够充分表征界面上元素之间的结构关系,进而提升了根据结构关系定位目标元素的准确性。元素逻辑关系指的是从元素的功能角度来区分不同类元素,功能相似或相同的多个元素属于满足一种逻辑关系的元素。例如,如图3提供了一个登陆界面的图像,在该登陆界面上提供了验证码登录或者密码登录,以及与验证码登录方式对应的获取短信验证码和获取语音验证码的输入框,登陆/注册选择框以及其他方式登录,其中,图3的标注框301内的三种元素功能相同且都属于第三方登录方式,则认为按照元素逻辑关系这三者属于一类。
可以理解的是,若所述任意两个节点的共同祖先节点的个数为一个时,则所述目标共同祖先节点为该共同祖先节点。若所述任意两个节点的共同祖先节点的个数为多个时,则所述目标共同祖先节点为所述多个共同祖先节点中距离所述两个节点最近的共同祖先节点。例如,目标共同祖先节点为最近祖先节点:在两个不同节点向上寻找祖先节点过程,遇见的最近的共同祖先。其中,祖先节点的含义为:节点的父节点属于祖先节点、父节点的父节点也属于祖先节点,父节点的父节点的父节点也属于祖先节点,依次类推,共同祖先节点为两个不同节点的祖先节点中,重叠的节点为它们的共同的祖先节点。
下面示例性阐述上述步骤的实现过程。
例如,在本申请的一些实施例中,S101涉及的获取待操作界面上至少部分元素之间的结构关系的过程示例性包括如下三个步骤:
第一步,将待操作界面的图像输入目标元素检测模型,得到从所述待操作界面的图像中检测到的所有元素的元素属性信息以及目标语义特征,其中,所述元素属性信息包括:元素位置(例如,采用坐标表征位置)和元素类别中的至少一个,所述目标语义特征为所述所有元素包括的各元素所在区域的语义特征。
可以理解的是,将基准界面图像输入目标元素检测模型可以得到该界面上所有元素的元素属性信息以及目标语义特征。
第二步,根据距离构图算法和所述所有元素的属性信息构建初始结构图,其中,所述初始结构图包括多个节点,每个节点用于表征一个元素,所述每个节点的特征采用所述元素属性信息进行表征。
可以理解的是,采用该第二步的过程也可以构建与基准界面对应的初始结构图。
第三步,将所述初始结构图输入目标图神经网络模型,并至少根据所述目标图神经网络模型得到所述待匹配元素结构树,其中,所述待匹配元素结构树包括所述多个节点以及与至少部分节点对应的祖先节点的位置信息。例如,所述待匹配元素结构树包括所述多个 节点以及与至少部分节点对应的最近的祖先节点的位置信息。
下面结合图4示例性阐述上述过程。
将待操作界面的图像输入目标元素检测模型110(用于执行上述第一步)得到元素属性信息和目标语义特征,并将元素属性信息提供给距离构图模块112(用于执行上述第二步)得到初始结构图,最后将初始结构图输入目标图神经网络模型120得到待匹配元素结构树。
可以理解的是,采用该第三步将与基准界面对应的初始结构图输入目标图神经网络模型可以得到所述参考元素结构树。
也就是说,本申请的一些实施例通过目标元素检测模型检测出待操作界面的图像上存在的所有元素的元素属性信息以及局部语义特征(或称为目标语义特征),之后再通过目标图神经网络模型构造出元素结构树进而得到各元素的结构关系,使得在待操作界面上查找目标元素的位置的技术方案依赖结构关系来查找,在降低技术方案复杂性的同时提升查找结果的准确性。
需要说明的是,为了提升对目标元素的定位准确性,本申请的一些实施例还需要根据在参考元素结构树上标注的各元素的语义来识别目标元素的位置。
例如,在本申请的一些实施例中,所述目标元素为目标按钮;其中,在获取到参考元素结构树之后,所述方法还包括:在所述参考元素结构树上标注每个元素的语义,得到参考元素语义树。相应的,S102示例性包括:确认所述参考元素结构树和所述待匹配元素结构树的结构一致;从所述参考元素语义树中查找与所述目标按钮对应的目标节点;从所述待匹配元素结构树中查找与所述目标节点对应的节点的元素位置特征值,并根据所述元素位置特征值从所述待操作界面上得到所述目标按钮的位置。本申请的一些实施例借助界面的结构关系和界面上元素的语义信息来定位待操作界面上的目标按钮,进一步提升定位结果的准确性。
下面结合图5来示例性阐述本申请一些实施例的定位界面上目标元素的方法。需要说明的是,目标元素检测模型以及目标图神经网络模型均是训练结束后得到的神经网络模型,其中,目标元素检测模型具备检测输入图像上元素位置、元素类别以及元素对应的局部语义特征的能力。上述目标图神经网络模型具备根据目标元素检测模型的输出数据获取各元素(属于界面上一个控件)祖先控件位置的功能。也就是说,通过目标图神经网络模型输出的是相连的两两元素的最近共同祖先节点的层级位置,之后根据层级位置可以构建出对应的元素结构树。对于如何根据层级位置构建元素结构树属于常规手段,因此在此不做过多赘述。
如图5所示,将基准界面图像输入两级级联的神经网络模型系统100中得到上述参考元素结构树,将待操作界面图像输入两级级联的神经网络模型系统100中得到待匹配元素结构树,该神经网络模型系统100至少包括目标元素检测模型110以及目标图神经网络模型120。之后,通过配置模块在参考元素结构树上配置各节点的语义得到参考元素语义结构树。然后,判断参考元素结构树和待匹配元素结构树的结构是否匹配(即两者是否一致),若不一致,则查找目标元素过程结束,否则由目标元素语义搜索模块来从参考元素语义结构树中搜索得到目标元素在该参考元素语义结构树中的位置。最后,根据该位置信息从待匹配元素结构树上定位出目标元素的位置(具体由图5的目标元素搜索模块的代码来实现),将该位置映射至待操作界面上以使机器人完成针对目标元素的点击或者其他操作。
需要说明的是,图5的两级级联的神经网络模型系统100还可以包括除两个模型之外的其它功能单元,如图4所示。
下面示例阐述通过目标元素检测模型获取目标语义特征的过程。在本申请的一些实施例中,上述将待操作界面的图像输入目标元素检测模型,得到从所述待操作界面的图像中检测到的所有元素的元素属性信息以及目标语义特征示例性包括:通过所述目标元素检测模型包括的主干网络中获取整体图片语义特征,其中,所述主干网络为特征提取网络;从所述整体图片语义特征中抠出与所述所有元素包括的各元素分别对应的局部语义特征,将得到的所有局部语义特征作为所述目标语义特征。本申请的一些实施例通过目标元素检测网络获取的整体图片的语义特征来得到各元素的局部语义特征,采用该特征表征各节点的特征在提升节点特征准确性的基础上还减少了数据处理量,提升数据处理速度。
不难理解的是,本申请的一些实施例图4两级级联的神经网络模型系统100同样采用了串联的方式,先对图像进行元素检测,再根据元素检测结果,利用图神经网络,进行元素结构树的搭建。不过本申请的一些实施例采用的模型数量只有两个,因此与现有技术的多模块方式实现元素定位的技术方案相比可以:降低多模型串联而导致的错误率累成效应;提高了整体的工作速度。另外,由于本申请一些实施例对界面中某些元素的寻找,并不是依赖于OCR给出的语义信息,而是依赖元素的结构关系,因此可以对语言版本或者颜色形状变换等外观信息具有较高的鲁棒性,同时降低模型的训练成本。
例如,在本申请的一些实施例中软件机器人为了执行相关操作需要包括软件机器人获取对应按钮(元素)流程的设计器阶段以及软件机器人获取对应按钮(元素)流程的执行器阶段。
在设计器阶段,设计器会通过如图5的目标元素检测模型110以及目标图神经网络模型120对基准界面进行推理,从而生成参考元素结构树。并通过人工配置等配置方式,对参考元素结构树中的每个节点进行配置,配置信息包括语义、功能、坐标等信息,配置后获得参考元素语义结构树。
需要说明的是,在RPA流程的设计器阶段会选定的某个应用程序app包括的某个界面,即基准界面(或称为标准界面),在后续RPA的执行过程中,机器人会无数次重复访问该界面,即待操作界面(或称为界面N)。
在执行器阶段,比如软件机器人在执行到某一步骤,需要在界面N中点击某个按钮X(该按钮即需要定位的目标元素的一种),则执行器的执行过程包括:
第一步,接收“点击按钮X”的搜索请求。
第二步,针对界面N(即待操作界面)采用图5所示的目标元素检测模型以及目标图神经网络模型得到待匹配元素结构树,同时读取预先根据基准界面、目标元素检测模型和目标图神经网络模型得到的参考元素结构树。
第三步,比较待匹配元素结构树和参考元素结构树结构是否一致,若不一致则从界面N上查找按钮X的过程失败;否则根据配置得到的参考元素语义结构树进行语义搜索,从参考元素结构树上定位到该按钮X对应的目标节点,并从待匹配元素结构树上找到与目标节点对应的节点,并将与目标节点对应的节点的坐标信息返回给软件机器人,并进行后续的RPA的工作,即将该节点位置信息作为按钮X的位置进而从界面N上定位到按钮X。
第四步,完成对按钮X的点击操作。
下面示例性阐述对元素检测模型111进行训练得到上述目标元素检测模型的过程。需 要说明的是,本申请的一些实施例的元素检测模型和目标元素检测模型的架构相同,只是元素检测模型的权重值是随机初始化的值,而目标元素检测模型的权重值是训练结束后得到的。本申请一些实施例的元素检测模型可以采用具备界面图像元素提取功能的任意神经网络模型,例如,元素检测模型可以yolov5神经网络模型,该模型采用了卷积神经网络CNN。
在本申请的一些实施例中,在执行S101之前,所述方法还包括对元素检测模型进行训练得到权重文件,并根据权重文件得到目标元素检测模型的过程,作为一个示例该过程包括:
第一步,获取N张原始界面图像。
进行数据采集,将采集到的网页界面图片或软件界面图片作为N张原始界面图像。
第二步,在所述N张原始界面图像包括的各张原始界面图像上均标注每个元素所在的区域以及所述每个元素的类别,得到N张元素标注图像,其中,所述每个元素所在区域采用矩形框标出,所述类别包括:滚动条、可编辑输入框、文本、超链接、有边界的图像、按钮、标记、窗口和弹窗中的至少一种;根据所述N张原始界面图像和所述N张元素标注图像对元素检测模型进行训练,得到所述目标元素检测模型。
也就是说,对第一步采集到的网页界面图片或软件界面图片进行标注(例如,人工标注),形成对应的标注集。例如,对原始界面图像中每个元素标出可以包含其的最小矩形框(作为每个元素所在区域),以及对应类别(即元素类别),类别包括:scrollbar:滚动条;textbox:可编辑输入框;text:文本;link:超链接(有下划线);image:有边界的图像;button:按钮;icon:标记,符号;window:窗口,弹窗;icon_button:既是icon又是button;icon_button_text:既是icon又是button又是text。
第三步,根据所述N张原始界面图像和所述N张元素标注图像对元素检测模型进行训练,得到所述目标元素检测模型。
也就是说,如图6所示,将第一步得到的N张原始界面图像和第二步得到的N张元素标注图像作为输入送入元素检测模型进行监督训练,其对应的标注集作为监督标签,获得训练后的第一模型权重文件,将该权重文件中的系数作为元素检测模型的系数即得到目标元素检测模型。
不难理解的是,本申请的一些实施例通过在每张训练图像上标注元素所在的位置以及元素类别使得训练结束后得到的目标元素检测网络具备预测输入图像上这些信息的功能。
下面结合图7以待操作界面图像为例示例性阐述目标元素检测模型具备的功能,即阐述该模型的输出。
如图7所示,将待操作界面图像输入目标元素检测模型,通过该目标元素检测模型可以得到在该界面上检测出的所有元素的元素坐标(即表征元素在界面上的位置)、元素类别以及界面的整体图片语义特征(例如,由该模型的主干网络获取该整体图片语义特征),之后再由目标区域语义特征获取模块从该整体图片语义特征上抠出各元素所在区域的语义特征得到目标语义特征。
可以理解的是,在应用过程中将目标元素检测模型输出的元素坐标和元素类别,以及目标语义特征输入给下游网络(例如,该下游网络包括距离构图模块和目标图神经网络模型)使用。
下面示例性阐述对图神经网络模型进行训练得到目标图神经网络模型的过程。需要说 明的是,对图神经网络的训练过程需要训练数据且对该图神经网络进行训练属于基于标注数据的有监督的训练。本申请的一些实施例的获取训练数据的过程示例性包括:获取输入x以及标注数据y(即N张祖先节点位置标注图像),下面分别示例性阐述获取输入x以及标注数据y的实现过程。
下面首先示例性阐述获取标注数据y的过程,需要说明的是,标注数据y将属于同一父节点下的所有子节点标在一起。需要说明的是,本申请一些实施例构建的元素结构树的叶节点(最底层)是对应界面图像上的真实的元素(控件),而其他往上的一层层节点,只是这些叶节点的聚类,不是真实的界面元素。
为了获取标注数据y在本申请的一些实施例中,所述方法还包括:在所述N张元素标注图像包括的每张元素标注图像上分别标注至少一个聚合区域并标注所述聚合区域在元素结构树中的层级(即在目标检测的标注基础上标注聚合框cluste并标注该聚合框在元素结构树中所处的层级,一个聚合框对应一个聚合区域),得到N张祖先节点位置及层数标注图像,其中,一个聚合区域包括一个或多个元素所在的区域,与所述一个聚合区域对应的是一个共同祖先节点,所述一个聚合区域用于表征所述共同祖先节点所在位置;至少根据所述N张祖先节点位置及层数标注图像对图神经网络进行训练得到所述目标图神经网络模型。本申请的一些实施例通过在N张元素标注图像上进一步标注邻近元素的共同祖先节点的位置信息以及该共同祖先节点在元素结构树中的层级,使得训练得到的目标图神经网络模型具备预测输入图像上节点间共同祖先节点所在位置的功能。
例如,在本申请的一些实施例中,所述在所述N张元素标注图像包括的每张元素标注图像上分别标注至少一个聚合区域并标注所述聚合区域在元素结构树中的层级的过程示例包括:根据预设元素逻辑关系(例如,预设元素逻辑关系为功能相同)以及预设元素空间距离关系对所述每张元素标注图像上的一个或多个元素进行聚合,在被聚合的所有元素所在的区域标注初始聚合区域并对所述初始聚合区域标注第一标识,再根据所述预设逻辑和所述预设元素空间距离关系将至少一个所述初始聚合区域聚合得到第二聚合区域并标注所述第二聚合区域并对所述第二聚合区域标注第二标识,依次类推,直到获得一个包含所述每张元素标注图像上所有元素的第N聚合区域并标注所述第N聚合区域并对所述第N聚合区域标注第N标识,其中,与所述第N聚合区域对应的是树的根节点,所述第N聚合区域包括一个或多个第N-1聚合区域,所述N的取值为大于1的整数,不同标识用于记录对应聚合区域在元素结构树上所处的层级。本申请的一些实施例通过在每张元素标注图上标注多层级的聚合区域并标注各聚合区域所处的层数作为对图神经网络模型进行训练的标注数据,标注的聚合区域可以反应原始界面图像上各元素的从属关系,这样的标注数据使得训练得到的目标图神经网络模型具备挖掘界面图像上各元素从属关系即结构关系的能力。
也就是说,上述过程为对输入x准备对应的标签集作为输入y,进行模型的监督标签。该标签的表达形式为:两个元素之间的最近共同祖先节点的位置(在待构建的元素结构树中的层数),也即代表了节点nodes之间的边edge。比如第一节点node1和第二节点node2的标注为3,则代表了node1和node2所代表的两个元素,在最终元素结构树中,两者的最近共同祖先节点的位置在元素结构树中为第三层,同样模型最终预测node1和node2之间的第一二边edge12(即连接第一节点和第二节点的边)的结果应该为3。需要说明的是,为了得到输入y需要首先根据距离构图模块对目标元素检测模块输出的信息进行构图得到 预测初始结构图(构图过程如下文所述,为避免重复在此不做过多赘述),该预测初始结构图包括与各元素对应的节点,但是连接节点之间的边是未设置任何数值的,本申请的一些实施例通过对图神经网络模型进行训练可以使得训练得到的目标图神经网络模型具备识别边上数值的功能,该数值表征了两个节点对应的共同祖先节点在元素结构树中的层数。
下面示例性阐述获取输入x的实现过程。
为了获取输入x需要得到根据距离构图模块得到的预测初始结构图。例如,在本申请的一些实施例中,在所述将所述初始结构图输入目标图神经网络模型之前,所述方法还包括:
第一步,通过所述目标元素检测模型得到与所述N张原始界面图像包括的每张原始界面图像对应的预测结果。
如图8所示,将N张原始界面图像输入如图7所示的目标元素检测模型得到预测结果,该预测结果包括在任一原始界面图像上检测到的所有元素的预测元素属性信息以及第二语义特征,所述预测元素属性信息包括所述元素位置和所述元素类别中的至少一个,所述第二语义特征为在所述任一原始界面图像上检测到的所有元素中各元素的局部语义特征(例如,从所述整体图片语义特征中上抠出与各元素对应的语义特征得到局部语义特征)。
第二步,根据所述预测元素属性信息和距离构图算法得到与所述任一原始界面图像对应的预测初始结构图,其中,所述预测初始结构图上包括多个第二节点。
如图8所述,将预测元素属性信息即元素类别和元素坐标输入距离构图模块(用于执行距离构图算法)得到对应与各张原始界面图像的预测初始结构图。
也就是说,本申请的一些实施例根据第一步得到的预测结果包括的元素坐标和元素类别利用距离构图算法,进行构图,得到预测初始结构图graph1。其中距离构图算法的定义为:将所有元素定义为预测初始结构图或者初始结构图中节点nodes(一个节点对应检测到一个元素),针对任一节点node N以其为圆心(例如,目标检测结果里的元素坐标是一个矩形,而这里的圆心指的是该矩形的中心点)且以一定距离d为半径,在其所画圆内的其他所有节点组成的nodes S合集,均视作与其相关,从而将其他所有nodes的S合集与该node N进行edge相连,得到预测初始结构图或者得到初始结构图。在预测初始结构图和初始结构图上并未设置与各边对应的数值,这些数值可以通过训练得到的目标图神经网络模型来获取,这些数值用于表征与边对应的两个节点的最近公共祖先节点在构建的元素结构树上的层数。
需要说明的是,将元素类别的信息作为预测初始结构图或者初始结构图上节点的特征,目的就是为了增加每个节点node的特征量,使得构建的元素结构树的构建结果受到每个元素类别的影响。可以理解的是,在本申请的一些实施例中,上述初始结构图和预测初始结构图在构图时也可以仅考虑元素位置(即元素坐标)信息,对应的目标元素检测模块的输出结果可以不包括元素类别,对应的标注数据也可以不用标注元素类别。
第三步,根据所述预测结果得到所述预测初始结构图上每个第二节点的特征,并根据所述特征得到输入特征向量。
下面示例性阐述获取每个第二节点的特征的实施例。
例如,在本申请的一些实施例中,该第三步所述根据所述预测结果得到所述预测初始结构图上每个第二节点的特征的过程示例性包括:将与所述任一第二节点对应的元素位置、元素类别以及局部语义特征作为所述任一第二节点的特征,其中,所述与所述任一第二节 点对应的局部语义特征为所述任一第二节点所在区域的语义特征。也就是说,本申请的一些实施例通过元素位置(即元素在对应界面图像上的坐标)、元素类别(例如,滚动条、可编辑输入框、文本、超链接、有边界的图像、按钮、标记、窗口和弹窗中的至少一种)和局部语义特征作为初始结构图上各节点的特征。
例如,在本申请的一些实施例中,该第三步所述根据所述预测结果得到所述预测初始结构图上每个第二节点的特征的过程示例性包括:对与任一第二节点对应的局部语义特征进行降维处理,得到降维局部语义特征,其中,所述与任一第二节点对应的局部语义特征为所述任一第二节点所在区域的语义特征;将与所述任一第二节点对应的元素位置、元素类别以及所述降维局部区域语义特征作为所述任一第二节点的特征。也就是说,本申请一些采用降维后的局部语义特征作为初始结构图上各节点的特征,可以降低训练时的数据处理量,提升训练速度。例如,在本申请的一些实施例中,通过PCA降维算法进行所述降维处理。本申请的一些实施例通过PCA降维算法对局部语义特征进行降维处理。
如图8所示,将与各张原始界面图像对应的整体图片语义特征输入目标区域语义特征获取模块,通过该模块从该整体图片语义特征上提取出与各元素分别对应的局部语义特征。再将各局部语义特征输入降维处理模块(用于执行降维处理算法)得到与各元素对应的降维局部语义特征。然后将该降维局部语义特征输入节点特征构建模块得到各第二节点的特征。
下面采用表达式阐述获取输入x的过程。
首先,对于上述第二语义特征,由于其的表达形式通常为一个m*n的矩阵V,并且m和n的数值通常较大,因而为了使语义特征的长度不要过大,并变得合理,所以采用PCA降维的方式,获得较小空间的特征表达K(即降维局部语义特征采用K来表征)。
其次,根据目标元素检测模型得到的元素的坐标和元素类别以及特征表达K,进行特征构建,获得各第二节点的元素特征,即每个graph1中各第二节点nodes的特征,该特征向量x包含位置坐标、类别、img特征,各第二节点的特征用表达式表征如下:
x=[node.class+node.location+node.img_feature]
其中,node.class、node.location、node.img_feature分别代表相应元素的位置坐标、类别、img特征(特征表达K),并且组合方式为拼接concatenate。
然后,预测初始结构图graph1的每个第二节点node的特征就是上步骤生成的对应位置的第二节点node特征向量x,并根据graph1图的结构构造出特征向量的集合-特征矩阵X;根据graph1图的结构生成邻接矩阵A和度矩阵D,将A、D、X作为输入x0,送入图神经网络进行训练。
其中,图神经网络的核心公式为:
Figure PCTCN2022138765-appb-000001
其中,X为nodes的特征向量,A为graph的邻接矩阵,W为可训练权重,W右上角的数字代表第几层的,如W0代表为第0层的可训练权重,ReLU为内激活函数,softmax为输出的激活函数。
需要说明的是,上述根据图结构而构造的A、D、X的方法,为图神经网络里的通用方法,为避免重复在此不做过多赘述。
如图8所示,根据各第二节点的特征通过邻接矩阵和度矩阵构建模块得到相应矩阵。
最后,将输入x和输入y送入图神经网路模型进行训练,获得第二模型权重文件,将 该文件的权重系数值作为图神经网络模型的参数值即得到目标图神经网络模型。
如图8所示,将N张祖先节点位置标注图像(对应于输入y)以及第二节点的特征组成矩阵、邻接矩阵以及度矩阵(对应于输入x)输入图神经网络121对其进行训练,得到目标神经网络模型120。
可以理解的是,上述实施例所述的至少根据所述N张祖先节点位置标注图像对图神经网络进行训练得到所述目标图神经网络模型的过程示例性包括:根据所述输入特征向量和所述N张祖先节点位置标注图像对图神经网络进行训练得到所述目标图神经网络。本申请的一些实施例还需要获取输入向量来得到对图神经网络进行训练的训练数据,这些数据与所述N张祖先节点位置标注图像同时输入图神经网络模型才能完成对网络的训练,得到具备构建元素结构树的目标图神经网络模型。
将目标元素检测模型的输出信息输入目标图神经网络模型,可以预测根据距离构图算法得到的图上各节点nodes之间的边edges对应的数值,该数值可以表征两两元素之间最近共同祖先节点的位置信息。不难理解的是,根据目标神经网络模型的输出可以搭建出元素结构树。
请参考图9,图9示出了本申请实施例提供的定位界面上目标元素的装置,应理解,该装置与上述图2方法实施例对应,能够执行上述方法实施例涉及的各个步骤,该装置的具体功能可以参见上文中的描述,为避免重复,此处适当省略详细描述。装置包括至少一个能以软件或固件的形式存储于存储器中或固化在装置的操作系统中的软件功能模块,该定位界面上目标元素的装置包括:元素结构树获取模块801以及定位模块802。
元素结构树获取模块801,被配置为获取待操作界面上至少部分元素之间的结构关系,得到待匹配元素结构树。
定位模块802,被配置为至少根据参考元素结构树和所述待匹配元素结构树从所述待操作界面上确定目标元素的位置,以完成对所述目标元素的操作;其中,所述参考元素结构树用于表征基准界面上至少部分元素之间的结构关系,所述结构关系是通过对相应界面的元素进行结构化解析得到的,所述相应界面包括所述基准界面和所述待操作界面。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的装置的具体工作过程,可以参考前述方法中的对应过程,在此不再过多赘述。
本申请的一些实施例提供一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时可实现如上述实施例所述定位界面上目标元素的方法包括的任意实施例。
如图10所示,本申请的一些实施例提供一种电子设备900,包括存储器910、处理器920以及存储在所述存储器910上并可在所述处理器920上运行的计算机程序,其中,所述处理器920通过总线930从存储器910读取程序并执行所述程序时可实现如上述定位界面上目标元素的方法包括的任意实施例。
处理器920可以处理数字信号,可以包括各种计算结构。例如复杂指令集计算机结构、结构精简指令集计算机结构或者一种实行多种指令集组合的结构。在一些示例中,处理器920可以是微处理器。
存储器910可以用于存储由处理器920执行的指令或指令执行过程中相关的数据。这些指令和/或数据可以包括代码,用于实现本申请实施例描述的一个或多个模块的一些功能或者全部功能。本公开实施例的处理器920可以用于执行存储器910中的指令以实现图2 中所示的方法。存储器910包括动态随机存取存储器、静态随机存取存储器、闪存、光存储器或其它本领域技术人员所熟知的存储器。
本申请的一些实施例提供一种计算机程序产品,所述的计算机程序产品包括计算机程序,其中,所述的计算机程序被处理器执行时可实现如上述实施例所述定位界面上目标元素的方法包括的任意实施例。
本申请的一些实施例提供一种机器人,所述机器人被配置为执行如上述实施例所述定位界面上目标元素的方法包括的任意实施例。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,也可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,附图中的流程图和框图显示了根据本申请的多个实施例的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现方式中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
另外,在本申请各个实施例中的各功能模块可以集成在一起形成一个独立的部分,也可以是各个模块单独存在,也可以两个或两个以上模块集成形成一个独立的部分。
所述功能如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述仅为本申请的实施例而已,并不用于限制本申请的保护范围,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应所述以权利要求的保护范围为准。
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所 固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。

Claims (18)

  1. 一种定位界面上目标元素的方法,其特征在于,所述方法包括:
    获取待操作界面上至少部分元素之间的结构关系,得到待匹配元素结构树;
    至少根据参考元素结构树和所述待匹配元素结构树从所述待操作界面上确定目标元素的位置,以完成对所述目标元素的操作;
    其中,所述参考元素结构树用于表征基准界面上至少部分元素之间的结构关系,所述结构关系是通过对相应界面的元素进行结构化解析得到的,所述相应界面包括所述基准界面和所述待操作界面。
  2. 如权利要求1所述的方法,其特征在于,
    所述结构化解析是依据元素逻辑关系以及元素空间距离关系对所述至少部分元素的分类结果;
    所述参考元素结构树和所述待匹配元素结构树用于表征任意两个节点的目标共同祖先节点。
  3. 如权利要求2所述的方法,其特征在于,若所述任意两个节点的共同祖先节点的个数为多个时,则所述目标共同祖先节点为多个所述共同祖先节点中距离所述两个节点最近的共同祖先节点。
  4. 如权利要求1所述的方法,其特征在于,所述获取待操作界面上至少部分元素之间的结构关系,包括:
    将所述待操作界面的图像输入目标元素检测模型,得到从所述待操作界面的图像中检测到的所有元素的元素属性信息以及目标语义特征,其中,所述元素属性信息包括:元素位置和元素类别中的至少一个,所述目标语义特征为所述所有元素包括的各元素所在区域的语义特征;
    根据距离构图算法和所述所有元素的属性信息构建初始结构图,其中,所述初始结构图包括多个节点,每个节点用于表征一个元素,所述每个节点的特征采用所述元素属性信息进行表征;
    将所述初始结构图输入目标图神经网络模型,并根据所述目标图神经网络模型得到所述待匹配元素结构树,其中,所述待匹配元素结构树包括所述多个节点以及与至少部分节点对应的祖先节点。
  5. 如权利要求4所述的方法,其特征在于,所述将所述待操作界面的图像输入目标元素检测模型,得到从所述待操作界面的图像中检测到的所有元素的元素属性信息以及目标语义特征,包括:
    通过所述目标元素检测模型包括的主干网络获取整体图片语义特征,其中,所述主干网络为特征提取网络;
    从所述整体图片语义特征中抠出与所述所有元素包括的各元素分别对应的局部语义特征,将得到的所有局部语义特征作为所述目标语义特征。
  6. 如权利要求4所述的方法,其特征在于,在所述将所述待操作界面的图像输入目标元素检测模型之前,所述方法还包括:
    获取N张原始界面图像;
    在所述N张原始界面图像包括的各张原始界面图像上均标注每个元素所在的区域以及 所述每个元素的类别,得到N张元素标注图像,其中,所述每个元素所在的区域采用矩形框标出,所述类别包括:滚动条、可编辑输入框、文本、超链接、有边界的图像、按钮、标记、窗口和弹窗中的至少一种;
    根据所述N张原始界面图像和所述N张元素标注图像对元素检测模型进行训练,得到所述目标元素检测模型。
  7. 如权利要求6所述的方法,其特征在于,在所述将所述初始结构图输入目标图神经网络模型之前,所述方法还包括:
    在所述N张元素标注图像包括的每张元素标注图像上分别标注至少一个聚合区域并标注所述聚合区域在元素结构树中的层级,得到N张祖先节点位置及层数标注图像,其中,一个聚合区域包括一个或多个元素所在的区域,与所述一个聚合区域对应的是一个共同祖先节点,所述一个聚合区域用于表征所述共同祖先节点所在位置;
    至少根据所述N张祖先节点位置及层数标注图像对图神经网络进行训练得到所述目标图神经网络模型。
  8. 如权利要求7所述的方法,其特征在于,所述在所述N张元素标注图像包括的每张元素标注图像上分别标注至少一个聚合区域并标注所述聚合区域在元素结构树中的层级,包括:
    根据预设元素逻辑关系以及预设元素空间距离关系对所述每张元素标注图像上的一个或多个元素进行聚合,在被聚合的所有元素所在的区域标注初始聚合区域并对所述初始聚合区域标注第一标识,再根据所述预设元素逻辑关系和所述预设元素空间距离关系将至少一个所述初始聚合区域聚合得到第二聚合区域并标注所述第二聚合区域并对所述第二聚合区域标注第二标识,依次类推,直到获得一个包含所述每张元素标注图像上所有元素的第N聚合区域并标注所述第N聚合区域并对所述第N聚合区域标注第N标识,其中,与所述第N聚合区域对应的是树的根节点,所述第N聚合区域包括一个或多个第N-1聚合区域,所述N的取值为大于1的整数,不同标识用于记录对应聚合区域在元素结构树上所处的层级。
  9. 如权利要求7-8任一项所述的方法,其特征在于,在所述将所述初始结构图输入目标图神经网络模型之前,所述方法还包括:
    通过所述目标元素检测模型得到与所述N张原始界面图像包括的每张原始界面图像对应的预测结果,其中,所述预测结果包括在任一原始界面图像上检测到的所有元素的预测元素属性信息以及第二语义特征,所述预测元素属性信息包括所述元素位置和所述元素类别中的至少一个,所述第二语义特征为在所述任一原始界面图像上检测到的所有元素中各元素的局部语义特征;
    根据所述预测元素属性信息和距离构图算法得到与所述任一原始界面图像对应的预测初始结构图,其中,所述预测初始结构图上包括多个第二节点;
    根据所述预测结果得到所述预测初始结构图上每个第二节点的特征,并根据所述特征得到输入特征向量;
    所述至少根据所述N张祖先节点位置及层数标注图像对图神经网络进行训练得到所述目标图神经网络模型,包括:
    根据所述输入特征向量和所述N张祖先节点位置及层数标注图像对图神经网络进行训练得到所述目标图神经网络。
  10. 如权利要求9所述的方法,其特征在于,所述根据所述预测结果得到所述预测初始结构图上每个第二节点的特征,包括:
    将与任一第二节点对应的元素位置、元素类别以及局部语义特征作为所述任一第二节点的特征,其中,与所述任一第二节点对应的局部语义特征为所述任一第二节点所在区域的语义特征。
  11. 如权利要求9所述的方法,其特征在于,所述根据所述预测结果得到所述预测初始结构图上每个第二节点的特征,包括:
    对与任一第二节点对应的局部语义特征进行降维处理,得到降维局部语义特征,其中,所述与任一第二节点对应的局部语义特征为所述任一第二节点所在区域的语义特征;
    将与所述任一第二节点对应的元素位置、元素类别以及所述降维局部语义特征作为所述任一第二节点的特征。
  12. 如权利要求9所述的方法,其特征在于,通过主分量分析降维算法PCA对所述局部语义特征进行降维。
  13. 如权利要求2所述的方法,其特征在于,
    所述至少根据参考元素结构树和所述待匹配元素结构树从所述待操作界面上确定目标元素的位置,包括:
    在所述参考元素结构树上标注每个元素的语义,得到参考元素语义树;
    确认所述参考元素结构树和所述待匹配元素结构树的结构一致;
    从所述参考元素语义树中查找与所述目标元素对应的目标节点;
    从所述待匹配元素结构树中查找与所述目标节点对应的节点的元素位置特征值,并根据所述元素位置特征值从所述待操作界面上得到所述目标元素的位置。
  14. 一种定位界面上目标元素的装置,其特征在于,所述装置包括:
    元素结构树获取模块,被配置为获取待操作界面上至少部分元素之间的结构关系,得到待匹配元素结构树;
    定位模块,被配置为至少根据参考元素结构树和所述待匹配元素结构树从所述待操作界面上确定目标元素的位置,以完成对所述目标元素的操作;其中,所述参考元素结构树用于表征基准界面上至少部分元素之间的结构关系,所述结构关系是通过对相应界面的元素进行结构化解析得到的,所述相应界面包括所述基准界面和所述待操作界面。
  15. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行时可实现如权利要求1-13中任意一项权利要求所述的方法。
  16. 一种电子设备,包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时可实现如权利要求1-13中任意一项权利要求所述的方法。
  17. 一种计算机程序产品,其特征在于,包括计算机程序,所述计算机程序被处理器执行时可实现如权利要求1-13中任意一项权利要求所述的方法。
  18. 一种机器人,其特征在于,所述机器人被配置为执行如权利要求1-13中任意一项权利要求所述的方法。
PCT/CN2022/138765 2022-09-30 2022-12-13 一种定位界面上目标元素的方法、介质及电子设备 WO2024066067A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211205671.2A CN115268719B (zh) 2022-09-30 2022-09-30 一种定位界面上目标元素的方法、介质及电子设备
CN202211205671.2 2022-09-30

Publications (1)

Publication Number Publication Date
WO2024066067A1 true WO2024066067A1 (zh) 2024-04-04

Family

ID=83758128

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/138765 WO2024066067A1 (zh) 2022-09-30 2022-12-13 一种定位界面上目标元素的方法、介质及电子设备

Country Status (2)

Country Link
CN (1) CN115268719B (zh)
WO (1) WO2024066067A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115268719B (zh) * 2022-09-30 2022-12-20 北京弘玑信息技术有限公司 一种定位界面上目标元素的方法、介质及电子设备
CN116051868B (zh) * 2023-03-31 2023-06-13 山东大学 一种面向windows系统的界面元素识别方法

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050102636A1 (en) * 2003-11-07 2005-05-12 Microsoft Corporation Method and system for presenting user interface (UI) information
CN109324796A (zh) * 2018-08-01 2019-02-12 浙江口碑网络技术有限公司 界面布局方法及装置
CN112015405A (zh) * 2019-05-29 2020-12-01 腾讯数码(天津)有限公司 界面布局文件的生成方法、界面生成方法、装置及设备
CN112052005A (zh) * 2019-06-06 2020-12-08 阿里巴巴集团控股有限公司 界面处理方法、装置、设备及存储介质
CN112231034A (zh) * 2019-12-23 2021-01-15 北京来也网络科技有限公司 结合rpa和ai的软件界面元素的识别方法与装置
WO2021076205A1 (en) * 2019-10-14 2021-04-22 UiPath Inc. Systems and methods of activity target selection for robotic process automation
US20210349430A1 (en) * 2020-05-11 2021-11-11 UiPath, Inc. Graphical element search technique selection, fuzzy logic selection of anchors and targets, and/or hierarchical graphical element identification for robotic process automation
EP3964946A1 (en) * 2020-09-08 2022-03-09 UiPath, Inc. Application-specific graphical element detection
CN114995816A (zh) * 2022-06-24 2022-09-02 中电金信软件有限公司 业务流程配置方法、装置、电子设备及可读存储介质
CN115268719A (zh) * 2022-09-30 2022-11-01 北京弘玑信息技术有限公司 一种定位界面上目标元素的方法、介质及电子设备

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7607110B2 (en) * 2003-10-23 2009-10-20 Microsoft Corporation Element persistent identification
US8655913B1 (en) * 2012-03-26 2014-02-18 Google Inc. Method for locating web elements comprising of fuzzy matching on attributes and relative location/position of element
KR101282975B1 (ko) * 2012-10-26 2013-07-08 (주)밸류팩토리 문서 요소를 분리 구조화하여 표준화한 후 웹페이지를 재구성하는 웹화면 크롭 서버 장치
CN111552627A (zh) * 2020-03-16 2020-08-18 平安科技(深圳)有限公司 用户界面测试方法、装置、存储介质及计算机设备
CN112308069A (zh) * 2020-10-29 2021-02-02 恒安嘉新(北京)科技股份公司 一种软件界面的点击测试方法、装置、设备及存储介质
CN113934487B (zh) * 2021-09-18 2024-01-23 达而观数据(成都)有限公司 一种用户界面元素定位方法、系统、计算机设备和存储介质
CN114219934A (zh) * 2021-12-22 2022-03-22 国网浙江省电力有限公司双创中心 机器人流程自动系统元素定位方法、装置、设备及介质

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050102636A1 (en) * 2003-11-07 2005-05-12 Microsoft Corporation Method and system for presenting user interface (UI) information
CN109324796A (zh) * 2018-08-01 2019-02-12 浙江口碑网络技术有限公司 界面布局方法及装置
CN112015405A (zh) * 2019-05-29 2020-12-01 腾讯数码(天津)有限公司 界面布局文件的生成方法、界面生成方法、装置及设备
CN112052005A (zh) * 2019-06-06 2020-12-08 阿里巴巴集团控股有限公司 界面处理方法、装置、设备及存储介质
WO2021076205A1 (en) * 2019-10-14 2021-04-22 UiPath Inc. Systems and methods of activity target selection for robotic process automation
CN112231034A (zh) * 2019-12-23 2021-01-15 北京来也网络科技有限公司 结合rpa和ai的软件界面元素的识别方法与装置
US20210349430A1 (en) * 2020-05-11 2021-11-11 UiPath, Inc. Graphical element search technique selection, fuzzy logic selection of anchors and targets, and/or hierarchical graphical element identification for robotic process automation
EP3964946A1 (en) * 2020-09-08 2022-03-09 UiPath, Inc. Application-specific graphical element detection
CN114995816A (zh) * 2022-06-24 2022-09-02 中电金信软件有限公司 业务流程配置方法、装置、电子设备及可读存储介质
CN115268719A (zh) * 2022-09-30 2022-11-01 北京弘玑信息技术有限公司 一种定位界面上目标元素的方法、介质及电子设备

Also Published As

Publication number Publication date
CN115268719A (zh) 2022-11-01
CN115268719B (zh) 2022-12-20

Similar Documents

Publication Publication Date Title
WO2024066067A1 (zh) 一种定位界面上目标元素的方法、介质及电子设备
US11361526B2 (en) Content-aware selection
US7774290B2 (en) Pattern abstraction engine
JP2022514155A (ja) ソフトウェアテスト
CN113391871B (zh) 一种rpa元素智能融合拾取的方法与系统
US20210397546A1 (en) Software test case maintenance
CN113255614A (zh) 一种基于视频分析的rpa流程自动生成方法与系统
Salvador et al. Cultural event recognition with visual convnets and temporal models
CN112631586B (zh) 一种应用开发方法、装置、电子设备和存储介质
US11854285B2 (en) Neural network architecture for extracting information from documents
Schäfer et al. Sketch2BPMN: Automatic recognition of hand-drawn BPMN models
CN110347382A (zh) 一种代码信息统计方法及装置
CN115546465A (zh) 一种用于定位界面上元素位置的方法、介质及电子设备
JP2001325104A (ja) 言語事例推論方法、言語事例推論装置及び言語事例推論プログラムが記録された記録媒体
CN113204333A (zh) 软件界面设计稿前端元素识别方法
CN115269107B (zh) 一种处理界面图像的方法、介质及电子设备
Carme et al. The lixto project: Exploring new frontiers of web data extraction
Patnaik et al. Trends in web data extraction using machine learning
US20230359659A1 (en) Systems and methods for advanced text template discovery for automation
US20240177007A1 (en) Software test case maintenance
US20240004620A1 (en) Automated generation of web applications based on wireframe metadata generated from user requirements
US20210064862A1 (en) System and a method for developing a tool for automated data capture
Koenig et al. NEURAL-UML: Intelligent Recognition System of Structural Elements in UML Class Diagram
De Rosa New methods, techniques and applications for sketch recognition
Cho et al. Utilizing Machine Learning for the Identification of Visually Similar Web Elements

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22960662

Country of ref document: EP

Kind code of ref document: A1