CN115982402A

CN115982402A - Graphical interface element feature generation method and electronic equipment

Info

Publication number: CN115982402A
Application number: CN202211667281.7A
Authority: CN
Inventors: 黄博; 张泉; 高磊; 赵素馨
Original assignee: Shanghai Hongji Information Technology Co Ltd
Current assignee: Shanghai Hongji Information Technology Co Ltd
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-04-18

Abstract

The application provides a graphical interface element feature generation method and electronic equipment, wherein the method comprises the following steps: determining a target element to be retrieved from the sample graphical interface image; generating a feature vector corresponding to each feature extraction network for the target element according to a plurality of feature extraction networks in the feature extraction network combination; the feature extraction network combination comprises a plurality of optional feature extraction networks; and carrying out fusion processing on a plurality of feature vectors of the target element to obtain the context retrieval feature. According to the scheme, the context retrieval characteristics including the context information can be generated for the target elements, and the target elements can be conveniently and accurately retrieved subsequently.

Description

Graphical interface element feature generation method and electronic equipment

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a method for generating element features of a graphical interface and an electronic device.

Background

The RPA (robot Process Automation) technology can simulate a human to execute repeated graphical interface operations, thereby reducing the manual operation cost of a user. In the process of simulating and executing the graphical interface operation, the element operated by the user on the graphical interface needs to be identified, and the element needs to be automatically operated again, so that the corresponding function is triggered. The common technology is realized based on the analysis of the operating system bottom layer, but different software design methods are different, and the scheme for analyzing the system bottom layer cannot be unified or cannot be completed.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method for generating element features of a graphical interface and an electronic device, which are used for performing accurate retrieval on an element by using context information of the element, so as to implement an RPA technique.

In one aspect, the present application provides a method for generating an element feature of a graphical interface, including:

determining a target element to be retrieved from the sample graphical interface image;

generating a feature vector corresponding to each feature extraction network for the target element respectively according to a plurality of feature extraction networks in the feature extraction network combination; wherein the feature extraction network combination comprises a plurality of selectable feature extraction networks;

and carrying out fusion processing on the plurality of feature vectors of the target element to obtain context retrieval features.

In an embodiment, the method further comprises:

respectively determining the context retrieval characteristics of the target elements and the similarity between the context retrieval characteristics of each element in the designated graphical interface image;

and determining an element retrieval result of the target element in the specified graphical interface image according to the similarity.

In an embodiment, the determining, from the sample graphical interface image, a target element to be searched includes:

carrying out target detection on the sample graphical interface image to obtain a plurality of elements;

and displaying the retrieved multiple elements on the sample graphical interface image, and responding to a selection instruction to select at least one element from the multiple elements as a target element.

In an embodiment, before generating a feature vector corresponding to each feature extraction network for the target element according to a plurality of feature extraction networks in the feature extraction network combination, the method further includes:

in response to a model selection instruction, a number of feature extraction networks for extracting feature vectors are selected from the plurality of selectable feature extraction networks.

In an embodiment, the generating, for the target element, a feature vector corresponding to each feature extraction network according to a plurality of feature extraction networks in the feature extraction network combination includes:

extracting a screenshot rule corresponding to the network according to each feature, and intercepting a local image where a target element is located from the sample graphical interface image; the method comprises the following steps that screenshot rules corresponding to different feature extraction networks are adopted, and the indicated ranges of intercepted image areas are different;

and processing the corresponding local image according to each feature extraction network to obtain a feature vector corresponding to each feature extraction network.

In an embodiment, the extracting, according to a screenshot rule corresponding to each feature, a local image where a target element is located from the sample graphical interface image includes:

and extracting screenshot rules corresponding to the network according to each feature, and taking the central point of the target element as a center to intercept a local image corresponding to the screenshot rules from the sample graphical interface image.

In an embodiment, before the processing the local image corresponding to each feature extraction network according to each feature extraction network to obtain the feature vector corresponding to each feature extraction network, the method further includes:

and reducing the local image to a specified size.

In an embodiment, the fusing the feature vectors of the target element to obtain the context retrieval feature includes:

respectively carrying out L2 normalization processing on each feature vector of the target element to obtain a plurality of normalized feature vectors;

splicing the plurality of normalized feature vectors in the channel direction to obtain spliced feature vectors;

and carrying out nonlinear transformation on the spliced feature vectors through a nonlinear full-connection layer to obtain the context retrieval features.

In another aspect, the present application provides a training method for a feature extraction network combination, including:

respectively generating a feature vector corresponding to each feature extraction network according to each feature extraction network of the feature extraction network combination aiming at each element in the sample data set sample graphical interface image; wherein elements in the sample graphical interface image are marked with corresponding specific category information;

performing fusion processing on the plurality of feature vectors of each element to obtain context retrieval features;

inputting the context retrieval characteristics of each element into a preset classifier to obtain a corresponding prediction category;

and adjusting the network parameters of each feature extraction network in the classifier and the feature extraction network combination according to the difference between the prediction category and the specific category information of the multiple elements to obtain the trained feature extraction network combination.

In one embodiment, before the fusion processing is performed on the plurality of feature vectors of each element to obtain the context retrieval feature, the method further includes:

randomly selecting a plurality of feature vectors of a plurality of elements, and setting each value of the selected feature vectors to be zero.

In another aspect, the present application provides an electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute an element feature generation method or a training method of a feature extraction network combination of the graphical interface.

Furthermore, the present application provides a computer-readable storage medium, which stores a computer program executable by a processor to perform the method for generating element features or the method for training a feature extraction network combination of the graphical interface.

According to the technical scheme, the context retrieval characteristics can be generated for the target elements by means of the plurality of feature extraction networks in the feature extraction networks, the context retrieval characteristics comprise context information of the target elements, and the subsequent retrieval of the target elements can be facilitated, so that the retrieval accuracy is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic view of an application scenario of a method for generating element features of a graphical interface according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a method for generating element features of a graphical interface according to an embodiment of the present application;

fig. 4 is a flowchart illustrating a method for determining a target element according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a method for extracting feature vectors according to an embodiment of the present application;

fig. 6 is a flowchart illustrating a feature vector fusion method according to an embodiment of the present application;

FIG. 7 is a schematic diagram illustrating the generation of context retrieval features according to an embodiment of the present application;

FIG. 8 is a flowchart illustrating an element retrieval method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a sample graphical interface image provided in an embodiment of the present application;

FIG. 10 is a schematic diagram of a designated graphical interface image provided in accordance with an embodiment of the present application;

FIG. 11 is a schematic diagram of a designated graphical interface image provided in accordance with another embodiment of the present application;

fig. 12 is a schematic flowchart of a training method for a feature extraction network combination according to an embodiment of the present application;

fig. 13 is a block diagram of an element retrieving apparatus of a graphical interface according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

The machine process automation technology can simulate the operation of staff on a computer through a keyboard and a mouse in daily work, and can replace human beings to execute operations of logging in a system, operating software, reading and writing data, downloading files, reading mails and the like. The automatic robot is used as virtual labor force of an enterprise, so that employees can be liberated from repeated and low-value work, and the energy is put into the work with high added value, so that the enterprise can realize digital intelligent transformation and meanwhile achieve cost reduction and benefit increase.

The RPA is a software robot which replaces manual tasks in business processes and interacts with a front-end system of a computer like a human, so the RPA can be regarded as a software program robot running in a personal PC or a server, and replaces human beings to automatically repeat operations such as mail retrieval, attachment downloading, system logging, data processing and analysis and other activities by imitating the operations performed by users on the computer, and is fast, accurate and reliable. Although the problems of speed and accuracy in human work are solved by specific rules set like the traditional physical robot, the traditional physical robot is a robot combining software and hardware, and can execute work only by matching with software under the support of specific hardware; the RPA robot is a pure software level, and can be deployed to any one PC and server to complete specified work as long as corresponding software is installed.

That is, RPA is a way to perform business operations using "digital staff" instead of people and its related technology. In essence, the RPA realizes unmanned operation of objects such as systems, software, webpages and documents on a computer by a human simulator through a software automation technology, acquires service information, executes service actions, and finally realizes flow automation processing, labor cost saving and processing efficiency improvement. In a scene of service function test and the like, a target element selected before can be searched on a designated graphical interface image through an RPA technology (the target element can be used for triggering a corresponding service function), so that a corresponding service is operated, and the service function test is completed.

Fig. 1 is a schematic application scenario diagram of a method for generating element features of a graphical interface according to an embodiment of the present application. As shown in fig. 1, the application scenario includes a client 20 and a server 30; the client 20 may be a user terminal such as a host, a mobile phone, a tablet computer, and the like, and is configured to send a selection instruction for selecting a target element from the sample graphical interface image to the server 30; the server 30, which may be a server, a cluster of servers, or a cloud computing center, may determine a target element for which to generate the contextual search feature in response to the selection instruction.

As shown in fig. 2, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor 11 being exemplified in fig. 2. The processor 11 and the memory 12 are connected by a bus 10, and the memory 12 stores instructions executable by the processor 11, and the instructions are executed by the processor 11, so that the electronic device 1 can execute all or part of the flow of the method in the embodiments described below. In an embodiment, the electronic device 1 may be the server 30 described above, and is configured to execute an element feature generation method of a graphical interface. In an embodiment, the electronic device 1 may be the server 30 described above, and is configured to perform a training method for feature extraction network combination.

The Memory 12 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically Erasable Programmable Read-Only Memory (EEPROM), erasable Programmable Read-Only Memory (EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk.

The present application also provides a computer-readable storage medium, which stores a computer program executable by the processor 11 to perform the method for generating element features of a graphical interface or the method for training a combination of feature extraction networks provided in the present application.

Referring to fig. 3, a flowchart of a method for generating element features of a graphical interface according to an embodiment of the present disclosure is shown in fig. 3, where the method may include the following steps 310 to 350.

Step 310: and determining a target element to be retrieved from the sample graphical interface image.

Here, the sample graphical interface image is a graphical interface image for determining a target element to be subsequently retrieved. The server side can obtain the sample graphical interface image from the client side, or read the sample graphical interface image from a preset storage position.

For any target element, the target element may be a text element, for example, the target element is a text "song body (body)", "five number", or the like in the sample graphical interface; alternatively, the target element may be a graphical element, for example, the target element may be a triangular drop-down menu icon, a maximized rectangular box, etc. in the sample graphical interface; or, the target element is a combined element of a text element and a graphic element; furthermore, the target element may be other elements that may appear in the graphical interface.

In an embodiment, referring to fig. 4, a flowchart of a method for determining a target element provided in an embodiment of the present application is shown in fig. 4, where the method may include the following steps 311 to 312.

Step 311: and carrying out target detection on the sample graphical interface image to obtain a plurality of elements.

The server can input the sample graphical interface image into a trained target detection network, and the target detection network can be obtained through training of network models such as SSD (Single Shot MultiBox), YOLO (You Only Look one), RCNN (Regions with CNN features) and the like and is used for detecting text elements and graphic elements from the image. And carrying out target detection on the sample graphical interface through a target detection network to obtain a plurality of target detection results. Each target detection result corresponds to an element, and the target detection result may include category information (text element or graphic element) of the element and position information of the element in the sample graphical interface image.

Step 312: and displaying the retrieved multiple elements on the sample graphical interface image, and responding to a selection instruction to select at least one element from the multiple elements as a target element.

After detecting the plurality of elements, the server can output and display the plurality of retrieved elements through the client. The server side can output and display the graphic interface image of the sample, and display the rectangular frame limiting the position of the element in the graphic interface image of the sample according to the position information of each element. In one embodiment, for text elements and graphic elements, the rectangular border may be displayed in different colors or different line types (one is a dashed box and the other is a solid box), so that the retrieved elements are displayed more intuitively.

After the plurality of elements are displayed, a user can select at least one element from the sample graphical interface image in a mouse click mode, a touch click mode and the like, and the element is a target element needing to be searched subsequently.

Step 320: generating a feature vector corresponding to each feature extraction network for the target element according to a plurality of feature extraction networks in the feature extraction network combination; wherein the feature extraction network combination comprises a plurality of selectable feature extraction networks.

The plurality of selectable feature extraction networks in the feature extraction network combination can be network models with the same structure or network models with different structures.

In an embodiment, the server may output, to the client, each feature extraction network in the feature extraction network combination, so that the user may select a plurality of feature extraction networks from the client in a manner of mouse click, keyboard manipulation, or the like. In order to guide the user to select the feature extraction network, the server may output and display a screenshot size corresponding to each feature extraction network, where the screenshot size represents a size of a local image captured with an element center point as a center. Illustratively, the size of the screenshot is expressed in terms of the number of pixels in the abscissa direction x the number of pixels in the ordinate direction, and may be 32 × 32, 64 × 64, 128 × 128, 512 × 512, 1024 × 1024, or the like. In this case, the user may select the feature extraction network as needed, so as to extract feature vectors from the local images under multiple coverage areas subsequently.

The client can respond to the selection operation of the user, generate a model selection instruction and send the model selection instruction to the server. The server can respond to the model selection instruction and select a plurality of feature extraction networks for extracting the feature vectors from a plurality of selectable feature extraction networks. Illustratively, there are 10 optional feature extraction networks in the feature extraction network combination, and after 4 feature extraction networks are selected in response to the model selection instruction, feature vectors of target elements are extracted by the 4 feature extraction networks in the subsequent element retrieval process.

In an embodiment, referring to fig. 5, a flowchart of a method for extracting a feature vector provided in an embodiment of the present application is shown in fig. 5, where the method may include the following steps 321 to 322.

Step 321: extracting a screenshot rule corresponding to the network according to each feature, and intercepting a local image where the target element is located from the sample graphical interface image; and the range of the indicated intercepted image area is different.

Different capture rules indicate that the captured image regions differ in extent, so that the captured partial images contain different contextual information. Here, the context information refers to information of the surroundings (context) of the element in addition to information of the element itself. Such as: the target element in the sample graphical interface image is a search box, and after a local image containing the search box is intercepted by a screenshot rule, the local image comprises a text element 'start query' around the search box, so that the text element can be used as the context information of the search box.

Wherein the rule of screenshot indicating the smallest extent of the region of the captured image is indicated, the captured partial image may only comprise the target element and not the contextual information.

The server side can extract screenshot rules corresponding to the network according to the selected features, and the screenshot rules are intercepted by taking the element center point as a center to indicate local images corresponding to screenshot sizes. Here, the element center point may be determined according to the position information in the element corresponding target detection result, in other words, the center point may be a center point of a rectangular frame indicated by the position information.

Step 322: and processing the corresponding local image according to each feature extraction network to obtain a feature vector corresponding to each feature extraction network.

After the local images with different sizes are intercepted, the server side can respectively extract the network according to each feature and process the local images corresponding to the feature extraction network. The feature extraction network can be obtained by convolutional neural network training, and the feature vector can be obtained by processing the local image through convolutional calculation. Here, the feature vector may be a floating-point number feature vector, in other words, the numerical value in the feature vector may be a decimal number.

Since the network structure of different feature extraction networks may be different, the dimensions of the plurality of feature vectors of the target element may be different. Exemplary, selected featuresThe number of the extracted networks is k, and the dimensionality of the corresponding output feature vector is c ₁ 、c ₂ ……c _k-1 、c _k 。

In one embodiment, since the calculation amount of the feature extraction network is related to the size of the local image input thereto, before the local image is processed by the feature extraction network, the server may reduce the local image to a specified size in order to reduce the calculation amount. Here, the specified size may be configured as needed. For example, the specified size may be the same as the minimum screenshot size indicated by all of the screenshot rules.

After the server reduces the local image to the designated size, the reduced local image can be input into a feature extraction network, so that a feature vector is obtained. In this case, since the calculation amount is reduced, the generation efficiency of the feature vector can be improved.

Step 330: and carrying out fusion processing on a plurality of feature vectors of the target element to obtain the context retrieval features.

After obtaining a plurality of feature vectors generated by a plurality of feature extraction networks for target elements, the server can perform fusion processing on the plurality of feature vectors to obtain a unique feature vector as a context retrieval feature.

In an embodiment, referring to fig. 6, a flowchart of a feature vector fusion method provided in an embodiment of the present application is shown in fig. 6, and the method may include the following steps 331 to 333.

Step 331: and respectively carrying out L2 normalization processing on each feature vector of the target element to obtain a plurality of normalized feature vectors.

Since the magnitude of the feature vectors processed by different feature extraction networks may be different, this may affect subsequent retrieval performance. To solve this problem, the server may perform L2 normalization on each feature vector to obtain a normalized feature vector.

Step 332: and splicing the plurality of normalized feature vectors in the channel direction to obtain spliced feature vectors.

For each feature vectorAfter normalization processing is performed to obtain a plurality of normalized feature vectors, the normalized feature vectors can be spliced in the channel direction, so that spliced feature vectors are obtained. Illustratively, the dimensions of the feature vectors generated by the k feature extraction networks for the target element are c respectively ₁ 、c ₂ ……c _k-1 、c _k Obtaining the dimension c _ all = c through splicing ₁ +c ₂ ……c _k-1 +c _k The spliced feature vectors of (1).

Step 333: and carrying out nonlinear transformation on the spliced feature vectors through a nonlinear full-connection layer to obtain the context retrieval features.

The server side can perform nonlinear transformation on the spliced feature vectors through a nonlinear full connection layer, the dimension of the spliced feature vectors can be changed through the nonlinear transformation, and new feature vectors are obtained through the nonlinear transformation and serve as the context retrieval features of the target elements. The context retrieval features include feature information of the target element itself, and context information of the target element.

Referring to fig. 7, a schematic diagram of generating context retrieval features provided for an embodiment of the present application is shown in fig. 7, after a target element in a graphical interface image is determined (the target element is indicated by an element border), screenshot rules corresponding to networks are extracted according to different features, and the graphical interface image is subjected to screenshot, so that a plurality of partial images are obtained. And respectively extracting the features of the local images corresponding to the feature extraction networks according to the feature extraction networks to obtain corresponding feature vectors. And respectively carrying out L2 normalization processing on each eigenvector, and splicing on the channel dimension to obtain spliced eigenvectors. Furthermore, fusion processing is carried out on the spliced feature vectors through a nonlinear full-connection layer, and context retrieval features are obtained.

By the aid of the measures, the context retrieval characteristics can be generated for the target elements by means of the feature extraction networks in the feature extraction networks, the context retrieval characteristics comprise context information of the target elements, subsequent retrieval of the target elements can be facilitated, and retrieval accuracy is improved.

In one embodiment, after obtaining the context retrieval characteristics of the target element, the target element may be retrieved. Referring to fig. 8, a flowchart of an element retrieval method according to an embodiment of the present application is shown, and as shown in fig. 8, the method may include the following steps 340 to 350.

Step 340: the similarity between the context retrieval characteristics of the target element and the context retrieval characteristics of each element in the designated graphical interface image is respectively determined.

Wherein the designated image interface image is a graphical interface image for searching the target element. The server can obtain at least one graphical interface image from the client again or read at least one graphical interface image from the memory as the designated graphical interface image. Or, the server may use the sample graphical interface image itself as the designated graphical interface image. Illustratively, a user uploads an interface image of a document editing web page through a client as a sample graphical interface image, and selects an icon of a font setting pull-down menu therein as a target element. The subsequent server side can acquire the appointed graphical interface image by screenshot aiming at the document editing webpage in the specific webpage, so that the icon of the font setting pull-down menu is retrieved in the document editing webpage.

After determining the contextual search characteristics of the target element, the server may calculate the similarity between the contextual search characteristics and the contextual search characteristics of each element in the designated graphical interface image. Here, the context retrieval feature of each element in the designated graphical interface image may be obtained by performing object detection from the designated graphical interface image. After the target detects each element in the designated graphical interface image, the element can be indicated according to the position information of the element detected by the target. The feature vectors are extracted for each element respectively through a plurality of feature extraction networks selected by a user, and the feature vectors of each element are fused to obtain the feature vector fusion method, which is specifically described in the above related description and is not described again here. In calculating the similarity, the similarity may be calculated by a cosine distance, a euclidean distance, and a manhattan distance, and preferably, the cosine distance may be selected.

Step 350: and determining an element retrieval result of the target element in the designated graphical interface image according to the similarity.

After calculating the similarity between the contextual retrieval characteristics of the target element and the contextual retrieval characteristics of the plurality of elements in the designated graphical interface image, the server can determine an element retrieval result according to the plurality of similarities.

In one case, the server may select N top-N elements with the similarity between the contextual search feature and the target element as the element search result of the target element in the designated graphical interface image. Wherein, N is an integer greater than 0, and can be set as required. The server may sort the similarity between each element and the target element from large to small, so as to select the element with the similarity ranked at top N (for example, top three) as the element retrieval result. In other words, the N elements are target elements searched in the designated graphical interface image.

In another case, the server may select a plurality of elements, as the element search result of the target element in the designated graphical interface image, for which the similarity between the context search feature and the target element is greater than the preset similarity threshold. Here, the similarity threshold may be pre-configured empirically, illustratively 80%.

In another case, the server may select, as an element search result of the target element in the designated graphical interface image, a number of elements whose contextual search features have a similarity with the target element ranked in top N and whose similarity is greater than a similarity threshold. In this case, the server may sort the similarity between each element and the target element from large to small, select the element with the similarity ranked at the top N, check whether the similarity between the N elements and the target element is greater than a similarity threshold, and use a plurality of elements with the similarity greater than the similarity threshold as the element retrieval result. In this case, the selection of the correct element search result can be ensured by performing the selection under the dual conditions.

By the measures, a plurality of feature vectors can be obtained after the feature extraction is carried out on the target element according to a plurality of selected feature extraction networks in the feature extraction network combination, and the plurality of feature vectors comprise the context information, so that the context retrieval feature comprising the context information and the element self information is obtained through the fusion of the plurality of feature vectors, and the target element can be accurately retrieved in the appointed graphical interface image by comparing with each element in the appointed graphical interface image according to the context retrieval feature.

Referring to fig. 9, which is a schematic diagram of a sample graphical interface image provided in an embodiment of the present application, as shown in fig. 9, the sample graphical interface image is an image of a document editing menu interface, and a target element is a triangular icon corresponding to an underlined selectable item defined by a rectangular frame.

Referring to fig. 10, a schematic diagram of a designated graphical interface image according to an embodiment of the present application is provided, where the designated graphical interface image in fig. 10 is an image of a document editing menu interface. Referring to fig. 11, a schematic diagram of a designated graphical interface image according to another embodiment of the present application is provided, where the designated graphical interface image in fig. 11 is also an image of a document editing menu interface.

After the context retrieval features are generated for the target element and each element in fig. 10 and fig. 11, comparison is performed, and it is determined that the similarity between the context retrieval feature of the target element and the context retrieval feature in the rectangular frame in fig. 10 is 50, and the similarity between the context retrieval feature of the target element and the context retrieval feature in the rectangular frame in fig. 11 is 90. The similarity threshold is 85, and therefore, the triangular icon inside the rectangular frame in fig. 11 is the element search result of the target element in the designated graphical interface image.

In an embodiment, referring to fig. 12, a flowchart of a training method for a feature extraction network combination provided in an embodiment of the present application is shown in fig. 12, where the method may include the following steps 1210 to 1240.

Step 1210: respectively generating a feature vector corresponding to each feature extraction network according to each feature extraction network of the feature extraction network combination aiming at each element in the sample data set sample graphical interface image; wherein, the elements in the sample graphical interface image are marked with the corresponding specific category information.

The sample data set includes a large number of sample graphical interface images, which are graphical interface images used as training samples. For example, the sample graphical interface image may be the same or similar style as the sample graphical interface image of the subsequent application stage, the designated graphical interface image, or a graphical interface image belonging to the same business system.

Each element within each sample graphical interface image is labeled with corresponding specific category information. Here, the specific category information is not the aforementioned text element or graphic element, but more specific category information. Illustratively, software icons of application software such as a pan icon, a hundred degree icon, a WeChat icon and the like are arranged on the sample graphical interface image, and specific category information of the software icons is the "pan icon", the "hundred degree icon" and the "WeChat icon".

For each element of each sample graphical interface image, after target detection, a plurality of local images can be intercepted for any element by using the screenshot rule corresponding to each feature extraction network, and then the corresponding local images are processed by each feature extraction network respectively, so that feature vectors are output. Thus, multiple feature vectors may be generated for each element. Here, the number of feature vectors corresponding to each element is the same as the total number of feature extraction networks in the feature extraction network combination.

Step 1220: and performing fusion processing on the plurality of feature vectors of each element to obtain the context retrieval features.

For a plurality of feature vectors of each element, the server may perform L2 normalization processing on each feature vector, respectively, to obtain a plurality of normalized feature vectors. And splicing the plurality of normalized feature vectors on the channel dimension to obtain spliced feature vectors. Furthermore, the spliced feature vectors are subjected to nonlinear transformation through a nonlinear full-connection layer, so that the context retrieval features of the elements are obtained.

In an embodiment, in an actual execution flow of the element retrieval, a user may not select all feature extraction networks in the feature extraction network combination, and in order to simulate an actual execution scene, before splicing a plurality of feature vectors of an element, the server may randomly select a plurality of feature vectors of the elements, and set each value of the selected feature vectors to zero. The server may randomly select a feature vector from all feature vectors of all elements according to a preset ratio (e.g., 5%), and set a value of the selected feature vector to zero.

For any element, if the element corresponds to n feature vectors, and one of the feature vectors is selected, the value of the feature vector is set to zero, and then the processing flow of splicing and nonlinear transformation can be continuously performed on the feature vector and other n-1 feature vectors subjected to L2 normalization, so as to obtain the context retrieval feature of the element.

Step 1230: and inputting the context retrieval characteristics of each element into a preset classifier to obtain a corresponding prediction class.

The predicted class is the specific class predicted by the classifier in the training process. The classifier may be preconfigured with a total number of categories output, which may be equivalent to a total number of categories of elements in the sample graphical interface image. The classifier may be implemented by a fully connected layer, a Softmax function, etc.

After obtaining the context retrieval features of each element, the server may input the context retrieval features of the element to the classifier, so as to obtain the prediction category output by the classifier.

Step 1240: and adjusting the network parameters of each feature extraction network in the combination of the classifier and the feature extraction network according to the difference between the prediction categories and the specific category information of the multiple elements to obtain the trained feature extraction network combination.

After the prediction categories of the elements are obtained, the difference between the prediction categories and the specific category information corresponding to the elements can be evaluated according to a cross entropy loss function, so that the network parameters of a plurality of feature extraction networks combined by a classifier, a nonlinear full-connection layer and a feature extraction network are adjusted in an end-to-end training mode. After adjusting the network parameters, the process may return to step 1210 to perform a next round of iterative training process. After repeated iteration, when the cross entropy loss function tends to be stable, or the training times reach a preset time threshold, each feature extraction network in the feature extraction network combination can be considered to be converged, and at this time, the trained feature extraction network combination is obtained.

By the above measures, after the feature extraction network combination is obtained by training, a plurality of feature extraction networks can be selected from the feature extraction networks for executing the element retrieval method.

Fig. 13 is a block diagram of an apparatus for retrieving elements of a graphical interface according to an embodiment of the present invention, as shown in fig. 13, the apparatus may include:

a determining module 1310, configured to determine a target element to be retrieved from the sample graphical interface image;

a generating module 1320, configured to generate, according to a plurality of feature extraction networks in the feature extraction network combination, a feature vector corresponding to each feature extraction network for the target element; wherein the feature extraction network combination comprises a plurality of selectable feature extraction networks;

a fusion module 1330 configured to perform fusion processing on the feature vectors of the target element to obtain a context retrieval feature.

The implementation processes of the functions and actions of the modules in the device are specifically described in the implementation processes of the corresponding steps in the element feature generation method of the graphical interface, and are not described herein again.

In the embodiments provided in the present application, the disclosed apparatus and method can also be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A method for generating element characteristics of a graphical interface is characterized by comprising the following steps:

generating 5 a feature vector corresponding to each feature extraction network for the target element according to a plurality of feature extraction networks in the feature extraction network combination; wherein the feature extraction network combination comprises a plurality of selectable feature extraction networks;

2. The method of claim 1, further comprising:

respectively determining the similarity between the context retrieval characteristics of the target element and the context retrieval characteristics of each 0 element in the designated graphical interface image;

3. The method of claim 1, wherein determining the target element to be searched from the sample graphical interface image comprises:

5, carrying out target detection on the sample graphical interface image to obtain a plurality of elements;

4. The method of claim 1, wherein before generating the feature vector corresponding to each feature extraction net 0 network for the target element according to a plurality of feature extraction nets in the feature extraction net combination, the method further comprises:

in response to a model selection instruction, selecting a number of feature extraction networks for extracting feature vectors from the plurality of selectable feature extraction networks.

5. The method of claim 1, wherein the generating a feature vector corresponding to each feature extraction network for the target element according to a plurality of feature extraction networks in the feature extraction network combination comprises:

6. The method of claim 5, wherein the extracting a partial image of the target element from the sample graphical interface image according to the screenshot rule corresponding to each feature extraction network comprises:

7. The method of claim 5, wherein before the processing the local image corresponding to each feature extraction network to obtain the feature vector corresponding to the local image according to each feature extraction network, the method further comprises:

and reducing the local image to a specified size.

8. The method according to claim 1, wherein the fusing the feature vectors of the target element to obtain the context retrieval feature comprises:

9. A training method for a feature extraction network combination is characterized by comprising the following steps:

performing fusion processing on a plurality of feature vectors of each element to obtain context retrieval features;

10. The method of claim 9, wherein before the fusing the plurality of feature vectors for each element to obtain the context-search feature, the method further comprises:

randomly selecting a plurality of feature vectors of a plurality of elements, and setting each value of the selected feature vectors to zero.

11. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of generating element features of a graphical interface of any one of claims 1-8 or the method of training a combination of feature extraction networks of any one of claims 9-10.