CN117032875B

CN117032875B - Associated broken map layer merging method and device based on multi-modal map neural network

Info

Publication number: CN117032875B
Application number: CN202311298208.1A
Authority: CN
Inventors: 陈柳青; 李佳智; 周婷婷; 孙凌云
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-10-09
Filing date: 2023-10-09
Publication date: 2024-02-13
Anticipated expiration: 2043-10-09
Also published as: CN117032875A

Abstract

The invention discloses a method for combining associated fragmentary graph layers based on a multi-modal graph neural network, which comprises the following steps: acquiring an original design manuscript file of a GUI; traversing all layers based on the descending order of the layer sizes to construct a corresponding GUI layer view tree, and connecting every two layers of the same tree level to obtain a GUI layer structure expression; constructing node embedded vectors and edge embedded vectors of layers on each node in the GUI graph structure; the graph structure, the graph layer and the feature vector are combined into a data set; constructing a graph neural network; training the graph neural network by utilizing a data set to obtain a classification recognition model; and inputting the original GUI original design manuscript file to be optimized into the classification recognition model to obtain a corresponding prediction result so as to generate a high-quality GUI design manuscript. The invention also provides a device for combining the associated fragmentary layers. The method provided by the invention can improve the accuracy rate of merging the broken layers, thereby obtaining a high-quality design draft meeting the industrial level standard.

Description

Associated broken map layer merging method and device based on multi-modal map neural network

Technical Field

The invention belongs to the field of interface UI design, and particularly relates to a method and a device for combining associated scattered map layers based on a multi-mode map neural network.

Background

A user graphical interface (GUI) establishes a bridge for software applications to communicate with users. The excellent GUI design enables the software application to be more efficient and convenient to use, and has important effects on popularizing the software and attracting customers. However, GUI development of software requires a large number of front-end developers, and complex and varied UI layout and repeated UI view development greatly reduce development rate and increase development cost. In order to assist front-end developers in front-end development, some former research works use machine learning technology to intelligently generate front-end codes from UI pictures, however, the availability and maintainability of the front-end codes generated by a machine learning model based on UI pictures are poor, and often cannot reach industrial standards.

At present, a method for combining design draft meta information with UI design draft pictures is proposed, so that reusability of generated codes is ensured. However, in the actual UI design process, the designer only considers the aesthetic property of the UI, and often ignores the design specification in order to achieve the aesthetic visual effect, which affects the quality of generating the front-end code by using the design draft meta information, but requires the designer to design strictly according to the specification, which greatly increases the working cost of the designer.

Patent document CN116450074a discloses an image display method and apparatus, the method comprising: receiving an image display request from a user; sending a first synchronization instruction to a server; under the condition that the first synchronization response report returned by the server side in response to the first synchronization instruction is received, sending an image display instruction to the server side, wherein the image display instruction carries image data of an image to be displayed, and the image to be displayed is one of a plurality of images generated in an image rendering process by an image rendering model; the image display instruction is used for indicating the server to call a Graphical User Interface (GUI) for the user to display the image to be displayed based on the image display instruction.

Patent document CN115291864a discloses a method for detecting a broken layer based on a graph neural network, which comprises the following steps: step 1, generating a tree undirected graph and an initial feature vector of the graph layer for reflecting the layer containing relation according to the layer information of the UI design draft; step 2, inputting the tree undirected graph obtained in the step 1 and the initial feature vector into a pre-constructed graph neural network model to obtain a fusion feature vector of a graph layer; step 3, inputting the fusion feature vector obtained in the step 2 and the corresponding layers into a multi-layer perceptron classification model, and outputting a layer classification result through classification treatment, wherein the layer classification result comprises a broken layer set and a non-broken layer set; and 4, clustering the scattered layer sets obtained by classification in the step 3, and grouping and merging clustering results to obtain a high-quality UI design draft.

Disclosure of Invention

The invention mainly aims to provide a method and a device for combining associated scattered layers based on a multi-mode graph neural network, and the method can improve the accuracy of the combination of the scattered layers, so that a high-quality design draft meeting industrial-level standards is obtained.

To achieve the first object of the present invention, there is provided an associated fragmentary graph layer merging method based on a multi-modal graph neural network, including the steps of:

acquiring an original design manuscript file of a GUI, wherein the original design manuscript file comprises coordinate positions and sizes of all layers, image information and categories;

traversing all layers based on the descending order of the layer sizes to construct a corresponding GUI layer view tree, and connecting every two layers of the same tree level to obtain a GUI layer structure expression;

aiming at the image layers on each node of the GUI image structure, converting the coordinate position and the size of the image layer into high-dimensional vectors by adopting a high-frequency function, and embedding the visual feature vectors corresponding to the image information into the same space by a parameter matrix for vector addition by the category embedded vectors corresponding to the categories so as to obtain corresponding node embedded vectors, and generating corresponding edge embedded vectors according to the high-dimensional vector difference value between two image layers of each edge;

labeling the acquired layers by using whether the layers are the scattered layers and the layer merging areas, embedding vectors into the GUI layer structure, the layers and the corresponding nodes, and forming a data set by embedding the vectors and the labels;

constructing a graphic neural network comprising a feature extraction module, an analysis module and a prediction module, wherein the feature extraction module is used for extracting node embedded vectors and edge embedded vectors of all layers in a GUI original design manuscript file, the analysis module carries out layer relation update according to the extracted node embedded vectors and edge embedded vectors so as to obtain corresponding fusion node embedded vectors, and the prediction module outputs a prediction result according to the obtained fusion node embedded vectors, wherein the prediction result comprises a classification result of whether the layer is a broken layer, a layer merging region where the layer is located and a confidence score between the position of the layer and the layer merging region;

training the graphic neural network by utilizing the data set to obtain a classification recognition model for classifying the combined region of the graphic layer and the positioning graphic layer;

and inputting the original GUI original design manuscript file to be optimized into a classification recognition model to obtain a corresponding prediction result, clustering the broken layers of the similar layer merging region in the prediction result by adopting a bounding box merging method, and grouping and merging the clustering results to generate a high-quality GUI design manuscript.

The invention utilizes the graphic neural network model to improve and update the representation vector of the graphic layer, so as to classify branches and predict whether one graphic layer is fragmented, locate branches for each fragmented graphic layer to return to a bounding box, finally obtain bounding boxes representing merging and grouping boundaries of the graphic layer through an improved target frame merging algorithm, thereby obtaining high-quality GUI design manuscripts conforming to industrial level standards.

Specifically, the acquisition process of the GUI diagram structural expression is as follows:

sorting in descending order according to the matrix area corresponding to the layer size to obtain a corresponding layer list;

traversing the layer list in sequence, and distributing the current layer as a child node to the last traversed layer which contains the current layer and has the smallest area so as to obtain a corresponding view tree;

and removing edges between each node and child nodes of the view tree, and simultaneously, taking the nodes connected with the same tree level in pairs as edges between layers represented by each node to reconstruct and obtain the GUI graph structural expression.

Specifically, the visual feature vector is obtained by extracting image information by using a ResNet-50 model as a backbone network.

Specifically, the expression of the analysis module is as follows:

in the formula, X represents a node characteristic matrix, E represents an edge characteristic matrix, L represents an L-th layer graph neural layer, MPNN represents a graph message transfer function, self-attn represents a self-attention function, layerNorm represents a regularization function, dropout represents a neuron failure function, and MLP represents a multi-layer perceptron network.

Specifically, in the training process, parameter updating is carried out on the graph neural network by adopting a plurality of loss functions, wherein the expression of the plurality of loss functions is as follows:

wherein p represents the classification probability, y is the layer classification label, b is the merging region in the prediction result, b ^gt The data set is represented and the region labels are combined,representing confidence scores in the predicted results, ioU is a cross-over function of the two merge regions, i.e., their intersection region area divided by the union region area.

Specifically, the bounding box merging method comprises optimizing similar layer merging areas and fragmented layer clustering.

Specifically, the specific process of optimizing the merging area of the similar graph layers is as follows:

selecting a layer merging region with the highest confidence, calculating the area proportion value of other scattered layers falling in the merging region, selecting a layer merging region larger than a threshold value, and calculating the average value of the layer merging regions to obtain an optimized layer merging region;

and after eliminating the optimized layer merging areas, repeating the operation until the optimization of all the layer merging areas is completed.

Specifically, the specific process of clustering the fragmented layers is as follows:

selecting a merging region with the minimum area based on the optimized merging region, calculating IoU values of boundary frames of the layers and other fragmented layers in the layer region, and selecting fragmented layers larger than a threshold value to obtain a clustering result for merging and grouping;

after the consolidated and grouped fragmented layers are removed, the above operation is repeated until all fragmented layers are processed.

In order to achieve the second object of the present invention, there is provided an associated fragmentary graph layer merging device, including a memory and one or more processors, wherein the memory stores executable codes, and the one or more processors are used for implementing the associated fragmentary graph layer merging method based on the multi-modal graph neural network when executing the executable codes, and the specific steps are as follows:

and inputting the original GUI original design manuscript file to be optimized into a classification recognition model for analysis to obtain an optimized high-quality GUI design manuscript.

Compared with the prior art, the invention has the beneficial effects that:

the layers are expressed in a graph structure to construct corresponding expression vectors, meanwhile, the expression vectors are updated and predicted through a graph neural network model, and the broken layers in the layers are grouped and combined based on a prediction result, so that a high-quality GUI design draft meeting the industrial level standard is obtained.

Drawings

FIG. 1 is a flowchart of a method for combining the associated fragmentary layers provided in the present embodiment;

fig. 2 is a flowchart for acquiring structural expression of a GUI diagram provided in the present embodiment;

fig. 3 is a schematic structural diagram of the neural network according to the present embodiment;

FIG. 4 is a flowchart of a prediction module provided in the present embodiment;

FIG. 5 is a flowchart of the method for optimizing the merging area of similar graphs provided in this embodiment;

fig. 6 is a flowchart of the broken layer clustering provided in this embodiment.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

As shown in fig. 1, the method for combining the associated fragmentary graphics layer provided in this embodiment includes the following steps:

acquiring a GUI original design manuscript file including coordinate positions and sizes of respective layers, image information and categories, more specifically, a GUI prototype design manuscript is composed of layers drawing various GUI components, and the present embodiment parses the GUI design manuscript into a layer list containing the following multi-modal information { (x, y, w, h), img-tensor, category } ⁿ _i=1 Where (x, y, w, h) is the position and size information of the layer, img-tensor is the image information of the layer, the resolution is 64×64, and category represents the category (e.g. text, vector path, etc.) of the layer.

Traversing all layers based on the descending order of the layer sizes to construct a corresponding GUI layer view tree, and connecting every two layers of the same tree level to obtain a GUI graph structure expression.

More specifically, as shown in fig. 2, given a layer list, the layers are first sorted according to the rectangular areas of the layers in descending order, the sorted layer list is traversed in order, and the current layer is assigned as a child node to the last traversed layer containing the current layer but having the smallest area. Based on the view tree, removing edges between the virtual nodes and the child nodes of the virtual nodes, and simultaneously connecting edges between layers of the same tree level in pairs to reconstruct and obtain the GUI graph structural expression.

For the layers on each node in the GUI graph structural expression, converting the coordinate position and the size of the layers into high-dimensional vectors by adopting a high-frequency function, and embedding the visual feature vectors corresponding to the image information into the same space by the category embedded vectors corresponding to the categories and the high-dimensional vectors through a parameter matrix for vector addition so as to obtain the corresponding node embedded vectors, and generating the corresponding edge embedded vectors according to the high-dimensional vector difference value between the two layers of each edge.

More specifically, visual information is critical for fragmented layer grouping, and 64×64 layer images are encoded into visual feature vectors, taking into account the complexity of time and space, using a pre-trained res net-50 model as the backbone network.

In GUI prototyping software, each GUI layer has its own category to facilitate the designer's creation of geometric shapes (e.g., ellipses, rectangles, etc.) or other GUI elements (e.g., text, images), for a total of 13 layer categories, thus providing an embedding matrix M _13×d Each row of which is an embedded vector of the corresponding class. For the size and position of the layers, we first convert the 4-dimensional coordinates (x, y, w, h) into a high-dimensional vector by a high frequency function as follows:then, a parameter matrix M is used _8×L,d The high-dimensional vector is further embedded into the d-dimensional space. Although many studies have attempted to design efficient multimodal feature fusion strategies, we have adopted a simple but experimentally effective strategy, namely directly adding these feature expression vectors.

A feature vector is also encoded for each edge: for the edges connecting nodes v_i and v_j, the difference in node coordinates (Δx, Δy, Δw, Δh) is encoded as an edge feature vector using the high frequency function in the above equation. Considering that layers within a consolidated group may be close to each other, the distance between layers helps to solve the task, which is effectively a priori information in grouping fragmented layers.

And labeling the acquired layers by using whether the layers are the scattered layers and the layer merging areas, embedding vectors into the layers and corresponding nodes, and embedding the vectors and the labels into a data set.

As shown in fig. 3 and fig. 4, a neural network including a feature extraction module, an analysis module and a prediction module is constructed, the feature extraction module is used for extracting node embedded vectors and edge embedded vectors of each layer in a GUI original design draft file, the analysis module performs layer relation update according to the extracted node embedded vectors and edge embedded vectors so as to obtain corresponding fusion node embedded vectors, and the prediction module outputs a prediction result according to the obtained fusion node embedded vectors, wherein the prediction result includes a classification result of whether the layer is a broken layer, a layer merging region where the layer is located, and a confidence score between a position of the layer and the layer merging region.

More specifically, the graph neural network may be expressed by a messaging framework according to the following formula:in->A state vector representing node v, e _uv An edge feature vector representing the connection nodes u and v, is->Representing the neighbors of node v, k represents the kth message passing iteration, COMBINE expresses some combination function, AGGREGATE expresses some message aggregation function.

In the kth iteration of GNN, GNN first aggregates features of neighboring nodes and neighboring edges, and then combines them together through a function for updating the expression of each node. After k iterations of the aggregated information, the representation of node v may capture local and structural information within its k-hop neighborhood. The AGGREGATE function should be permutation-invariant to eliminate the effect of the message input order and the COMBINE function should be scalable. The AGGREGTE function is implemented as a sum function and the COMBINE function is implemented with two-layer MLP, which can make the ability of the graph neural network model comparable to WL testing.

However, the graph neural network is often limited by problems of overcomplete, overcompression, and low expressive force on WL testing. Recently, some efforts have attempted to alleviate the basic problems described above by allowing a node to participate in the computation of all other nodes in the graph, i.e., employing a global attention mechanism. Multi-headed self-attention is introduced into the model and the node representation is updated according to the following equation:

after K iterations of updating the node representation, the final node embedding vector is input into both MLP branches. Whether the first branch prediction layer is fragmented or not, and the second branch locates the boundary of the merge group. The sort branch sorts all layers in the GUI design script into two categories that filter out most irrelevant layers for further layer grouping work. The filtered GUI layers may already contain rich semantics as components. To group the fragmented layers, we employ a locating branch to first regress and merge the grouped bounding boxes, and then can naturally group the fragmented layers together.

One of the key challenges of target detection is that the size of the target object is not fixed and may vary widely. In our fragmented layer grouping task, some UI merge groups draw large background patterns, and some draw small icons. A set of anchor rectangular boxes with different aspect ratios and sizes are predefined based on the fast RCNN model to solve the above-described problem. Unlike the object detection algorithm, which sets the center of the anchor frame as the pixel center, the center of the GUI layer is directly regarded as the center of the anchor frame. We select three aspect ratios, 2:1, 1:1 and 1:2, respectively. The height of each aspect ratio is 16, 128 and 256, respectively. Positioning branch predictions deforms these anchor boxes to the offsets of the true rectangular boxes. And outputting confidence scores of each prediction boundary box, and obtaining a final boundary box regression result by using a similar merging region optimization algorithm. Prior to optimization, bounding boxes for non-maximum confidence scores for each of the fragmented layers are first filtered out.

The neural network is trained using the dataset to obtain a classification recognition model for classifying and locating the region of the map layer union.

More specifically, the multi-modal information of the layers is encoded into 128-dimensional feature vectors. The visual image features of the layers were extracted using a pre-trained ResNet50 on ImageNet, and only the linear layers were trained to convert the convolved feature map into 128-dimensional vectors. For high frequency coding we will do the aboveIs set to 9 and encodes the output of the high frequency function into 128-dimensional feature vectors using a linear layer. The attention module of each graph learning block has 4 attention heads with a hidden dimension of 128. We learn the neighborhood information around each node using the GINE graph neural layer as a messaging module and the hidden dimension is also 128. The graphic neural network consists of 9 graphic learning blocks. Both the sort branch and the locate branch consist of 3-layer MLPs, with the hidden dimension set to 256. For the penalty function, we set the weight of the classification penalty to 1, the weight of the ciou penalty to 10.0, and the weight of the confidence penalty to 5.0. The IoU threshold in the similar layer merge area algorithm was set to 0.45 and the threshold in the fragmented layer cluster algorithm was set to 0.7.

Wherein the bounding box merge method includes optimizing similar layer merge regions and fragmented layer clusters.

As shown in fig. 5, the specific process of optimizing the merging area of the similar graph layers is as follows:

the bounding boxes are first sorted in descending order according to confidence scores. For bounding boxes with highest confidence scores, ioU for the current and other rectangular boxes are computed and those rectangular boxes with IoU values greater than a predefined threshold are found. For this bounding box cluster, its average size rectangle is used as the bounding box for the respective layer merging groupings.

As shown in fig. 6, the specific process of the fragmented layer clustering is as follows:

selecting a merging region with the minimum area based on the optimized merging region, calculating the area proportion value of other fragmented layers falling on the merging region, and selecting fragmented layers larger than a threshold value to obtain a clustering result for merging and grouping;

The embodiment also provides a device for combining the associated parts and the layers, which comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the method for combining the associated parts and the layers, which is provided by the embodiment, when the executable codes are executed by the one or more processors.

The method comprises the following specific steps: and inputting the original GUI original design manuscript file to be optimized into a classification recognition model for analysis to obtain an optimized high-quality GUI design manuscript.

In summary, our model regresses bounding boxes that merge groupings while detecting fragmented layers, grouping fragmented layers is trivial. Considering the overlapping relationship between the background UI components and other components, we sort in ascending order according to the rectangular area of the bounding box and first group semantically consistent layers into smaller bounding boxes. We traverse the ordered list of bounding boxes and calculate the proportion of the fragmented layers that intersect the current bounding box. If the value exceeds the threshold, we assign the shard layer to the current merge group. In this way, we can effectively group the fragmented layers in the design script and facilitate the downstream code generation platform to generate higher quality code. Our algorithm has the further advantage that errors in the packet result can be corrected based on some a priori knowledge. We observe that text layers are typically not merged with other layers, while layers depicting geometry are typically part of a merged group. In merging the search within the bounding box, we can consider merging together layers (e.g., ellipses, rectangles, etc.) that are predicted to be non-fragmented but belong to a particular class.

Claims

1. The method for combining the associated fragmentary graph layers based on the multi-modal graph neural network is characterized by comprising the following steps of:

for the layers on each node in the GUI graph structure, converting the coordinate position and the size of the layers into high-dimensional vectors by adopting a high-frequency function, and embedding the visual feature vectors corresponding to the image information, wherein the category embedded vectors corresponding to the categories and the high-dimensional vectors are embedded into the same space through a parameter matrix to carry out vector addition so as to obtain corresponding node embedded vectors, and meanwhile, generating corresponding edge embedded vectors according to high-dimensional vector difference values between two layers of each edge;

2. The method for combining associated fragmentary graph layers based on the multi-modal graph neural network according to claim 1, wherein the acquisition process of the GUI graph structural expression is as follows: sorting in descending order according to the matrix area corresponding to the layer size to obtain a corresponding layer list;

3. The method for combining the associated fragmentary map layers based on the multi-modal map neural network according to claim 1, wherein the visual feature vector is obtained by extracting image information by using a ResNet-50 model as a backbone network.

4. The method for combining associated fragmentary graph layers based on a multi-modal graph neural network according to claim 1, wherein the expression of the analysis module is as follows:wherein X represents a node feature matrixE represents an edge feature matrix, L represents an L-th layer graph neural layer, MPNN represents a graph message transfer function, self-attn represents a self-attention function, layerNorm represents a regularization function, dropout represents a neuron failure function, and MLP represents a multi-layer perceptron network.

5. The method for combining associated fragmentary graph layers based on the multi-modal graph neural network according to claim 1, wherein in training, parameter updating is performed on the graph neural network by adopting a plurality of loss functions, and the expression of the plurality of loss functions is as follows: wherein p represents the classification probability, y is the layer classification label, b is the merging region in the prediction result, b ^gt A label representing the combined area of the dataset,>representing confidence scores in the predicted results, ioU is a cross-over function of the two merge regions, i.e., their intersection region area divided by the union region area.

6. The method of multi-modal-diagram neural-network-based association of the fragmentary layers of the graph, according to claim 1, wherein the bounding box merging method comprises optimizing similar layer merging regions and fragmentary layer clusters.

7. The method for combining associated fragmentary graph layers based on the multi-modal graph neural network according to claim 6, wherein the specific process of optimizing the similar graph layer combining area is as follows:

selecting the layer merging area with the highest confidence, calculating other layer merging areas and IoU values of the layer merging areas, selecting the layer merging areas larger than a threshold value, and calculating an average value of the layer merging areas to obtain an optimized layer merging area;

8. The method for combining associated fragmentary layers based on the multi-modal neural network according to claim 6, wherein the specific process of clustering the fragmentary layers is as follows:

9. An associated fragmentary graph layer merging device, characterized by comprising a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the associated fragmentary graph layer merging method based on the multi-modal graph neural network according to any one of claims 1-8 when executing the executable codes, and the specific steps are as follows: