CN111985542B

CN111985542B - Representative graph structure model, visual understanding model establishing method and application

Info

Publication number: CN111985542B
Application number: CN202010778717.4A
Authority: CN
Inventors: 吴东岳; 余昌黔; 高常鑫; 桑农
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2020-08-05
Filing date: 2020-08-05
Publication date: 2022-07-12
Anticipated expiration: 2040-08-05
Also published as: CN111985542A

Abstract

The invention discloses a method for establishing a representative graph structure model and a visual understanding model and application thereof, belonging to the field of visual understanding and comprising the following steps: establishing a representative graph structure model; the representative graph structure model includes: the characteristic mapping module is used for extracting value branches, key value branches and sequence branches from the input characteristic image and generating an offset matrix; the sampling module is used for sampling nodes (pixels or image grids) in the value branches and the key value branches according to the offset matrix to obtain representative characteristics; the long-distance dependence information capturing module is used for performing matrix multiplication on the representative characteristics of the key value branches and the sequence branches and then performing Softmax operation to obtain a relation matrix, and performing matrix multiplication on the representative characteristics of the value branches and the relation matrix to obtain a long-distance dependence matrix; and a feature backprojection module to encode the long-range dependency information into the input feature image. The invention can learn more refined long-distance dependence information and improve the accuracy of the visual understanding task.

Description

Representative graph structure model, visual understanding model establishing method and application

Technical Field

The invention belongs to the field of visual understanding, and particularly relates to a representative graph structure model, a visual understanding model establishing method and application.

Background

Long-range dependencies are semantic relationships that exist between regions or pixels that are located farther apart in an image. The modeling work for long-distance dependence has great significance for visual understanding tasks such as semantic segmentation, target detection, target segmentation and the like, for example, when a certain region/pixel in an image belongs to a category, other similar feature regions/pixels with long distances can be included as influence factors of a judgment result. Previous mainstream methods rely on depth stacking of local operations, such as convolution operations. But this method is computationally inefficient, difficult to optimize and has a small receptive field.

To solve the above problems, a non-local method is proposed to capture the long distance dependence. The Non-local operation takes the weighted sum of all other positions as calculated long-distance dependence information for each position, and the weight of the long-distance dependence information is obtained from a dense relation matrix. The dense relation matrix is generated by convolutional layer mapping and some column matrix operations, and for each position, the dense relation matrix records the importance degree of all other positions to the current position. Of course, there may be redundancy in the relationship matrix, which also results in higher computational complexity. For each location, some other locations may produce only a small response. Through statistical studies, some locations make major contributions to the response, while most locations make only minor contributions. Therefore, redundancy in the relationship matrix necessarily results in computational redundancy in non-local computation.

Generally, the method for capturing the long-distance dependency has the characteristic of high computational complexity, limits the efficiency and effect of capturing the long-distance dependency in practical application, and causes low accuracy of various computer vision understanding tasks.

Disclosure of Invention

Aiming at the defects and improvement requirements of the prior art, the invention provides a representative graph structure model, a visual understanding model establishing method and application, and aims to solve the technical problem that the accuracy of a visual understanding task is low due to the fact that the existing method for capturing long-distance dependence is high in calculation complexity, and the efficiency and effect of long-distance dependence capturing are limited.

To achieve the above object, according to an aspect of the present invention, there is provided a representative graph structure model building method including:

establishing a representative graph structure model for capturing long-distance dependence information of an input characteristic image;

the representative graph structure model includes: the system comprises a feature mapping module, a sampling module, a long-distance dependence information capturing module and a feature reflecting module;

the characteristic mapping module is used for extracting a value branch, a key value branch and a sequence branch from an input characteristic image and generating an offset matrix for indicating the coordinates of sampling points;

the sampling module is used for respectively sampling neighbor nodes of each node in the value branch and the key value branch according to the offset matrix to obtain representative characteristics of the value branch and representative characteristics of the key value branch;

the long-distance dependence information capturing module is used for performing matrix multiplication on the representative characteristics of the key value branches and the sequence branches and then performing Softmax operation to obtain a relation matrix, wherein the relation matrix records a relation vector between each node and a sampling point of the node; the long-distance information capturing module is also used for carrying out matrix multiplication on the representative characteristics of the value branches and the relation matrix to obtain a long-distance dependency matrix, and the long-distance dependency matrix records the long-distance dependency information of each node;

the characteristic reflection module is used for encoding the long-distance dependency information between the nodes into an input characteristic image and outputting the characteristic image containing the long-distance dependency information;

wherein the nodes are pixels or image grids.

Further, the representative graph structure model further comprises a channel division module and a feature integration module;

the channel division module is used for dividing the representative characteristics of the value branches, the representative characteristics of the key value branches and the branch sequences according to the channels to obtain a plurality of channel groups, and inputting the channel groups to the long-distance dependence information capture module respectively so as to capture the long-distance dependence information between the nodes in the channel groups;

the characteristic integration module is used for integrating the long-distance dependency information among the nodes in each channel group to obtain the long-distance dependency information of each node in the input characteristic image, and inputting the long-distance dependency information into the characteristic reflection module so that the long-distance dependency information among the nodes is coded into the input characteristic image by the characteristic reflection module;

after the channels are divided, the representative characteristics of the value branches of the corresponding channels, the representative characteristics of the key value branches and the sequence branches form a channel group.

In some optional embodiments, the feature mapping module comprises: a first convolution layer, a second convolution layer, a third convolution layer and a fourth convolution layer, wherein the convolution kernel size of each convolution layer is 1 multiplied by 1;

the first convolution layer, the second convolution layer, the third convolution layer and the fourth convolution layer are respectively used for carrying out convolution operation on the input characteristic image to obtain a value branch, an offset matrix, a key value branch and a sequence branch.

Further, the feature reflection module includes: a fifth convolution layer and a first aggregation layer, the convolution kernel size of the fifth convolution layer being 1 × 1;

a fifth convolution layer, configured to perform convolution operation on the long-distance dependency matrix, and reduce the size of the long-distance dependency matrix to be the same as that of the input feature image;

and the first aggregation layer is used for performing aggregation operation on the feature image input into the feature mapping module and the restored long-distance dependency matrix and encoding the long-distance dependency information among the nodes into the input feature image.

In some optional embodiments, the feature mapping module comprises: the convolution kernel sizes of the sixth convolution layer and the seventh convolution layer are both 1 x 1;

the sixth convolution layer, the first batch normalization layer and the first activation layer are used for sequentially carrying out convolution operation, batch normalization operation and activation operation on the input characteristic image to obtain a value branch, a key value branch and a sequence branch;

and the seventh convolution layer is used for performing convolution operation on the output image of the first active layer to obtain an offset matrix.

Further, the feature reflection module includes: the convolution kernel size of the eighth convolution layer is 1 multiplied by 1;

the eighth convolution layer and the second batch normalization layer are used for sequentially carrying out convolution operation and batch normalization operation on the long-distance dependence matrix and reducing the size of the long-distance dependence matrix to be the same as that of the input characteristic image;

and the second aggregation layer and the second activation layer are used for performing aggregation operation on the input characteristic image and the restored long-distance dependency matrix, then performing activation operation on an operation result, and encoding the long-distance dependency information between the nodes into the input characteristic image.

According to another aspect of the present invention, there is provided a visual understanding model building method, including:

inserting a representative diagram structure model obtained by the representative diagram structure model establishing method provided by the invention into a backbone network for executing a target visual understanding task to obtain a visual understanding model;

training the visual understanding model by using a standard training set to obtain a trained visual understanding model;

each sample in the standard training set is composed of an image related to the target visual understanding task and a corresponding label truth value, and the label truth value is used for indicating a task result.

According to still another aspect of the present invention, there is provided a visual understanding task performing method including:

inputting an image of a visual understanding task to be executed into a trained visual understanding model to obtain a task result;

the trained visual understanding model is obtained by the visual understanding model establishing method provided by the invention.

According to still another aspect of the present invention, there is provided a computer-readable storage medium including a stored computer program which, when executed by a processor, controls an apparatus on which the computer-readable storage medium is located to perform a representative diagram structure module building method provided by the present invention, and/or a visual understanding model building method provided by the present invention, and/or a visual understanding task execution method provided by the present invention.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

(1) according to the method for establishing the representative graph structure model, the value branch, the key value branch and the sequence branch of the characteristic image are respectively extracted, the neighbor nodes of all nodes in the value branch and the key value branch are dynamically sampled, and the representative characteristics of all nodes in the value branch and the key value branch are obtained, so that more refined long-distance dependence information can be learned based on the sampled representative nodes, the characteristic capability of the characteristics is enhanced, and the accuracy of the computer vision understanding task is improved.

(2) According to the method for establishing the representative graph structure model, the neighbor nodes of each node in the value branch and the key value branch are dynamically sampled, so that the calculation complexity is greatly reduced, the limit of partial long-distance dependence in application due to the calculation complexity is counteracted, and the application prospect of the long-distance dependence is improved.

(3) According to the visual understanding model establishing method provided by the invention, the representative graph structure model is inserted into the main network for executing the visual understanding task, so that the long-distance dependence information among the nodes in the characteristic image can be captured in the process of executing the visual understanding task by the main network, and the accuracy of the visual understanding task can be improved by utilizing the long-distance dependence information.

Drawings

FIG. 1 is a schematic diagram of a representative graph structure model provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a simple representative graph structure model provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a sampling module according to an embodiment of the present invention;

FIG. 4 is a schematic view of a representative diagram of a bottleneck according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a rasterized representative graph structure model provided by an embodiment of the invention;

FIG. 6 is a block diagram of a representative group graph according to an embodiment of the present invention;

fig. 7(a) to 7(e) are visualization results of representative nodes sampled by different nodes in an automatic driving scene;

fig. 8(a) to 8(e) are visualization results of representative nodes sampled for different nodes in the geographic information system.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

In the present application, the terms "first," "second," and the like (if any) in the description and the drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Example 1:

a representative graph structure model building method comprises the following steps:

as shown in fig. 1, the representative graph structure model includes: the system comprises a feature mapping module, a sampling module, a long-distance dependence information capturing module and a feature reflection module;

the long-distance dependence information capturing module is used for performing matrix multiplication on the representative characteristics of the key value branches and the sequence branches and then performing Softmax operation to obtain a relation matrix, and relation vectors between each node and a sampling point of the node are recorded in the relation matrix; the long-distance information capturing module is also used for carrying out matrix multiplication on the representative characteristics of the value branches and the relation matrix to obtain a long-distance dependency matrix, and the long-distance dependency matrix records the long-distance dependency information of each node;

the characteristic reflection module is used for coding the long-distance dependency information between the nodes into an input characteristic image and outputting the characteristic image containing the long-distance dependency information;

in this embodiment, the nodes are pixels;

in the embodiment, the value branch, the key value branch and the sequence branch of the feature image are respectively extracted, the neighbor nodes of each node in the value branch and the key value branch are dynamically sampled, and the representative features of each node in the value branch and the key value branch are obtained, so that more refined long-distance dependence information can be learned based on the sampled representative nodes, the characteristic capability of the features is enhanced, and the accuracy of the computer vision understanding task is improved; in addition, in the embodiment, the neighbor nodes of each node in the value branch and the key value branch are dynamically sampled, so that the calculation complexity is greatly reduced, the limitation of part of long-distance dependence on the application caused by the calculation complexity is counteracted, and the application prospect of the long-distance dependence is improved.

As an optional implementation manner, the representative diagram structure model provided in this embodiment is a simple representative diagram structure model, and the structure thereof is specifically shown in fig. 2; in FIG. 2, V, K, Q denote a value branch, a key-value branch, and a sequence branch, W, respectively_g、W_h、W_φAnd W_θRespectively represent weights; in the field of visual understanding, the constituent elements of a feature image can be viewed as consisting of a series of<Key,Value>In the data pair, Key and Value respectively represent Key Value and Value; the sequence branch comprises attention vectors of all nodes, a node is given, the value in the attention vector represents the weight of corresponding edges between the node and the rest nodes, and the importance degree of the rest nodes to the node is represented; n represents total node number, C represents channel number of original input feature, C' represents channel number after feature conversion, and S sample represents representative node numberCounting;

as shown in fig. 2, in this embodiment, the feature mapping module includes: a first convolution layer, a second convolution layer, a third convolution layer and a fourth convolution layer, wherein the convolution kernel size of each convolution layer is 1 multiplied by 1;

the first convolution layer, the second convolution layer, the third convolution layer and the fourth convolution layer are respectively used for carrying out convolution operation on the input characteristic image to obtain a value branch, an offset matrix, a key value branch and a sequence branch;

as shown in fig. 2, in this embodiment, the feature reflection module includes: a fifth convolution layer and a first aggregation layer, the convolution kernel size of the fifth convolution layer being 1 × 1;

the first aggregation layer is used for performing aggregation operation on the feature image of the input feature mapping module and the restored long-distance dependency matrix and encoding the long-distance dependency information among the nodes into the input feature image;

the aggregation operation indicates a corresponding position addition operation or a splicing operation.

As shown in fig. 2, in this embodiment, the sampling module includes two samplers Sampler, which are respectively used for dynamically sampling nodes in the value branch and the key value branch, the position information according to which the sampling is performed is from an offset matrix generated by performing 1 × 1 convolution operation regression on the second convolution layer, S representative nodes are sampled according to the offset matrix, the position of each representative node is represented by a two-dimensional coordinate, and accordingly, the offset matrix has 2S channels, and the value of which is in a decimal form;

as an optional implementation manner, in this embodiment, for each node p, 9 representative nodes S are sampled from its neighbor nodes₁～S₉The sampling results are shown in fig. 3; based on the sampling result, in the connection matrix of the input characteristic image, any node is only in a neighbor relation with the representative node thereof, namely, only edges connected between the node and the representative node thereof exist;

after sampling, performing matrix multiplication on the representative features of the key value branches and the sequence branches, performing matrix multiplication on the representative features of the nodes in the key value branches and attention vectors of corresponding nodes in the sequence branches to obtain relationship vectors between the nodes and the representative nodes thereof, and then performing Softmax operation to obtain a relationship matrix with dimension of NxS.

Example 2:

a representative diagram structure model building method, which is similar to that in embodiment 1, but is different from that in embodiment 1 in that the representative diagram structure model provided in this embodiment is a bottleneck-shaped representative diagram structure model, and the structure of the representative diagram structure model is specifically shown in fig. 4;

as shown in fig. 4, in this embodiment, the feature mapping module includes: the convolution kernel sizes of the sixth convolution layer and the seventh convolution layer are both 1 x 1;

The feature reflection module includes: the convolution kernel size of the eighth convolution layer is 1 multiplied by 1;

the second aggregation layer and the second activation layer are used for performing aggregation operation on the input characteristic image and the restored long-distance dependency matrix, activating an operation result, and encoding long-distance dependency information between nodes into the input characteristic image;

in this embodiment, the activation functions used by the first activation layer and the second activation layer are both Relu activation functions.

Example 3:

a method for building a representative graph structure model, which is similar to embodiment 1, except that, as shown in fig. 5, in this embodiment, nodes represent image grids;

specifically, the feature mapping module rasterizes an input feature image according to space, divides positions in the input feature into different groups, takes upper left position elements in each group as anchoring positions, and uses average pooling to aggregate information to regress an offset matrix; each grid acts as a node; the learned offset matrix is applied to all anchor locations to sample its representative node for each grid;

as shown in fig. 5, in the present embodiment, the grid size is 3 × 3, specifically, as shown by the central box in the 3 × 3 input and its corresponding rasterized feature p, the anchor coordinate of each group is the coordinate of the pixel position at the upper left corner in the grid; g represents the number of pixels in each group along a certain dimension, and in the present embodiment, G is 2.

In the embodiment, the image grids are used as nodes, and the number of nodes participating in operation can be greatly reduced by using the image grids as nodes, so that the calculation is reduced.

Example 4:

a method for building a representative graph structure model, which is similar to that in embodiment 1, the difference is that, as shown in fig. 6, in this embodiment, the representative graph structure model further includes a channel partitioning module and a feature integration module;

The method divides the representative characteristics of the value branches, the representative characteristics of the key value branches and the branch sequences according to the channels to obtain a plurality of channel groups, captures long-distance dependency information among nodes in each channel group respectively, can capture respective characteristics of the long-distance dependency information in each channel group by channel division, and can reduce the characteristic dimension so as to reduce and increase the model capacity by calculation.

In order to improve the parallelism among the channel groups, optionally, in this embodiment, a plurality of long-distance dependent information capturing modules may be correspondingly provided according to the number of the channel groups obtained by dividing;

each long-distance dependence information capturing module is used for capturing long-distance dependence information among nodes in one channel group and inputting the long-distance dependence information to the feature integration module.

Example 5:

a visual understanding model building method, comprising:

each sample in the standard training set consists of an image related to a target visual understanding task and a corresponding label truth value, and the label truth value is used for indicating a task result;

in this embodiment, after the representative graph structure model is inserted into the backbone network, the feature image input into the representative graph structure model is a feature image output after the original image features sequentially pass through the modules before the insertion position in the backbone network;

the output characteristics of which module in the backbone network is specifically used can be set according to the actual requirements of the target visual understanding task.

Example 6:

a visual understanding task execution method, comprising:

the trained visual understanding model is obtained by the visual understanding model establishing method provided in the above embodiment 5.

Example 7:

a computer-readable storage medium comprising a stored computer program which, when executed by a processor, controls an apparatus on which the computer-readable storage medium is located to perform the method for building the representative graph structure module provided in any one of embodiments 1 to 4, and/or the method for building the visual understanding model provided in embodiment 5, and/or the method for executing the visual understanding task provided in embodiment 6.

The following further explains the effective effects obtained by the present invention with reference to some specific application scenarios:

a bottleneck-shaped representative diagram structure model is established by adopting the embodiment 2, and the representative diagram structure model firstly passes through a 1 × 1 convolution layer, a batch normalization layer and a Relu activation function and then carries out four-branch operation; the sampler module dynamically samples for each node (or position) in the value branch and the key value branch respectively, each node selects nine sampling points (S is 9) as representative nodes, determines a representative node set of each node, and obtains representative characteristics of the two sampled branches on the value branch and the key value branch respectively through interpolation operation. Matrix multiplication is carried out on the characteristic of a sampling point of a certain node after sampling of the key value branch and the attention vector corresponding to the node in the sequence branch to obtain a relation vector between the node and the sampling point, and Softmax operation is carried out on all the nodes after the operation is carried out on all the nodes to obtain a relation matrix with dimension of N multiplied by S. And performing matrix multiplication on the relation matrix and the representative characteristics after value branch sampling, performing aggregation operation on the relation matrix and the original input characteristic image through a 1 × 1 convolution layer and a batch normalization layer, and then activating a function through Relu, and finally outputting the characteristic image containing long-distance dependence information.

Taking a semantic segmentation task as a target visual understanding task, adopting ResNet as a backbone network for executing the target visual understanding task, and inserting the representative graph structure model established by the embodiment 2 into an appropriate position in ResNet, for example, after stage5 of ResNet-50, so as to capture long-distance dependency information; after inserting the representative graph structure model into ResNet, a visual understanding task model is obtained.

The visual understanding model was trained using the ADE20K dataset. The ADE20K data set is an image segmentation data set in a complex scene, and comprises 2 ten thousand images as a training set, 2 thousand images for a verification set and 3 thousand images for a test set; each pixel is labeled with a predefined class and the dataset has 150 classes of predefined semantic tags. Dividing a training set into training subsets with the same size at random, specifically, the size of each training subset is 16, performing data amplification on data of each training subset to improve accuracy, wherein the data amplification mode specifically includes the following operations: (1) calculating the mean value of each channel of the images in the training set; (2) subtracting the image mean value from each image in the training subset; (3) randomly flipped horizontally and randomly scaled by any of {0.5,0.75,1.0,1.5,1.7 }.

In the training process, one training subset is trained each time, and the training end of all the training subsets is an iteration end; repeating the training until the iteration times reach the upper limit to obtain a trained visual understanding model; in actual training, the upper limit of the number of iterations is preferably 100000.

In the above iterative training, the training process in one iteration is as follows: and training network parameters of the visual understanding model by utilizing a forward propagation algorithm and a backward propagation algorithm, calculating a loss function corresponding to each training subset by forward propagation, and obtaining a corresponding gradient of the training subset by backward propagation. The loss calculation uses cross-entropy loss.

The target segmentation tasks under different scenes are executed by using the trained visual understanding model, the visualization results of the representative nodes of different nodes under different scenes are shown in fig. 7 and fig. 8, the results under the automatic driving scene are shown in fig. 7(a) -7 (e), the results in the geographic information system are shown in fig. 8(a) -8 (e), the diamond-shaped points in the graph represent the current nodes, the round dots represent the representative nodes obtained after sampling the current nodes, and the sizes of the sampling points with very small weights are adjusted for better display effect. The differences between the different nodes and their sampling points in fig. 7 and fig. 8 clearly show that the representative graph structure module captures the long-distance dependence information on the different nodes, for example, the sampling points of the nodes on the vegetation in fig. 7(a) are distributed on the vegetation, the sampling points of the nodes on the road in fig. 7(b) are mostly distributed on the road, and the respective sampling points of the nodes on the different vehicles in fig. 7(c) and fig. 7(d) are distributed on the different vehicles in response. Fig. 7 and fig. 8 also can clearly show the weight difference between the same sampling points, for example, the weight of the sampling point on the road in fig. 7(b) is significantly greater than that of the sampling point on the vegetation.

In general, the method can reduce the calculation complexity, accurately capture long-distance dependence information and effectively improve the accuracy of the visual understanding task. The invention can be applied to the fields of automatic driving, geographic information systems, video monitoring, medical image analysis, robots and the like, and can accurately execute the visual understanding task.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for building a representative graph structure model is characterized by comprising the following steps:

the representative graph structure model includes: the system comprises a feature mapping module, a sampling module, a long-distance dependence information capturing module and a feature reflection module;

the characteristic mapping module is used for extracting a value branch, a key value branch and a sequence branch from the input characteristic image and generating an offset matrix for indicating a sampling point coordinate;

the sampling module is used for respectively sampling the neighbor nodes of each node in the value branch and the key value branch according to the offset matrix, and respectively taking the sampling results as the representative characteristics of the value branch and the representative characteristics of the key value branch;

the long-distance dependence information capturing module is used for performing matrix multiplication on the representative characteristic of the key value branch and the sequence branch and then performing Softmax operation to obtain a relation matrix, wherein the relation matrix records a relation vector between each node and a sampling point of the node; the long-distance information capturing module is further configured to perform matrix multiplication on the representative characteristics of the value branches and the relationship matrix to obtain a long-distance dependency matrix, where the long-distance dependency matrix records long-distance dependency information of each node;

the feature reflection module is used for encoding long-distance dependency information among nodes into the input feature image and outputting the feature image containing the long-distance dependency information;

wherein the nodes are pixels or image grids; the Value branch and the Key-Value branch are respectively composed of keys and values in a series of < Key, Value > data pairs constituting the input feature image, and the sequence branch contains attention vectors of each node in the input feature image.

2. The method of building a representative graph structure model according to claim 1, wherein the representative graph structure model further comprises a channel partitioning module and a feature integration module;

the channel dividing module is used for dividing the representative characteristics of the value branches, the representative characteristics of the key value branches and the branch sequences according to channels to obtain a plurality of channel groups, and inputting the channel groups to the long-distance dependence information capturing module respectively so as to capture long-distance dependence information between nodes in the channel groups;

the feature integration module is used for integrating long-distance dependency information among nodes in each channel group to obtain long-distance dependency information of each node in the input feature image, and inputting the long-distance dependency information to the feature reflection module so that the long-distance dependency information among the nodes is coded into the input feature image by the feature reflection module;

after the value branches are divided according to the channels, the representative characteristics of the value branches, the representative characteristics of the key value branches and the sequence branches of the corresponding channels form a channel group.

3. The representative graph structure model building method according to claim 1 or 2, wherein the feature mapping module includes: a first convolution layer, a second convolution layer, a third convolution layer and a fourth convolution layer, wherein the convolution kernel size of each convolution layer is 1 multiplied by 1;

the first convolution layer, the second convolution layer, the third convolution layer and the fourth convolution layer are respectively used for performing convolution operation on the input characteristic image to obtain a value branch, an offset matrix, a key value branch and a sequence branch.

4. A representative graph structure model building method according to claim 3, wherein the feature reflection module includes: a fifth convolution layer and a first aggregation layer, the convolution kernel size of the fifth convolution layer being 1 × 1;

the fifth convolution layer is used for performing convolution operation on the long-distance dependence matrix and reducing the size of the long-distance dependence matrix to be the same as that of the input characteristic image;

the first aggregation layer is used for performing aggregation operation on the feature images input into the feature mapping module and the restored long-distance dependency matrix and encoding long-distance dependency information among nodes into the input feature images.

5. The representative graph structure model building method according to claim 1 or 2, wherein the feature mapping module includes: the convolution layer comprises a sixth convolution layer, a first batch of normalization layers, a first activation layer and a seventh convolution layer, wherein the convolution kernel sizes of the sixth convolution layer and the seventh convolution layer are both 1 multiplied by 1;

the sixth convolution layer, the first batch normalization layer and the first activation layer are used for sequentially performing convolution operation, batch normalization operation and activation operation on the input feature image to obtain a value branch, a key value branch and a sequence branch;

6. The method of modeling a representative graph structure of claim 5, wherein the feature reflection module comprises: an eighth convolution layer, a second batch of normalization layers, a second aggregation layer and a second activation layer, wherein the convolution kernel size of the eighth convolution layer is 1 multiplied by 1;

the eighth convolution layer and the second batch normalization layer are used for sequentially performing convolution operation and batch normalization operation on the long-distance dependence matrix and reducing the size of the long-distance dependence matrix to be the same as the feature image input into the feature mapping module;

and the second aggregation layer and the second activation layer are used for performing aggregation operation on the input characteristic image and the restored long-distance dependency matrix, then performing activation operation on an operation result, and encoding long-distance dependency information between nodes into the input characteristic image.

7. A visual understanding model building method, comprising:

inserting a representative diagram structure model obtained by using the representative diagram structure model establishing method of any one of claims 1 to 6 into a backbone network for executing a target visual understanding task to obtain a visual understanding model;

8. A visual understanding task execution method, comprising:

wherein the trained visual understanding model is obtained by the visual understanding model establishing method of claim 7.

9. A computer-readable storage medium, comprising a stored computer program, wherein when the computer program is executed by a processor, the computer-readable storage medium controls an apparatus to execute the representative graph structure model building method according to any one of claims 1 to 6, the visual understanding model building method according to claim 7, or the visual understanding task executing method according to claim 8.