CN116740418A

CN116740418A - Target detection method based on graph reconstruction network

Info

Publication number: CN116740418A
Application number: CN202310575816.6A
Authority: CN
Inventors: 邸江磊; 江文隽; 秦智坚; 吴计; 王萍; 任振波; 秦玉文
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-09-12

Abstract

The invention belongs to the field of target detection, and discloses a target detection method based on a graph reconstruction network. According to the method, firstly, multispectral images for a period of time are collected, dimension reduction and feature extraction are carried out on the multispectral images, and a physical feature map, a spatial feature map and a spectral feature map of the multispectral images are respectively extracted in a map embedding mode. And connecting the obtained three feature graphs with different node types through the link edges and the nodes, and obtaining a heterogeneous graph fusing the multi-source features by adopting a self-attention-based graph pooling method. And sequencing and inputting the fused graph data according to the time dimension, acquiring time and space dimension characteristics of the data, convoluting by using a plurality of layers of space-time graphs to extract information of different time and space dimensions, enabling a result of the space-time convolution to be consistent with a predicted target dimension through convolution operation and full connection operation of CNN, and classifying and positioning targets through a full connection layer with shared weight values to finish a target detection task. The time-space characteristics obtained by the method enable detection to be more accurate.

Description

Target detection method based on graph reconstruction network

Technical Field

The invention relates to the field of target detection, in particular to a target detection method based on a graph reconstruction network.

Background

Object detection is an important task in the field of computer vision, whose object is to accurately detect objects of interest in images or videos and to mark their positions. A multispectral image is an image that contains information for a plurality of bands. The system not only contains the spatial information of the target, but also contains the spectral information, thereby overcoming the problem of limited single-mode image information. For a dynamic weak target, the target area and the position information can be acquired more accurately by utilizing the multi-mode information of the multi-spectrum image to perform target identification. Therefore, the method combines the characteristics of the multispectral image, and can improve the detection accuracy and reliability by applying the multispectral image to the target detection task. By utilizing various wave band information in the multispectral image, the algorithm can better distinguish the target from the background and extract richer characteristic information from the target, thereby realizing more accurate target detection results.

Early multi-spectral target identification was achieved mainly by manual band selection. If a specific characteristic wave band separates a detection target from a complex field background, the target detection is realized by utilizing polarization multispectral image fusion aiming at the spectrum characteristics of a camouflage target. In recent years, traditional detection means for artificial feature selection and fusion are gradually replaced by convolutional neural networks. Zhang Shaoting from university of North Carolina verifies the impact of feature fusion at different stages of CNN on the target detection performance of multispectral images. Meanwhile, the Hangil utilizes CNN and support vector regression to complete the combined feature extraction of visible and far infrared spectrogram images. The depth residual error network is utilized by the university of northwest industry He Ming and the like to extract different layers of characteristics of the multispectral remote sensing image, and the detection of a remarkable target is realized in an end-to-end mode.

However, CNNs as the underlying network model only excel in processing spatial network data and establishing spatial local neighborhood relationships between pixels, readily ignoring implicit relationships between visual information and irregular representations of the data itself. The down-sampling process in the CNN reduces the spatial resolution of the feature images, thus inevitably causing the loss of small target information, making it difficult for the detection network to perform characterization learning from limited and distorted structural information, and at the same time, the CNN cannot extract the time dimension feature information from frame to frame. For this reason, it is necessary to propose a solution to the above-mentioned problems.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a target detection method based on a graph reconstruction network.

The technical scheme for solving the technical problems is as follows: the target detection method based on the graph reconstruction network comprises the following steps:

(S1) collecting multispectral images for a period of time, and carrying out dimension reduction and feature extraction on the multispectral images;

(S2) respectively extracting a physical characteristic diagram, a spatial characteristic diagram and a spectral characteristic diagram of the multispectral image in a diagram embedding mode;

(S3) connecting the obtained three-dimensional feature graphs with the node types through the link edges, and obtaining a heterogeneous graph fusing the multi-source features by adopting a self-attention-based graph pooling method;

(S4) sequencing and inputting the fused graph data according to time dimension, and acquiring time dimension and space dimension characteristics and time-space correlation of the data by using multi-layer space-time graph convolution;

and S5, enabling the space-time convolution result to be consistent with the predicted target dimension through the convolution operation and the full-connection operation of the CNN, and classifying and positioning the target through the full-connection layer with the shared weight.

Preferably, in step (S1), the multispectral image is captured by a multispectral camera that can collect 3 or more spectral bands simultaneously.

Preferably, in step (S1), the feature dimension reduction and feature extraction are that the data information is subjected to weight distribution of different pixel spectrum feature similarities by using spatial spectrum embedding, and similarity classification and feature dimension reduction are performed on local neighborhood space and spectrum information through manifold learning.

Preferably, in step (S2), the three feature maps are obtained respectively as follows: extracting a physical feature map of the spectrum data by utilizing the spectrum data after dimension reduction and combining the infrared spectrum features; determining super-pixel neighbor node information by using a linear iterative clustering method, constructing edge connection relations among nodes according to the spatial connectivity relations of the super-pixels, and extracting a spatial feature map; and combining the spectrum characteristic similarity of the target, sampling and recombining from different spectrum band dimensions to obtain the spectrum characteristic distribution of the target, and effectively representing the spectrum data residing on the smooth manifold by using the graph neural network.

Preferably, in step (S3), the linked network model is a graph self-encoder including, but not limited to, a graph convolutional self-encoder, a variational graph convolutional self-encoder, an anti-regularization graph self-encoder.

Preferably, in step (S3), the pooling method includes, but is not limited to DiffPool, SAGPool, ASAP.

Preferably, in step (S4), the space-time diagram convolution performs feature extraction in a time dimension and a space dimension by different methods, wherein a network for extracting the time dimension includes, but is not limited to, RNN, GRU, LSTM, TCmodule, transformer, and a feature network for extracting the space dimension includes, but is not limited to, GCN, GAT, GCN and GAT.

Preferably, in step (S5), the detection method is to implement classification and positioning on the target by passing through the convolution module after the convolution of the space-time diagrams of four layers, and finally feeding the extracted features to the target detection module to complete the target recognition task.

Compared with the prior art, the invention has the following beneficial effects:

the multi-dimensional characteristic information of the spatial characteristic, the physical characteristic and the spectral characteristic is obtained by carrying out dimension reduction, characteristic extraction and image embedding on the multi-spectral image. The obtained graph structures are combined to obtain the heterogeneous graph of the multi-source information, and semantic association and relative position information of the targets can be supplemented by the multi-source heterogeneous information better.

Meanwhile, the nodes and the edge features in the graph data represent potential relations between the data, the nodes have the characteristics of disorder and variable size, the states of the nodes can be updated according to neighbor nodes with any depth, and therefore the attribute feature relations are represented, and the characteristics enable the nodes to be used for representing long-range space and spectrum relations in multispectral images.

In addition, aiming at the fact that CNN can not extract time dimension characteristic information between frames, the time dimension characteristic is extracted by utilizing space-time diagram convolution, and the associated characteristic extraction of a moving target and global background information of inter-frame changes can be more accurately determined through the characteristic.

Drawings

Fig. 1 is a flowchart of a target detection method based on a graph reconstruction network according to the present invention.

Fig. 2 is a frame diagram of a target detection method based on a graph reconstruction network according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Referring to fig. 1, the target detection method based on the graph reconstruction network of the present invention includes the following steps:

Referring to fig. 1, in step (S1), the acquired multispectral data is composed of multispectral images of four bands, and 1000 multispectral images of different time periods are taken.

Referring to fig. 1, in step (S1), the dimension reduction and feature extraction method: the spectrum and the spatial information are fused by using the augmentation vector:

x＝(u,v,b ₁ ,b ₂ ,...,b _B )＝(x ¹ ,x ² ,...,x ^B+2 ) ^T (1)

where h (u, v) is a pixel on the graph, (b) ₁ ,b ₂ ,b ₃ ,b _B ) Is a band array.

In this embodiment, images of 4 bands are acquired, so b=4.

Will augment the vectorAs training data, normalized to any x _i And carrying out same-class classification under a supervision mode, constructing a pixel local neighborhood by a k nearest neighbor algorithm, carrying out similarity classification and feature dimension reduction on local neighborhood space and spectrum information by manifold learning, combining spatial spectrum polynomial local area or neighborhood embedding to finish weight distribution on different pixel spectrum feature similarities in the local neighborhood, and finally combining element matrix multiplication to establish low-dimensional nonlinear explicit mapping between multispectral data.

In a specific embodiment, the number of marked elements in the augmentation vector is 6.

Referring to fig. 1, in step (S2), the physical characteristic map, including the equivalent temperature and equivalent area physical characteristics, is represented as a map by a random walk map embedding method.

Referring to fig. 1, in step (S2), the spatial feature map performs super-pixel segmentation on the multispectral image through the SLIC algorithm, calculates the spatial distance and the spectral distance between the pixel points, balances the weights, and iteratively updates the cluster center and the range boundary of the super-pixel to obtain multispectral image data formed by the super-pixels, and constructs the edge connection relationship between the nodes according to the spatial connectivity relationship of the super-pixels.

Referring to fig. 1, the spectral signature described in step (S2) is constructed by a method that passes through a semi-supervised adjacency matrix. Specifically, the method is constructed based on information provided by a limited number of tag data and a large number of unlabeled data, pseudo tags are constructed by using a Dirichlet process mixed model based on variation, and space spectrum adjacency matrix construction is realized based on an inherent clustering algorithm in a data sample.

Referring to fig. 1, in step (S3), the nodes and edges of the obtained three feature graphs are analyzed, and three different types of node and edge feature graphs are connected using a network structure based on a graph self-encoder.

Specifically, each given graph is analyzed, the node feature vectors among different graphs are analyzed through cosine similarity, and nodes with high similarity in the three graphs are reserved. And for the three processed graphs, the graph convolution network is used for calculating the three processed graphs, so that the node representation z of each node is obtained. The following formula is then used:

wherein the method comprises the steps ofIs the predictive probability between the linked nodes (i, j), where σ is the Sigmoid activation function. Here, the probability greater than 0.8 is set for linking, and the probability less than 0.2 is not connected, so that a new graph after three graphs are linked is obtained.

Referring to fig. 1, in step (S3), the node of the new graph is extracted and aggregated with the SAGpool method.

Specifically, the extracted new graph is subjected to a convolution operation of a graph nerve, and the GCN learns the characteristic representation of each node V epsilon V, namely, the characteristic representation of the node V is obtained by aggregating the neighbor node characteristics of each node. For each node v, a self-attention mechanism is used to calculate an attention score z for each node. Next, topk is used to select the most important node, and the number of reserved nodes is determined by pooling the ratio k, where we set k to 0.5. By thus obtaining the attention-based mask map, the mask map is multiplied by the corresponding node of the map structure of the fused heterogeneous information of the original input, and the final output map, i.e. the heterogeneous map fused with the multi-source features, is obtained.

Referring to fig. 1, in step (S4), the network for extracting the time dimension is TCmodule. The time module consists of two expansion initiation layers.

Specifically, the output source of the whole time convolution module is divided into two parts, the input of the module is filtered through an expansion starting layer formed by a group of one-dimensional convolution filters, and the difference is that the subsequent activation functions of the expansion starting layer are different. The expansion initiation layer adopts a structure consisting of filter sizes of 1 x 2, 1 x 3, 1 x 6 and 1 x 7, so that the above-mentioned time period can be covered by a combination of these filter sizes.

In this example, we input 10 graphs of fused heterogeneous features at a time, i.e. extract the front-to-back temporal feature relationship from the original 10-frame multispectral image.

Referring to fig. 1, in step (S4), the extracted feature network of the spatial dimension is GCN plus GAT.

Specifically, spatial features are extracted through a GCN layer after a network of a time module is passed, information transfer between nodes is performed through a GAT graph annotation force layer, and dependency relations among the nodes are captured. The features after passing through TCmodule, GCN and GAT once again pass through the same processing, and the features generated after each processing are subjected to feature extraction.

In this example, we extract four levels of features and splice the acquired features with a concat function to obtain multi-scale spatio-temporal features.

Referring to fig. 1, in step (S5), after the extracted spatio-temporal features, the result of the spatio-temporal convolution is consistent with the prediction target dimension through a rolling block of five layers of CNNs, and finally, the acquired features are classified by using MLP, so as to acquire position information in the image, and the task of target detection is completed for the category information.

In this example, training was performed in the Linux operating system of Ubuntu 18.04.3, pyCharm compiling environment Python 3.9 programming language, pytorch-cuda11.7 deep learning application library, environment on GeForce 3090 graphics card.

Also in this example, the IOU threshold is set to 0.5. A space-time diagram convolutional network was constructed using pyrerch. Here, the loss function we set to a two-class cross entropy loss function:

where y is groudtruth, 1 if positive and 0 if negative, andis the probability of the accuracy of the model predictions. The learning rate is set to 0.01, the epoch is set to 300 during training, the batch size is 8, and 1000 pictures are input.

The foregoing is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above examples, but all technical solutions falling within the spirit and principle of the present invention fall within the scope of the present invention. It should be noted that modifications and adaptations to the present invention are intended to be within the scope of the present invention without departing from the principles thereof.

Claims

1. The target detection method based on the graph reconstruction network is characterized by comprising the following steps of:

2. The method of claim 1, wherein in step (S1), the multispectral image is captured by a multispectral camera capable of capturing 3 or more spectral bands simultaneously.

3. The method for detecting the target based on the graph reconstruction network according to claim 1, wherein in the step (S1), the feature dimension reduction and feature extraction are that the data information is subjected to weight distribution of different pixel spectrum feature similarities by utilizing spatial spectrum embedding, and the local neighborhood space and the spectrum information are subjected to similarity classification and feature dimension reduction through manifold learning.

4. The method for detecting an object based on a graph reconstruction network according to claim 1, wherein in step (S2), the three feature graphs are obtained by: extracting a physical feature map of the spectrum data by utilizing the spectrum data after dimension reduction and combining the infrared spectrum features; determining super-pixel neighbor node information by using a linear iterative clustering method, constructing edge connection relations among nodes according to the spatial connectivity relations of the super-pixels, and extracting a spatial feature map; and combining the spectrum characteristic similarity of the target, sampling and recombining from different spectrum band dimensions to obtain the spectrum characteristic distribution of the target, and effectively representing the spectrum data residing on the smooth manifold by using the graph neural network.

5. The method of claim 1, wherein in step (S3), the linked network model is a graph self-encoder including, but not limited to, a graph convolutional self-encoder, a variational graph convolution self-encoder, and an anti-regularization graph self-encoder.

6. The method of claim 1, wherein in step (S3), the pooling method includes, but is not limited to, diffPool, SAGPool, ASAP.

7. The method of claim 1, wherein in step (S4), the space-time graph convolution performs feature extraction in a time dimension and a space dimension respectively, wherein the network for extracting the time dimension includes but is not limited to RNN, GRU, LSTM, TCN, transformer, and the feature network for extracting the space dimension includes but is not limited to GCN, GAT, GCN in combination with GAT.

8. The method for detecting the target based on the graph reconstruction network according to claim 1, wherein in the step (S5), after the four-level space-time graph convolution, the extracted features are passed through a convolution module and finally fed into a target detection module, so as to classify and locate the target, and the target recognition task is completed.