CN115620150A

CN115620150A - Multi-modal image ground building identification method and device based on twin transform

Info

Publication number: CN115620150A
Application number: CN202211545426.6A
Authority: CN
Inventors: 蒙顺开; 瞿锐恒; 李叶雨
Original assignee: Dolphin Lezhi Technology Chengdu Co ltd
Current assignee: Dolphin Lezhi Technology Chengdu Co ltd
Priority date: 2022-12-05
Filing date: 2022-12-05
Publication date: 2023-01-17
Anticipated expiration: 2042-12-05
Also published as: CN115620150B

Abstract

The invention discloses a twin transform-based multi-modal image ground building identification method and device, and belongs to the technical field of ground building identification. The multi-modal image ground building identification method comprises the following steps: establishing a multi-twin neural network with N transform structures, wherein the multi-twin neural network is a pseudo-twin neural network; acquiring N target images in different modes; inputting the target image into the multi-twin neural network, and outputting a recognition result by the multi-twin neural network. The invention realizes the accurate recognition of the multi-platform multi-modal ground building image.

Description

Multi-modal image ground building identification method and device based on twin transform

Technical Field

The invention belongs to the technical field of ground building identification, and particularly relates to a twin transform-based multi-modal image ground building identification method and device.

Background

With the continuous advance of urbanization, the land occupation ratio of modern urban buildings is larger and larger, the types of urban buildings are richer and richer, and the buildings are communicated with each other; residential areas with different internal layouts, business areas with different heights of office buildings, low houses and industrial parks with wide occupied areas, and different difficulties are generated in ground building search by various building scenes.

From the perspective of a scout image source, visible light images, infrared images, SAR radar images and the like are mainly used for ground building target scout at present, and corresponding remote sensing satellites or unmanned aerial vehicles and other equipment can be used for capturing the scout image information. The visible light image mainly comprises color and texture information of the target, has higher resolution, has more details and light and shade contrast, and describes the target more specifically and is closer to the target information seen by human eyes. But visible light is greatly affected by illumination and weather conditions during imaging; the infrared image captures the thermal radiation of the target, has strong penetrating power and strong contour capturing performance on the target, but generally has lower resolution and poorer texture. The SAR image belongs to a radar image, has the characteristics of all weather, all time and no influence of weather, has higher imaging resolution and large width, can record information such as phase, amplitude, intensity and the like, and can obtain a clear high-resolution gray scale image through certain focusing processing and other modes.

The omnibearing sensing platform for the ground building comprises a space base platform, an empty base platform, a shore base platform, a sea base platform and the like, and senses information of the ground building, the environment, the geography and the like through various sensors. However, the targets shot by the platforms have large angle and size changes, which brings great difficulty to the identification of the ground buildings.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a twin transform-based multi-modal image ground building identification method and device.

The purpose of the invention is realized by the following technical scheme:

according to the first aspect of the invention, the twin transform-based multi-modal image ground building identification method comprises the following steps:

establishing a multi-twin neural network with N transform structures, wherein the multi-twin neural network is a pseudo-twin neural network;

acquiring N target images in different modes;

inputting the target image into the multi-twin neural network, and outputting a recognition result by the multi-twin neural network;

the multi-twin neural network comprises a plurality of neural network units, the neural network units comprise an image preprocessing network, a position and image data coding network, an encoder network and a full connection layer, and the encoder network comprises L encoders connected in series;

the image preprocessing network is used for converting an input target image into a normalized image feature map;

the position and image data encoding network is used for converting the image feature map into a feature vector containing position and image data;

the encoder network is used for completing the extraction of the feature vector;

and the full connection layer is used for completing the mapping of the characteristic vector output by the encoder network to the target class and outputting the class probability of the target.

Further, the target image is an infrared image, a visible light image, a SAR radar image, a multispectral image or a laser radar image.

Further, establishing a multi-twin neural network with N transform structures, comprising:

acquiring a plurality of source images to form a data set, and labeling the data set to form training data of a multi-twin neural network;

establishing a joint loss function of the multi-twin neural network;

and training the multi-twin neural network by using the joint loss function to obtain parameters of the multi-twin neural network.

According to a second aspect of the present invention, a twin Transformer based multi-modal image ground building recognition apparatus comprises:

the model building module is used for building a multi-twin neural network with N transform structures, and the multi-twin neural network is a pseudo-twin neural network;

the image acquisition module is used for acquiring N target images in different modes;

and the target recognition module is used for inputting the target image into the multi-twin neural network to obtain a recognition result output by the multi-twin neural network.

The invention has the beneficial effects that:

(1) The method utilizes a transducer attention mechanism to realize the extraction of global effective information in a scene and the focusing attention of local feature points, then uses a multi-twin neural network to carry out feature extraction and similarity calculation on target images of multiple modes and multiple visual angles, realizes the relevance synthesis of different information sources of the same target scene, completes the integral modeling expression of the target scene, and realizes the accurate identification of multi-platform multi-mode ground building images;

(2) According to the method, a more typical pseudo-twin neural network is adopted, and a neural network model with consistent expression among different modal data is constructed by designing a loss function of the pseudo-twin neural network, so that the problem of matching of targets among different modal images is solved.

Drawings

FIG. 1 is a flow diagram of one embodiment of a method for multi-modal image ground structure identification in accordance with the present invention;

FIG. 2 is a schematic diagram of a pseudo-twin neural network;

FIG. 3 is a diagram of a transform as an encoder-decoder architecture;

FIG. 4 is a schematic diagram of a multi-twin neural network training process;

FIG. 5 is a schematic diagram of an image pre-processing network;

FIG. 6 is a schematic diagram of a plurality of modules obtained by dividing an image feature map by a location and image data encoding network;

FIG. 7 is a schematic diagram of an encoder;

FIG. 8 is a schematic diagram of a process of inputting a target image of a multi-twin neural network;

fig. 9 is a block diagram of the multi-modal image ground structure recognition apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

Referring to fig. 1 to 4, the present invention provides a twin transform-based multi-modal image ground building recognition method and apparatus:

in a first aspect of the present invention, a twin Transformer-based multimodal image ground structure recognition method is provided, as shown in fig. 1, the multimodal image ground structure recognition method includes steps S100 to S300, which are described in detail below.

S100, establishing a multi-twin neural network with N transform structures, wherein the multi-twin neural network is a pseudo-twin neural network.

The neural network structures employed in each input image branch of the pseudo-twin neural network are different or the parameters are not shared, as shown in fig. 2. In the embodiment, a pseudo-twin neural network is adopted, and a neural network model with consistent expression among different modal data is constructed by designing a loss function of the pseudo-twin network, so that the problem of matching targets among different modal images is solved.

In this embodiment, by comparing the distance expression modes among the euclidean distance, the cosine distance, the exponential distance, and the like, a distance measurement method with the minimum distance of the same class and the maximum distance of the different class is selected as the distance measurement method of the multi-twin neural network. The twin network inputs a plurality of vectors passing through the deep neural network, and in a new vector space, the purpose can be achieved as long as the distance between the vectors can be judged, and the distance of the same type is smaller and the distance of the different type is larger.

When the Transformer is used as an encoder-decoder, it is based on the attention mechanism completely, without any convolution layer or recurrent neural network layer, and its overall structure is shown in fig. 3. The embedded representations of the input (source) sequence and the output (target) sequence, plus position coding, are input to an encoder and decoder, respectively. Different from the convolution operation which only models the relation between the neighborhood pixels, the Transformer is the global operation which can model the relation between all the pixels, has stronger modeling capability, can better extract the global characteristics of a scene by using the model, and highlights the relation between local parts and the whole. In the embodiment, a transform is used instead of a convolution operation, so that better scene feature extraction and subsequent task requirements are realized. In this embodiment, the Transformer and the twin network are combined to form a multi-twin neural network with N Transformer structures, and a consistency expression model of the extended target under the conditions of cross-modal, multi-view and different scale changes is established.

In some embodiments, the multi-twin neural network includes a plurality of neural network elements including an image pre-processing network, a position and image data encoding network, an encoder network, a fully connected layer, the encoder network including L encoders in series, as shown in fig. 4.

The image preprocessing network is used for converting an input target image into a normalized image feature map.

Specifically, the image preprocessing network is used for completing conversion from different resolutions and different channel numbers to the same image feature map. Suppose the matrix of the input image of the image preprocessing network in the ith neural network unit is: h _i ×W _i ×C _i The image preprocessing network output size is: (M P) x (K P) x C normalized image feature map. The image pre-processing network structure is shown in fig. 5. The input image is first interpolated into: (M P) x (K P) x Ci size image, and then 1 x 1And (M) convolving the channels to obtain an image feature map of (M P) x (K P) x C.

The location and image data encoding network is used to convert the image feature map into feature vectors containing location and image data.

Specifically, the input of the position and image data encoding network is the output of the image preprocessing module, i.e. the size is: (M x P) x (K x P) x C normalized image feature map. The output of the location and image data encoding network is (M x K) a feature vector containing location and image data. The position and image data coding network divides the normalized image feature map of (M × P) × (K × P) × C into (M × K) modules as shown in fig. 6 according to the size of P × P, and each module contains the following data amount: p × P × C.

Expanding the three-dimensional image feature map of P to form a feature vector Z with the size of P C1 _t . Assume that the position of the feature pattern block in the feature map is (m, n) (0)<m<M+1,0<n<K +1, m, n are integers) defines the position code X of the characteristic diagram shown below _pos ：

X _pos =(n*M+m)/（M*K）

Combining the P X P C X1 characteristic diagram vector and the position code of the characteristic diagram to obtain the following characteristic vector Z _p :

Z _p =[X _pos ;Z _t ]

Z _p The size of (A) is as follows: (P + C + 1). Times.1, Z _p Cannot be directly input into the encoder module and needs normalization processing. The normalization processing mode is as follows:

Z _po,i =sigmod(BN(Z _p,i *W _p,i +B _i ))

wherein i is more than or equal to 0<(M x K), and i is an integer wherein W _p,i Is a matrix of (P + C + 1) X (P + C + 1), B _i Comprises the following steps: (P × C + 1) × 1 matrix. Both Wp and B are learnable network parameters. The sigmod function is used as a nonlinear function, on one hand, the output of the module is normalized to be between (0-1), and meanwhile, the sigmod function has nonlinear characteristics and stronger expression capability. The definition of the Sigmod function is as follows:

。

the encoder network is used to perform feature vector extraction.

Specifically, the encoder network is composed of L encoders connected in series, and the structure of each encoder is shown in fig. 7. The input to the encoder network is a (M X K) number of (P X P C + 1) x 1 eigenvectors Z _po ^l (0≤l<L-1, and L is an integer), the output of the encoder network is: z _po ^l+1 And (M × K) feature vectors of (P × C + 1) × 1.

And the full connection layer is used for finishing the mapping of the characteristic vector output by the encoder network to the target class and outputting the class probability of the target.

Specifically, assuming that the class of the target is T, (M × K) pieces of (P × C + 1) × 1 are accumulated to obtain a feature vector Z _M The vector is a feature vector of (P × C + 1) × 1. Full link layer Z _C Output of and Z _M The relationship of (a) to (b) is as follows:

Z _C = sofmax(Z _M *W _M +B _M )

wherein Z _C The vector is T multiplied by 1, the elements of Zc represent the probability that the input heterogeneous image is in a certain category, and the value range is (0-1). W _M And B _M The weights and biases for fully connected layers are parameters that can be learned. The definition of the Softmax function is as follows:

。

in some embodiments, establishing a multi-twin neural network having N transform structures comprises:

s110, obtaining N source images to form a data set, and labeling the data set to form training data of a multi-twin neural network;

s120, establishing a joint loss function (L) of the multi-twin neural network, which is defined as:

wherein Y represents whether the plurality of sample labels match, Y =1, representing a label match for N samples; y =0, representing two tags not matching; m is a set threshold value, belongs to a super parameter of the network and is obtained according to experience; h is the number of samples of single training;

and outputting the distance between different twin network feature layer in the h training sample.

Is defined as follows:

。

s130, training the multi-twin neural network by using the joint loss function to obtain a parameter W of the multi-twin neural network ₁ *，W ₂ *,...,W _N *。

S200, acquiring N target images in different modes.

Generally, the target image is an infrared image, a visible light image, an SAR radar image, a multispectral image or a laser radar image; specifically, the target image may be one or more of an infrared image, a visible light image, a SAR radar image, a multispectral image, and a lidar image.

Generally, the number of the target images is two or more. In this embodiment, there is no requirement for the shooting angle of the target image or the like.

S300, inputting the target image into the multi-twin neural network, and outputting a recognition result by the multi-twin neural network.

Specifically, the processing procedure of the multi-twin neural network on the input target image is as follows: using parameters of the multi-twin neural network during training: w ₁ *，W ₂ *,...,W _N * For the i-thThe input image of the source passes through the ith neural network unit, and the classification result of the target can be obtained at the output end Y of the neural network unit, as shown in FIG. 8.

The method comprises the steps of extracting global effective information in a scene and focusing attention on local feature points by using a transducer attention mechanism, extracting features and calculating similarity of target images of multiple modes and multiple visual angles by using a multi-twin neural network, and realizing relevance synthesis of different information sources of the same target scene, so that target consistency feature expression of cross-mode, large-visual-angle transformation and scale transformation of a ground building/offshore large ship is completed, and accurate identification of multi-platform multi-mode ground building images is realized.

In this embodiment, different types of target images are input into different networks, the network networks have different network structures and may have different network parameters, and the same feature vectors of the different types of target images for the ground building target are obtained by defining the target functions of the networks, so that the same feature vectors of the target are established among the input images of different modalities, and the purpose of target identification is finally achieved. For example, the infrared image and the visible light image are input into two independent networks, different network structures exist between the two networks, parameters of the networks may also be different, and the same feature vectors of the visible light image and the infrared image for the ground building target are obtained by defining the target functions of the two networks.

A second aspect of the present invention provides a twin Transformer-based multimodal image ground structure recognition apparatus, as shown in fig. 9, the multimodal image ground structure recognition apparatus includes a model construction module, an image acquisition module, and a target recognition module.

The model building module is used for building a multi-twin neural network with N transform structures, and the multi-twin neural network is a pseudo-twin neural network. In this embodiment, the model building module may be configured to perform step S100 shown in fig. 1, and reference may be made to the description of step S100 for a detailed description of the model building module.

And the image acquisition module is used for acquiring N target images in different modes. In this embodiment, the image acquiring module may be configured to perform step S200 shown in fig. 1, and reference may be made to the description of step S200 for a detailed description of the image acquiring module.

And the target recognition module is used for inputting the target image into the multi-twin neural network to obtain a recognition result output by the multi-twin neural network. In this embodiment, the object recognition module may be configured to execute step S300 shown in fig. 1, and reference may be made to the description of step S300 for a detailed description of the object recognition module.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The multimodal image ground building identification method based on the twin transform is characterized by comprising the following steps:

acquiring N target images in different modes;

2. The twin Transformer-based multimodal image ground building identification method according to claim 1, wherein the target image is an infrared image, a visible light image, a SAR radar image, a multispectral image or a lidar image.

3. The twin transducer-based multimodal image ground building identification method according to claim 1, wherein establishing a multi-twin neural network with N transducer structures comprises:

obtaining a plurality of source images to form a data set, and labeling the data set to form training data of a multi-twin neural network;

establishing a joint loss function of the multi-twin neural network;

4. Twin transform-based multimodal image ground building recognition device is characterized by comprising:

the image acquisition module is used for acquiring N target images in different modalities;