CN117173562A

CN117173562A - SAR image ship identification method based on latent layer diffusion model technology

Info

Publication number: CN117173562A
Application number: CN202311065024.0A
Authority: CN
Inventors: 王路; 亓宇航; 赵春晖; 李开誉; 刘浩东; 孙百良
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2023-08-23
Filing date: 2023-08-23
Publication date: 2023-12-05

Abstract

The application provides a SAR image ship identification method based on a latent layer diffusion model technology. The method is used for SAR ship recognition tasks, the data set of a limited sample is expanded through an image generation module, then the characteristic extraction of Transformer Layer of a T2T module and an SE attention mechanism increasing module is carried out, the information fusion between adjacent characteristics is enhanced, and key characteristics are highlighted. Efficient and accurate SAR image ship identification is realized.

Description

SAR image ship identification method based on latent layer diffusion model technology

Technical Field

The application belongs to the technical field of target recognition of Synthetic Aperture Radar (SAR), and particularly relates to an SAR image ship recognition method based on a latent layer diffusion model technology.

Background

In recent years, the deep learning method has been successfully applied to target recognition in SAR images. However, the identification of vessels in SAR images still presents significant challenges. Firstly, due to the specificity of SAR images, the traditional algorithm cannot model important local structures such as edges and lines between adjacent pixels, so that the training efficiency of samples is low. Secondly, in case of insufficient SAR ship identification data sets, the training samples provide very limited features for the backbone network, which makes the model impossible to achieve high classification accuracy. Therefore, how to efficiently and accurately identify the target of the SAR ship under the condition of limited training samples is a problem to be studied and solved.

Disclosure of Invention

The application aims to solve the problem of insufficient data of a SAR ship sample, and provides a SAR image ship identification method based on a submerged layer diffusion model technology. According to the method, the deep learning network is utilized to efficiently and accurately identify the SAR ship targets, and corresponding category information is output.

The application is realized by the following technical scheme, and provides a SAR image ship identification method based on a submerged layer diffusion model technology, which comprises the following steps:

step 1: after the current limited sample data is processed by an image generation module, generating a picture of a corresponding category according to text information or semantic description; dividing the enhanced data set into a training set, a verification set and a test set according to the proportion;

step 2: extracting features of an input image, selecting a T2T-ViT model as a feature extraction network, converting the input image into feature vectors after image segmentation and T2T module processing, and fusing adjacent feature information;

step 3: an SE attention mechanism module is added after the multi-head attention, the weight of each channel is calculated according to the input feature diagram, and the attention degree of important features is increased;

step 4: and (3) performing multi-task regression on the characteristics by using the classification head, and giving a weight coefficient to adapt to the scene of SAR ship identification, so as to finally obtain an identification result.

Further, in step 1, the data set used is the OpenSARShip 2.0 ship data set, and 320 sample pictures of three kinds of ships and Cargo ships, cargo, fishing vessels, and tugboat Tug are selected.

Further, in step 1, for the image generation module set based on the latent layer diffusion model, the image is converted from the pixel space to the latent layer space by the encoder, and after being subjected to the processes of adding noise and U-Net denoising, the image is converted from the latent layer space to the pixel space by decoding, and the whole process objective is simplified as follows:

wherein the nerve trunk e _θ (o, t) is U-Net, z for a time condition _t Is generated by the addition of noise to the input latent layer feature vector z.

Further, in order that the image generation module can generate corresponding pictures according to requirements, the U-Net backbone of the foundation of the image generation module is enhanced by using a cross attention mechanism, the U-Net backbone is converted into a more flexible conditional image generator, and in order to preprocess y from various modes, an encoder tau in a specific field is introduced _θ The encoder will project to the intermediate representationIt is then mapped to the middle layer of the U-Net by implementing a cross-attention layer of attention:

wherein,representation e _θ Andis represented in the middle of U-Net; />And->Is a learnable projection matrix;

based on the image condition pair, the final learning condition is:

wherein τ _θ And e _θ And the optimization is combined through the formula.

Further, in the step 2, the image is divided into n image blocks through a T2T-ViT model, then feature extraction and fusion are carried out on the image blocks through a T2T module, a certain number of tokens are generated, and then class tokens and position codes position embedding are combined for subsequent processing.

Further, there are two steps per T2T module; reconstruction and soft segmentation; for input image I _i It is converted into tokens by soft segmentation:

T _i+1 ＝SS(I _i )

then T is generated by conversion of transformerencoder _i+1 ′：

T _i+1 ′＝MLP(MSA(T _i+1 ))

Wherein MSA is a layerNormalized multi-headed self-care operation, MLP is a layer normalized multi-layer perceptron in standard transformers, and then remodelling these symbols in the spatial dimension into image I _i+1 ：

I _i+1 ＝Reshape(T _i+1 ′)

Wherein Reshape indicates thatRecombined as->Wherein l is T _i+1 ' length, h, w, c are height, width, channel, respectively, and l=h×w.

Further, in step 3, the SE attention mechanism module is divided into two parts, namely extrusion and excitation, and channel weights are reconstructed by modeling the relation between channels;

extrusion using global average pooling F _sq (.) generating channel level information for the purpose of compressing global space information into a channel descriptor vector Z E R ^C The method comprises the steps of carrying out a first treatment on the surface of the The channel descriptor vector Z is considered a set of local features, where each element represents the global features of each channel of U; formally representing z= [ Z ] ₁ ,z ₂ ,…,z _c ]It is obtained by compressing the feature map u= [ U ] ₁ ,u ₂ ,…,u _c ]And generated; the C-th element using the spatial dimension H x W, Z of U is calculated as follows:

after compressing the information, implementing excitation to fully capture the relationship between the channels of U, each element of the descriptor vector Z representing a global feature of the corresponding channel of U; thus, two fully connected layers are established, which are regarded as mapping functions F _ex Parameterizing the nonlinear relationship of each element of Z, and then activating the parameters by an S-shaped activation function to obtain channel weights at the U-pixel level; the excitation equation is expressed as:

S＝F _ex (Z,W)＝σ(g(Z,V))＝σ(V ₂ δ(V ₁ Z))

wherein σ is a sigmoid function; delta is a reforming linear unit ReLU activation function; v (V) ₁ ∈R ^C/R×C And V ₂ ∈R ^C/R×C A weight matrix representing the full connection layer; C/R is the dimension-reducing gravity of the full-connection layer; s epsilon R ^C With a value falling between 0 and 1, representing the model's interest in each channel of the profile U;

the final output obtained in the SE attention mechanism module by activating S rescaled U is:

x _c ′＝F _scale (u _c ,s _c )＝u _c s _c

wherein X' = [ X ] ₁ ′,x ₂ ′,…,x _c ′]，F _scale (u _c ,s _c ) Index quantity s _c And a characteristic diagram u _c ∈R ^H×W Channel multiplication between; the output X' of the SE attention mechanism module is the product of readjusting the channel weights on U.

Further, in step 4, a cross entropy loss function and label smoothing are used to prevent over-fitting problems; the cross entropy loss function formula is as follows:

wherein x is _i Is the result of the model output passing through softmax, y _i Indicating whether it is the corresponding category label, expressed by the following formula:

then, a label smoothing method is used, so that probability distribution is changed, and the following is obtained:

where ε is an infinitesimal constant.

The application provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the SAR image ship recognition method based on the latent layer diffusion model technology when executing the computer program.

The application provides a computer readable storage medium for storing computer instructions which when executed by a processor realize the steps of the SAR image ship identification method based on the latent layer diffusion model technology.

Compared with the prior art, the application has the beneficial effects that:

the application provides a SAR image ship recognition method based on a latent layer diffusion model technology, which is used for SAR ship recognition tasks. Efficient and accurate SAR image ship identification is realized.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a SAR image ship identification method based on a latent layer diffusion model technology.

Fig. 2 is a network architecture diagram. Wherein, (a) is an overall network structure diagram, (b) is a T2T module, and (c) is an SE attention mechanism module.

Fig. 3 is a schematic diagram of data generated by the image generating module in an embodiment. The method comprises the steps of (a) displaying a remote sensing Cargo image generation result, (b) displaying a remote sensing Fiswing image generation result, and (c) displaying a remote sensing Tug image generation result.

Fig. 4 is a schematic diagram of a SAR ship identification result in an embodiment.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The application provides a T2T-ViT model based on an improvement of a latent layer diffusion model, which is used for a SAR image ship identification method. In order to solve the problem of insufficient SAR ship sample data, an image generation module is added into a T2T-ViT model according to the thought of a latent layer diffusion model, training can be performed according to current limited SAR ship sample data, and then corresponding types of SAR ship pictures can be generated according to input text information or semantic description, so that the lack of training samples is compensated. Meanwhile, an SE attention mechanism (sque-and-specification) is added into the T2T-ViT model, the weight of each channel is calculated according to the input feature map, and the network is enabled to pay more attention to the features which have important contributions to tasks such as classification and the like in a weighted mode, so that the performance of the model is improved. Meanwhile, due to the existence of global pooling, the SE attention mechanism can introduce more information without adding too many parameters, so that the problem of overfitting is avoided.

With reference to fig. 1-4, the application provides a method for identifying a ship from an SAR image based on a latent layer diffusion model technology, which comprises the following steps:

step 1: and processing the current limited sample data by an image generation module, and generating pictures of corresponding categories according to the text information or the semantic description. The enhanced data set is proportionally divided into a training set, a verification set and a test set.

Step 2: and carrying out feature extraction on the input image, selecting a T2T-ViT model as a feature extraction network, converting the input image into feature vectors after image segmentation and T2T module processing, and fusing adjacent feature information.

Step 3: the traditional Transformer Layer is improved, an SE attention mechanism module is added after the multi-head attention, the weight of each channel is calculated according to the input feature map, and the attention degree of important features is increased.

Optionally, the data set used in the step 1 is an OpenSARShip 2.0 ship data set, and the data set is processed by an image generation processing module to expand the data set in consideration of the problems of unbalance of various samples and excessive invalid samples. The data set is next divided into a training set, a validation set and a test set. Finally, training parameters are set.

Optionally, in step 2, the image is divided into n image blocks through a T2T-ViT model, and then feature extraction and fusion are performed on the image blocks through a T2T module, so as to generate a certain number of tokens. And then combines the class token with the position code position embedding for subsequent processing.

Optionally, step 3 processes the acquired token by Transformer Layer, adds an SE attention mechanism module after multi-head attention, and makes the network pay more attention to the features that have important contributions to tasks such as recognition by a weighted manner.

Optionally, step 4 uses a cross entropy loss function for task identification. On this basis, the label smoothing is modified for preventing overfitting.

Examples

The application aims to solve the problem of SAR image ship identification. And (3) efficiently and accurately identifying the SAR ship targets by using the deep learning network and outputting corresponding category information. In order to achieve the object, the embodiment of the application provides a SAR image ship identification method based on a latent layer diffusion model technology, the basic flow of which is shown in figure 1, comprising the following steps:

The data set used in the step 1 is an OpenSARShip 2.0 ship data set, and 320 sample pictures of three kinds of ship Cargo ships, cargo, fishing boat Fising and tugboat Tug are selected. After the existing sample is added into the image generation module for processing, corresponding text information is input, such as SARCargo, SARFishing, SARTanker, so that sample data are generated, and each ship is expanded by 400. Next, the data set is divided, the training set accounts for 80% of the total number of images, the test set accounts for 20% of the total number of images (the training set and the test set are randomly generated), and a part of the training set is randomly selected as the verification set. During training, the input image is fixed at 224×244. The training batch size was 8 and the number of training iterations was 500.

For an image generation module based on the latent layer diffusion model, an image is converted from a pixel space to a latent layer space through an encoder, and after being subjected to noise adding and U-Net denoising process treatment, the image is converted from the latent layer space to the pixel space through decoding, and the whole process target can be simplified as follows:

wherein the nerve trunk e _θ (o, t) is U-Net, z for a time condition _t Is generated by the addition of noise to the input latent layer feature vector z. For the image generation module to generate corresponding pictures according to the requirement, the U-Net backbone of the foundation is enhanced by using a cross-attention mechanism, the U-Net backbone is converted into a more flexible conditional image generator, and for preprocessing y from various modes (such as text information), an encoder tau in a specific field is introduced _θ The encoder will project to the intermediate representationIt is then mapped to the middle layer of the U-Net by implementing a cross-attention layer of attention:

wherein,here->Representation e _θ And->Is denoted as U-Net. />And->Is a learnable projection matrix.

Based on the image condition pair, the final learning condition is specifically:

wherein τ _θ And e _θ And the optimization is combined through the formula.

And step 2, dividing an original image into image blocks through a T2T-ViT model, and extracting image features in a T2T module process.

Each of which isThe T2T module has two steps; reconstruction and soft segmentation. For input image I _i It is converted into tokens by soft segmentation:

T _i+1 ＝SS(I _i )

then T is generated by classical transformerencoder conversion _i+1 ′。

T _i+1 ′＝MLP(MSA(T _i+1 ))

Wherein MSA is multi-head self-attention operation of layer normalization, and MLP is multi-layer perceptron of layer normalization of standard transformer. These symbols are then reshaped in the spatial dimension into image I _i+1 。

I _i+1 ＝Reshape(T _i+1 ′)

Step 3 improves on the traditional Transformer Layer by adding an SE attention mechanism module after multi-head attention. The SE attention mechanism module is divided into two parts, namely extrusion and excitation, and channel weights are reconstructed by modeling the relationship between channels. Thus, the features associated with the channel region are more pronounced.

Extrusion using global average pooling F _sq (.) generating channel level information for the purpose of compressing global space information into a channel descriptor vector Z E R ^C . The channel descriptor vector Z is considered a set of local features, where each element represents the global features of each channel of U. Formally representing z= [ Z ] ₁ ,z ₂ ,…,z _c ]It is obtained by compressing the feature map u= [ U ] ₁ ,u ₂ ,…,u _c ]And is generated. The C-th element using the spatial dimension H x W, Z of U is calculated as follows:

after compressing the information, excitation is achieved to fully capture the relationship between the channels of the U. Each element of the descriptor vector Z represents a global feature of the corresponding channel of U. Thus, two fully connected layers are established, which are regarded as mapping functions F _ex (.) to parameterize the nonlinear relationship of each element of Z. The parameters are then activated by an S-shaped activation function to obtain the channel weights at the U pixel level. The excitation equation is expressed as:

S＝F _ex (Z,W)＝σ(g(Z,V))＝σ(V ₂ δ(V ₁ Z))

wherein σ is a sigmoid function; delta is a reforming linear unit (ReLU) activation function; v (V) ₁ ∈R ^C/R×C And V ₂ ∈R ^C/R×C A weight matrix representing the full connection layer; C/R is the dimension-reducing gravity of the fully connected layer, with a recommended value of 16. Thus S ε R ^C With a value falling between 0 and 1, representing the model's interest in each channel of the feature map U.

x _c ′＝F _scale (u _c ,s _c )＝u _c s _c

wherein X' = [ X ] ₁ ′,x ₂ ′,…,x _c ′]，F _scale (u _c ,s _c ) Index quantity s _c And a characteristic diagram u _c ∈R ^H×W Channel multiplication between them. Obviously, the output X' of the SE attention mechanism block is the product of readjusting the channel weights on U. In the task learning process, the weight of the channel related to the state is increased, and the expression capability of the characteristics is improved.

Step 4 employs a cross entropy loss function and label smoothing to prevent over-fitting problems.

The cross entropy loss function formula is as follows:

wherein x is _i Is the result of the model output passing through softmax, y _i Whether or not it is a corresponding category label can be expressed by the following formula:

the processing can lead to neglect of the relation between the real tag and other tags, and the model is easy to influence when the classification and identification problems of SAR ship data sets with high sample similarity and high data noise are processed. The label smoothing method is used later, so that the probability distribution is changed, resulting in:

where ε is an infinitesimal constant, which makes the probability optimal targets in softmax loss no longer 1 and 0, which avoids over-fitting to some extent and also mitigates the effects of false labels.

The memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and direct memory bus RAM (DRRAM). It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a high-density digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

It should be noted that the processor in the embodiments of the present application may be an integrated circuit chip with signal processing capability. In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

The SAR image ship identification method based on the latent layer diffusion model technology is described in detail, and specific examples are applied to illustrate the principle and the implementation mode of the method, and the description of the examples is only used for helping to understand the method and the core idea of the method; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A SAR image ship identification method based on a latent layer diffusion model technology is characterized by comprising the following steps of: the method comprises the following steps:

2. The method according to claim 1, characterized in that: in step 1, the data set used is the OpenSARShip 2.0 ship data set, and 320 sample pictures of three kinds of ships and Cargo ships, namely Cargo, fishing vessel fisher and tugboat Tug are selected.

3. The method according to claim 1, characterized in that: in step 1, for an image generation module set based on a latent layer diffusion model, an image is converted from a pixel space to a latent layer space through an encoder, and after being subjected to noise adding and U-Net denoising processes, the image is converted from the latent layer space to the pixel space through decoding, and the whole flow target is simplified into:

4. A method according to claim 3, characterized in that: in order that the image generation module can generate corresponding pictures according to requirements, the U-Net backbone of the foundation of the image generation module is enhanced by using a cross attention mechanism, the U-Net backbone is converted into a more flexible conditional image generator, and in order to preprocess y from various modes, an encoder tau in a specific field is introduced _θ The encoder will project to the intermediate representationIt is then mapped to the middle layer of the U-Net by implementing a cross-attention layer of attention:

based on the image condition pair, the final learning condition is:

wherein τ _θ And e _θ And the optimization is combined through the formula.

5. The method according to claim 4, wherein: and 2, dividing the image into n image blocks through a T2T-ViT model, performing feature extraction and fusion on the image blocks through a T2T module to generate a certain number of tokens, and combining class tokens and position codes position embedding for subsequent processing.

6. The method according to claim 5, wherein: each T2T module has two steps; reconstruction and soft segmentation; for input image I _i It is converted into tokens by soft segmentation:

T _i+1 ＝SS(I _i )

then T is generated by conversion of transformerencoder _i+1 ′：

T _i+1 ′＝MLP(MSA(T _i+1 ))

Wherein MSA is multi-head self-attention operation of layer normalization, MLP is multi-layer perceptron of layer normalization of standard transformer, and then the symbols are arranged in the following waySpatially dimensionally remodelling to image I _i+1 ：

I _i+1 ＝Reshape(T _i+1 ′)

7. The method according to claim 6, wherein: in step 3, the SE attention mechanism module is divided into two parts, namely extrusion and excitation, and channel weights are reconstructed by modeling the relation between channels;

S＝F _ex (Z,W)＝σ(g(Z,V))＝σ(V ₂ δ(V ₁ Z))

x _c ′＝F _scale (u _c ,s _c )＝u _c s _c

8. The method according to claim 7, wherein: in step 4, a cross entropy loss function and label smoothing are used to prevent over-fitting problems;

the cross entropy loss function formula is as follows:

where ε is an infinitesimal constant.

9. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1-8 when the computer program is executed.

10. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the method of any one of claims 1-8.