CN118037873A

CN118037873A - Infrared target image generation method, device, equipment and storage medium

Info

Publication number: CN118037873A
Application number: CN202410088720.1A
Authority: CN
Inventors: 李冰; 常燕; 赵久奋; 管冬冬; 朱宏伟; 杨奇松; 武健; 李雪瑞
Original assignee: Rocket Force University of Engineering of PLA
Current assignee: Rocket Force University of Engineering of PLA
Priority date: 2024-01-22
Filing date: 2024-01-22
Publication date: 2024-05-14

Abstract

Embodiments of the present disclosure provide a method, apparatus, device, and storage medium for generating an infrared target image based on an attention mechanism under an countermeasure network, including: acquiring a training sample set, wherein the training sample set comprises visible light images and infrared real images corresponding to the visible light images; constructing a generating model and a judging model, wherein the generating model comprises an encoding-decoding network, the encoding-decoding network comprises an encoding part, a middle part and a decoding part, the encoding part comprises a ResNet network and a self-attention model of a non-local neural network, the middle part comprises a cavity convolution layer, the decoding part adopts a tanh activation function to form an output layer, and the judging model comprises a Markov judging device; and training the generation model and the discrimination model based on the training sample set to obtain a target generation model, so that the generated infrared target image can truly reflect the image characteristics.

Description

Infrared target image generation method, device, equipment and storage medium

Technical Field

Embodiments of the present disclosure relate to the field of image processing technology and related technology, and in particular, to a method, apparatus, device, and storage medium for generating an infrared target image based on an attention mechanism under a countermeasure network.

Background

The infrared image simulation technology is a technology for rendering infrared images of targets and background scenes, can provide reliable and various basic data for tasks such as infrared imaging guidance, infrared target detection, identification and evaluation, target tracking and the like, and is widely applied to the fields such as aviation, aerospace, navigation and the like.

Currently, the mainstream infrared image simulation technologies can be divided into two main categories: the method is characterized by comprising an image simulation technology based on infrared characteristic modeling and an infrared image simulation technology based on deep learning. The method is characterized in that on the basis of establishing a mathematical model related to the infrared characteristics of a presentation scene, the infrared simulation effect of the scene is displayed through computer simulation; the latter trains a large number of visible light scenes and infrared characteristic effects thereof by using a deep learning algorithm to obtain a mapping model between visible light and infrared images of the scenes, and then generates the infrared characteristics of the visible light scenes according to the model and displays the infrared characteristics in the form of images. In the prior art, the infrared simulation technology based on infrared characteristic modeling can generate a better target infrared texture, but the degree of automation is not high, and the infrared simulation technology has the defects in the aspects of generating infrared characteristics of a target scene under the interference of smoke, shadow and the like, processing data in batches and the like, and the converted infrared image is not real in the traditional deep learning algorithm.

Based on the problems existing in the prior art, how to utilize a deep learning network to realize information conversion of an infrared image and a visible light image is a technical problem to be solved urgently by those skilled in the art.

Disclosure of Invention

Embodiments described herein provide a method, apparatus, device, and storage medium for generating an infrared target image based on an attention mechanism under an countermeasure network, solving the problems of the prior art.

In a first aspect, according to the present disclosure, there is provided a method of generating an infrared target image based on an attention mechanism under an countermeasure network, comprising:

Acquiring a training sample set, wherein the training sample set comprises visible light images and infrared real images corresponding to the visible light images;

Constructing a generating model and a judging model, wherein the generating model comprises an encoding-decoding network, the encoding-decoding network comprises an encoding part, an intermediate part and a decoding part, the encoding part comprises a ResNet network and a self-attention model of a non-local neural network, the ResNet network comprises a first downsampling layer, a second downsampling layer, a third downsampling layer, a fourth downsampling layer and a fifth downsampling layer, a feature vector of a first Res-block and a second Res-block in the second downsampling layer, the third downsampling layer, the fourth downsampling layer and the fifth downsampling layer after the self-attention model of the non-local neural network is processed is arranged between the first Res-block and the second Res-block in the second downsampling layer, the third downsampling layer, the second Res-block in the fourth downsampling layer and the fifth downsampling layer, the intermediate part comprises a Markov judging function which is formed by adopting a decoder, and the judging model h is formed by the intermediate part;

and training the generating model and the judging model based on the training sample set to obtain a target generating model.

In some embodiments of the disclosure, the training the generating model and the discriminating model based on the training sample set to obtain a target generating model includes:

Inputting the visible light image and an infrared real image corresponding to the visible light image into the generation model to generate an infrared generation image;

Inputting the visible light image, the infrared real image corresponding to the visible light image and the infrared generated image into the discrimination model to obtain a discrimination result;

And determining a loss value of the loss function according to the judging result, and correcting parameters of the generating model and the judging model based on the loss value to obtain a target generating model.

In some embodiments of the disclosure, the inputting the visible light image, the infrared real image corresponding to the visible light image, and the infrared generation image into the discrimination model to obtain a discrimination result includes:

inputting the visible light image and the infrared generated image into the discrimination model to obtain a first discrimination result;

and inputting the visible light image and the infrared real image corresponding to the visible light image into the discrimination model to obtain a second discrimination result.

In some embodiments of the disclosure, the determining a loss value of a loss function according to the discrimination result, and correcting parameters of the generation model and the discrimination model based on the loss value to obtain a target generation model includes:

determining a loss value of a loss function according to the first discrimination result and the second discrimination result;

and correcting parameters of the generating model and the judging model according to the loss value, wherein the generating model corresponding to the loss value of the loss function is the target generating model when the loss value of the loss function meets a preset threshold.

In some embodiments of the present disclosure, the input feature image size input to the encoding section is 256×256, and the output feature image of the encoding section is 8×8.

In some embodiments of the present disclosure, the middle portion is a hole convolution layer with hole coefficients of 0, 1, and 2.

In some embodiments of the present disclosure, the decoding portion includes 4 upsampling layers, and the decoding portion upsamples based on a transpose convolution.

In a second aspect, according to the present disclosure, there is provided an infrared target image generation apparatus for generating an attention-based mechanism under an countermeasure network, comprising:

the training sample set acquisition module is used for acquiring a training sample set, wherein the training sample set comprises visible light images and infrared real images corresponding to the visible light images;

A model building module, configured to build a generating model and a discriminant model, where the generating model includes an encoding-decoding network, the encoding-decoding network includes an encoding portion, an intermediate portion, and a decoding portion, the encoding portion includes a ResNet network and a self-attention model of a non-local neural network, the ResNet network includes a first downsampling layer, a second downsampling layer, a third downsampling layer, a fourth downsampling layer, and a fifth downsampling layer, the self-attention model of the non-local neural network is set between a first Res-block and a second Res-block in the second downsampling layer, the third downsampling layer, the second Res-block in the fourth downsampling layer, and the fifth downsampling layer receive a first Res-block after the self-attention model of the non-local neural network is processed, the intermediate portion includes a discriminant function, and the discriminant model includes a discriminant function;

And the target model generating module is used for training the generating model and the judging model based on the training sample set to obtain a target generating model.

In a third aspect, according to the present disclosure, there is provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method as in any one of the embodiments above when the computer program is executed.

In a fourth aspect, according to the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a method as in any of the above embodiments.

The method, the device, the equipment and the storage medium for generating the infrared target image based on the attention mechanism under the countermeasure network provided by the embodiment of the disclosure firstly acquire a training sample set, wherein the training sample set comprises visible light images and infrared real images corresponding to the visible light images; then constructing a generating model and a judging model, wherein the generating model comprises an encoding-decoding network, the encoding-decoding network comprises an encoding part, a middle part and a decoding part, the encoding part comprises a ResNet network and a self-attention model of a non-local neural network, the middle part comprises a cavity convolution layer, the decoding part adopts a tanh activation function to form an output layer, and the judging model comprises a Markov judging device; finally, training the generation model and the discrimination model based on a training sample set to obtain a target generation model, namely, in the infrared target image generation method disclosed by the application, the generation model and the discrimination model are respectively trained by utilizing the visible light image and the infrared real image corresponding to the visible light image, so that the generated infrared target image can truly reflect the image characteristics; in addition, by adding the self-attention model of the non-local neural network in the coding part of the generation model, the self-attention model of the non-local neural network is arranged between the first Res-block and the second Res-block in the second downsampling layer, the third downsampling layer, the fourth downsampling layer and the fifth downsampling layer, so that the generation model can better establish the dependence relationship between the visible light image and the texture and the image of the infrared real image corresponding to the visible light image, and the gray information and the fidelity of the texture of the infrared generation image are improved.

The foregoing description is only an overview of the technical solutions of the embodiments of the present application, and may be implemented according to the content of the specification, so that the technical means of the embodiments of the present application can be more clearly understood, and the following specific embodiments of the present application are given for clarity and understanding.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the following brief description of the drawings of the embodiments will be given, it being understood that the drawings described below relate only to some embodiments of the present disclosure, not to limitations of the present disclosure, in which:

Fig. 1 is a schematic flow chart of an infrared target image generation method according to an embodiment of the disclosure;

FIG. 2 is a schematic diagram of a network architecture for generating a model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a self-attention model of a non-local neural network provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a structure of a middle portion in a generative model provided by an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a generation model and a discrimination model provided by an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a discriminant model provided by an embodiment of the present disclosure;

Fig. 7 is a schematic structural diagram of an infrared target image generating apparatus according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.

In the drawings, the last two digits are identical to the elements. It is noted that the elements in the drawings are schematic and are not drawn to scale.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by those skilled in the art based on the described embodiments of the present disclosure without the need for creative efforts, are also within the scope of the protection of the present disclosure.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the presently disclosed subject matter belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the specification and relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein. As used herein, a statement that two or more parts are "connected" or "coupled" together shall mean that the parts are joined together either directly or joined through one or more intermediate parts.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of the phrase "an embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: there are three cases, a, B, a and B simultaneously. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Furthermore, in all embodiments of the present disclosure, terms such as "first" and "second" are used merely to distinguish one component (or portion of a component) from another component (or another portion of a component).

In the description of the present application, unless otherwise indicated, the meaning of "plurality" means two or more (including two), and similarly, "plural sets" means two or more (including two).

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

Based on the problems existing in the prior art, fig. 1 is a schematic flow chart of a method for generating an infrared target image based on an attention mechanism under an countermeasure network according to an embodiment of the present disclosure, and as shown in fig. 1, a specific process for generating an infrared target image based on an attention mechanism under an countermeasure network includes:

s110, acquiring a training sample set.

The training sample set includes visible light images and infrared real images corresponding to each visible light image.

GANs (GENERATIVE ADVERSARIAL Networks, generating countermeasure Networks) are implemented by competing two Networks with each other, one of which is called a producer network (Generator Network, G), which continuously captures data in a training library to generate new samples, and the other is called a arbiter network (Discriminator Network, D), which also discriminates whether the data provided by the producer is sufficiently authentic based on the relevant data. GANs the training process is similar to gaming between a counterfeit banknote manufacturer and a banknote validator: the counterfeit money manufacturer improves the manufacturing level of the counterfeit money as much as possible, generates the counterfeit money which is as same as the real money as possible, and further avoids the inspection of the currency detector; the bill validator continuously improves the capability of discriminating counterfeit bills to strive to discriminate the authenticity of the input bill. With the continuous training of GANs, the generating capability of the generator and the discrimination capability of the discriminator are improved continuously, so that the accuracy of the result of infrared image conversion based on the GANS is lower.

Aiming at the problem that the original GANs cannot generate the picture with the specific attribute, the condition information is integrated into the generator and the discriminator through a Conditional GENERATIVE ADVERSARIAL Networks (CGAN), and the condition information can be any label information, such as facial expression of a face image, category of the image and the like.

In the method for generating the infrared target image provided by the embodiment of the disclosure, a target generation model is built based on a condition generation countermeasure network, namely, the generation model and the discrimination model in the condition generation countermeasure network are improved, training sample data are input into the improved generation model and the improved discrimination model, whether the generation model and the discrimination model are converged is judged through a defined loss function, and when the generation model and the discrimination model are converged, the infrared target image is generated based on the converged generation model.

The characteristics of the countermeasure network are generated based on the conditions, and the training sample set is constructed to comprise visible light images and infrared real images corresponding to the visible light images, namely the infrared real images corresponding to the visible light images are used as condition information.

S120, constructing a generation model and a discrimination model.

The generating model comprises an encoding-decoding network, the encoding-decoding network comprises an encoding part, an intermediate part and a decoding part, the encoding part comprises a ResNet network and a self-attention model of a non-local neural network, the ResNet network comprises a first downsampling layer, a second downsampling layer, a third downsampling layer, a fourth downsampling layer and a fifth downsampling layer, a self-attention model of the non-local neural network is arranged between a first Res-block and a second Res-block in the second downsampling layer, the third downsampling layer, the second Res-block in the fourth downsampling layer and the fifth downsampling layer, the second Res-block in the second downsampling layer, the fourth downsampling layer and the fifth downsampling layer receives feature vectors of the first Res-block after being processed by the self-attention model of the non-local neural network, the intermediate part comprises a hole convolution layer, the decoding part adopts a tanh activation function to form an output layer, and the judging model comprises a Markov judging device.

Based on the research of the prior art, the application designs to receive 256 multiplied by 256 resolution characteristic images as input by improving the D-LinkNet (semantic segmentation neural network), the structure of the improved D-LinkNet network (named as D-LinkNet-v, namely the generated model of the application) is shown as figure 2, and the D-LinkNet-v is formed byThree parts, which are the coding part, the center part and the decoding part, respectively. In fig. 2, conv represents convolution, res-blocks represents residual block, stride=2 represents a step size of 2 for the operation, max-pooling represents maximum pooling, dilation =n represents the number of holes of the hole convolution as n, transposed-Conv represents transposed convolution, skip connection represents Skip linkage.

In D-LinkNet-v, the encoding part uses ResNet trained on the ImageNet dataset as the encoder for the network, and the ResNet network structure is shown in Table 1.

Table 1ResNet structure

In combination with table 1 and fig. 2, the resnet34 network includes a first downsampling layer, a second downsampling layer, a third downsampling layer, a fourth downsampling layer, and a fifth downsampling layer, where in table 1, conv1 represents the first downsampling layer (i.e., conv [ (7x7, stride=2) ] in fig. 2), conv2_x represents the second downsampling layer (i.e., 3×res-blocks in fig. 2), conv3_x represents the third downsampling layer (i.e., 4×res-blocks in fig. 2), conv4_x represents the fourth downsampling layer (i.e., 6×res-blocks in fig. 2), and conv5_x represents the fifth downsampling layer (i.e., 3×res-blocks in fig. 2).

In the application, the second downsampling layer comprises 3 Res-blocks, the third downsampling layer comprises 4 Res-blocks, the fourth downsampling layer comprises 6 Res-blocks, the fifth downsampling layer comprises 3 Res-blocks, the self-attention model of the non-local neural network is arranged between the first Res-block and the second Res-block of the second downsampling layer, the self-attention model of the non-local neural network is arranged between the first Res-block and the second Res-block of the third downsampling layer, the self-attention model of the non-local neural network is arranged between the first Res-block and the second Res-block of the fourth downsampling layer, the self-attention model of the non-local neural network is arranged between the first Res-block and the second Res-block of the fifth downsampling layer, the self-attention model of the non-local neural network is arranged between the second Res-block and the second Res-block of the fifth downsampling layer, the self-image can be better generated through the image grey-scale and the image of the image, the image can be better created through the second subsampling layer and the second Res-block, the image is better in the subsampling layer and the image, and the image can be better in the image, and the image is better in the image.

Wherein the structure of the self-attention model of the non-local neural network is shown in fig. 3.

As a specific embodiment, the feature vector (including the first convolution layer f (x), the second convolution layer q (x), and the third convolution layer v (x)) output by the first Res-block of the second downsampling layer is input to the self-attention model of the non-local neural network, and then there is:

Where D represents the number of channels, N is the number of features, W _v and W _q are both learned weights, α _j,i represents the degree of attention of the self-attention model of the non-local neural network to the ith location when generating the jth region, and the output attention layer can be expressed as:

Wherein f (x _i)＝W_fx_i,t(x_i)＝W_tx_i, And/>Is the learned weight.

Finally, introducing the idea of a residual network, and integrating the input characteristic x, wherein the following are:

O_i＝λo_i+x_i

where λ is the scale factor that can be learned.

I.e. the eigenvector of the second Res-block input to the second downsampling layer is O _i＝λo_i+x_i.

The central part of the D-LinkNet-v network is additionally provided with a cavity convolution with direct connection, so that the receptive field can be effectively enlarged, the network identification capacity can be enhanced, and multi-scale information can be fused. Since the D-LinkNet-v network coding section has 5 downsampling layers and the size of the input data is 256 x 256, the size of the coding section output profile will be 8 x 8. When the number of holes in the hole convolution is set to 2, the corresponding receptive field is 7. I.e. the central part of the D-LinkNet-v network is designed as a hole convolution layer with hole coefficients 0, 1, 2, the feature points on the last central layer are realized to see 7 x 7 points on the first central feature map, which can cover the main part of the first central feature map. The expanded view of the central part is shown in fig. 4, the number of holes in the hole convolution in fig. 4 is 2, 1 and 0 from top to bottom, the corresponding receptive fields are 7, 3 and 1 respectively, and the fusion characteristic is obtained by summarizing the results of the three branches.

In addition, in the original D-LinkNet network, the last output layer of the decoding part adopts a Sigmoid activation function. When the input is within the range of [ -1,1], the function value change of the Sigmoid activation function is sensitive, the sensitivity is easy to lose when the input approaches or exceeds the interval, and the sensitivity is in a saturated state, so that the prediction accuracy of the neural network is affected. Because the output and input of the tanh activation function can maintain nonlinear monotonic rising and falling relation, the fault tolerance is better, and therefore, the decoding part of the D-LinkNet-v network adopts the tanh function as the activation function in the last obtained output layer. The decoding section upsamples using transpose convolution. The settings of m and n in the convolution in the encoder are shown in table 2. Sequence numbers ① to ④ in table 2 correspond to the ① to ④ upsampling blocks, respectively, of the decoder portion of fig. 2.

Table 2 up-sampling feature map channel number

The Markov discriminant (Markovian discriminator) is a discriminant model consisting of a convolution layer, the output is an N matrix X, and finally the average value of the X matrix is taken as the true/false output. Because each value in the output matrix corresponds to a receptive field in the artwork, i.e., to a block of regions (patch) in the artwork.

Taking an image with an input size of 256×256 as an example, a conventional arbiter maps an image with a size of 256×256 to a single scalar output, representing "true" or "false", but this is not easy to embody local features of the image; while the discrimination model in the present application includes a Markov discriminator that implements mapping 256×256 sized images to an N×N sized X matrix, where each X _ij represents whether a region block ij in the image is true or false. The Markov discriminator realizes the extraction and characterization of the local features of the image, so that the model can pay more attention to the detailed information of the image, and is favorable for generating the image with higher quality.

S130, training the generation model and the discrimination model based on the training sample set to obtain a target generation model.

In a specific embodiment, training the generating model and the judging model based on the training sample set to obtain a target generating model includes:

inputting the visible light image and the infrared real image corresponding to the visible light image into a generation model to generate an infrared generation image; inputting the visible light image, the infrared real image corresponding to the visible light image and the infrared generated image into a discrimination model to obtain a discrimination result; and determining a loss value of the loss function according to the judging result, and correcting parameters of the generating model and the judging model based on the loss value to obtain the target generating model.

In a specific implementation process, inputting a visible light image, an infrared real image corresponding to the visible light image and an infrared generated image into a discrimination model to obtain a discrimination result, including: inputting the visible light image and the infrared generated image into a discrimination model to obtain a first discrimination result; and inputting the visible light image and the infrared real image corresponding to the visible light image into a discrimination model to obtain a second discrimination result.

Then determining a loss value of the loss function according to the first discrimination result and the second discrimination result; and correcting the generated model and the parameters of the judging model according to the loss value, wherein the generated model corresponding to the loss value of the loss function is the target generated model when the loss value of the loss function meets a preset threshold.

Referring to fig. 5 and 6, first, a visible light image and an infrared real image corresponding to the visible light image are input into a generation model (i.e., a generator G) to generate an infrared generation image, then the infrared generation image generated by the generation model is input into a discrimination model (i.e., a discriminator D), the discrimination model obtains a first discrimination result according to the visible light image and the infrared generation image, a second discrimination result is obtained according to the visible light image and the infrared real image corresponding to the visible light image, a loss value of a loss function is determined based on the first discrimination result and the second discrimination result, and parameters of the generation model and the discrimination model are corrected based on the loss value until the loss value of the loss function satisfies a preset threshold, and parameters corresponding to the generation model and the discrimination model are determined.

Specifically, in original GANs, the value function of GANs is defined as:

Wherein: x-p _data represent data taken from the true distribution; z-p _z represent data from an analog distribution of random noise z (e.g., gaussian noise distribution); d (x) represents the probability that the discrimination model judges whether the real data is real or not; d (G (z)) represents a probability that the discrimination model judges whether or not the data G (z) generated by the generation model is true; e _z～pz represents an expected value.

In GANs training, training D is to improve the discrimination of D, i.e., maximize D (x), corresponding to maximizing V (G, D); training G corresponds to minimizing log (1-D (G (z))), and corresponds to minimizing V (G, D). At this point, the training process GANs can be regarded as a min-max optimization process, and the objective function can be defined as:

At the initial stage of GANs training, the capability of generating a model and the capability of distinguishing the model are weak, but the difference between generated data and real data is large at this time, so that the distinguishing model can still accurately distinguish the generated data from the real data. This results in saturation of log (1-D (G (x))). In this case, if the gradient of the discrimination model transferred to the generation model is 0, it is difficult for the generation model to learn something useful from the discrimination model, and thus learning cannot be continued, and further, it is difficult for the subsequent network to adjust.

In terms of the definition of the GANs loss function, the CGAN loss function may be defined as:

In order to better exploit the structural information of the input image, a loss function is introduced:

Where mu _x and mu _z are the mean of the source domain image (i.e., the infrared true image) x and the target domain image (i.e., the infrared generated image) generated by random noise z, respectively, And/>Is the variance of x and z, σ _xz is the covariance of x and z, C ₁＝(γ₁L)² and C ₂＝(γ₂L)² are constants for maintaining stability, γ ₁＝0.01,γ₂ =0.03, and l is the dynamic range of the pixel values, respectively.

I.e. the loss function of the present application is ultimately defined as:

in the above formula: x-p _data represent data taken from the true distribution; z-p _z represent data from the analog distribution of random noise z; d (x, y) represents the probability that the discrimination model judges whether the real data is real or not; g (x, z) represents a probability that the generation model determines whether the data G (x, z) generated by the generation model is true or not based on the source domain image x and the target domain image (i.e., the infrared generation image) generated by the random noise z, D (x, G (x, z)) represents a discrimination model, and E represents an expected value.

In the above embodiment, after the target generation model is obtained, the target infrared image corresponding to the target visible light image is obtained by inputting the target visible light image into the target generation model.

As a specific example, in order to make the generated model and the discrimination model constructed by the application, namely, whether the infrared image generated by the IR-GANs algorithm can truly reflect the characteristics of the real shot infrared image, and the quality of the infrared simulation generated image, from the practical value of the infrared generated image, the infrared generated image generated by the algorithm is verified and evaluated for two application scenes of guidance matching and target detection in the accurate guided weapon.

Aiming at an image matching task in scene matching guidance, a visible light reference image is manufactured by using a visible light image of a target, and an infrared generation reference image is manufactured by using an infrared generation image of the target generated by an algorithm. And respectively using the visible light reference image and the infrared generation reference image as templates, and carrying out matching verification based on normalized cross correlation on the infrared real image to verify whether the infrared generation reference image can improve matching accuracy and precision.

Aiming at a target detection task, on the basis of the proposed improved second-order target detection algorithm, training the algorithm by using a simulation image set F-IOD and a real shot image set IOD respectively to generate two target detection models; and respectively utilizing the two models to carry out target detection verification on the real shot infrared image, comparing the average target detection precision of the two models, and if the average detection precision of the two models is within 5%, indicating that the simulation image set can truly reflect the infrared real characteristics of the target.

According to the method for generating the infrared target image based on the attention mechanism under the countermeasure network, a training sample set is firstly obtained, and the training sample set comprises visible light images and infrared real images corresponding to the visible light images; then constructing a generating model and a judging model, wherein the generating model comprises an encoding-decoding network, the encoding-decoding network comprises an encoding part, a middle part and a decoding part, the encoding part comprises a ResNet network and a self-attention model of a non-local neural network, the middle part comprises a cavity convolution layer, the decoding part adopts a tanh activation function to form an output layer, and the judging model comprises a Markov judging device; finally, training the generation model and the discrimination model based on a training sample set to obtain a target generation model, namely, in the infrared target image generation method disclosed by the application, the generation model and the discrimination model are respectively trained by utilizing the visible light image and the infrared real image corresponding to the visible light image, so that the generated infrared target image can truly reflect the image characteristics; in addition, by adding the self-attention model of the non-local neural network in the coding part of the generation model, the self-attention model of the non-local neural network is arranged between the first Res-block and the second Res-block in the second downsampling layer, the third downsampling layer, the fourth downsampling layer and the fifth downsampling layer, so that the generation model can better establish the dependence relationship between the visible light image and the texture and the image of the infrared real image corresponding to the visible light image, and the gray information and the fidelity of the texture of the infrared generation image are improved.

On the basis of the above embodiments, the present disclosure further provides an infrared target image generation apparatus, as shown in fig. 7, including:

The training sample set obtaining module 210 is configured to obtain a training sample set, where the training sample set includes visible light images and infrared real images corresponding to the visible light images;

The model building module 220 is configured to build a generating model and a discriminating model, where the encoding-decoding network includes an encoding part, an intermediate part and a decoding part, the encoding part includes a ResNet network and a self-attention model of a non-local neural network, the ResNet network includes a first downsampling layer, a second downsampling layer, a third downsampling layer, a fourth downsampling layer and a fifth downsampling layer, a self-attention model of the non-local neural network is set between a first Res-block and a second Res-block in the second downsampling layer, the second Res-block in the third downsampling layer, the fourth downsampling layer and the fifth downsampling layer, the intermediate part includes a hole convolution layer, and the discriminating model includes a markov discriminator;

The target model generating module 230 is configured to train the generating model and the discriminating model based on the training sample set to obtain a target generating model.

According to the infrared target image generating device based on the attention mechanism under the generation countermeasure network, the training sample set acquisition module acquires a training sample set, and the training sample set comprises visible light images and infrared real images corresponding to the visible light images; the model construction module constructs a generating model and a judging model, wherein the generating model comprises an encoding-decoding network, the encoding-decoding network comprises an encoding part, a middle part and a decoding part, the encoding part comprises a ResNet network and a self-attention model of a non-local neural network, the middle part comprises a cavity convolution layer, the decoding part adopts a tanh activation function to form an output layer, and the judging model comprises a Markov judging device; the target model generating module trains the generating model and the judging model based on the training sample set to obtain a target generating model, and the generating model and the judging model are respectively trained by utilizing the visible light image and the infrared real image corresponding to the visible light image, so that the generated infrared target image can truly reflect the image characteristics; in addition, by adding the self-attention model of the non-local neural network in the coding part of the generation model, the self-attention model of the non-local neural network is arranged between the first Res-block and the second Res-block in the second downsampling layer, the third downsampling layer, the fourth downsampling layer and the fifth downsampling layer, so that the generation model can better establish the dependence relationship between the visible light image and the texture and the image of the infrared real image corresponding to the visible light image, and the gray information and the fidelity of the texture of the infrared generation image are improved.

In a specific embodiment, the target model generating module comprises a generating unit, a judging unit and a target model generating unit;

The generating unit is used for inputting the visible light image and the infrared real image corresponding to the visible light image into the generating model to generate an infrared generating image;

The judging unit is used for inputting the visible light image, the infrared real image corresponding to the visible light image and the infrared generated image into the judging model to obtain a judging result;

and the target model generating unit is used for determining the loss value of the loss function according to the judging result, and correcting parameters of the generating model and the judging model based on the loss value so as to obtain the target generating model.

In a specific embodiment, the discrimination unit includes a first discrimination subunit and a second discrimination subunit;

the first judging subunit is used for inputting the visible light image and the infrared generated image into a judging model to obtain a first judging result;

and the second judging subunit is used for inputting the visible light image and the infrared real image corresponding to the visible light image into the judging model to obtain a second judging result.

In a specific embodiment, the object model generating unit includes a loss value determining unit and a parameter correcting unit;

The loss value determining unit is used for determining a loss value of the loss function according to the first judging result and the second judging result;

and the parameter correction unit is used for correcting parameters of the generation model and the discrimination model according to the loss value, and the corresponding generation model is the target generation model when the loss value of the loss function meets the preset threshold value.

In a specific embodiment, the input feature image size input to the encoding section is 256×256, and the output feature image of the encoding section is 8×8.

In a specific embodiment, the middle portion is a hole convolution layer with hole coefficients of 0,1, and 2.

In a specific embodiment, the decoding portion includes 4 upsampling layers, and the decoding portion upsamples based on a transpose convolution.

The embodiment of the application also provides computer equipment. Referring specifically to fig. 8, fig. 8 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device includes a memory 410 and a processor 420 communicatively coupled to each other via a system bus. It should be noted that only computer devices having components 410-420 are shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.

The computer device may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The computer device can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 410 includes at least one type of readable storage medium including non-volatile memory (non-volatile memory) or volatile memory, such as flash memory (flash memory), hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read-only memory, EPROM), electrically erasable programmable read-only memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-only memory, EEPROM), programmable read-only memory (programmable read-only memory, PROM), magnetic memory, RAM, optical disk, etc., which may include static or dynamic. In some embodiments, the memory 410 may be an internal storage unit of a computer device, such as a hard disk or memory of the computer device. In other embodiments, the memory 410 may also be an external storage device of a computer device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, or a flash memory card (FLASH CARD) or the like, which are provided on the computer device. Of course, memory 410 may also include both internal storage units of a computer device and external storage devices. In this embodiment, the memory 410 is typically used to store an operating system installed on a computer device and various types of application software, such as program codes of the above-described methods. In addition, the memory 410 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 420 is typically used to perform the overall operations of the computer device. In this embodiment, the memory 410 is used for storing program codes or instructions, the program codes include computer operation instructions, and the processor 420 is used for executing the program codes or instructions stored in the memory 410 or processing data, such as the program codes for executing the above-mentioned method.

Herein, the bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, a peripheral component interconnect (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus system may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

Still another embodiment of the present application provides a computer-readable medium, which may be a computer-readable signal medium or a computer-readable medium. A processor in a computer reads computer readable program code stored in a computer readable medium, such that the processor is capable of performing the functional actions specified in each step or combination of steps in the above-described method; a means for generating a functional action specified in each block of the block diagram or a combination of blocks.

The computer readable medium includes, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared memory or semiconductor system, apparatus or device, or any suitable combination of the foregoing, the memory storing program code or instructions, the program code including computer operating instructions, and the processor executing the program code or instructions of the above-described methods stored by the memory.

The definition of memory and processor may refer to the description of the embodiments of the computer device described above, and will not be repeated here.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The functional units or modules in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

As used herein and in the appended claims, the singular forms of words include the plural and vice versa, unless the context clearly dictates otherwise. Thus, when referring to the singular, the plural of the corresponding term is generally included. Similarly, the terms "comprising" and "including" are to be construed as being inclusive rather than exclusive. Likewise, the terms "comprising" and "or" should be interpreted as inclusive, unless such an interpretation is expressly prohibited herein. Where the term "example" is used herein, particularly when it follows a set of terms, the "example" is merely exemplary and illustrative and should not be considered exclusive or broad.

Further aspects and scope of applicability will become apparent from the description provided herein. It is to be understood that various aspects of the application may be implemented alone or in combination with one or more other aspects. It should also be understood that the description and specific examples are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.

While several embodiments of the present disclosure have been described in detail, it will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present disclosure without departing from the spirit and scope of the disclosure. The scope of the present disclosure is defined by the appended claims.

Claims

1. A method of generating an infrared target image based on an attention mechanism under an countermeasure network, comprising:

2. The method of claim 1, wherein training the generative model and the discriminant model based on the training sample set results in a target generative model, comprising:

3. The method according to claim 2, wherein inputting the visible light image, the infrared real image corresponding to the visible light image, and the infrared generated image to the discrimination model to obtain a discrimination result includes:

4. A method according to claim 3, wherein said determining a loss value of a loss function from said discrimination result and correcting parameters of said generation model and said discrimination model based on said loss value to obtain a target generation model comprises:

5. The method according to claim 1, wherein the input feature image size input to the encoding section is 256×256, and the output feature image of the encoding section is 8×8.

6. The method of claim 1, wherein the intermediate portion is a hole convolution layer having hole coefficients of 0, 1, and 2.

7. The method of claim 1, wherein the decoding portion comprises 4 upsampling layers, the decoding portion upsampling based on a transpose convolution.

8. An infrared target image generation apparatus for generating attention-based mechanisms under an countermeasure network, comprising:

9. A computer device, comprising:

One or more processors;

Storage means for storing one or more programs,

When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-7.