WO2023070695A1 - 一种红外图像的转换训练方法、装置、设备及存储介质 - Google Patents

一种红外图像的转换训练方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023070695A1
WO2023070695A1 PCT/CN2021/128161 CN2021128161W WO2023070695A1 WO 2023070695 A1 WO2023070695 A1 WO 2023070695A1 CN 2021128161 W CN2021128161 W CN 2021128161W WO 2023070695 A1 WO2023070695 A1 WO 2023070695A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
preset
true
infrared
domain set
Prior art date
Application number
PCT/CN2021/128161
Other languages
English (en)
French (fr)
Inventor
陈凯
王水根
王建生
王宏臣
Original Assignee
烟台艾睿光电科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 烟台艾睿光电科技有限公司 filed Critical 烟台艾睿光电科技有限公司
Publication of WO2023070695A1 publication Critical patent/WO2023070695A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • G06F18/295Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Definitions

  • the invention relates to the technical field of image processing, in particular to an infrared image conversion training method, device, equipment and computer-readable storage medium.
  • infrared imaging technology can obtain images that cannot be captured by the naked eye and various visible light detectors at night or in various extreme environments. With this unique advantage, it is widely used in many fields such as night monitoring, fire detection, and night driving assistance.
  • infrared images also have obvious shortcomings.
  • the grayscale images formed by them do not conform to the visual perception nerves of the human eye; in addition, compared with visible light true color images, infrared images lack a lot of detailed texture information, which makes the back-end various Image algorithms are difficult to implement.
  • infrared true-color conversion algorithms based on deep neural networks have also appeared. Compared with traditional fusion techniques, these The algorithm can directly convert infrared images into true-color images of visible light, which also has certain advantages in technology.
  • the deep neural network that realizes infrared true-color conversion has the following disadvantages: the algorithm using supervised learning requires strict pairing and registration of samples, and this kind of data set is almost impossible to obtain for the task of infrared color night vision such as cross-time conversion.
  • the algorithms that use unsupervised learning are basically based on the idea of cycle consistency; although such algorithms do not require strict paired samples, the training requirements are often too strict, and the two-way mapping between infrared images and visible light images needs to appear in pairs. Models such as generators and discriminators require high computing power; the idea of cycle consistency is prone to mode collapse when the difference between infrared images and visible light images is too large, that is, training fails, and the effect of generating images is not stable; and the existing The algorithm lacks constraints on time continuity. When the conversion task becomes a video conversion task of continuous frames, the problem of style drift or inter-frame flicker will occur, which greatly reduces the visual effect of the generated video.
  • the object of the present invention is to provide a conversion training method, device, equipment and computer-readable storage medium of infrared images, so that the conversion model of infrared images obtained through training can generate realistic and detailed visible light true-color images. Avoid flickering between frames.
  • the present invention provides a conversion training method of infrared images, including:
  • the first image domain set includes an infrared image corresponding to an infrared video
  • the second image domain set includes a true color image corresponding to a true color video
  • the infrared The scene of the video is the same as that of the said true-color video
  • the preset generator is used to convert the infrared image into a converted true-color image
  • the preset discriminator is used to distinguish the true and false results corresponding to the input true-color image
  • a conversion generator is obtained, so as to use the conversion generator to perform image conversion on the actual infrared video to obtain the target true-color video.
  • the weight parameters of the preset generator and the preset discriminator including:
  • the weight parameters of the preset generator and the preset discriminator are iteratively trained by using the first image domain set and the second image domain set, so that the input and output The semantic information of the corresponding images of the preset generator remains consistent.
  • the idea of contrastive learning is implemented based on a semantic structure loss function, and the semantic structure loss function includes a multi-layer infrared image block contrast loss function and a multi-layer true-color image block contrast loss function;
  • the multi-layer infrared image block contrast loss function is The multi-layer true color image block comparison loss function is X is the first image domain set, Y is the second image domain set, l is the target convolutional layer in the encoder of the preset generator, and L is the target convolutional layer in the encoder quantity, s is the target position in the target convolutional layer of each layer, S 1 is the target position quantity in the target convolutional layer, and z 1 is generated after passing through the encoder and the preset multi-layer perceptron network Characteristics, is the feature on the target position corresponding to the infrared image or the true color image and the converted true color image, is a feature on the target position in the infrared image or the true-color image that does not correspond to the converted true-color image, is the feature on the target position corresponding to the converted true-color image in the infrared image or the true-color image.
  • the inter-frame difference consistency idea is specifically implemented based on an inter-frame difference consistency loss function, and the inter-frame difference consistency loss function is
  • T is the total number of frames of the infrared video
  • It is the input frame sequence of the preset generator
  • ⁇ (x t ) f m (x t+1 )-f m (x t )
  • x t is the gap between the t+1th frame and the tth frame
  • m is the target feature layer
  • f m (x t ) is the feature extracted by the convolutional layer of the preset convolutional neural network.
  • the weight parameters of the preset generator and the preset discriminator are iteratively trained by using the first image domain set and the second image domain set, and include:
  • the generative adversarial idea is specifically implemented based on a generative adversarial loss function, and the generative adversarial loss function is G( ) is the output of the preset generator, D( ) is the output of the preset discriminator, X is the infrared image, Y is the true color image, and y k is the kth frame True-color video frame image, x i is the infrared video frame image of the i-th frame.
  • G( ) is the output of the preset generator
  • D( ) is the output of the preset discriminator
  • X is the infrared image
  • Y is the true color image
  • y k is the kth frame True-color video frame image
  • x i is the infrared video frame image of the i-th frame.
  • the weight parameters of the preset generator and the preset discriminator are iteratively trained by using the first image domain set and the second image domain set to obtain Preset generators for training, including:
  • the preset loss function uses the first image domain set and the second image domain set to iteratively train the weight parameters of the preset generator and the preset discriminator, and obtain the trained preset generation device; wherein, the preset loss function is the sum of the products of the semantic structure loss function, the inter-frame difference consistency loss function, and the product of the generative confrontation loss function and their corresponding loss function weight coefficients.
  • the preset generator completed according to the training, after obtaining the conversion generator further includes:
  • the image set to be converted includes an infrared image to be converted corresponding to the actual infrared video;
  • the acquiring the first image domain set and the second image domain set includes:
  • the training video data includes the infrared video and the true color video
  • a preset number of continuous target single-frame images are spliced to obtain an infrared image corresponding to the infrared video and a true-color image corresponding to the true-color video.
  • the present invention also provides a conversion training device for infrared images, comprising:
  • An acquisition module configured to acquire a first image domain set and a second image domain set; wherein, the first image domain set includes an infrared image corresponding to an infrared video, and the second image domain set includes a true color image corresponding to a true color video image, the scene of the infrared video is the same as that of the true color video;
  • the training module is configured to use the first image domain set and the second image domain set to iteratively train the weight parameters of the preset generator and the preset discriminator based on inter-frame difference consistency and contrastive learning to obtain training A completed preset generator; wherein, the preset generator is used to convert the infrared image into a converted true-color image, and the preset discriminator is used to distinguish the true and false results corresponding to the input true-color image;
  • the generation module is used to obtain a transformation generator according to the preset generator completed by training, so as to use the transformation generator to perform image conversion on the actual infrared video to obtain the target true-color video.
  • the present invention also provides an infrared image conversion training device, comprising:
  • the processor is configured to implement the steps of the above-mentioned infrared image conversion training method when executing the computer program.
  • the present invention also provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the above-mentioned infrared image conversion training method is realized. step.
  • An infrared image conversion training method includes: acquiring a first image domain set and a second image domain set; wherein, the first image domain set includes an infrared image corresponding to an infrared video, and the second image domain set includes The true color image corresponding to the true color video, the scene of the infrared video and the true color video are the same; based on the consistency of the frame difference and the contrastive learning, the preset generator and the preset discrimination are made by using the first image domain set and the second image domain set
  • the weight parameters of the device are iteratively trained to obtain the trained preset generator; among them, the preset generator is used to convert the infrared image into a converted true-color image, and the preset discriminator is used to distinguish whether the input true-color image is true or false
  • a conversion generator is obtained, so that the conversion generator can be used to perform image conversion on the actual infrared video to obtain the target true-color video;
  • the weight parameters of the preset generator and the preset discriminator are iteratively trained by using the first image domain set and the second image domain set, and the training completed preset parameters are obtained.
  • the generator is set up, and the idea of contrastive learning is introduced to avoid the strict two-way mapping method of the existing cycle consistency idea, which can better adapt to the infrared image conversion with time span, and makes the converted daytime true color image retain the nighttime infrared image.
  • the original semantic structure information of the image realizes the transformation and generation of realistic and detailed daytime true-color images; based on the consistency of the inter-frame difference, the idea of inter-frame difference is used to constrain the inter-frame difference between the input and output, which can effectively prevent the generation of true color images.
  • Color video suffers from frame-to-frame flickering.
  • the present invention also provides an infrared image conversion training device, equipment and computer-readable storage medium, which also have the above beneficial effects.
  • Fig. 1 is a flow chart of an infrared image conversion training method provided by an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a residual module of an infrared image conversion training method provided by an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of a Markov discriminator of an infrared image conversion training method provided by an embodiment of the present invention
  • FIG. 4 is a schematic structural diagram of a semantic structure loss function based on contrastive learning of an infrared image conversion training method provided by an embodiment of the present invention
  • Fig. 5 is a structural block diagram of an infrared image conversion training device provided by an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of an infrared image conversion training device provided by an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of an infrared image conversion training device provided in this embodiment.
  • FIG. 1 is a flowchart of an infrared image conversion training method provided by an embodiment of the present invention.
  • the method can include:
  • Step 101 Obtain a first image domain set and a second image domain set; wherein, the first image domain set includes infrared images corresponding to infrared videos, the second image domain set includes true color images corresponding to true color videos, and the infrared video and true color images The scene of color video is the same.
  • the first image domain set in this step can be the set of infrared images corresponding to infrared video (such as infrared video at night), and the second image domain set in this step can be the true color of the same scene as the infrared video A collection of true-color images corresponding to videos (such as daytime true-color videos).
  • the specific method for the processor to obtain the first image domain set and the second image domain set can be set by the designer according to practical scenarios and user needs.
  • the processor can directly receive the first image domain set and the second image domain set. A second set of image domains.
  • the processor can also preprocess the received infrared video and true color video to construct and generate the first image domain set and the second image domain set; Collect dual-light video data in the same scene, that is, infrared video and visible light true color video during the day, and infrared video and visible light true color video at night, so that the processor can compare the received infrared video at night and visible light true color video during the day Perform preprocessing, build and generate the first image domain set and the second image domain set, and the binocular camera can perform registration and synchronization processing in the hardware part to ensure the frame rate and scene of the infrared video at night and the visible light true color video during the day Are the same.
  • the processor obtains the training video Data; frame the training video data to obtain a single-frame image; convert the single-frame image to obtain a target single-frame image with a preset image specification; perform a preset number of continuous target single-frame images according to the video frame sequence splicing to obtain the infrared image corresponding to the infrared video and the true color image corresponding to the true color video; wherein, the training video data includes the infrared video and the true color video.
  • the processor can perform three preprocessing operations on the received nighttime infrared video and daytime true-color video by framing, converting (resize) and splicing (merge), and separate the nighttime infrared video and daytime true-color video through the frame-dividing operation.
  • Frames of images are spliced to form a collection of n consecutive frames as one image (that is, infrared images or true color images). A collection of images.
  • each image can be spliced by target single-frame images corresponding to n consecutive original frame sequence images (ie, single-frame images), where n can be a positive integer between 2 and 5.
  • the processor may only perform two preprocessing operations of frame division and splicing or frame division and conversion on the received nighttime infrared video and daytime true color video, which is not limited in this embodiment.
  • Step 102 Based on the inter-frame difference consistency and comparative learning, use the first image domain set and the second image domain set to iteratively train the weight parameters of the preset generator and the preset discriminator, and obtain the trained preset generator ; Wherein, the preset generator is used to convert the infrared image into a converted true-color image, and the preset discriminator is used to distinguish the true and false results corresponding to the input true-color image.
  • I2V-NET Infrared to visible Network, infrared to visible network
  • the more detailed local style features of the cycle consistency algorithm can also effectively avoid the problem of flickering between frames.
  • the overall training structure of I2V-NET is simple, and can include a generator (ie, preset generator) and a discriminator (ie, preset discriminator), without other complicated auxiliary structures, and the training is fast and effective.
  • the preset generator can convert the infrared image corresponding to the infrared video into a corresponding true-color image (that is, convert the true-color image); the preset discriminator can distinguish whether the input true-color image is true or false As a result, it is determined whether the input true-color image is a true-color image from the second image domain set (ie, a true true-color image), or a converted true-color image generated by a preset generator (ie, a converted true-color image).
  • the specific structure of the preset generator and preset discriminator in this embodiment can be set by the designer according to practical scenarios and user needs, and the preset generator can include an encoder (Encoder), a converter and a decoder; Among them, the encoder can include three sets of "Conv+instance norm+relu", which are used to downsample the input frame sequence image into a feature map with a set number of channels.
  • Encoder Encoder
  • the encoder can include three sets of "Conv+instance norm+relu", which are used to downsample the input frame sequence image into a feature map with a set number of channels.
  • Conv can be a convolutional layer, and instance norm can be normalized layer, relu can be a non-linear activation layer;
  • the converter is composed of a preset number of modules (m) and a residual module (Resnet_block), as shown in Figure 2, its function is to feature the feature map obtained after the encoder is down-sampled Recombination and conversion; the decoder restores the image size through a deconvolution operation, and finally obtains the generated image (ie converted true-color image).
  • the preset discriminator can use the Markov discriminator (PatchGAN) as shown in Figure 3, which can be completely composed of convolutional layers, and finally outputs a k*k matrix, and finally takes each true or false of the output matrix ( The mean value of the output of real or fake) is used as the output of the true and false results; each output in the output matrix represents a receptive field in the input image (such as converting a true color image or a true color image), corresponding to a piece of the input image ( patch).
  • PatchGAN Markov discriminator
  • the iterative training process of the weight parameters of the preset generator and discriminator of I2V-NET can be obtained after several iterations.
  • Appropriate weight parameters of the preset generator and preset discriminator such as deep neural network weights, so as to use the preset generator obtained through training to obtain a generator (ie, a conversion generator) for image conversion of the actual infrared video.
  • the processor can use the first image domain set and the second image domain set to adjust the preset generator and preset discriminator based on the idea of frame-to-frame difference consistency.
  • the weight parameters are iteratively trained so that the frame-to-frame difference between the corresponding continuous frame images of the input and output preset generators is smaller than the preset value; based on the idea of comparative learning of semantic structure, the first image domain set and the second image domain set can be used Iteratively train the weight parameters of the preset generator and the preset discriminator to keep the semantic information of the corresponding image input and output preset generator consistent;
  • the preset loss function of the comparative learning idea of the structure uses the first image domain set and the second image domain set to iteratively train the weight parameters of the preset generator and the preset discriminator to obtain the trained preset generator; where , the preset loss function can include a semantic structure loss function based on contrastive learning, which is used to adjust the semantic structure information of the image generated by the preset generator, and to train the converted true-color image generated by the preset generator based on the idea of contrastive learning to preserve
  • the original semantic structure information of the infrared image keeps the semantic information of the corresponding image input
  • the processor can also use the preset network model based on the idea of inter-frame difference consistency and the idea of contrastive learning based on semantic structure, and use the weights of the first image domain set and the second image domain set to the preset generator and the preset discriminator Parameters are iteratively trained to obtain a trained preset generator, which is not limited in this embodiment.
  • the first image domain set and the second image domain set can be used to iteratively train the weight parameters of the preset generator and the preset discriminator at the same time, so that the preset discriminator outputs
  • the true and false results corresponding to the true color image are equal to the scores of the true and false results corresponding to the converted true color image (for example, both are 0.5), so that the preset discriminator has a strong enough ability to distinguish true and false, and the preset generator
  • the resulting converted true-color image is more like an actual true-color image.
  • the above-mentioned preset loss function may also include generating an adversarial loss function, which is used to adjust the output of the preset discriminator to the true-color image and the converted true-color image and adjust the output of the preset generator, so that the preset discriminator has enough Strong ability to distinguish between true and false, and the converted true-color image generated by the preset generator is more like the actual true-color image, that is, the idea of generative confrontation can be realized based on the generative confrontation loss function; that is, in this step, it can be based on the frame Interval consistency loss function, semantic structure loss function and generative adversarial loss function, use the first image domain set and the second image domain set to iteratively train the weight parameters of the preset generator and the preset discriminator, and obtain the training completed Preset generator.
  • an adversarial loss function which is used to adjust the output of the preset discriminator to the true-color image and the converted true-color image and adjust the output of the preset generator, so that the preset discriminator has enough Strong ability to
  • the training process of the two is a binary game process.
  • the purpose of training the preset discriminator is to make it have a strong enough ability to distinguish between true and false, that is, when the generated result fake_B of the preset generator passes through the preset discriminator, the preset discriminator can give a low score, and the second The original true-color image real_B in the image domain set gives a high score after passing through the preset discriminator; however, the opposite is true when training the preset generator.
  • the training goal of the preset generator is, when the generated image fake_B of the preset generator is A high score can be obtained through the preset discriminator, that is, fake_B is more in line with the distribution law of real_B, and can be faked in style, and "more like" the true color image in the second image domain set in appearance (that is, the target domain true color images).
  • the two form a game of maximum and minimum values during training, and the loss function can be generated as a confrontation loss (GAN_loss) function:
  • G( ⁇ ) can be the output of the preset generator
  • D( ⁇ ) can be the output of the preset discriminator
  • X can be the infrared image in the first image domain set (ie source domain sample image )
  • Y can be the true-color image in the second image domain set (i.e., the sample image in the target domain)
  • y k can be the true-color video frame image of the kth frame
  • xi can be the infrared video frame image of the i-th frame.
  • the traditional method based on CycleGAN is to maintain the input and output semantic structure of the generator through the idea of bidirectional mapping formed by cycle consistency. Consistency, which leads to the need to train two generators and two discriminators at the same time during training, which consumes a lot of computing power, and this strict two-way mapping based on the original image is easy to fail in training, and the generated image lacks detailed information.
  • the I2V-NET in this embodiment only needs a preset generator and a preset discriminator, by comparing the input and output image blocks (patch) to find the corresponding mutual information, and by maximizing the mutual information to obtain a strong The output of the content correspondence.
  • the method of mutual information maximization can be based on the idea of contrastive learning, by mapping the query signal v, positive signal v + and N negative signals v - into a K-dimensional vector, where is a K-dimensional real number, It is an N ⁇ K dimensional real number. In order to prevent the pattern from collapsing, these vectors are normalized to the unit ball, thus establishing an N+1 classification problem. Finally, by calculating the cross-entropy loss, it represents the probability of positive samples being selected.
  • the loss function can be the information network loss (infoNCE Loss) function:
  • is a preset scaling factor
  • a multi-layer and paired contrastive learning method can be used, and under unsupervised learning conditions, the semantic information of the image (image) and image block (patch) can be constrained to be consistent with the input image at the same time, so the preset
  • the image generated by the generator not only has common features in the content of the entire image, but also has such a corresponding relationship between the input and output patches.
  • a patch of a car in the generated daytime true-color image, then Its night-time IR image before conversion should also contain patches corresponding to cars, not patches from trees or other background parts.
  • the encoder part of the preset generator in I2V-NET can be Genc , by reusing Genc and then adding a preset multi-layer perceptron (MLP) network H l (MLP network in Figure 4 ), stack the input and output features of the preset generator; for example, send the real nighttime infrared image real_A and the generated corresponding real-color image fake_B patches to Genc and H l respectively to generate a series of features.
  • the corresponding PatchNCE loss function can be used to compare the differences between the image blocks, so as to match the image blocks at the corresponding positions of the input and output.
  • the semantic structure loss function may include a multi-layer infrared image block contrast loss function L PatchNCE (G, H, X):
  • l can be the target convolutional layer in the encoder of the preset generator, such as any layer of convolutional layer, multi-layer convolutional layer or all convolutional layers in the encoder, the more layers of attention , the effect can be better;
  • L can be the target number of convolutional layers in the encoder of the preset generator, that is, the number of concerned convolutional layers in the encoder;
  • X can be the first image domain set;
  • s can be each layer The target position in the target convolutional layer, that is, the position of interest in each layer of convolution;
  • S l can be the number of target positions in the target convolutional layer, that is, the number of focused positions in the target convolutional layer;
  • z l is the The features generated by the sensor and the preset multi-layer perceptron network, is the feature on the target position corresponding to the input image (such as an infrared image) and the converted true-color image, is the feature in the target position in the input image that does not correspond
  • the corresponding image patch net loss function can also be used for the true-color images in the second image domain set (i.e., target domain sample images), that is, the semantic structure loss function can include multiple layers of true color images.
  • Y can be the second image domain set, is the feature on the target position corresponding to the input image (such as a true color image) and the converted true color image, is the feature on the target position where the input image does not correspond to the converted true-color image, It may be a feature in the target position corresponding to the converted true-color image in the input image, and y to Y may indicate that y obeys the probability distribution of the second image domain set Y.
  • the current existing algorithm trains a new auxiliary structure to stably generate the frame sequence, thereby alleviating the inter-frame flickering problem of the generated video, such as using a dual-channel generator, etc.
  • Others obtain stable video effects by calculating the optical flow loss; but these algorithms require a large amount of calculation and are not convenient for practical use.
  • the inter-frame difference consistency loss function L temp can be used to improve the problem of inter-frame flicker in the generated video frame sequence.
  • the formula of the inter-frame difference consistency loss function can be:
  • T can be the total number of frames of the infrared video
  • I t can be the input frame sequence of the preset generator
  • It can be the output frame sequence of the preset generator
  • ⁇ (x t ) f m (x t+1 )-f m (x t )
  • x t can be the gap between the t+1th frame and the tth frame
  • m can be the target feature layer
  • f m (x t ) can be the feature extracted by the convolutional layer of the preset convolutional neural network, such as by extracting the pre-trained vgg16 (a convolutional neural network, that is, the pre- Let the convolutional neural network) obtain features from the output of each convolutional layer.
  • the preset loss function in this step can be the sum of the products of each loss function and the corresponding loss function weight coefficient, such as the preset loss function includes the above-mentioned semantic structure loss function, the inter-frame difference consistency loss function and the generation adversarial loss function
  • the processor uses the first image domain set and the second image domain set to iteratively train the weight parameters of the preset generator and the preset discriminator, and obtain the training completed
  • the specific process of the preset generator can be set by the designer.
  • the processor can build a preset generator and a preset discriminator; Assume that the discriminator distinguishes the converted true-color image based on the true-color image, and obtains the true and false results corresponding to the converted true-color image; judges whether the preset number of iterations is reached; Step; if not, you can use the preset loss function to adjust the weight parameters of the preset generator and the preset discriminator, and after the adjustment is completed, perform image conversion on the infrared image using the preset generator to obtain a converted true-color image Steps to continue iterative training; for example, the weight parameters of the preset generator and the preset discriminator of I2V-NET in this embodiment can use pytorch 1.7.0 (a kind of open source Python machine learning library software) deep learning framework , using xavier (a parameter initialization method) random parameter initialization, using adam (an optimizer) optimizer for training, and the initial learning rate is 0.0002.
  • pytorch 1.7.0 a kind of open source Python
  • Step 103 Obtain a conversion generator according to the trained preset generator, so as to use the conversion generator to perform image conversion on the actual infrared video to obtain the target true-color video.
  • the processor can use the trained preset generator to obtain a generator model (ie, a conversion generator) for true-color conversion of the nighttime infrared video (ie, the actual infrared video) that actually needs to be converted.
  • a generator model ie, a conversion generator
  • the specific method of obtaining the conversion generator can be set by the designer himself, for example, the processor can load the weight parameters of the trained preset generator into the new build In the generator model, the loaded generator model is determined as the conversion generator; the processor may also directly determine the trained preset generator as the conversion generator, which is not limited in this embodiment.
  • the method provided by this implementation can also include the process of using the conversion generator to perform image conversion on the actual infrared video, such as the processor can obtain the image set to be converted; use the conversion generator to perform image conversion on the infrared image to be converted, and obtain the target True color video; wherein, the image set to be converted includes the infrared image to be converted corresponding to the actual infrared video.
  • the processor can read the infrared video frame sequence at night in real time and send it to the loaded conversion generator to generate continuous and stable daytime true-color video to realize the daytime infrared true-color conversion task.
  • the method provided by this implementation can also include the test process of the conversion generator or the preset generator that has been trained, such as the processor can obtain the test image set; Carry out image conversion, obtain and test and convert true-color video; Contrast and display the frame sequence of test and convert true-color video and the sequence of test true-color video;
  • test image collection can comprise the test infrared image corresponding to test infrared video (such as infrared video at night)
  • the test true-color video can be a true-color video of the same scene as the test infrared video (such as a true-color video at night).
  • the embodiment of the present invention uses the first image domain set and the second image domain set to iteratively train the weight parameters of the preset generator and the preset discriminator to obtain training
  • the completed preset generator introduces the idea of contrastive learning to avoid the strict two-way mapping method of the existing cycle consistency idea, which can better apply to the infrared image conversion with time span, and makes the converted daytime true color image can be Preserve the original semantic structure information of nighttime infrared images to realize the transformation and generation of realistic and detailed daytime true-color images; Prevents frame-to-frame flickering when generating true-color video.
  • the embodiment of the present invention also provides an infrared image conversion training device, the infrared image conversion training device described below and the infrared image conversion training method described above can be mutually Corresponding reference.
  • FIG. 5 is a structural block diagram of an infrared image conversion training device provided by an embodiment of the present invention.
  • the device can include:
  • the acquiring module 10 is configured to acquire a first image domain set and a second image domain set; wherein, the first image domain set includes an infrared image corresponding to an infrared video, and the second image domain set includes a true color image corresponding to a true color video, and an infrared image corresponds to an infrared video.
  • the scene of the video is the same as that of the true color video;
  • the training module 20 is used to iteratively train the weight parameters of the preset generator and the preset discriminator by using the first image domain set and the second image domain set based on the consistency of the inter-frame difference and contrastive learning, and obtain the training completed preset Provide a generator; wherein, the preset generator is used to convert the infrared image into a converted true-color image, and the preset discriminator is used to distinguish the true and false results corresponding to the input true-color image;
  • the generating module 30 is configured to obtain a conversion generator according to the trained preset generator, so as to use the conversion generator to perform image conversion on the actual infrared video to obtain the target true-color video.
  • the training module 20 may include:
  • the inter-frame difference consistency training sub-module is used to iteratively train the weight parameters of the preset generator and the preset discriminator by using the first image domain set and the second image domain set based on the idea of inter-frame difference consistency, so that The frame-to-frame difference between the corresponding continuous frame images input and output to the preset generator is smaller than the preset value;
  • the contrastive learning training sub-module is used for the idea of contrastive learning based on semantic structure, using the first image domain set and the second image domain set to iteratively train the weight parameters of the preset generator and the preset discriminator, so that the input and output The semantic information of the corresponding image of the preset generator remains consistent.
  • the idea of contrastive learning is specifically implemented based on a semantic structure loss function, and the semantic structure loss function includes a multi-layer infrared image block comparison loss function and a multi-layer true-color image block comparison loss function;
  • the multi-layer infrared image block contrast loss function is The multi-layer true-color image block comparison loss function is X is the first image domain set, Y is the second image domain set, l is the target convolutional layer in the encoder of the preset generator, L is the number of target convolutional layers in the encoder, and s is the target convolution of each layer
  • the target position in the product layer S l is the number of target positions in the target convolutional layer
  • z l is the feature generated after passing through the encoder and the preset multi-layer perceptron network, is the feature on the target position corresponding to the infrared image or the true color image and the converted true color image, is the feature on the target position in the infrared image or the true color image that does not correspond to the converted true color image, is the feature on the target position corresponding to the converted true-color image in the infrared image or the true-color image.
  • inter-frame difference consistency loss function the inter-frame difference consistency loss function
  • T is the total frame number of infrared video
  • I t is the input frame sequence of preset generator
  • ⁇ (x t ) f m (x t+1 )-f m (x t )
  • x t is the gap between the t+1th frame and the tth frame
  • m is the target feature layer
  • f m (x t ) is the feature extracted by the convolutional layer of the preset convolutional neural network.
  • the training module 20 may also include:
  • the generation confrontation training sub-module is used to iteratively train the weight parameters of the preset generator and the preset discriminator at the same time by using the first image domain set and the second image domain set based on the idea of generating confrontation, so that the preset discriminator outputs
  • the fraction of real-false results corresponding to the true-color image of is equal to the fraction of true-false results corresponding to the converted true-color image.
  • the idea of generative confrontation is specifically based on the realization of the generative confrontation loss function, and the generative confrontation loss function is G( ) is the output of the preset generator, D( ) is the output of the preset discriminator, X is the infrared image, Y is the true color image, y k is the true color video frame image of the kth frame, x i is the infrared video frame image of the i-th frame.
  • the training module 20 can be specifically configured to use the first image domain set and the second image domain set to iteratively train the weight parameters of the preset generator and the preset discriminator according to the preset loss function, and obtain the training completed A preset generator; wherein, the preset loss function is the sum of the products of the semantic structure loss function, the inter-frame difference consistency loss function, and the product of the generative confrontation loss function and their corresponding loss function weight coefficients.
  • the device may also include:
  • the conversion acquisition module is used to obtain the image set to be converted; wherein, the image set to be converted includes the infrared image to be converted corresponding to the actual infrared video;
  • the conversion generating module is used for performing image conversion on the infrared image to be converted by using the conversion generator to obtain the target true-color video.
  • the acquisition module 10 may include:
  • the video acquisition sub-module is used to acquire training video data; wherein, the training video data includes infrared video and true color video;
  • the frame sub-module is used to frame the training video data to obtain a single frame image
  • the conversion sub-module is used to convert the single-frame image to obtain a target single-frame image with preset image specifications
  • the splicing sub-module is used to splice a preset number of continuous target single-frame images according to the sequence of video frames to obtain an infrared image corresponding to the infrared video and a true-color image corresponding to the true-color video.
  • the embodiment of the present invention uses the training module 20 to iterate the weight parameters of the preset generator and the preset discriminator by using the first image domain set and the second image domain set based on inter-frame difference consistency and contrastive learning Training, obtain the preset generator that has been trained, and introduce the idea of contrastive learning to avoid the strict two-way mapping method of the existing cycle consistency idea, which can better apply to the infrared image conversion with time span, and make the conversion daytime
  • the true color image can retain the original semantic structure information of the infrared image at night, and realize the transformation and generation of a realistic and detailed daytime true color image; based on the consistency of the difference between frames, the idea of the difference between the frames is used to constrain the frame difference between the input and the output The difference can effectively prevent the inter-frame flickering problem caused by the generation of true-color video.
  • the embodiment of the present invention also provides an infrared image conversion training device, the infrared image conversion training device described below and the infrared image conversion training method described above can be mutually Corresponding reference.
  • FIG. 6 is a schematic structural diagram of an infrared image conversion training device provided by an embodiment of the present invention.
  • the conversion training equipment may include:
  • memory D1 for storing computer programs
  • the processor D2 is configured to implement the steps of the infrared image conversion training method provided by the above method embodiments when executing the computer program.
  • FIG. 7 is a schematic structural diagram of an infrared image conversion training device provided in this embodiment.
  • the conversion training device 310 may have relatively large differences due to different configurations or performances, and may include a or more than one processor (central processing units, CPU) 322 (for example, one or more processors) and memory 332, one or more storage media 330 for storing application programs 342 or data 344 (for example, one or more mass storage equipment).
  • the memory 332 and the storage medium 330 may be temporary storage or persistent storage.
  • the program stored in the storage medium 330 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the electronic device.
  • the central processing unit 322 may be configured to communicate with the storage medium 330 , and execute a series of instruction operations in the storage medium 330 on the conversion training device 310 .
  • Transition training device 310 may also include one or more power sources 326 , one or more wired or wireless network interfaces 350 , one or more input-output interfaces 358 , and/or, one or more operating systems 341 .
  • operating systems 341 For example, Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the steps in the infrared image conversion training method described above can be realized by the structure of the infrared image conversion training device.
  • the embodiment of the present invention also provides a computer-readable storage medium.
  • the computer-readable storage medium described below and the infrared image conversion training method described above can refer to each other. .
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the infrared image conversion training method provided by the above method embodiments are implemented.
  • the computer-readable storage medium can specifically be a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, etc., which can store various program codes.
  • readable storage media can specifically be a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, etc., which can store various program codes. readable storage media.
  • each embodiment in the description is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other.
  • the description is relatively simple, and for relevant details, please refer to the description of the method part.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

一种红外图像的转换训练方法、装置、设备及计算机可读存储介质,方法包括:获取第一图像域集和第二图像域集;基于帧间差一致性和对比学习,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器;根据训练完成的预设生成器,获取转换生成器。该方法引入对比学习的思想避免了现有循环一致性思想的严格的双向映射的方式,能够更好地适用存在时间跨度的红外图像转换,使得转换的白天真彩图像能够保留夜间红外图像的原始语义结构信息;基于帧间差一致性,采用帧间差分的思想约束输入与输出之间的帧间差异,能够有效防止生成真彩视频产生帧间闪烁问题。

Description

一种红外图像的转换训练方法、装置、设备及存储介质
本申请要求于2021年10月26日提交中国专利局、申请号为202111247706.4、发明名称为“一种红外图像的转换训练方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及图像处理技术领域,特别涉及一种红外图像的转换训练方法、装置、设备及计算机可读存储介质。
背景技术
红外成像技术由于其独特的热成像原理,能够在夜间或者各种极端环境下获得肉眼以及各种可见光探测器无法捕捉的图像。凭借此独特的优势,使其被广泛应用于夜间监控、火灾侦测、夜间辅助驾驶等众多领域。然而红外图像也有明显的缺点,其所成的灰度图像并不符合人眼的视觉感知神经;除此以外,相比可见光真彩图像,红外图像缺少了很多细节纹理信息,使得后端各种图像算法难以实现。
近年来,深度学习在各种图像处理任务中都取得了远超传统算法的效果,在红外彩色夜视领域也出现了基于深度神经网络的红外真彩转换算法,相比传统的融合技术,这些算法能够直接将红外图像转换为可见光的真彩图像,在技术上也有一定优越性。但是实现实现红外真彩转换的深度神经网络有着以下缺点:采用监督学习的算法需要严格配对配准的样本,对于红外彩色夜视这种跨时间转换任务,这种数据集几乎无法获取。而采用无监督学习的算法,基本都是基于循环一致性的思想;这类算法虽然无需严格的配对样本,但训练要求往往过于严格,红外图像与可见光图像之间的双向映射需要成对出现的生成器、判别器等模型,算力要求较高;循环一致性的思想在红外图像和可见光图像差异过大时,容易发生模式崩塌即训练失败,生成图像的效果并不稳定;并且现有的算法缺乏对于时间连续性的约束,当转换任务变为连续帧的视频转换任务时,会出现风格漂移或帧 间闪烁的问题,使得生成视频的视觉效果极大降低。
因此,如何能够使训练得到一种红外图像的转换模型,在生成逼真的富有细节的可见光真彩图像的同时避免帧间闪烁问题的发生,提升用户体验,是现今急需解决的问题。
发明内容
本发明的目的是提供一种红外图像的转换训练方法、装置、设备及计算机可读存储介质,以使训练得到的红外图像的转换模型,能够在生成逼真的富有细节的可见光真彩图像的同时避免帧间闪烁问题的发生。
为解决上述技术问题,本发明提供一种红外图像的转换训练方法,包括:
获取第一图像域集和第二图像域集;其中,所述第一图像域集包括红外视频对应的红外图像,所述第二图像域集包括真彩视频对应的真彩图像,所述红外视频与所述真彩视频的场景相同;
基于帧间差一致性和对比学习,利用所述第一图像域集和所述第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器;其中,所述预设生成器用于将所述红外图像转换为转换真彩图像,所述预设判别器用于判别得到输入的真彩图像对应的真假结果;
根据训练完成的预设生成器,获取转换生成器,以利用所述转换生成器对实际红外视频进行图像转换,得到目标真彩视频。
可选的,所述基于帧间差一致性和对比学习,利用所述第一图像域集和所述第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,包括:
基于帧间差一致性思想,利用所述第一图像域集和所述第二图像域集对所述预设生成器和所述预设判别器的权重参数进行迭代训练,以使输入与输出所述预设生成器的对应连续帧图像的帧间差小于预设值;
基于语义结构的对比学习思想,利用所述第一图像域集和所述第二图像域集对所述预设生成器和所述预设判别器的权重参数进行迭代训练,以使输入与输出所述预设生成器的对应图像的语义信息保持一致。
可选的,所述对比学习思想具体基于语义结构损失函数实现,所述语义结构损失函数包括多层红外图像块对比损失函数和多层真彩图像块对比损失函数;
其中,所述多层红外图像块对比损失函数为
Figure PCTCN2021128161-appb-000001
所述多层真彩图像块对比损失函数为
Figure PCTCN2021128161-appb-000002
X为所述第一图像域集,Y为所述第二图像域集,l为所述预设生成器的编码器中的目标卷积层,L为所述编码器中的目标卷积层数量,s为每层所述目标卷积层中的目标位置,S l为所述目标卷积层中的目标位置数量,z l为通过所述编码器和预设多层感知器网络后生成的特征,
Figure PCTCN2021128161-appb-000003
为所述红外图像或所述真彩图像与所述转换真彩图像相对应的所述目标位置上的特征,
Figure PCTCN2021128161-appb-000004
为所述红外图像或所述真彩图像中的与所述转换真彩图像不对应的所述目标位置上的特征,
Figure PCTCN2021128161-appb-000005
为所述红外图像或所述真彩图像中的与所述转换真彩图像相对应的所述目标位置上的特征。
可选的,所述帧间差一致性思想具体基于帧间差一致性损失函数实现,所述帧间差一致性损失函数为
Figure PCTCN2021128161-appb-000006
其中,T为所述红外视频的总帧数,I t为所述预设生成器的输入帧序列,
Figure PCTCN2021128161-appb-000007
为所述预设生成器的输出帧序列,φ(x t)=f m(x t+1)-f m(x t),x t为第t+1帧和第t帧之间的差距,m为目标特征图层,f m(x t)为通过预设卷积神经网络的卷积层提取到的特征。
可选的,所述基于帧间差一致性和对比学习,利用所述第一图像域集和所述第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,还包括:
基于生成对抗思想,利用所述第一图像域集和所述第二图像域集对所 述预设生成器和所述预设判别器的权重参数同时进行迭代训练,以使所述预设判别器输出的所述真彩图像对应的真假结果与所述转换真彩图像对应的真假结果的分数相等。
可选的,所述生成对抗思想具体基于生成对抗损失函数实现,所述生成对抗损失函数为
Figure PCTCN2021128161-appb-000008
G(·)为所述预设生成器的输出,D(·)为所述预设判别器的输出,X为所述红外图像,Y为所述真彩图像,y k为第k帧的真彩视频帧图像,x i为第i帧的红外视频帧图像。
可选的,所述基于帧间差一致性和对比学习,利用所述第一图像域集和所述第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器,包括:
根据预设损失函数,利用所述第一图像域集和所述第二图像域集对所述预设生成器和所述预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器;其中,所述预设损失函数为语义结构损失函数、帧间差一致性损失函数和生成对抗损失函数与各自对应的损失函数权重系数的乘积之和。
可选的,所述根据训练完成的预设生成器,获取转换生成器之后,还包括:
获取待转换图像集;其中,所述待转换图像集包括所述实际红外视频对应的待转换红外图像;
利用所述转换生成器对所述待转换红外图像进行图像转换,获取所述目标真彩视频。
可选的,所述获取第一图像域集和第二图像域集,包括:
获取训练视频数据;其中,所述训练视频数据包括所述红外视频和所述真彩视频;
对所述训练视频数据进行分帧,得到单帧图像;
将所述单帧图像进行转换,得到预设图像规格的目标单帧图像;
按照视频帧序,对预设数量的连续的所述目标单帧图像进行拼接,得到所述红外视频对应的红外图像和所述真彩视频对应的真彩图像。
本发明还提供了一种红外图像的转换训练装置,包括:
获取模块,用于获取第一图像域集和第二图像域集;其中,所述第一图像域集包括红外视频对应的红外图像,所述第二图像域集包括真彩视频对应的真彩图像,所述红外视频与所述真彩视频的场景相同;
训练模块,用于基于帧间差一致性和对比学习,利用所述第一图像域集和所述第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器;其中,所述预设生成器用于将所述红外图像转换为转换真彩图像,所述预设判别器用于判别得到输入的真彩图像对应的真假结果;
生成模块,用于根据训练完成的预设生成器,获取转换生成器,以利用所述转换生成器对实际红外视频进行图像转换,得到目标真彩视频。
本发明还提供了一种红外图像的转换训练设备,包括:
存储器,用于存储计算机程序;
处理器,用于执行所述计算机程序时实现如上述所述的红外图像的转换训练方法的步骤。
此外,本发明还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上述所述的红外图像的转换训练方法的步骤。
本发明所提供的一种红外图像的转换训练方法,包括:获取第一图像域集和第二图像域集;其中,第一图像域集包括红外视频对应的红外图像,第二图像域集包括真彩视频对应的真彩图像,红外视频与真彩视频的场景相同;基于帧间差一致性和对比学习,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器;其中,预设生成器用于将红外图像转换为转换真彩图像,预设判别器用于判别得到输入的真彩图像对应的真假结果根据训练完成的预设生成器,获取转换生成器,以利用转换生成器对实际红外视频进行图像转换,得到目标真彩视频;
可见,本发明中通过基于帧间差一致性和对比学习,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练, 获取训练完成的预设生成器,引入对比学习的思想避免了现有循环一致性思想的严格的双向映射的方式,能够更好地适用存在时间跨度的红外图像转换,并且使得转换的白天真彩图像能够保留夜间红外图像的原始语义结构信息,实现逼真的富有细节的白天真彩图像的转换生成;基于帧间差一致性,采用帧间差分的思想约束输入与输出之间的帧间差异,能够有效防止生成真彩视频产生帧间闪烁问题。此外,本发明还提供了一种红外图像的转换训练装置、设备及计算机可读存储介质,同样具有上述有益效果。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本发明实施例所提供的一种红外图像的转换训练方法的流程图;
图2为本发明实施例所提供的一种红外图像的转换训练方法的残差模块的结构示意图;
图3为本发明实施例所提供的一种红外图像的转换训练方法的马尔可夫判别器的结构示意图;
图4为本发明实施例所提供的一种红外图像的转换训练方法的基于对比学习的语义结构损失函数的结构示意图;
图5为本发明实施例所提供的一种红外图像的转换训练装置的结构框图;
图6为本发明实施例所提供的一种红外图像的转换训练设备的结构示意图;
图7为本实施例提供的一种红外图像的转换训练设备的具体结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描 述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
请参考图1,图1为本发明实施例所提供的一种红外图像的转换训练方法的流程图。该方法可以包括:
步骤101:获取第一图像域集和第二图像域集;其中,第一图像域集包括红外视频对应的红外图像,第二图像域集包括真彩视频对应的真彩图像,红外视频与真彩视频的场景相同。
可以理解的是,本步骤中的第一图像域集可以为红外视频(如夜间红外视频)对应的红外图像的集合,本步骤中的第二图像域集可以为与红外视频相同场景的真彩视频(如白天真彩视频)对应的真彩图像的集合。
具体的,对于本步骤中处理器获取第一图像域集和第二图像域集的具体方式,可以由设计人员根据实用场景和用户需求自行设置,如处理器可以直接接收第一图像域集和第二图像域集。处理器也可以对接收的红外视频和真彩视频进行预处理,构建生成第一图像域集和第二图像域集;例如可以采用车载双目摄像头设备,分别在白天和黑夜两个时间段,采集相同场景下的双光视频数据,即白天的红外视频与可见光真彩视频以及夜间的红外视频与可见光真彩视频,从而使处理器可以对接收的夜间的红外视频和白天的可见光真彩视频进行预处理,构建生成第一图像域集和第二图像域集,并且双目摄像头在硬件部分可以进行配准和同步处理,保证夜间的红外视频与白天的可见光真彩视频的帧率和场景是相同的。
对应的,对于上述处理器对接收的红外视频和真彩视频进行预处理,构建生成第一图像域集和第二图像域集的具体方式,可以由设计人员自行设置,如处理器获取训练视频数据;对训练视频数据进行分帧,得到单帧图像;将单帧图像进行转换,得到预设图像规格的目标单帧图像;按照视频帧序,对预设数量的连续的目标单帧图像进行拼接,得到红外视频对应的红外图像和真彩视频对应的真彩图像;其中,训练视频数据包括红外视频和真彩视频。例如,处理器可以对接收的夜间红外视频和白天真彩视频进行分帧、转换(resize)和拼接(merge)这三个预处理操作,通过分帧 操作将夜间红外视频和白天真彩视频分别转化为多个连续的单帧图像;通过转换操作将单帧图像统一转化为预设图像规格(如256*256)的目标单帧图像;通过拼接操作将连续预设数量(n)的目标单帧图像进行拼接,形成连续n帧为一张图像(即红外图像或真彩图像)的集合,如将连续n帧的目标单帧图像在宽度或长度方向上进行拼接,形成连续n帧为一张图像的集合。也就是说,无论是红外视频(如夜间红外视频)对应的第一图像域集,还是真彩视频(如白天真彩视频)对应的第二图像域集,其集合(即样本集)中的每一张图像都可以由n张连续的原始帧序列图像(即单帧图像)对应的目标单帧图像拼接而成的,其中n可以为2到5之间的正整数。处理器可以仅对接收的夜间红外视频和白天真彩视频进行分帧和拼接这两个预处理操作或者分帧和转换这两个预处理操作,本实施例对此不做任何限制。
步骤102:基于帧间差一致性和对比学习,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器;其中,预设生成器用于将红外图像转换为转换真彩图像,预设判别器用于判别得到输入的真彩图像对应的真假结果。
可以理解的是,本实施例中提供了一种基于对比学习和帧间差分损失的红外真彩夜视算法:I2V-NET(Infrared to visible Network,红外到可见光网络),既能获得比现有循环一致性算法更细节的局部风格特征,还可以有效避免出现帧间闪烁的问题。I2V-NET的整体训练结构简单,可以包括一个生成器(即预设生成器)和一个判别器(即预设判别器),无需其他繁杂的辅助结构,训练快且效果好。
具体的,本实施例中预设生成器可以将红外视频对应的红外图像转换为相应的真彩图像(即转换真彩图像);预设判别器可以判别得到输入的真彩图像对应的真假结果,即判断输入的真彩图像是来自第二图像域集的真彩图像(即真实真彩图像),还是来自预设生成器生成的转换真彩图像(即转换真彩图像)。对于本实施例中的预设生成器和预设判别器的具体结构,可以由设计人员根据实用场景和用户需求自行设置,预设生成器可以包括编码器(Encoder)、转换器和解码器;其中,编码器可以包括三组“Conv+instance norm+relu”,用于将输入的帧序列图像下采样为设定通道数的特 征图,Conv可以为卷积层,instance norm可以为归一化层,relu可以为非线性激活层;转换器由预设模块数量(m)个残差模块(Resnet_block)构成,如图2所示,其作用是将编码器下采样后得到的特征图进行特征重组和转换;解码器通过反卷积操作还原图像尺寸,最后得到生成图像(即转换真彩图像)。预设判别器可以采用如图3所示的马尔可夫判别器(PatchGAN),可以完全由卷积层构成,最后输出的是一个k*k的矩阵,最后取输出矩阵的各真或假(real or fake)的输出的均值作为真假结果的输出;输出矩阵中的每一个输出,代表着输入图像(如转换真彩图像或真彩图像)中一个感受野,对应了输入图像的一片(patch)。
需要说明的是,本步骤中基于帧间差一致性和对比学习,对I2V-NET的预设生成器和预设判别器的权重参数进行迭代训练的过程,可以在若干次迭代后,训练得到合适的预设生成器和预设判别器的权重参数(如深度神经网络权重),从而利用训练得到的预设生成器,获取对实际红外视频进行图像转换的生成器(即转换生成器)。
对应的,对于本步骤中基于帧间差一致性和对比学习,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练的具体方式,可以由设计人员根据实用场景和用户需求自行设置,如本步骤中处理器可以基于帧间差一致性思想,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,以使输入与输出预设生成器的对应连续帧图像的帧间差小于预设值;可以基于语义结构的对比学习思想,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,以使输入与输出预设生成器的对应图像的语义信息保持一致;例如处理器可以利用基于帧间差一致性思想和基于语义结构的对比学习思想的预设损失函数,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器;其中,预设损失函数可以包括基于对比学习的语义结构损失函数,用于调整预设生成器生成的图像的语义结构信息,以基于对比学习的思想训练预设生成器生成的转换真彩图像要保留红外图像的原始语义结构信息,使输入与输出预设生成器的对应图像的语义信息保持一致,即基 于语义结构的对比学习思想具体可以基于语义结构损失函数实现;预设损失函数也可以包括帧间差一致性损失函数,用于调整预设生成器生成的连续帧图像与输入的连续帧图像的帧间差,使输入与输出预设生成器的对应连续帧图像的帧间差小于预设值,以利用帧间差一致性损失(temporal loss)函数,在保证预设生成器能够正确完成生成任务的前提下,令其生成的连续帧的真彩图像与原始输入视频的帧间变换具有一致性,也就是具有相同的时间相干性,改善生成视频帧序列出现帧间闪烁的问题,即帧间差一致性思想具体可以基于帧间差一致性损失函数实现。处理器也可以利用基于帧间差一致性思想和基于语义结构的对比学习思想的预设网络模型,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器,本实施例对此不做任何限制。
相应的,本步骤中还可以基于生成对抗思想,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数同时进行迭代训练,以使预设判别器输出的真彩图像对应的真假结果与转换真彩图像对应的真假结果的分数相等(如均为0.5),从而使得预设判别器拥有足够强的辨别真假的能力,且预设生成器生成的转换真彩图像更像实际的真彩图像。例如上述预设损失函数还可以包括生成对抗损失函数,用于调整预设判别器对真彩图像和转换转换真彩图像的输出和调整预设生成器的输出,从而使得预设判别器拥有足够强的辨别真假的能力,且预设生成器生成的转换真彩图像更像实际的真彩图像,即生成对抗思想具体可以基于生成对抗损失函数实现;也就是说,本步骤中可以根据帧间差一致性损失函数、语义结构损失函数和生成对抗损失函数,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器。
具体的,对于上述生成对抗损失函数,由于I2V-NET的预设生成器(G)与预设判别器(D)同时训练,二者的训练过程是一个二元博弈的过程。训练预设判别器的目的是为了让其拥有足够强的辨别真假的能力,即当预设生成器的生成结果fake_B经过预设判别器时,预设判别器能够给出低得分,第二图像域集中原始的真彩图像real_B经过预设判别器后给出高得分; 然而在训练预设生成器时恰恰相反,预设生成器的训练目标为,当把预设生成器的生成图像fake_B经过预设判别器时能够得到高得分,即fake_B更符合real_B的分布规律,可以在风格上达到了以假乱真的程度,在外观上“更像”第二图像域集中的真彩图像(即目标域真彩图像)。这样二者在训练时形成了极大极小值的博弈,损失函数即可以为生成对抗损失(GAN_loss)函数:
Figure PCTCN2021128161-appb-000009
其中,
Figure PCTCN2021128161-appb-000010
可以为生成对抗损失函数,G(·)可以为预设生成器的输出,D(·)可以为预设判别器的输出,X可以为第一图像域集中的红外图像(即源域样本图像),Y可以为第二图像域集中的真彩图像(即目标域样本图像),y k可以为第k帧的真彩视频帧图像,x i可以为第i帧的红外视频帧图像。
相应的,对于上述基于对比学习的语义结构损失函数,传统的基于CycleGAN(一种循环生成对抗网络)的方法是通过循环一致性来构成的双向映射的思想保持生成器的输入和输出语义结构的一致性,这会导致训练时需要同时训练两个生成器和两个判别器,非常消耗算力,而且这种严格的基于原图像的双向映射很容易训练失败,而且生成的图像缺乏细节信息。本实施例中的I2V-NET只需要一个预设生成器和一个预设判别器,通过对比输入和输出的图像块(patch)寻找对应的互信息,通过互信息最大化来得到具有较强的内容对应关系的输出。互信息最大化的方法可以为对比学习的思想,通过将查询的信号v、正信号v +和N个负信号v -映射为K维的向量,其中
Figure PCTCN2021128161-appb-000011
Figure PCTCN2021128161-appb-000012
为K维实数,
Figure PCTCN2021128161-appb-000013
为N×K维实数,为了防止模式崩塌,将这些向量正则化到单位球上,这样就建立起了一个N+1的分类问题,最后通过计算交叉熵损失,表示正样本被选择的概率,损失函数即可以为信息网损失(infoNCE Loss)函数:
Figure PCTCN2021128161-appb-000014
其中,τ为预设缩放因子。
本实施例中可以采用多层和成对的对比学习方法,在无监督学习条件下能够同时在图像(image)和图像块(patch)约束生成图像的语义信息与输入图像保持一致,所以预设生成器生成的图像除了整张图像具有内容上的共同特征,输入和输出每个对应的patch也具有这样的对应关系,如图4所示,生成的白天真彩图像中一个汽车的patch,则其在转换前的夜间红外图像也应该包含对应的汽车的patch,而不是来自树木或者其他背景部分的patch。本实施例中I2V-NET中的预设生成器的编码器部分可以为G enc,通过重用G enc并在其之后加入预设多层感知器(MLP)网络H l(如图4中MLP网络),对预设生成器的输入和输出的特征进行堆栈;如将真实的夜间红外图像real_A和生成的相应的真彩图像fake_B的patch分别送入G enc和H l,生成一系列特征
Figure PCTCN2021128161-appb-000015
然后可以利用相应的图像块网损失(PatchNCE loss)函数比较图像块之间的差异,从而将输入输出对应位置的图像块进行匹配。以输入第一图像域集中的红外图像(即源域样本图像)为例,语义结构损失函数可以包括多层红外图像块对比损失函数L PatchNCE(G,H,X):
Figure PCTCN2021128161-appb-000016
其中,l可以为预设生成器的编码器中的目标卷积层,如编码器中所关注的任意一层卷积层、多层卷积层或全部卷积层,关注的层数越多,效果可以越好;L可以为预设生成器的编码器中的目标卷积层数量,即编码器中所关注卷积层的数量;X可以为第一图像域集;s可以为每层目标卷积层中的目标位置,即每层卷积中关注的位置;S l可以为目标卷积层中的目标位置数量,即目标卷积层中关注的位置的数量;z l为通过编码器和预设多层感知器网络后生成的特征,
Figure PCTCN2021128161-appb-000017
为输入图像(如红外图像)与转换真彩图像相对应的目标位置上的特征,
Figure PCTCN2021128161-appb-000018
为输入图像中的与转换真彩图像不对应的 目标位置上的特征,
Figure PCTCN2021128161-appb-000019
可以为输入图像中的与转换真彩图像相对应的目标位置上的特征,x~X可以表示x服从第一图像域集X的概率分布。
为了防止预设生成器进行不必要的更改,还可以对第二图像域集中的真彩图像(即目标域样本图像)使用相应的图像块网损失函数,即语义结构损失函数可以包括多层真彩图像块对比损失函数L PatchNCE(G,H,Y):
Figure PCTCN2021128161-appb-000020
其中,Y可以为第二图像域集,
Figure PCTCN2021128161-appb-000021
为输入图像(如真彩图像)与转换真彩图像相对应的目标位置上的特征,
Figure PCTCN2021128161-appb-000022
为输入图像与转换真彩图像不对应的目标位置上的特征,
Figure PCTCN2021128161-appb-000023
可以为输入图像中的与转换真彩图像相对应的目标位置上的特征,y~Y可以表示y服从第二图像域集Y的概率分布。
具体的,对于上述帧间差一致性损失函数,目前现有的算法通过训练一个新的辅助结构来稳定生成帧序列,从而减轻生成视频的帧间闪烁问题,例如采用双通道的生成器等,还有的通过计算光流损失来获得稳定的视频效果;但这些算法都需要很大的计算量,实用起来并不方便。本实施例中可以无需训练多余的辅助结构,在算法上更具有实用性和优越性;由于原始的红外视频(或内容视频)是时间相干的,而且这种相干特性经过编码后成为了帧间差,因此可以通过要求转换后的视频即生成的真彩视频在模型中具有相似的帧间差,就可以获得和输入具有近似稳定的视频。如I2V-NET中可以通过帧间差一致性损失函数L temp改善生成视频帧序列出现帧间闪烁的问题,帧间差一致性损失函数的公式可以为:
Figure PCTCN2021128161-appb-000024
其中,T可以为红外视频的总帧数,I t可以为预设生成器的输入帧序列,
Figure PCTCN2021128161-appb-000025
可以为预设生成器的输出帧序列,φ(x t)=f m(x t+1)-f m(x t),x t可以为第t+1帧和第t帧之间的差距,m可以为目标特征图层;f m(x t)可以为通过预设卷积神经网络的卷积层提取到的特征,如通过提取预训练的vgg16(一种卷积神经网络,即预设卷积神经网络)各卷积层输出获得特征。
也就是说,本步骤中的预设损失函数(即I2V-NET的总损失函数)可 以为各损失函数与各自对应的损失函数权重系数的乘积之和,如预设损失函数包括上述语义结构损失函数、帧间差一致性损失函数和生成对抗损失函数时,预设损失函数可以为语义结构损失函数、帧间差一致性损失函数和生成对抗损失函数与各自对应的损失函数权重系数的乘积之和;例如预设损失函数:L(G,H,D,X,Y)=L gan(G,D,X,Y)+λ XL PatchNCE(G,H,X)+λ YL PatchNCE(G,H,Y)+λ TL temp,其中,λ X、λ Y和λ T可以分别为预设的各自对应的损失函数的权重系数。
可以理解的是,对于本步骤中处理器根据预设损失函数,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器的具体过程,可以由设计人员自行设置,如处理器可以构建预设生成器和预设判别器;利用预设生成器对红外图像进行图像转换,获取转换真彩图像;利用预设判别器基于真彩图像对转换真彩图像进行判别,获取转换真彩图像对应的真假结果;判断是否达到预设迭代次数;若是,则确定预设生成器训练完成,并执行步骤103的步骤;若否,则可以利用预设损失函数调整预设生成器和预设判别器的权重参数,并在调整完成后,执行利用预设生成器对红外图像进行图像转换,获取转换真彩图像的步骤,继续进行迭代训练;例如,本实施例中I2V-NET的预设生成器和预设判别器的权重参数可以利用pytorch 1.7.0(一种开源的Python机器学习库软件)深度学习框架,采用xavier(一种参数初始化方法)随机参数初始化,利用adam(一种优化器)优化器进行训练,初始学习率为0.0002。
步骤103:根据训练完成的预设生成器,获取转换生成器,以利用转换生成器对实际红外视频进行图像转换,得到目标真彩视频。
可以理解的是,本步骤中处理器可以利用训练完成的预设生成器,得到对实际需要转换的夜间红外视频(即实际红外视频)进行真彩转换的生成器模型(即转换生成器)。
具体的,对于本步骤中根据训练完成的预设生成器,获取转换生成器的具体方式,可以由设计人员自行设置,如处理器可以将训练完成的预设生成器的权重参数加载到新构建的生成器模型中,将加载完成的生成器模 型确定为转换生成器;处理器也可以直接将训练完成的预设生成器确定为转换生成器,本实施例对此不做任何限制。
对应的,本实施所提供的方法还可以包括利用转换生成器对实际红外视频进行图像转换的过程,如处理器可以获取待转换图像集;利用转换生成器对待转换红外图像进行图像转换,获取目标真彩视频;其中,待转换图像集包括实际红外视频对应的待转换红外图像。例如处理器可以实时读取夜间红外视频帧序列,送入已经加载好的转换生成器中,即可生成连续稳定的白天真彩视频,实现白天红外真彩转换任务。
需要说明的是,本实施所提供的方法还可以包括转换生成器或训练完成的预设生成器的测试过程,如处理器可以获取测试图像集;利用转换生成器对测试图像集中的测试红外图像进行图像转换,获取测试转换真彩视频;对比展示测试转换真彩视频的帧序列与测试真彩视频的序列;其中,测试图像集可以包括测试红外视频(如夜间红外视频)对应的测试红外图像,测试真彩视频可以为与测试红外视频相同场景的真彩视频(如夜间真彩视频)。
本实施例中,本发明实施例基于帧间差一致性和对比学习,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器,引入对比学习的思想避免了现有循环一致性思想的严格的双向映射的方式,能够更好地适用存在时间跨度的红外图像转换,并且使得转换的白天真彩图像能够保留夜间红外图像的原始语义结构信息,实现逼真的富有细节的白天真彩图像的转换生成;基于帧间差一致性,采用帧间差分的思想约束输入与输出之间的帧间差异,能够有效防止生成真彩视频产生帧间闪烁问题。
相应于上面的方法实施例,本发明实施例还提供了一种红外图像的转换训练装置,下文描述的一种红外图像的转换训练装置与上文描述的一种红外图像的转换训练方法可相互对应参照。
请参考图5,图5为本发明实施例所提供的一种红外图像的转换训练装置的结构框图。该装置可以包括:
获取模块10,用于获取第一图像域集和第二图像域集;其中,第一图像域集包括红外视频对应的红外图像,第二图像域集包括真彩视频对应的真彩图像,红外视频与真彩视频的场景相同;
训练模块20,用于基于帧间差一致性和对比学习,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器;其中,预设生成器用于将红外图像转换为转换真彩图像,预设判别器用于判别得到输入的真彩图像对应的真假结果;
生成模块30,用于根据训练完成的预设生成器,获取转换生成器,以利用转换生成器对实际红外视频进行图像转换,得到目标真彩视频。
可选的,训练模块20可以包括:
帧间差一致性训练子模块,用于基于帧间差一致性思想,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,以使输入与输出预设生成器的对应连续帧图像的帧间差小于预设值;
对比学习训练子模块,用于基于语义结构的对比学习思想,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,以使输入与输出预设生成器的对应图像的语义信息保持一致。
可选的,对比学习思想具体基于语义结构损失函数实现,语义结构损失函数包括多层红外图像块对比损失函数和多层真彩图像块对比损失函数;
其中,多层红外图像块对比损失函数为
Figure PCTCN2021128161-appb-000026
多层真彩图像块对比损失函数为
Figure PCTCN2021128161-appb-000027
X为第一图像域集,Y为第二图像域集,l为预设生成器的编码器中的目标卷积层,L为编码器中的目标卷积层数量,s为每层目标卷积层中的目标位置,S l为目标卷积层中的目标位置数量,z l为通过编码器和预设多层感知器网络后生成的特征,
Figure PCTCN2021128161-appb-000028
为红外图像或真彩图像与转换真彩图像相对应的目标位置上的特征,
Figure PCTCN2021128161-appb-000029
为红外图像或真彩图像中的与转换真彩图像不对应的目标位置上的特征,
Figure PCTCN2021128161-appb-000030
为 红外图像或真彩图像中的与转换真彩图像相对应的目标位置上的特征。
可选的,帧间差一致性思想具体基于帧间差一致性损失函数实现,帧间差一致性损失函数为
Figure PCTCN2021128161-appb-000031
其中,T为红外视频的总帧数,I t为预设生成器的输入帧序列,
Figure PCTCN2021128161-appb-000032
为预设生成器的输出帧序列,φ(x t)=f m(x t+1)-f m(x t),x t为第t+1帧和第t帧之间的差距,m为目标特征图层,f m(x t)为通过预设卷积神经网络的卷积层提取到的特征。
可选的,训练模块20还可以包括:
生成对抗训练子模块,用于基于生成对抗思想,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数同时进行迭代训练,以使预设判别器输出的真彩图像对应的真假结果与转换真彩图像对应的真假结果的分数相等。
可选的,生成对抗思想具体基于生成对抗损失函数实现,生成对抗损失函数为
Figure PCTCN2021128161-appb-000033
G(·)为预设生成器的输出,D(·)为预设判别器的输出,X为红外图像,Y为真彩图像,y k为第k帧的真彩视频帧图像,x i为第i帧的红外视频帧图像。
可选的,训练模块20可以具体用于根据预设损失函数,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器;其中,预设损失函数为语义结构损失函数、帧间差一致性损失函数和生成对抗损失函数与各自对应的损失函数权重系数的乘积之和。
可选的,该装置还可以包括:
转换获取模块,用于获取待转换图像集;其中,待转换图像集包括实际红外视频对应的待转换红外图像;
转换生成模块,用于利用转换生成器对待转换红外图像进行图像转换,获取目标真彩视频。
可选的,获取模块10可以包括:
视频获取子模块,用于获取训练视频数据;其中,训练视频数据包括红外视频和真彩视频;
分帧子模块,用于对训练视频数据进行分帧,得到单帧图像;
转换子模块,用于将单帧图像进行转换,得到预设图像规格的目标单帧图像;
拼接子模块,用于按照视频帧序,对预设数量的连续的目标单帧图像进行拼接,得到红外视频对应的红外图像和真彩视频对应的真彩图像。
本实施例中,本发明实施例通过训练模块20基于帧间差一致性和对比学习,利用第一图像域集和第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器,引入对比学习的思想避免了现有循环一致性思想的严格的双向映射的方式,能够更好地适用存在时间跨度的红外图像转换,并且使得转换的白天真彩图像能够保留夜间红外图像的原始语义结构信息,实现逼真的富有细节的白天真彩图像的转换生成;基于帧间差一致性,采用帧间差分的思想约束输入与输出之间的帧间差异,能够有效防止生成真彩视频产生帧间闪烁问题。
相应于上面的方法实施例,本发明实施例还提供了一种红外图像的转换训练设备,下文描述的一种红外图像的转换训练设备与上文描述的一种红外图像的转换训练方法可相互对应参照。
请参考图6,图6为本发明实施例所提供的一种红外图像的转换训练设备的结构示意图。该转换训练设备可以包括:
存储器D1,用于存储计算机程序;
处理器D2,用于执行计算机程序时实现上述方法实施例所提供的红外图像的转换训练方法的步骤。
具体的,请参考图7,图7为本实施例提供的一种红外图像的转换训练设备的具体结构示意图,该转换训练设备310可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)322(例如,一个或一个以上处理器)和存储器332,一个或一个以 上存储应用程序342或数据344的存储介质330(例如一个或一个以上海量存储设备)。其中,存储器332和存储介质330可以是短暂存储或持久存储。存储在存储介质330的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对电子设备中的一系列指令操作。更进一步地,中央处理器322可以设置为与存储介质330通信,在转换训练设备310上执行存储介质330中的一系列指令操作。
转换训练设备310还可以包括一个或一个以上电源326,一个或一个以上有线或无线网络接口350,一个或一个以上输入输出接口358,和/或,一个或一个以上操作系统341。例如,Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等。
上文所描述的红外图像的转换训练方法中的步骤可以由红外图像的转换训练设备的结构实现。
相应于上面的方法实施例,本发明实施例还提供了一种计算机可读存储介质,下文描述的一种计算机可读存储介质与上文描述的一种红外图像的转换训练方法可相互对应参照。
一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现上述方法实施例所提供的红外图像的转换训练方法的步骤。
该计算机可读存储介质具体可以为U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可存储程序代码的可读存储介质。
说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置、设备及计算机可读存储介质而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
以上对本发明所提供的一种红外图像的转换训练方法、装置、设备及计算机可读存储介质进行了详细介绍。本文中应用了具体个例对本发明的 原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以对本发明进行若干改进和修饰,这些改进和修饰也落入本发明权利要求的保护范围内。

Claims (12)

  1. 一种红外图像的转换训练方法,其特征在于,包括:
    获取第一图像域集和第二图像域集;其中,所述第一图像域集包括红外视频对应的红外图像,所述第二图像域集包括真彩视频对应的真彩图像,所述红外视频与所述真彩视频的场景相同;
    基于帧间差一致性和对比学习,利用所述第一图像域集和所述第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器;其中,所述预设生成器用于将所述红外图像转换为转换真彩图像,所述预设判别器用于判别得到输入的真彩图像对应的真假结果;
    根据训练完成的预设生成器,获取转换生成器,以利用所述转换生成器对实际红外视频进行图像转换,得到目标真彩视频。
  2. 根据权利要求1所述的红外图像的转换训练方法,其特征在于,所述基于帧间差一致性和对比学习,利用所述第一图像域集和所述第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,包括:
    基于帧间差一致性思想,利用所述第一图像域集和所述第二图像域集对所述预设生成器和所述预设判别器的权重参数进行迭代训练,以使输入与输出所述预设生成器的对应连续帧图像的帧间差小于预设值;
    基于语义结构的对比学习思想,利用所述第一图像域集和所述第二图像域集对所述预设生成器和所述预设判别器的权重参数进行迭代训练,以使输入与输出所述预设生成器的对应图像的语义信息保持一致。
  3. 根据权利要求2所述的红外图像的转换训练方法,其特征在于,所述对比学习思想具体基于语义结构损失函数实现,所述语义结构损失函数包括多层红外图像块对比损失函数和多层真彩图像块对比损失函数:
    其中,所述多层红外图像块对比损失函数为
    Figure PCTCN2021128161-appb-100001
    所述多层真彩图像块对比损失函数为
    Figure PCTCN2021128161-appb-100002
    X为所述第一图像域集,Y为所述第二图像域集,l为所述预设生成器的编码器中的目标卷积层,L为所述编码器中的目标卷积层数量,s为每层所述目标卷积层中的目标位置,S l为所述目标卷积层中的目标位置数量,z l为通过所述编码器和预设多层感知器网络后生成的特征,
    Figure PCTCN2021128161-appb-100003
    为所述红外图像或所述真彩图像与 所述转换真彩图像相对应的所述目标位置上的特征,
    Figure PCTCN2021128161-appb-100004
    为所述红外图像或所述真彩图像中的与所述转换真彩图像不对应的所述目标位置上的特征,
    Figure PCTCN2021128161-appb-100005
    为所述红外图像或所述真彩图像中的与所述转换真彩图像相对应的所述目标位置上的特征。
  4. 根据权利要求2所述的红外图像的转换训练方法,其特征在于,所述帧间差一致性思想具体基于帧间差一致性损失函数实现,所述帧间差一致性损失函数为
    Figure PCTCN2021128161-appb-100006
    其中,T为所述红外视频的总帧数,I t为所述预设生成器的输入帧序列,
    Figure PCTCN2021128161-appb-100007
    为所述预设生成器的输出帧序列,φ(x t)=f m(x t+1)-f m(x t),x t为第t+1帧和第t帧之间的差距,m为目标特征图层,f m(x t)为通过预设卷积神经网络的卷积层提取到的特征。
  5. 根据权利要求1所述的红外图像的转换训练方法,其特征在于,所述基于帧间差一致性和对比学习,利用所述第一图像域集和所述第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,还包括:
    基于生成对抗思想,利用所述第一图像域集和所述第二图像域集对所述预设生成器和所述预设判别器的权重参数同时进行迭代训练,以使所述预设判别器输出的所述真彩图像对应的真假结果与所述转换真彩图像对应的真假结果的分数相等。
  6. 根据权利要求5所述的红外图像的转换训练方法,其特征在于,所述生成对抗思想具体基于生成对抗损失函数实现,所述生成对抗损失函数为
    Figure PCTCN2021128161-appb-100008
    G(·)为所述预设生成器的输出,D(·)为所述预设判别器的输出,X为所述红外图像,Y为所述真彩图像,y k为第k帧的真彩视频帧图像,x i为第i帧的红外视频帧图像。
  7. 根据权利要求5所述的红外图像的转换训练方法,其特征在于,所述基于帧间差一致性和对比学习,利用所述第一图像域集和所述第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器,包括:
    根据预设损失函数,利用所述第一图像域集和所述第二图像域集对所述预设生成器和所述预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器;其中,所述预设损失函数为语义结构损失函数、帧间差一 致性损失函数和生成对抗损失函数与各自对应的损失函数权重系数的乘积之和。
  8. 根据权利要求1所述的红外图像的转换训练方法,其特征在于,所述根据训练完成的预设生成器,获取转换生成器之后,还包括:
    获取待转换图像集;其中,所述待转换图像集包括所述实际红外视频对应的待转换红外图像;
    利用所述转换生成器对所述待转换红外图像进行图像转换,获取所述目标真彩视频。
  9. 根据权利要求1至8任一项所述的红外图像的转换训练方法,其特征在于,所述获取第一图像域集和第二图像域集,包括:
    获取训练视频数据;其中,所述训练视频数据包括所述红外视频和所述真彩视频;
    对所述训练视频数据进行分帧,得到单帧图像;
    将所述单帧图像进行转换,得到预设图像规格的目标单帧图像;
    按照视频帧序,对预设数量的连续的所述目标单帧图像进行拼接,得到所述红外视频对应的红外图像和所述真彩视频对应的真彩图像。
  10. 一种红外图像的转换训练装置,其特征在于,包括:
    获取模块,用于获取第一图像域集和第二图像域集;其中,所述第一图像域集包括红外视频对应的红外图像,所述第二图像域集包括真彩视频对应的真彩图像,所述红外视频与所述真彩视频的场景相同;
    训练模块,用于基于帧间差一致性和对比学习,利用所述第一图像域集和所述第二图像域集对预设生成器和预设判别器的权重参数进行迭代训练,获取训练完成的预设生成器;其中,所述预设生成器用于将所述红外图像转换为转换真彩图像,所述预设判别器用于判别得到输入的真彩图像对应的真假结果;
    生成模块,用于根据训练完成的预设生成器,获取转换生成器,以利用所述转换生成器对实际红外视频进行图像转换,得到目标真彩视频。
  11. 一种红外图像的转换训练设备,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序时实现如权利要求1至9任一项所述的红外图像的转换训练方法的步骤。
  12. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至9任一项所述的红外图像的转换训练方法的步骤。
PCT/CN2021/128161 2021-10-26 2021-11-02 一种红外图像的转换训练方法、装置、设备及存储介质 WO2023070695A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111247706.4A CN113850231A (zh) 2021-10-26 2021-10-26 一种红外图像的转换训练方法、装置、设备及存储介质
CN202111247706.4 2021-10-26

Publications (1)

Publication Number Publication Date
WO2023070695A1 true WO2023070695A1 (zh) 2023-05-04

Family

ID=78982936

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/128161 WO2023070695A1 (zh) 2021-10-26 2021-11-02 一种红外图像的转换训练方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN113850231A (zh)
WO (1) WO2023070695A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116634110A (zh) * 2023-07-24 2023-08-22 清华大学 一种基于语义编解码的夜间智能养殖监控系统
CN117830806A (zh) * 2024-03-06 2024-04-05 广东琴智科技研究院有限公司 一种红外图像收集的方法及相关装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114331821B (zh) * 2021-12-29 2023-09-22 中国人民解放军火箭军工程大学 一种图像转换方法及系统
CN116503502A (zh) * 2023-04-28 2023-07-28 长春理工大学重庆研究院 一种基于对比学习的未配对红外图像彩色化方法
CN116485934A (zh) * 2023-04-28 2023-07-25 长春理工大学重庆研究院 一种基于CNN和ViT的红外图像彩色化方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001868A (zh) * 2020-07-30 2020-11-27 山东师范大学 基于生成对抗性网络的红外和可见光图像融合方法及系统
CN112347850A (zh) * 2020-09-30 2021-02-09 新大陆数字技术股份有限公司 红外图像转换方法、活体检测方法、装置、可读存储介质
CN112967178A (zh) * 2021-03-08 2021-06-15 烟台艾睿光电科技有限公司 一种图像转换方法、装置、设备及存储介质
CN113362243A (zh) * 2021-06-03 2021-09-07 Oppo广东移动通信有限公司 模型训练方法、图像处理方法及装置、介质和电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001868A (zh) * 2020-07-30 2020-11-27 山东师范大学 基于生成对抗性网络的红外和可见光图像融合方法及系统
CN112347850A (zh) * 2020-09-30 2021-02-09 新大陆数字技术股份有限公司 红外图像转换方法、活体检测方法、装置、可读存储介质
CN112967178A (zh) * 2021-03-08 2021-06-15 烟台艾睿光电科技有限公司 一种图像转换方法、装置、设备及存储介质
CN113362243A (zh) * 2021-06-03 2021-09-07 Oppo广东移动通信有限公司 模型训练方法、图像处理方法及装置、介质和电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHEN HENG TAO, ZHUANG YUETING, SMITH JOHN R., YANG YANG, CESAR PABLO, METZE FLORIAN, PRABHAKARAN BALAKRISHNAN, LI SHUANG, HAN BING: "I2V-GAN: Unpaired Infrared-to-Visible Video Translation", PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, ACM, NEW YORK, NY, USA, 17 October 2021 (2021-10-17) - 24 October 2021 (2021-10-24), New York, NY, USA, pages 3061 - 3069, XP093060033, ISBN: 978-1-4503-8651-7, DOI: 10.1145/3474085.3475445 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116634110A (zh) * 2023-07-24 2023-08-22 清华大学 一种基于语义编解码的夜间智能养殖监控系统
CN116634110B (zh) * 2023-07-24 2023-10-13 清华大学 一种基于语义编解码的夜间智能养殖监控系统
CN117830806A (zh) * 2024-03-06 2024-04-05 广东琴智科技研究院有限公司 一种红外图像收集的方法及相关装置
CN117830806B (zh) * 2024-03-06 2024-05-07 广东琴智科技研究院有限公司 一种红外图像收集的方法及相关装置

Also Published As

Publication number Publication date
CN113850231A (zh) 2021-12-28

Similar Documents

Publication Publication Date Title
WO2023070695A1 (zh) 一种红外图像的转换训练方法、装置、设备及存储介质
EP3876140B1 (en) Method and apparatus for recognizing postures of multiple persons, electronic device, and storage medium
CN111985343B (zh) 一种行为识别深度网络模型的构建方法及行为识别方法
WO2023280065A1 (zh) 一种面向跨模态通信系统的图像重建方法及装置
Ren et al. Overview of object detection algorithms using convolutional neural networks
US20230112462A1 (en) Video generation method and system for high resolution face swapping
CN110599411A (zh) 一种基于条件生成对抗网络的图像修复方法及系统
CN113516005B (zh) 一种基于深度学习和姿态估计的舞蹈动作评价系统
WO2022213761A1 (zh) 一种图像处理方法、装置、电子设备和存储介质
CN111222459B (zh) 一种视角无关的视频三维人体姿态识别方法
CN112967178A (zh) 一种图像转换方法、装置、设备及存储介质
CN116343329A (zh) 一种红外-可见光多源图像融合一体管控系统和设备
CN112766217A (zh) 基于解纠缠和特征级差异学习的跨模态行人重识别方法
CN113489958A (zh) 一种基于视频编码数据多特征融合的动态手势识别方法及系统
CN112633425B (zh) 图像分类方法和装置
Tang et al. A Survey on Human Action Recognition based on Attention Mechanism
CN115661858A (zh) 一种基于局部特征与全局表征耦合的2d人体姿态估计方法
Zhong et al. Unsupervised learning for forecasting action representations
Wang et al. Insulator defect detection based on improved you-only-look-once v4 in complex scenarios
Li et al. Intelligent terminal face spoofing detection algorithm based on deep belief network
He et al. A Single-Frame Face Anti-Spoofing Algorithm With Circular CDC and Multi-Scale Spatial Attention
Li et al. Crowd Counting Network with Self-attention Distillation.
CN114283290B (zh) 图像处理模型的训练、图像处理方法、装置、设备及介质
Wenhu et al. An improved KCF algorithm and its application in video tracking
Zong et al. Research on intelligent attendance management system based on face recognition technology

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE