CN113065585B

CN113065585B - Training method and device of image synthesis model and electronic equipment

Info

Publication number: CN113065585B
Application number: CN202110308488.4A
Authority: CN
Inventors: 姚寒星; 王锦申
Original assignee: Beijing LLvision Technology Co ltd
Current assignee: Beijing LLvision Technology Co ltd
Priority date: 2021-03-23
Filing date: 2021-03-23
Publication date: 2021-12-28
Anticipated expiration: 2041-03-23
Also published as: CN113065585A

Abstract

The present disclosure provides a training method, a device and an electronic device for an image synthesis model, wherein the training method comprises: synthesizing the first gray level image sample and the second gray level mask image sample to obtain a synthesized image sample; inputting sample data of the sampled image into a convolutional neural network for training, and outputting the sample data as a homography transformation matrix; transforming the first gray level image sample according to the homography transformation matrix to obtain a first gray level transformation image sample; determining a first pixel value of the ith index position of the first gray scale conversion image sample, determining a second pixel value of the ith index position of the second gray scale mask image sample, and determining a larger value; and determining the training loss of the convolutional neural network according to the first pixel value, the second pixel value and the pixel values of all index positions of the second gray shade image sample, and adjusting configuration parameters. Through the embodiment of the disclosure, the accuracy of homography transformation parameter estimation and the reliability and applicability of image synthesis are improved.

Description

Training method and device of image synthesis model and electronic equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a training method and apparatus for an image synthesis model, and an electronic device.

Background

At present, homography transformation refers to a mapping relation existing between images formed by shooting objects on the same plane from different visual angles, and the homography transformation is widely applied to the fields of image splicing, monocular SLAM, video stabilization and the like, and generally uses a 3x3 matrix

Or the equivalent 4-point form H_4ptAnd (4) showing. Homography transform estimation is traditionally generalThe method is characterized in that a local Feature-based operator, such as SIFT (Scale-Invariant Feature Transform), which is a computer vision-based Feature extraction algorithm, is used for detecting and describing the locality in an image, ORB (ordered FAST and ordered BRIEF), and the like, is adopted, the corresponding relation of two graph key point sets is established by using Feature matching, and then RANSAC (random consistent sampling) is used for searching for the optimal homography transformation parameter estimation. However, in the case that enough image feature key points cannot be detected, or in the case of key point matching errors caused by illumination and too large visual angle difference between images, the estimation of the homography transformation parameters based on the feature operators is often very inaccurate.

In the related art, with the development of deep learning techniques, research on homography transformation parameter estimation has been currently turned to a deep learning-based method.

However, the image synthesis scheme based on the deep learning method has at least the following technical problems:

(1) the influence of moving objects or objects which are not on the assumed plane cannot be eliminated due to the lack of a processing process of random consistent sampling, and the problem of inaccurate estimation of the homography conversion parameters is caused.

(2) The loss function is directly calculated in the image brightness domain, and lacks robustness to environmental illumination changes.

(3) The scheme of image composition is only applicable to describe the motion of a planar object or the motion caused by camera rotation.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a method, an apparatus, and an electronic device for training an image synthesis model, which are used to overcome, at least to some extent, the problem of poor image synthesis effect due to the limitations and disadvantages of the related art.

According to a first aspect of the embodiments of the present disclosure, there is provided a training method of an image synthesis model, including: synthesizing the first gray level image sample and the second gray level mask image sample to obtain a synthesized image sample; inputting sample data of the sampled image into a convolutional neural network for training, wherein the output of the convolutional neural network is a homography transformation matrix; transforming the first gray level image sample according to the homography transformation matrix to obtain a first gray level transformation image sample; determining a first pixel value at an ith index position of the first gray scale transformed image sample, determining a second pixel value at an ith index position of the second gray scale masked image sample, and determining a larger value between the first pixel value and the second pixel value; and determining the training loss of the convolutional neural network according to the first pixel value, the second pixel value and the pixel values of all index positions of the second gray shade image sample, and adjusting the configuration parameters of the convolutional neural network according to the training loss.

In an exemplary embodiment of the present disclosure, before the combining the first grayscale image sample and the second grayscale mask image sample, the method further includes: performing Gaussian filtering on the two gray level images to be synthesized to obtain a filtered image; carrying out gradient operator processing on the filtered image to obtain a gradient image; determining a gray scale maximum value in the gradient image, and dividing the pixel value of each point in the gradient image by the gray scale maximum value to obtain a normalized gray scale image sample; one of the grayscale image samples is determined to be a first grayscale image sample and another of the grayscale image samples is determined to be a second grayscale mask image sample.

In an exemplary embodiment of the present disclosure, the synthesizing the first grayscale image sample and the second grayscale mask image sample to obtain a synthesized image sample specifically includes: combining the first gray level image sample and the second gray level mask image sample to obtain a combined image sample; and performing down-sampling processing on the merged image sample, and determining the result of the sampling processing as a synthesized image sample.

In an exemplary embodiment of the present disclosure, sample data of a sampled image is input to a convolutional neural network for training, where an output of the convolutional neural network is a homography transform matrix, which specifically includes: inputting the synthesized image sample into a convolutional neural network, wherein the convolutional neural network executes 1 multiplied by 1 convolutional operation, the input channel of the convolutional neural network is 2, and the output channel of the convolutional neural network is 3; performing block normalization operation on the result of the output channel; inputting the result of the block normalization operation into a regression backbone network, and outputting a homography transformation parameter by the regression backbone network; and inputting the homography transformation parameters into a direct linear transformation layer, and outputting the homography transformation parameters into a homography transformation matrix by the direct linear transformation layer.

In one exemplary embodiment of the present disclosure, the convolutional neural network performs a convolutional kernel of 5 × 5, the execution step size of the convolutional neural network is 2, and the execution padding of the convolutional neural network is 2.

In an exemplary embodiment of the present disclosure, transforming the first grayscale image sample according to the homography transformation matrix to obtain a first grayscale transformed image sample specifically includes: and inputting the first gray level image sample into a spatial transformation layer, and transforming the first gray level image sample through a homography transformation matrix of the spatial transformation layer to obtain a first gray level transformation image sample.

In an exemplary embodiment of the present disclosure, determining a training loss of the convolutional neural network according to the first pixel value, the second pixel value, and pixel values of all index positions of the second gray mask image sample, and adjusting configuration parameters of the convolutional neural network according to the training loss specifically includes: determining an absolute value of a difference between the first pixel value and the second pixel value for the ith index position; determining a pixel product between the absolute value of the difference at the ith index position and the maximum value; accumulating the pixel products of all the index positions, and determining an accumulation result as a first accumulation sum; accumulating pixel values of all index positions of the second gray shade image sample, and determining an accumulation result as a second accumulation sum; determining the training loss of the convolutional neural network according to the proportional relation between the first accumulated sum and the second accumulated sum; and adjusting the configuration parameters of the convolutional neural network according to the training loss.

According to a second aspect of the embodiments of the present disclosure, there is provided a training apparatus for an image synthesis model, including: the synthesis module is used for carrying out synthesis processing on the first gray level image sample and the second gray level mask image sample to obtain a synthesized image sample; the training module is used for inputting sample data of the sampled image into a convolutional neural network for training, and the output of the convolutional neural network is a homography transformation matrix; the transformation module is used for transforming the first gray-scale image sample according to the homography transformation matrix to obtain a first gray-scale transformed image sample; a determining module for determining a first pixel value at an ith index position of the first gray scale transformed image sample, a second pixel value at an ith index position of the second gray scale masked image sample, and determining a larger value between the first pixel value and the second pixel value; and the determining module is used for determining the training loss of the convolutional neural network according to the first pixel value, the second pixel value and the pixel values of all index positions of the second gray shade image sample, and adjusting the configuration parameters of the convolutional neural network according to the training loss.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a memory; and a processor coupled to the memory, the processor configured to perform a method as in any above based on instructions stored in the memory.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements a method of training an image synthesis model as in any one of the above.

According to the embodiment of the disclosure, the first gray scale image sample and the second gray scale mask image sample are subjected to synthesis processing, and a larger value between the first pixel value and the second pixel value is determined, so that the pixel value of the ith index position at any stage of training is not smaller than the pixel value of the corresponding index position of the second gray scale mask image sample, thereby ensuring that the pixel difference information of the image edge position is fully utilized in the training, and improving the accuracy of homography conversion parameter estimation, the image synthesis effect, the reliability and the applicability.

Furthermore, the forward reasoning of the neural network model only needs to be executed to the direct linear transformation module and the homography transformation matrix H is output, so that the calculated amount of the neural network model is greatly reduced.

Further, the present disclosure proposes an "input-or-mask" method of training pairs of images (I)_a，I_b) Preprocessing is carried out, only key position information such as edges is kept, and a preprocessed image pair is generated

Will be provided with

Meanwhile, the method is used as network input and a mask, wherein the mask image is generated by preprocessing and is kept unchanged in the training process, so that the problem that the mask image is easy to slide to 0 in the training process is avoided.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the scope of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

FIG. 1 is a flow diagram of a method for training an image synthesis model in an exemplary embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of training an image synthesis model in another exemplary embodiment of the present disclosure;

FIG. 3 is a flow chart of a method of training an image synthesis model in another exemplary embodiment of the present disclosure;

FIG. 4 is a flow chart of a method of training an image synthesis model in another exemplary embodiment of the present disclosure;

FIG. 5 is a flow chart of a method of training an image synthesis model in another exemplary embodiment of the present disclosure;

FIG. 6 is a flow chart of a method of training an image synthesis model in another exemplary embodiment of the present disclosure;

FIG. 7 is a flow chart of a method of training an image synthesis model in another exemplary embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a training model of an image synthesis model in another exemplary embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a training model of an image synthesis model in another exemplary embodiment of the present disclosure;

FIG. 10 is a schematic illustration of a training image of an image synthesis model in an exemplary embodiment of the present disclosure;

FIG. 11 is a schematic illustration of a training image of an image synthesis model in another exemplary embodiment of the present disclosure;

FIG. 12 is a schematic illustration of a training image of an image synthesis model in another exemplary embodiment of the present disclosure;

FIG. 13 is a schematic illustration of a training image of an image synthesis model in another exemplary embodiment of the present disclosure;

FIG. 14 is a block diagram of an apparatus for training an image synthesis model according to an exemplary embodiment of the present disclosure;

fig. 15 is a block diagram of an electronic device in an exemplary embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Further, the drawings are merely schematic illustrations of the present disclosure, in which the same reference numerals denote the same or similar parts, and thus, a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The following describes exemplary embodiments of the present disclosure in detail with reference to fig. 1 to 15 of the drawings.

Fig. 1 is a flowchart of a training method of an image synthesis model in an exemplary embodiment of the present disclosure.

Referring to fig. 1, the training method of the image synthesis model may include:

step S102, a first gray scale image sample and a second gray scale mask image sample are subjected to synthesis processing to obtain a synthesized image sample.

And step S104, inputting the sample data of the sampled image into a convolutional neural network for training, wherein the output of the convolutional neural network is a homography transformation matrix.

And step S106, transforming the first gray-scale image sample according to the homography transformation matrix to obtain a first gray-scale transformed image sample.

Step S108, determining a first pixel value of the ith index position of the first gray scale transformed image sample, a second pixel value of the ith index position of the second gray scale masked image sample, and determining a larger value between the first pixel value and the second pixel value.

Step S110, determining the training loss of the convolutional neural network according to the first pixel value, the second pixel value and the pixel values of all index positions of the second gray shade image sample, and adjusting the configuration parameters of the convolutional neural network according to the training loss.

In the above embodiment, the first grayscale image sample and the second grayscale mask image sample are synthesized to determine the larger value between the first pixel value and the second pixel value, so that the pixel value at the i-th index position at any stage of training is not less than the pixel value at the corresponding index position of the second grayscale mask image sample, thereby ensuring that the pixel difference information at the edge position of the image is fully utilized in the training, and improving the accuracy of the homography transformation parameter estimation, the image synthesis effect, the reliability and the applicability.

Further, for the case that the mask image is liable to slide to 0 during the training process, the present disclosure also proposes a method of "inputting, i.e. masking", the training image pair (I)_a,I_b) Preprocessing is carried out, only key position information such as edges is kept, and a preprocessed image pair is generated

Will be provided with

Meanwhile, the method is used as network input and a mask, wherein the mask image is generated by preprocessing and is kept unchanged in the training process, so that the condition that the mask image is easy to slide to 0 in the training process is reduced.

The following describes each step of the training method of the image synthesis model in detail with reference to fig. 2 to 7.

As shown in fig. 2, before the synthesizing process is performed on the first gray-scale image sample and the second gray-scale mask image sample, the method further includes:

step S202, performing gaussian filtering on the two grayscale images to be synthesized to obtain a filtered image.

And step S204, performing gradient operator processing on the filtered image to obtain a gradient image.

Step S206, determine the maximum value of the gray scale in the gradient image, and divide the pixel value of each point in the gradient image by the maximum value of the gray scale to obtain a normalized gray scale image sample.

In step S208, one of the grayscale image samples is determined as a first grayscale image sample, and the other of the grayscale image samples is determined as a second grayscale mask image sample.

In the embodiment, the two gray level images to be synthesized are subjected to gaussian filtering, gradient operator processing is performed on the filtered images, normalized gray level image samples are obtained, the images are denoised, high-frequency information is retained and normalized, images with different brightness can be normalized to 1, and the stability during training is improved.

As shown in fig. 3, the synthesizing the first grayscale image sample and the second grayscale mask image sample to obtain a synthesized image sample specifically includes:

step S302, a merging process is performed on the first grayscale image sample and the second grayscale mask image sample to obtain a merged image sample.

In step S304, the combined image sample is subjected to down-sampling processing, and the result of the sampling processing is determined as the combined image sample.

In the above embodiment, the merged image sample is subjected to down-sampling processing, and the result of the sampling processing is determined as the merged image sample, so that the feature map size subsequently input to the regression backbone network is reduced, and the calculation amount of the regression backbone network is reduced.

As shown in fig. 4, the sampling image sample data is input to a convolutional neural network for training, where the output of the convolutional neural network is a homography transform matrix, and the method specifically includes:

step S402, inputting the synthesized image sample into a convolutional neural network, where the convolutional neural network performs 1 × 1 convolution operation, an input channel of the convolutional neural network is 2, and an output channel of the convolutional neural network is 3.

In step S404, a block normalization operation is performed on the result of the output channel.

Step S406, inputting the result of the block normalization operation into a regression backbone network, and outputting the homography transformation parameter in the form of 4-point by the regression backbone network.

Step S408, inputting the homography transformation parameter into a Direct Linear Transformation (DLT) layer, and outputting the Direct Linear transformation parameter as a homography transformation matrix by the DLT layer.

In an embodiment of the present disclosure, the homography transformation parameter in the form of 4-point includes that the regression backbone network outputs 4 point coordinates, the 4 corresponding regression reference points thereof are generally located in the original input image, and the four point coordinates are respectively close to the positions of the upper left corner, the upper right corner, the lower left corner and the lower right corner of the original input image.

In the above embodiment, the direct linear transformation directly solves the homography transformation matrix parameters according to the 4 regression reference point coordinates and the 4 corresponding point coordinates output by the regression backbone network.

As shown in fig. 5, transforming the first gray-scale image sample according to the homography transformation matrix to obtain a first gray-scale transformed image sample specifically includes:

step S502, inputting the first gray level image sample into a space transformation layer, and transforming the first gray level image sample through a homography transformation matrix of the space transformation layer to obtain a first gray level transformation image sample.

In the above embodiment, the Spatial Transformer layer is a differentiable module, and performs homography transformation on the input picture through Inverse transform (Inverse Warping), which supports training by using back propagation.

As shown in fig. 6, determining a training loss of the convolutional neural network according to the first pixel value, the second pixel value, and pixel values of all index positions of the second gray mask image sample, and adjusting configuration parameters of the convolutional neural network according to the training loss specifically includes:

in step S602, the absolute value of the difference between the first pixel value and the second pixel value at the ith index position is determined.

In step S604, the pixel product between the absolute value of the difference value of the ith index position and the maximum value is determined.

In step S606, the pixel products at all index positions are accumulated, and the accumulated result is determined as a first accumulated sum.

In step S608, the pixel values of all the index positions of the second gray mask image sample are accumulated, and the accumulated result is determined as a second accumulated sum.

And step S610, determining the training loss of the convolutional neural network according to the proportional relation between the first accumulated sum and the second accumulated sum.

And step S612, adjusting configuration parameters of the convolutional neural network according to the training loss.

In the above embodiment, by determining the maximum value between the first pixel value and the second pixel value of the ith index position, the value of the ith index position is not less than the value of the ith index position corresponding to the second gray-scale mask image sample at any stage of training, so that the pixel difference information of the image edge position is fully utilized in the training, and the occurrence of the situation that the pixel value of the mask image is close to 0 is reduced.

Further, the weight of the training image pair is determined according to the proportional relation between the first accumulation sum and the second accumulation sum, the weight of the training image pair with a moving object or a large depth difference is large, and the weight of the training image pair without the moving object or the large depth difference is large, so that the image synthesis effect is improved, and meanwhile, the applicable scene of image synthesis is improved, namely, the method is applicable to static images, dynamic images and the like, but not limited to the above.

As shown in fig. 7, the training method of the image synthesis model further includes the following embodiments:

(1)I_aand I_bTo input an image pair.

(2)F_aAnd F_bFeatures extracted by the feature extraction module f () are input to the image.

(3)M_aAnd M_bFor inputting image pair warpA mask image generated by the mask generation module m.

(4)G_aIs I_aAnd M_aDot product of (1), corresponding to G_bIs I_bAnd M_bDot product of (c).

(5)G_aAnd G_bAfter being combined, the combined signals are input into a convolutional neural network model 800, and the convolutional neural network model 800 outputs a homography transformation parameter H_ab。

(6) To M_aApplications H_abConversion output M'_aTo F_aApplications H_abConversion output F'_aThe loss function is expressed as:

wherein, m is a margin of the Triplet Loss, and the Triplet Loss is a Loss function in deep learning, and is used for training samples with small differences, such as human faces.

As shown in fig. 7, the original image to be synthesized 1000 is input as shown in fig. 10, and the mask map 1100 obtained from the input image is obtained as shown in fig. 11, and the reason why the mask map close to all 0 s is easily generated is lack of constraint on the mask generation module.

Applicant determined by analyzing the loss function L_mIt can be found that L_mThe triple loss constraint is performed on the feature extraction module f (but not on the mask image generation module m (). Since most of the training image pairs (I)_a，I_b) The object has more or less depth difference and local motion, and the homography change model can not be completely modeled (I)_a，I_b) Geometric transformation of the image pair, resulting in a majority position | | F'_a-F_b||₁Is always positive, and M 'in the training process'_aOr M_bThe values of the majority of the location points are easily slid to 0 to minimize the loss L_m。

As can be seen from fig. 7, the mask image can help the homography transform parameter estimation model to omit the information of the non-key position image during training, and only the key position information is concerned, which has a positive effect on the model training.

Based on the solution shown in fig. 7, as shown in fig. 8 and 9, the present disclosure further proposes a method of "inputting, i.e. masking", a training image pair (I)_a，I_b) Preprocessing is carried out, only key position information such as edges is kept, and a preprocessed image pair is generated

Will be provided with

As well as the network input and the mask image.

Further, as shown in fig. 8, the training method of the image synthesis model according to the present disclosure further includes the following steps:

in step S802, the image data pairs Ia and Ib are input as training samples.

Step S804, for I_aAnd I_bRespectively pretreated to generate I_a ^pAnd I_b ^p。

Step S806, the preprocessed results are combined, and down-sampling with a step size of 2 is performed.

Step S808, inputting the down sampling result into the backbone regression network, and outputting H by the backbone regression network_4pt。

And step S810, inputting the result of the previous step into DLT Solver to solve H.

Step S812, adding H and I_a ^pInput ST Layer becomes I_ap。

Step S814, adding I_b ^pAnd I^a _p' the input loss function module calculates a training loss.

As shown in fig. 8 and 9, the mask image 1200 shown in fig. 12 is obtained by an "input or mask" method, and the mask image 1200 is generated by preprocessing and remains unchanged during the training process, so that the problem that the mask image is easy to slide to 0 during the training process is avoided.

As shown in fig. 9, a grayscale image pair (I) is input_a，I_b) Formed after pretreatment

Are combined into I_CPost-input and down-sampling module for processing and output I_D。

Will I_DInputting the regression backbone network 900, outputting the homography transformation parameter H in the form of 4-point by the regression backbone network 900_4ptAnd then, solving the homography transformation matrix H through a DLT solution module.

Will be provided with

Obtained after applying H transform

And

the input loss module L calculates the loss of the training process.

The improvement of the above embodiment is mainly embodied in the preprocessing module, the down-sampling module and the loss module. The method comprises the following training stage processes:

(1) for input gray image pair (I)_a，I_b) And (3) preprocessing, wherein the two graphs are respectively processed according to the following steps:

(1.1) inputting a gray image I, applying Gaussian blur filtering to I, and outputting I_blur。

(1.2) to I_blurCalculating by Laplacian (gradient) operator, and outputting I_hf。

(1.3) calculating image I_hfMaximum value of gray scale V_max，I_hfIs divided by V_maxOutputting a normalized result I_normThe specific expression is

ε is a fraction close to 0 selected to avoid overflow of the division.

The image is denoised, high-frequency information is reserved and normalized, and the stability during training is improved by normalizing the images with different brightness to 1.

(2) The preprocessed image pair

Are combined into a composite image I_CThen to I_CDown-sampling to output sampled image I_DThe down-sampling module is as follows:

(2.1) mixing I_CThe input module performs a DepthWise convolution operation with a convolution kernel of 5x5, stride of 2, padding of 2, followed by a BatchNorm operation and ReLU activation.

Wherein BatchNorm is batch normalization, the BatchNorm layer is used for preventing the convergence rate from decreasing due to gradient disappearance in the training process, and the activation function can be input into a sensitive area of the activation function after normalization so as to accelerate the network convergence rate.

In addition, the ReLU is a Linear rectification function (Rectified Linear Unit), which is a commonly used activation function in an artificial neural network.

(2.2) the 1x1 convolution operation is performed with the input channel of the 1x1 convolution operation being 2 and the output channel of the 1x1 convolution operation being 3, followed by the BatchNorm operation.

The down-sampling module is used for reducing the size of a feature diagram which is subsequently input to the regression backbone network and reducing the calculated amount of the backbone network.

The 2-channel image input into the regression backbone network is converted into the 3-channel image, which is beneficial to utilizing the parameters of the backbone network model which is trained in advance during training, because the input image of the backbone network model is generally 3 channels, and the 2-channel input can not be initialized by effectively utilizing the pre-training parameters of the 1 st layer of the backbone network.

(3) Will I_DInputting a regression backbone network and outputting a homography transformation parameter H in a 4-point form_4pt. The regression backbone Network is a convolution Network including, but not limited to, a Residual Network (resnet), a mobile terminal Network (MobileNet), and the like.

(4) H is to be_4ptInput DLT (direct Linear transformation) solution module to solve homographyAnd (5) changing the matrix H.

Wherein, the DLT (direct Linear transformation) solution module is a direct Linear transformation module.

(5) Pre-processing the picture

Input ST (spatial transformer) Layer, and the homography transformation moment H is used to obtain

Wherein ST (spatial transformer) Layer is a local conversion Layer.

(6) Mixing the above

The input loss function module L calculates the loss of the training process, and the calculation formula of L is as follows:

wherein, the ith index position refers to the pixel position of the preprocessed image. Formula (II)

Is sought at index i

Sum of pixel values

The largest of the pixel values serves as the mask pixel value at index i.

Is calculated at index i

Sum of pixel values

The absolute value of the difference between the pixel values, i.e., the L1 penalty.

Is a calculation image

The result of the sum of pixel values at all index positions.

In the loss function shown in FIG. 7, mask image M'_aAnd M_bDirectly do the product, absent other constraints, which would result in M 'in the training process'_a×M_bThe pixel information at most index positions tends to 0, i.e. the pixel information at most index positions is ignored, affecting the robustness of the trained result.

In response to the technical problem of the embodiment of fig. 7, the applicant has proposed that in the loss function shown in fig. 8 and 9, the loss function will be

And

as mask image, in loss function

The value at index i is not less than at any stage of training

And the value of the corresponding position ensures that the pixel difference information of the image edge position is fully utilized in training.

Is always greater than or equal to during training

And tend to and

and (5) the consistency is achieved.

Wherein, it is provided with

W is larger for hard example (training image pair with moving object or larger depth difference) and is smaller for easy example, which means that hard example has larger weight than easy example in training process, which is beneficial to improving algorithm performance.

In addition, in the testing stage, the model forward reasoning is only required to be executed to the DLT solution module and output the homography transformation matrix H.

As shown in FIG. 13, embodiments of the present disclosure employ a training set of "dhpairs-train" that contains approximately 800k training data pairs.

The applicant finds that the 'dhpair-train' training data has limited application in the aspects of AR anti-shake and the like due to lack of geometric transformation such as longitudinal translation, rotation, scale reduction and the like. Thus, the following enhancement methods may be employed for the image pair (img1, img2) in one or a combination:

(1) img2 has a 50% probability of rotating 90 degrees.

(2) img2 was rotated uniformly and randomly from-1 degree to +1 degree.

(3) img2 was uniformly randomly scaled in size from-1% to + 1%.

(4) img2 translates uniformly randomly-7.5 to +7.5 pixels laterally and vertically.

(5) The precedence order of img1 and img2 had a 50% probability of swapping.

(6) After enhancement, a training data set "dhpairs-train +" is formed.

As shown in fig. 13, the embodiment of the present disclosure performs the test by using "dhpairs-test", which contains about 4.2k test data pairs, including 5 subsets of RE (regular), LT (low texture), LL (low illumination), SF (including small-sized foreground object), LF (including large-sized foreground object), and the like.

Each test data pair is labeled with 6 pairs of match points, such as the first set of match points 1302, the second set of match points 1304, the third set of match points 1306, the fourth set of match points 1308, the fifth set of match points 1310, and the sixth set of match points 1312 shown in fig. 13.

Further, the match error of the test image data pair is calculated as:

wherein

Is the coordinates of the data for the ith marker of figure 1,

is the coordinate of the ith mark of the data pair 2, and N is the matching point number of the mark.

Training is carried out on a data set of 'dhpairs-train +', the basic learning rate is set to be 1e-3, an optimizer is selected to be AdamW, and a Mobilenet tiny RFB backbone network is adopted.

The batch size (block size) configuration of one embodiment of the present disclosure is 80, training iteration number 164k times, and the model separately trained using the prior art method and the method of the present disclosure averages match error pairs over "dhpair-test" as shown in table 1 below:

TABLE 1

	Input size	RE	LT	LL	SF	LF	Avg
								Prior Art	640x360	7.46	7.68	7.05	7.84	4.06	6.82
The disclosure of the invention	540x360	2.59	3.60	4.20	3.47	2.30	3.23

It can be seen that the method of the present disclosure has a significant reduction in average match error, on average about 52.6%, over the prior art methods.

The feature extraction and mask prediction network is designed in a light weight mode, and the calculation amount of the optimized prior art are as shown in the following table 2:

TABLE 2

	Input size	Calculated quantity (FLOPs)
			Prior Art	640x360	349.8M
The disclosure of the invention	540x360	61.5M

Corresponding to the above method embodiment, the present disclosure also provides a training apparatus for an image synthesis model, which may be used to execute the above method embodiment.

Fig. 14 is a block diagram of an apparatus for training an image synthesis model according to an exemplary embodiment of the present disclosure.

Referring to fig. 14, the training apparatus 1400 of the image synthesis model may include:

a synthesizing module 1402, configured to perform synthesizing processing on the first grayscale image sample and the second grayscale mask image sample to obtain a synthesized image sample.

The training module 1404 is configured to input sample data of the sampled image into a convolutional neural network for training, where an output of the convolutional neural network is a homography transform matrix.

The transforming module 1406 is configured to transform the first grayscale image sample according to the homography transformation matrix to obtain a first grayscale transformed image sample.

A determining module 1408 for determining a first pixel value for the i-th index position of the first gray scale transformed image sample, a second pixel value for the i-th index position of the second gray scale masked image sample, and determining a larger value between the first pixel value and the second pixel value.

The determining module 1408 is further configured to determine a training loss of the convolutional neural network according to the first pixel value, the second pixel value and the pixel values of all index positions of the second gray mask image sample, and adjust a configuration parameter of the convolutional neural network according to the training loss.

In an exemplary embodiment of the disclosure, before the synthesizing process is performed on the first gray-scale image sample and the second gray-scale mask image sample, the training device 1400 of the image synthesis model is further configured to: performing Gaussian filtering on the two gray level images to be synthesized to obtain a filtered image; carrying out gradient operator processing on the filtered image to obtain a gradient image; determining a gray scale maximum value in the gradient image, and dividing the pixel value of each point in the gradient image by the gray scale maximum value to obtain a normalized gray scale image sample; one of the grayscale image samples is determined to be a first grayscale image sample and another of the grayscale image samples is determined to be a second grayscale mask image sample.

In an exemplary embodiment of the present disclosure, the synthesis module 1402 is further configured to: combining the first gray level image sample and the second gray level mask image sample to obtain a combined image sample; and performing down-sampling processing on the merged image sample, and determining the result of the sampling processing as a synthesized image sample.

In an exemplary embodiment of the present disclosure, the training module 1404 is further configured to: inputting the synthesized image sample into a convolutional neural network, wherein the convolutional neural network executes 1 multiplied by 1 convolutional operation, the input channel of the convolutional neural network is 2, and the output channel of the convolutional neural network is 3; performing block normalization operation on the result of the output channel; inputting the result of the block normalization operation into a regression backbone network, and outputting a homography transformation parameter by the regression backbone network; and inputting the homography transformation parameters into a direct linear transformation layer, and outputting the homography transformation parameters into a homography transformation matrix by the direct linear transformation layer.

In an exemplary embodiment of the disclosure, the transformation module 1406 is further for: and inputting the first gray level image sample into a spatial transformation layer, and transforming the first gray level image sample through a homography transformation matrix of the spatial transformation layer to obtain a first gray level transformation image sample.

In an exemplary embodiment of the disclosure, the determining module 1408 is further configured to: determining an absolute value of a difference between the first pixel value and the second pixel value for the ith index position; determining a pixel product between the absolute value of the difference at the ith index position and the maximum value; accumulating the pixel products of all the index positions, and determining an accumulation result as a first accumulation sum; accumulating pixel values of all index positions of the second gray shade image sample, and determining an accumulation result as a second accumulation sum; determining the training loss of the convolutional neural network according to the proportional relation between the first accumulated sum and the second accumulated sum; and adjusting the configuration parameters of the convolutional neural network according to the training loss.

Since the functions of the apparatus 1400 have been described in detail in the corresponding method embodiments, the disclosure is not repeated herein.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 1500 according to this embodiment of the invention is described below with reference to fig. 15. The electronic device 1500 shown in fig. 15 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 15, electronic device 1500 is in the form of a general purpose computing device. Components of electronic device 1500 may include, but are not limited to: the at least one processing unit 1510, the at least one memory unit 1520, and the bus 1530 that connects the various system components (including the memory unit 1520 and the processing unit 1510).

Wherein the memory unit stores program code that is executable by the processing unit 1510 to cause the processing unit 1510 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present specification. For example, the processing unit 1510 may perform a method as shown in embodiments of the present disclosure.

The storage unit 1520 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)15201 and/or a cache memory unit 15202, and may further include a read only memory unit (ROM) 15203.

Storage unit 1520 may also include a program/utility 15204 having a set (at least one) of program modules 15205, such program modules 15205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 1530 may be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 1500 may also communicate with one or more external devices 1540 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 1500, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1500 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 1550. Also, the electronic device 1500 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 1560. As shown, the network adapter 1560 communicates with the other modules of the electronic device 1500 over the bus 1530. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.

The program product for implementing the above method according to an embodiment of the present invention may employ a portable compact disc read only memory (CD-ROM) and include program codes, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method for training an image synthesis model, comprising:

synthesizing the first gray level image sample and the second gray level mask image sample to obtain a synthesized image sample;

inputting the image sample into a convolutional neural network for training, wherein the output of the convolutional neural network is a homography transformation matrix;

transforming the first gray scale image sample according to a homography transformation matrix to obtain a first gray scale transformation image sample;

determining a first pixel value for an ith index position of the first grayscale transformed image sample, a second pixel value for an ith index position of the second grayscale masked image sample, and determining a larger value between the first pixel value and the second pixel value;

determining a training loss of the convolutional neural network according to the first pixel value, the second pixel value and pixel values of all index positions of the second gray mask image sample, and adjusting configuration parameters of the convolutional neural network according to the training loss, including:

determining an absolute value of a difference between a first pixel value and a second pixel value for the ith index position;

determining a pixel product between the absolute value of the difference for the ith index position and the larger value;

accumulating the pixel products of all the index positions, and determining an accumulation result as a first accumulation sum;

accumulating pixel values of all index positions of the second gray shade image sample, and determining an accumulation result as a second accumulation sum;

determining a training loss of the convolutional neural network according to a proportional relation between the first accumulated sum and the second accumulated sum;

and adjusting configuration parameters of the convolutional neural network according to the training loss.

2. The method for training an image synthesis model according to claim 1, wherein before the synthesizing the first gray-scale image sample and the second gray-scale mask image sample, the method further comprises:

performing Gaussian filtering on the two gray level images to be synthesized to obtain a filtered image;

carrying out gradient operator processing on the filtering image to obtain a gradient image;

determining a large gray value in the gradient image, and dividing the pixel value of each point in the gradient image by the large gray value to obtain a normalized gray image sample;

determining one of the grayscale image samples as the first grayscale image sample and another one of the grayscale image samples as the second grayscale mask image sample.

3. The method for training an image synthesis model according to claim 1, wherein synthesizing the first grayscale image sample and the second grayscale mask image sample to obtain a synthesized image sample comprises:

combining the first gray scale image sample and the second gray scale mask image sample to obtain a combined image sample;

and performing downsampling processing on the merged image sample, and determining the result of the downsampling processing as the synthesized image sample.

4. The method for training an image synthesis model according to claim 1, wherein the image samples are input to a convolutional neural network for training, and an output of the convolutional neural network is a homography transformation matrix, specifically comprising:

inputting the synthesized image sample to the convolutional neural network, wherein the convolutional neural network performs 1 × 1 convolution operation, the input channel of the convolutional neural network is 2, and the output channel of the convolutional neural network is 3;

performing a block normalization operation on the result of the output channel;

inputting the result of the block normalization operation into a regression backbone network, and outputting a homography transformation parameter by the regression backbone network;

and inputting the homography transformation parameters into a direct linear transformation layer, and outputting the homography transformation parameters into the homography transformation matrix by the direct linear transformation layer.

5. The method for training an image synthesis model according to claim 1, wherein the convolutional neural network performs a convolutional kernel of 5x5, the convolutional neural network has an execution step size of 2, and the convolutional neural network has an execution padding of 2.

6. The method for training an image synthesis model according to claim 1, wherein transforming the first gray-scale image sample according to a homography transformation matrix to obtain a first gray-scale transformed image sample specifically comprises:

and inputting the first gray level image sample into a spatial transformation layer, and transforming the first gray level image sample through a homography transformation matrix of the spatial transformation layer to obtain a first gray level transformation image sample.

7. An apparatus for training an image synthesis model, comprising:

the synthesis module is used for carrying out synthesis processing on the first gray level image sample and the second gray level mask image sample to obtain a synthesized image sample;

the training module is used for inputting the image sample into a convolutional neural network for training, and the output of the convolutional neural network is a homography transformation matrix;

the transformation module is used for transforming the first gray-scale image sample according to the homography transformation matrix to obtain a first gray-scale transformation image sample;

a determining module for determining a first pixel value at an ith index position of the first gray scale transformed image sample, a second pixel value at an ith index position of the second gray scale masked image sample, and determining a larger value between the first pixel value and the second pixel value;

the determining module is configured to determine a training loss of the convolutional neural network according to the first pixel value, the second pixel value, and pixel values of all index positions of the second gray mask image sample, and adjust configuration parameters of the convolutional neural network according to the training loss, and includes:

8. An electronic device, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the method of training an image synthesis model according to any one of claims 1-6 based on instructions stored in the memory.

9. A computer-readable storage medium on which a program is stored which, when being executed by a processor, implements the method of training an image synthesis model according to any one of claims 1 to 6.