CN117474817B

CN117474817B - Method for content unification of composite continuous images

Info

Publication number: CN117474817B
Application number: CN202311800961.6A
Authority: CN
Inventors: 翟晓东; 汝乐; 夏哲
Original assignee: Jiangsu Austin Photoelectric Technology Co ltd
Current assignee: Jiangsu Austin Photoelectric Technology Co ltd
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-03-15
Anticipated expiration: 2043-12-26
Also published as: CN117474817A

Abstract

The invention discloses a method for content unification of synthetic continuous images, which comprises the following steps: step 1, extracting variance, mean value and color histogram features from each image frame in a synthetic video; and 2, taking the variance, the mean value and the color histogram characteristics as input data of a K-means algorithm, classifying images with similar characteristics into the same category, namely the same cluster, obtaining an optimal characteristic cluster, taking one image in the optimal characteristic cluster as a sample image, and adjusting contrast, brightness and color histograms of other images in the synthesized video according to the sample image. Thereby ensuring the consistency of the video on the content and achieving the optimal effect.

Description

Method for content unification of composite continuous images

Technical Field

The invention belongs to the field of video synthesis, in particular to a method for carrying out uniformity on a time line on a picture synthesized by one frame, which is caused by discontinuous texture and motion changes.

Technical Field

The existing image synthesis field is mature, and can realize a strong virtual effect, so that people cannot distinguish true from false. In many cases, however, these composite images are each semantically generated, so that there is a lack of correlation between each frame. Even with template video schemes, there is still a "jerkiness" after synthesis. I.e. between each frame, the perceived change is abrupt due to the discontinuous texture and motion changes.

In the existing schemes, there are mainly two solutions, namely a video-to-video translation scheme and a post-processing technology for converting video per frame.

The first video-to-video translation scheme adds a loss of time consistency in the design and training of the network to improve time correlation. However, this approach has two drawbacks. First, it requires knowledge of the correlation to redesign the algorithm to train the depth model, while requiring the video dataset to train. However, video datasets are quite scarce, especially with surveillance algorithms. Second, these methods are slow due to the need to calculate the test time stream.

The second scheme is to perform post-processing on each frame of transformed video by using the video after image enhancement, so that the video is consecutive in time. The post-processing technique does not require retraining of the image enhancement algorithm, and applies any image enhancement algorithm to the original video to achieve temporal consistency.

However, the second scheme is based on the premise that the 1 st frame is completely reliable, and then gradually deduces from the 1 st frame to the 2 nd and 3 rd frames until the last frame. The disadvantage of this scheme is quite obvious, which requires a very satisfactory frame 1 to be selected manually or by the machine. If frame 1 is defective, all frames are affected.

Disclosure of Invention

The invention provides a method for content unification of synthetic video based on the prior artificial intelligence technology and aiming at the prior problems of the post-processing technical scheme of video transformation of each frame, which comprises the following steps:

step 1, extracting variance, mean value and color histogram features from each image frame in a synthetic video;

and 2, taking the variance, the mean value and the color histogram characteristics as input data of a K-means algorithm, classifying images with similar characteristics into the same category, namely the same cluster, obtaining an optimal characteristic cluster, taking one image in the optimal characteristic cluster as a sample image, and adjusting contrast, brightness and color histograms of other images in the synthesized video according to the sample image.

Further, in the step 2, one image in the optimal feature cluster is taken as a sample image, which specifically includes: in the optimal feature cluster, the cluster center of the optimal feature cluster is calculated, and for each sample point (namely an image) in the optimal feature cluster, the distance between each sample point in the cluster and the cluster center is calculated by using Euclidean distance, and the sample point closest to the cluster center is found and used as a sample image.

Further, the method for content unification of the synthetic video of the invention further comprises a step 3 of performing time unification processing on the synthetic video by adopting an image conversion network to obtain a processed video image frame at the time tThe method comprises the steps of carrying out a first treatment on the surface of the Specifically, let the first frameThe current processing frame in the synthesized videoOriginal video frameOriginal video frameOutput frame of last timeInputting the video frames into an image conversion network, performing time unification processing, and outputting the video frames at the moment tThe method comprises the steps of carrying out a first treatment on the surface of the The original video refers to the video before synthesis.

Further, in the step 2, the intra-cluster square sum WCSS and the profile coefficient index are adopted, and an optimal feature cluster is selected, specifically:

the calculation formula of the intra-cluster square sum WCSS is as follows:

wherein i is the coordinate point in each sample point, namely the straight feature graph,n represents the number of sample points in the cluster;

contour coefficient of single sampleExpressed as:

wherein,representing the degree of cohesion of the sample points,representing the minimum value of the distance between the sample point and the other class,the calculation method is as follows:

where j represents the other sample points in the same cluster as sample i, and distance represents the distance between sample point i and sample point j.

Further, in step 3, the image conversion network is an encoder-decoder architecture, and a ConvLSTM module is inserted into the encoder-decoder.

The image conversion network comprises an encoder, a ConvLSTM module and a decoder which are sequentially linked, and skip connection is added between the encoder and the decoder; the encoder comprises a first downsampling convolution layer, a second downsampling convolution layer, a splicing layer and a residual block, wherein a normalization layer is arranged behind each downsampling convolution layer;

currently processed frameOutput frame of last timeInput to a first downsampled convolution layer, original video framesOriginal video frameAnd the data are input into a second downsampling convolution layer, are spliced in a splicing layer after downsampling respectively, and then are decoded by a decoder after passing through a residual block and a ConvLSTM module.

Further, the overall loss function for training the image conversion network is:

wherein,in order to account for the loss of consistency of the features,in order for the short-term loss to occur,for the purpose of long-term loss,、andthe weights of the overall feature consistency penalty, short term penalty, and long term penalty, respectively.

The method for calculating the feature consistency loss adopts the relu 1-2 layer of the pretrained VGG-19 to extract the shallow feature information of the image, and comprises the following steps:

wherein the method comprises the steps ofRepresenting the average value over the channel dimension,the standard deviation is indicated as such,representing vector EWhich is provided withWith an output at time tRGB pixel values of (c), andrepresenting VGG-19 networksIs the first of (2)Feature activation at the layer;

at the same time, toFeature consistency constraints are also made between:

thus, the overall feature consistency loss functionThe method comprises the following steps:

。

short term lossExpressed as:

wherein the method comprises the steps ofIs thatFrom optical flowWarp frame post-acquisitionThe image of the object to be processed,) Is based on input framesAnd warped input frameA visibility mask for the warp error calculation in between; optical flowIs thatAndand the reflux between them.

Applying long-term time loss between first output frame and all output framesExpressed as:

) Is based on input framesAnd warped input frameVisibility mask for warp error calculation in between.

The beneficial effects are that: the method for content unification of the synthetic video provided by the invention is a scheme for finding out the optimal characteristics in all images of the video and transferring all the images to the optimal characteristics, thereby ensuring the consistency of the video on the content and achieving the optimal effect; and the present invention redesigns the loss function of the network.

Drawings

FIG. 1 is a flow chart of dominant frequency signal feature selection in an embodiment of the invention;

FIG. 2 is a diagram of the internal structure of ConvLSTM in an embodiment of the invention;

FIG. 3 is a diagram of an encoder-decoder architecture in an embodiment of the invention;

fig. 4 is a diagram of an encoder architecture in an embodiment of the present invention.

Detailed Description

The invention is based on the premise that the input continuous single-frame image is a composite image which is carried out by taking a template video as a reference object, and the continuous single-frame image is continuous video, and each frame has correlation. For example, in the template video Vmp, the video of dancing of a male and female professional dancer 2 (MP) is that, the synthesized videos Vhp1 and Vhp2 are respectively that of a male dancer imitating the action of the male dancer in the template video Vmp and that of a female dancer imitating the action of the female dancer in the template video Vmp; the synthesized video includes a new video Vin of all people to be synthesized (HP). The video Vin is related to the action content rather than the video without the relation, wherein the video without the relation means that the first frame is a static diagram of a cat and the second frame is a static diagram of a dog.

Under this condition, the invention works as follows:

in the composite video, due to the fact that the original pictures in the earlier stage are subjected toPicture set synthesized one by one in individual picture formUnder different training iteration cycles or input conditions, the style or color contrast distribution of each frame of picture of the video may be inconsistent. Therefore, the invention needs to carry out post-processing steps on the enhanced image, searches the optimal dominant frequency style signal, so as to restrict the overall video frame style and achieve the video with better overall consistency. The invention adopts image clustering based on principal component analysis (Principal Component Analysis, PCA) and K-means algorithm to find the optimal characteristics.

Step 1, feature extraction

And extracting variance, mean and color histogram features from each image frame in the composite video, wherein the variance and the mean reflect brightness and contrast features of the image, and the color histogram describes color distribution conditions in the image.

Step 2, searching optimal characteristics through clustering

The K-means algorithm is to divide a set of data points into K different clusters, each cluster consisting of data points within it, the data points within the clusters being similar to each other.

Clustering the dimensionality reduced images by using a K-means algorithm, taking the variance, the mean value and the color histogram characteristics as input data of the K-means algorithm, classifying the images with similar characteristics into the same category, namely the same cluster, and finding the central mean value of each cluster. For example, of 10 images, the color histogram of 7 images is reddish and the color histogram of 3 images is yellowish, then 2 clusters are possible. And selecting an optimal characteristic cluster by adopting the square sum and the outline coefficient index in the cluster. The intra-cluster sum of squares is a sum of squares that calculates the euclidean distance of the samples in each cluster from the center of the cluster, and a smaller value of the intra-cluster sum of squares (Within-Cluster Sum of Square, WCSS) indicates a higher degree of tightness of the intra-cluster samples. The calculation formula is as follows:

where i is the coordinate point in each sample point, i.e. the feature map,for the cluster center, n shows the number of sample points within the cluster, and the value of WCSS is the sum of the squares of the distances of each data point from its cluster center.

The contour coefficient is calculated for each sample, and represents the similarity between the sample and other samples in the cluster to which the sample belongs and the dissimilarity between the sample and the nearest neighbor cluster. The profile coefficients for a single sample are calculated as:

wherein,the cohesion of the representative sample points is calculated as follows:

where j represents other sample points in the same cluster as sample i, distance represents the distance between i and j, where n represents n sample points in the cluster. So thatSmaller indicates that the class is tighter.

Calculation mode and of (2)Similarly, but requires traversing other classes of clusters (m total) to get multiple valuesFrom which the smallest value is selected as the final result.

So the profile coefficient of the original single sampleCan be written as:

from the above, it can be seen thatWhen the distance within the class is smaller than the class spacingAnd if the clustering is away, the clustering result is more compact. The value of S will approach 1, with a more aggressive 1 representing a more pronounced profile. In contrast, whenWhen the distance in the class is larger than the distance between the classes, the clustering result is loose. The value of S will approach-1, the worse the clustering effect will be the closer to-1.

By profile factor of individual samplesAnd (5) obtaining an index to obtain an optimal characteristic cluster. In the optimal feature cluster, for each sample point (i.e., image) in the cluster, the distance between each sample point in the cluster and the center of the cluster to which the sample point belongs is calculated by using Euclidean distance, and the distance is used as the optimal sample image which is needed to be obtained by the invention. And taking the obtained sample image as a standard reference, and carrying out corresponding contrast, brightness and color histogram adjustment on each target image so as to enable the target image to have consistent visual characteristics with the sample image.

Step 3, image unification

The invention takes a deep recursion network as a basic transmission module, and the concept of the recursion network is to infer the current output by combining all previous frame information and current frame information. To convert original videoAnd currently synthesized videoAs input, time-consistent output video is generatedWherein the value range of T is 1-T, and T represents the total frame number in the original video. The first output frameFrom the following componentsEndowing, setting a first output frame. At each time step, an AND is generated using image conversion network learningTemporally coincident output frames. The current output frame is then taken as input for the next time step.

Step 3.1 image converting network structure

The image conversion network consists of a classical encoder-decoder architecture. To capture the spatio-temporal correlation of video, a ConvLSTM module is inserted in the encoder-decoder network, as shown in fig. 4.

By integrating the ConvLSTM module into the image conversion network, the image conversion network is able to capture timing information in the video sequence and learn the spatio-temporal correlation between video frames using the memory unit in the ConvLSTM module. The ConvLSTM module can compress the information into a hidden state that can be used to estimate the current state. This hidden state can capture spatial information of the entire input sequence and allow the ConvLSTM module to learn temporal consistency coherently. In combination with the time loss, the ConvLSTM module provides satisfactory results in facilitating time consistency of the video-style transport network. The ConvLSTM module converts the 2D input in LSTM into a 3D tensor, the last two dimensions being the spatial dimensions (rows and columns). For each time t of data, the ConvLSTM module replaces a portion of the join operations in LSTM with convolution operations, i.e., predictions are made by the current input and the past state of the local neighbors. The convolution operation of extracting the spatial features is added to the LSTM network, and a part of connection operations in the LSTM are replaced by convolution operations. The internal structure of ConvLSTM module is shown in FIG. 2, H in FIG. 2 _t-1 Representing the state of the hidden layer of the neuron at the time t-1, C _t-1 Indicating the output of the neuron at time t-1. X is X _t A value representing time-series data X at the time t, in the present embodiment, the time tThe reason for the O of (c) being retained as parameter symbol X here is to facilitate a comparative understanding with the original ConvLSTM.

The present invention uses a basic encoder-decoder architecture as shown in fig. 3. The input to the image conversion network comprises the currently processed frameOutput frame of previous timeAnd unprocessed frames at the current and previous moments、Is a series of (a) and (b). Since the output frame is generally similar to the currently processed frame, the training image conversion network is used to predict the residual rather than the actual pixel values, i.eWhere F represents the image conversion network.For learning content and feature information, by learning unprocessed、Space-time relationship, constraint betweenTime information between them.

The encoder consists of two downsampled stride convolutional layers, each followed by an instance normalization. The encoder is followed by 5 residual blocks and a ConvLSTM module. The decoder placed after the ConvLSTM module consists of two transposed convolutional layers, with example normalization later.

As shown in fig. 4, a skip connection is added between the encoder and the decoder to provideHigh reconstruction quality and reduced information loss. The skip connection allows the underlying feature to pass directly to the decoder so that it can access higher level feature representations from the encoder. This helps to improve the detail retention of the network during the reconstruction process. However, skipping the connection may transfer low-level information (e.g., color) to the output frame and create visual artifacts. Thus, the input to the encoder is split into two streams: one stream for the processed framesAndanother stream is used for inputting framesAnd. The skip connection adds only the skip connection from the processed frame to avoid transmitting low-level information from the input frame.

Step 3.2 loss function

The object of the present invention is to reduce temporal inconsistencies in the output video while maintaining content, feature similarity to the processed frames.

[1] Content aware losses

Computing perceived loss using a pretrained VGG classification networkAndsimilarity between them. The perceived loss is defined as:

wherein,representing vector EWith an output at time tIs used for the RGB pixel values of (c),is the total number of pixels in the frame, andrepresenting VGG-19 networksIs the first of (2)Features at the layer activate. The Relu 4-3 layer is selected to calculate the perceptual loss.

[2] Feature consistency loss

Content aware losses take into accountAndthe comparison between pixel levels, the loss of which is too severe, may result in an overall feature difference from the original video. The present invention therefore proposes a new feature consistency penalty aimed at ensuring that the generated video frames do not change the feature distribution of the original video.

The relu 1-2 layer of the pretrained VGG-19 described above is also used to extract shallow feature information for the image. The feature consistency loss is:

wherein the method comprises the steps ofRepresenting the average value over the channel dimension,representing standard deviation.

At the same time consider, generating an imageThe offset feature association may be lost between the previous and subsequent frames. Thus, forFeature consistency constraints are also made between:

thus, the overall feature consistency loss function is:

[3] time loss

Short term time loss. The time loss is formulated as a warp error between output frames:

wherein the method comprises the steps ofIs thatFrom optical flowThe image obtained after the frame is warped,) Is based on input framesAnd warped input frameDistortion error betweenAnd (3) a calculated visibility mask. Optical flowIs thatAndand the reflux between them. The dynamic traffic is calculated using FlowNet2, using bilinear sampling layer warped frames, and empirically set(the pixel range is 0,1]Between).

Long term time loss. Although short term time loss enforces temporal consistency between successive frames, long term (e.g., more than 5 frames) coherence is not guaranteed. A straightforward method of enforcing long-term temporal consistency is to apply a loss of time across all output frame pairs. However, such strategies require significant computational costs (e.g., optical flow estimation). Furthermore, it is meaningless to calculate the time loss between two intermediate outputs before the network converges. Thus, a long-term time penalty is imposed between the first output frame and all output frames:

long-term temporal coherence can be implemented over a maximum of 10 frames (t=10) during training.

[4] Total loss of

The overall loss function for training the image conversion network is:

wherein the method comprises the steps of、Andthe weights of the feature consistency loss, short term loss, and long term loss, respectively.

By implementing the invention, the prior problems of the post-processing technical scheme of video conversion per frame can be faced, the scheme of automatically identifying the optimal characteristics in all images and transferring the images to the optimal characteristics is realized, thereby ensuring the consistency of the whole video on the content and achieving the optimal effect.

Claims

1. A method for content unification of composite video, comprising the steps of:

step 2, taking the variance, the mean value and the color histogram feature as input data of a K-means algorithm, classifying images with similar features into the same category, namely the same cluster, obtaining an optimal feature cluster, taking one image in the optimal feature cluster as a sample image, and adjusting contrast, brightness and color histogram of other images in the synthesized video according to the sample image;

the method for acquiring the optimal feature cluster specifically comprises the following steps:

the calculation formula of the intra-cluster square sum WCSS is as follows:

wherein i is coordinate point in each sample point, namely the straight feature diagram, C _k N represents the number of sample points in the cluster;

the profile coefficient S (i) of a single sample is expressed as:

where a (i) represents the cohesion of the sample point, b (i) represents the minimum value of the distance between the sample point and other classes, and a (i) is calculated as follows:

2. The method for content matching of composite video according to claim 1, further comprising step 3 of performing time matching processing on the composite video using an image conversion network to obtain a processed video image frame O at time t _t ；

Specifically, let the first frame O ₁ ＝P ₁ The current processing frame P in the synthesized video _t Original video frame I _t Original video frame I _t-1 Output frame O of the previous time _t-1 Input into an image conversion network, output a video frame O at the moment t after time unification processing _t The method comprises the steps of carrying out a first treatment on the surface of the The original video refers to the video before synthesis.

3. The method for content unification of composite video according to claim 1, wherein one image in the optimal feature cluster is taken as a sample image, specifically:

and calculating the cluster center of the optimal feature cluster in the optimal feature cluster, and for each sample point in the optimal feature cluster, calculating the distance between each sample point in the cluster and the cluster center by using Euclidean distance to find the sample point closest to the cluster center as a sample image.

4. The method of claim 2, wherein in step 3, the image conversion network is an encoder-decoder architecture, and wherein the ConvLSTM module is inserted into the encoder-decoder.

5. The method for content unification of synthetic video according to claim 2, wherein the image conversion network comprises an encoder, a ConvLSTM module, and a decoder linked in sequence, and a skip connection is added between the encoder and the decoder;

the encoder comprises a first downsampling convolution layer, a second downsampling convolution layer, a splicing layer and a residual block, wherein a normalization layer is arranged behind each downsampling convolution layer;

currently processed frame P _t Output frame O of last time _t-1 Input to a first downsampled convolution layer, original video frame I _t Original video frame I _t-1 And the data are input into a second downsampling convolution layer, are spliced in a splicing layer after downsampling respectively, and then are decoded by a decoder after passing through a residual block and a ConvLSTM module.

6. A method for content unification of composite video according to claim 2, wherein the overall loss function for training the image conversion network is:

L＝λ _f L _f +λ _st L _st +λ _lt L _lt

wherein L is _f For the total feature consistency loss, L _st L is short term loss _lt Lambda is a long-term loss _f 、λ _st And lambda (lambda) _lt The weights of the overall feature consistency penalty, short term penalty, and long term penalty, respectively.

7. The method for content matching of composite video of claim 6, wherein the pre-trained relu 1-2 layer of VGG-19 is used to extract shallow feature information of the image, and the feature matching loss is:

where μ () represents the channel dimensionAverage over the degrees, σ () represents the standard deviation,representing vector e R ³ Which has RGB pixel values of the output O at time t and +.>Represents VGG-19 network->Feature activation at layer 1 of (c);

at the same time to O _t 、O _t-1 Feature consistency constraints are also made between:

thus, the overall feature consistency loss function L _f The method comprises the following steps:

L _f ＝L _f1 +L _f2 。

8. the method for content unification of synthetic video according to claim 6, wherein L is a short term loss _st Expressed as:

wherein the method comprises the steps ofIs O _t-1 By optical flow->Image obtained after warping the frame, < >>Is based on input frame I _t And distorted input frame->A visibility mask for the warp error calculation in between; optical flow->Is I _t And I _t-1 And the reflux between them.

9. The method for content unification of synthetic video according to claim 6, wherein a long term time loss L is applied between the first output frame and all output frames _lt Expressed as:

is based on input frame I _t And distorted input frame->Visibility mask for warp error calculation in between.