CN112686922B - Method for separating animation special effect and background content based on multi-scale motion information - Google Patents

Method for separating animation special effect and background content based on multi-scale motion information Download PDF

Info

Publication number
CN112686922B
CN112686922B CN202110101404.XA CN202110101404A CN112686922B CN 112686922 B CN112686922 B CN 112686922B CN 202110101404 A CN202110101404 A CN 202110101404A CN 112686922 B CN112686922 B CN 112686922B
Authority
CN
China
Prior art keywords
special effect
frame
scale
sequence
frames
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110101404.XA
Other languages
Chinese (zh)
Other versions
CN112686922A (en
Inventor
徐雪妙
屈玮
韩楚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110101404.XA priority Critical patent/CN112686922B/en
Publication of CN112686922A publication Critical patent/CN112686922A/en
Application granted granted Critical
Publication of CN112686922B publication Critical patent/CN112686922B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a method for separating animation special effects and background contents based on multi-scale motion information, which comprises the following steps: 1) Acquiring a sequence frame with a special effect fragment in an animation video; 2) Calculating a single-scale special effect prediction image set between each frame and other frames in the sequence frames; 3) Merging the single-scale special effect prediction image set of each frame as multi-scale special effect prediction; 4) Obtaining self-attention multi-scale special effect set characteristics through a self-attention mechanism; 5) Extracting the characteristics of an input sequence frame through a three-dimensional convolutional neural network layer; 6) Combining the characteristics of the sequence frame and the self-attention multi-scale special effect set characteristics; 7) Separating the special effect sequence frame and the transparent channel information through a three-dimensional residual convolution neural network; 8) Making a difference between the sequence frame and the special effect sequence frame to obtain a damaged background sequence frame; 9) And obtaining the repaired background sequence frame through a three-dimensional convolutional neural network. The method can be applied to special effect migration and can improve the accuracy of segmenting and identifying the specific object in the animation.

Description

Method for separating animation special effect and background content based on multi-scale motion information
Technical Field
The invention relates to the technical field of video separation, in particular to a method for separating animation special effects and background contents based on multi-scale motion information.
Background
Cartoons have been widely used in the field of animation as a form of visually relevant artistic expression. To present weather conditions and environments in animation, artists often use various cartoon special effects to present effects, such as rain, snow, fallen leaves, fallen flowers, and so on. These special effects are not only used to represent the environment, but also help to enrich the visual expressive power in the animation. Although the study of cartoon vision based on animation has gained wide attention in recent years, a part of background information is often blocked by a cartoon special effect in an animation scene, which causes a part of information to be lost when the animation background is analyzed, which is not beneficial to the study of directions such as segmentation of specific objects of animation, analysis of animation background, and the like. Meanwhile, the cartoon special effect in the animation is used as a special effect displayed visually, and layering and migration of the cartoon special effect are widely applied, so that the separation of the cartoon special effect and the background in the animation becomes an urgent research direction.
However, the cartoon special effects in the animation move irregularly, and the cartoon special effects are complex in type and different in size, so that the difficulty of separating the special effects by using a traditional rule method is increased. Meanwhile, the special effect data set of the cartoon is rare, so that the difficulty of separating the special effects by the deep learning method is increased. At present, some methods separate the foreground and the background through deep learning or a traditional time sequence method, but the methods are not suitable for separating the cartoon special effect in the animation, because the videos which are usually considered by the methods are natural videos, and the foreground distribution in the natural videos and the special effect distribution in the cartoon animation are usually inconsistent, for example, the size and the shape of the special effect in the animation are changed at many ends and the appearance position of the special effect in the animation is more unpredictable. Therefore, how to accurately separate the cartoon special effect and the background in the animation becomes a key problem.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art, provides a method for separating animation special effects and background contents based on multi-scale motion information, can separate animation video clips into fine special effect effects and clean background contents, can be effectively suitable for separating different special effects in animations, can further repair the background contents behind the special effects after the special effects are effectively separated, and greatly improves downstream applications such as segmentation, recognition and special effect migration.
In order to realize the purpose, the technical scheme provided by the invention is as follows: the method for separating the animation special effect and the background content based on the multi-scale motion information comprises the following steps:
1) Acquiring data, including sequence frames with special effect fragments in the animation video as input;
2) Calculating a single-scale special effect prediction image set between each frame and other frames in the input sequence frames;
3) Merging the single-scale special effect prediction image set of each frame to be used as multi-scale special effect prediction;
4) Adjusting the multi-scale special effect prediction through a self-attention mechanism to obtain self-attention multi-scale special effect set characteristics;
5) Extracting the characteristics of an input sequence frame through a three-dimensional convolution neural network layer;
6) Combining the characteristics of the input sequence frame and the self-attention multi-scale special effect set characteristics;
7) Through a three-dimensional residual convolutional neural network, adding a Non-local module into each residual module to strengthen time sequence information association, and then outputting a separated special effect sequence frame and transparent channel information;
8) Obtaining a damaged background sequence frame by subtracting the input sequence frame from the special effect sequence frame obtained by separation;
9) And combining the damaged background sequence frame and the transparent channel information, inputting the combined frame and the transparent channel information into a three-dimensional convolution neural network, replacing all convolution layers with gate convolutions for dynamic feature selection, replacing the middle layer with expansion convolutions with different expansion rates, and finally outputting to obtain a repaired background sequence frame.
In the step 1), the animation video with the special effect fragment is a video fragment obtained by cutting the collected animation video containing the special effect through professional cartography software Adobe Premier, wherein the special effect types comprise rain, snow, fallen petals and fallen leaves; the sequence frame refers to a continuous image frame obtained by sampling in a video clip at 25 frames per second, and the continuous image is divided into sequence frames with 5 as a number unit through data preprocessing:
Figure BDA0002915831300000031
wherein I represents that the input is a sequence of 5 consecutive frames, wherein I 1 Denotes the first frame, I 2 Represents a second frame, I 5 Represents a fifth frame;
Figure BDA0002915831300000032
for the real number set, C is the number of channels, and H and W represent the length and width of the frame.
In step 2), a single-scale special effect prediction image set between each frame and other frames in the input sequence frames is calculated, and the method comprises the following steps:
2.1 Each frame I is calculated by an optical flow estimation neural network FlowNet2 i With other frames I in the sequence frame j|j≠i And is affine transformed back by an optical flow
Figure BDA0002915831300000033
Figure BDA0002915831300000034
In the formula, V represents the optical flow estimation neural network FlowNet2, I i Denotes the ith frame, I j|j≠i Representing the j-th frame and j is not equal to I, V (I) i ,I j|j≠i ) Representing calculating an optical flow estimate from the ith frame to the jth frame; w denotes an affine transformation, i.e. I j|j≠i Affine transformation back to I by estimated optical flow i
Figure BDA0002915831300000035
Representing the result of calculating an affine transformation from the ith frame to the jth frame;
2.2 Based on the characteristics of the speed and direction difference on the special effect and background content sports field, each frame I is obtained by calculation i The single-scale special effect prediction graph is as follows:
Figure BDA0002915831300000036
in the formula I i Which represents the (i) th frame,
Figure BDA0002915831300000037
representing affine from frame i to frame jAs a result of the transformation, the result,
Figure BDA0002915831300000038
is represented by I i With affine transformed results
Figure BDA0002915831300000039
Calculating Euclidean distances from channel C to channel C, and accumulating and summing the obtained results, D i→j Representing calculated I i A single-scale special effect prediction graph from the jth frame,
Figure BDA00029158313000000310
the method comprises the following steps of (1) representing the length and the width of a single-scale special effect prediction graph by a real number set and 1 representing the size of a vector channel;
2.3 Calculate the ith frame I i To all other frames I in the input sequence frame j|j≠i Obtaining I after the single-scale special effect prediction graph i Set of single-scale special effect prediction maps
Figure BDA0002915831300000041
D i→j Representing the calculated single-scale special effect prediction graph of the ith frame from the jth frame,
Figure BDA0002915831300000042
indicating that the ith frame is from all other frames I j Wherein j is not equal to i, i, j ∈ [1,5 ]]Indicating that i and j take on a closed interval of 1 to 5.
In step 3), combining the single-scale special effect prediction image set obtained by calculation in each frame, and fully utilizing different time scale information in an input time sequence to assist special effect prediction of different rates, wherein the multi-scale special effect prediction obtained by calculation is as follows:
Figure BDA0002915831300000043
in the formula, D i→j Representing the calculated single-scale special effect prediction graph of the ith frame from the jth frame,
Figure BDA0002915831300000044
represents a set of single-scale special effect prediction graphs from all other jth frames at ith frame, where j is not equal to i, i, j ∈ [1,5 ]]A closed interval of values of i and j in a range of 1 to 5 is shown; max means that 4 single-scale special effect prediction graphs D with different time spans are obtained i→j Taking the maximum values in the time dimension for combination; d i Indicates obtained a subject of I i The multi-scale special effect prediction of the method,
Figure BDA0002915831300000045
is a real number set, 1 is the vector channel size, and H and W represent the length and width of the multi-scale special effect prediction.
In step 4), the multi-scale special effect prediction is adjusted through an attention mechanism, namely, the noise of a non-effective area is adjusted and suppressed through weight, and the method comprises the following steps:
4.1 Respectively passing the multi-scale special effect prediction of each frame through a self-attention mechanism to obtain new weight, and balancing the response size of each feature in the multi-scale special effect prediction again through the weight:
M i =Sigmoid(H(D i ))
in the formula, D i Is represented as belonging to i H represents a convolution layer with a convolution kernel size of 1 multiplied by 1, sigmoid represents an activation function for calculating features, M i Representing the calculated weight;
4.2 ) calculate the weight M i Reunion multi-scale special effects prediction D i Combining the components as follows:
Figure BDA0002915831300000051
in the formula (I), the compound is shown in the specification,
Figure BDA0002915831300000052
the representation matrix is multiplied element by element,
Figure BDA0002915831300000053
presentation reassignmentThe weighted self-attention multi-scale special effect characteristics of the ith frame;
4.3 Combine the self-attention multi-scale special effect features of all input sequence frames on the channel to obtain a self-attention multi-scale special effect set feature as follows:
Figure BDA0002915831300000054
in the formula (I), the compound is shown in the specification,
Figure BDA0002915831300000055
indicating that the vector merging is performed in the time dimension T,
Figure BDA0002915831300000056
and
Figure BDA0002915831300000057
respectively representing the self-attention multi-scale special effect characteristics of the 1 st frame, the 2 nd frame and the 5 th frame, D is the combined self-attention multi-scale special effect set characteristic,
Figure BDA0002915831300000058
the method is characterized in that the method is a real number set, 1 is a vector channel size, 5 is a time dimension size, and H and W represent the length and width of the attention multi-scale special effect set feature.
In step 5), extracting the characteristics of the input sequence frames of 5 continuous frames through a three-dimensional convolution neural network layer as follows:
Figure BDA0002915831300000059
wherein I is an input sequence frame, conv is a three-dimensional convolution layer with a convolution kernel size of 5 × 5 × 3, F is a characteristic of the extracted input sequence frame,
Figure BDA00029158313000000510
representing a set of real numbers, C a vector channel size, 5 a time dimension size, H and W representing the length and width of features of the input sequence frames。
In step 6), the extraction of the special effect part is guided by using the self-attention multi-scale special effect set characteristics obtained by calculation, and the self-attention multi-scale special effect set characteristics and the characteristics of the input sequence frame are fused:
Figure BDA00029158313000000511
wherein F is the extracted characteristics of the input sequence frame, D is the characteristics of the self-attention multi-scale special effect set,
Figure BDA00029158313000000512
representing the multiplication of the matrix element by element, F e Is the image frame feature after fusion.
In step 7), each residual module adds a Non-local module to strengthen time sequence information association through a three-dimensional residual convolutional neural network, and then outputs a separated special effect sequence frame and transparent channel information, which comprises the following steps:
7.1 Coding and decoding sequence frame features through a three-dimensional residual convolutional neural network, wherein the structural design of the three-dimensional residual convolutional neural network consists of 2 down-sampling convolutional layers sharing parameters, 4 residual modules adding Non-local layers, 1 up-sampling convolutional layer and 2 up-sampling convolutional layers not sharing the parameters, and the size of all three-dimensional convolutional cores is 3 multiplied by 3;
7.2 Separate special effect sequence frames and transparent channel information are output:
Figure BDA0002915831300000061
in the formula, G e Representing a three-dimensional residual convolutional neural network, E representing a special effect sequence frame obtained by separation,
Figure BDA0002915831300000062
indicates that E belongs to
Figure BDA0002915831300000063
Real number set, continuous 5 frames of size 3 × H × W; a represents the transparent channel information and a is,
Figure BDA0002915831300000064
indicates that A belongs to
Figure BDA0002915831300000065
Real number set, 5 consecutive ones of size 1 × H × W.
In step 8), subtracting the input sequence frame and the separated special effect sequence frame to obtain a damaged background sequence frame C without special effect r Comprises the following steps:
C r =I-E
wherein I represents an input sequence frame, E represents a special effect sequence frame obtained by separation, C r Representing the calculated lossy background sequence frames.
In step 9), combining the missing background sequence frame and the transparent channel information, inputting the combined background sequence frame and transparent channel information into a three-dimensional convolution neural network, replacing all convolution layers with gate convolutions for dynamic feature selection, replacing the middle layer with expansion convolutions with different expansion rates, and finally outputting to obtain a repaired background sequence frame, wherein the method comprises the following steps:
9.1 The method comprises the following steps that) coding and decoding repair is carried out on a damaged background sequence frame through a three-dimensional convolution neural network, the network frame is designed to sequentially pass through 2 downsampling convolution layers, 4 expansion convolution perception incomplete blocks with different expansion rates and 2 upsampling layers, all the convolution layers are replaced by gate convolution network layers to control channel information to be fully utilized, and therefore redundancy is avoided, and 3 x 3 three-dimensional convolution kernels are adopted for all convolution kernels;
9.2 The damaged background sequence frame and the transparent channel information are jointly input into a three-dimensional convolution neural network to obtain a repaired background sequence frame, wherein the repaired background sequence frame is as follows:
Figure BDA0002915831300000071
in the formula, G c Representing a three-dimensional convolutional neural network, C r Is a damaged background sequence frame, A represents transparent channel information, C represents a repaired background sequence frame,
Figure BDA0002915831300000072
indicates that C belongs to
Figure BDA0002915831300000073
Real number set, continuous 5 frames of size 3 × H × W;
after separating the special effect layer of the input sequence frame of the original animation video clip and outputting the repaired background content layer, the special effect and the background content in the animation are separated.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention provides a method for separating an animation video containing a special effect by using a neural network in deep learning for the first time, and separating an animation video clip into a special effect layer and a background content layer.
2. The method can extract various different special effects in the animation, and can recover complete background contents without special effects while accurately extracting the special effects.
3. The invention provides a method for sensing the difference of a motion field in multiple scales for the first time, which can sense and position a special effect with larger difference of direction, speed and shape distribution, and embed the special effect into the learning process of a neural network to be used as prior knowledge to guide the network to capture the special effect motion characteristics so as to further help the neural network to learn how to separate the special effect.
4. The invention provides an attention mechanism to assist the guiding of multi-scale perception motion difference, and can further guide a network to obtain more accurate special effect motion prior information to avoid noise errors.
5. The invention introduces a three-dimensional convolution neural network to restore the damage background, the background perception of different damage blocks is realized by a modeling mode taking transparency information as soft auxiliary input and the expansion convolution of different receptive fields, and the three-dimensional convolution is utilized to consider the time sequence consistency, so that the restored background is clearer and more complete.
6. The method has wide use space in the animation video processing task, and has short inference time and good generalization.
Drawings
FIG. 1 is a logic flow diagram of the method of the present invention.
Fig. 2-1 to fig. 2-5 show input sequence frames of the method of the present invention.
Fig. 3-1 to 3-4 are single-scale special effect prediction image sets calculated by using fig. 2-5 as current frames.
Fig. 4 is a multi-scale special effect prediction of a current frame obtained by merging.
FIG. 5 is a self-attention multi-scale effect feature resulting from a multi-scale effect prediction calculation self-attention mechanism.
Fig. 6-1 to 6-5 are separated special effect sequence frames.
Fig. 7-1 to 7-5 are sequence frames of background content obtained after the separation and repair.
Detailed Description
The present invention is further illustrated by the following examples.
As shown in fig. 1, the method for separating an animated special effect and background content based on multi-scale motion information provided by this embodiment includes the following steps:
1) Acquiring sequence frames with special effect fragments in the animation video, wherein each frame specifically refers to an image containing a special effect in the background. Video sequence frames with special effect segments are obtained by using the professional graphics software Adobe Premier. The manufacturing process is that firstly, an animation video with a special effect is collected, then a fragment containing the special effect is cut and appointed from a video segment, and a sequence frame refers to a continuous image frame obtained by sampling 25 frames per second; the special effect comprises four different types of rain, snow, fallen petals and fallen leaves. Firstly, data preprocessing is carried out on animation sequence frames, and all the sequence frames are divided into 5 continuous sequence input. As shown in fig. 2-1 to 2-5, each frame is adjacent to each other.
2) Calculating a single-scale special effect prediction image set between the input animation sequence frame and other frames, wherein the single-scale special effect prediction image set comprises a single scale of the current frame and all other framesAnd (5) degree special effect prediction graph. Calculating to obtain every ith frame I through optical flow estimation neural network FlowNet2 i And other frames I in the sequence j|j≠i Optical flow estimation and affine transformation back
Figure BDA0002915831300000091
Figure BDA0002915831300000092
In the formula, V represents classical optical flow estimation neural network FlowNet2, I i Denotes the ith frame, I j|j≠i Representing the j-th frame and j is not equal to I, V (I) i ,I j|j≠i ) Representing calculating an optical flow estimate from the ith frame to the jth frame; w denotes an affine transformation, i.e. I j|j≠i Affine transformation back to I by estimated optical flow i
Figure BDA0002915831300000099
Indicating the result of calculating the affine transformation from the ith frame to the jth frame.
After affine transformation results of the current frame and all other frames are obtained, a single-scale special effect prediction graph D of the current frame is calculated according to the special effect and the speed and direction difference of a background content motion field i→j Comprises the following steps:
Figure BDA0002915831300000093
in the formula (I), the compound is shown in the specification,
Figure BDA0002915831300000094
represents the current frame I i With this frame after affine transformation
Figure BDA0002915831300000095
Calculating Euclidean distance channel by channel C, accumulating and summing to obtain a single-scale special effect prediction image from the ith frame to the jth frame,
Figure BDA0002915831300000096
is a real number set, 1 is the vector channel size, and H and W represent the length and width of the single-scale special effect prediction graph. After the single-scale special effect prediction images from all the current frames I to other frames j are calculated, the current frame I is obtained i The set of single-scale special effect prediction maps. Assume that the current frame is shown in fig. 2-5, and is shown in fig. 3-1 to fig. 3-4, and is a set of calculated single-scale special effect prediction maps.
3) For the obtained single-scale special effect prediction graph D i→j And then, combining the set of the single-scale special effect prediction images of each frame into multi-scale special effect prediction through maximum operation:
Figure BDA0002915831300000097
in the formula (I), the compound is shown in the specification,
Figure BDA0002915831300000098
and representing that the current frame is the ith frame, and obtaining a single-scale special effect prediction graph set through calculation from other frames j, wherein j is not equal to i, and the value of the j is within an integer range from 1 to 5. Max is the single-scale special effect prediction graph D of 4 different time spans to be obtained by maximum operation i→j Taking the maximum values in the time dimension and combining to obtain the multi-scale special effect prediction D belonging to the ith frame i
Figure BDA0002915831300000101
For the real number set, 1 is the vector channel size, and H and W represent the length and width of the multi-scale special effect prediction, as shown in fig. 4.
4) After obtaining the multi-scale special effect prediction of each frame, calculating the self-attention multi-scale special effect set characteristics of each frame in the sequence frames. Firstly, self-attention multi-scale special effect characteristics are calculated for multi-scale special effect prediction of each frame, and each position characteristic in the multi-scale special effect prediction is balanced again through a weight value obtained by learning:
M i =Sigmoid(H(D i ))
wherein H represents a convolution layer having a convolution kernel size of 1X 1, sigmoid represents a calculated activation function,obtained M i Is the self-attention weight.
Then calculating self-attention weight and combining the self-attention weight with the original predicted multi-scale special effect prediction to obtain the following combination:
Figure BDA0002915831300000102
in the formula (I), the compound is shown in the specification,
Figure BDA0002915831300000103
the representation matrix is multiplied element by element,
Figure BDA0002915831300000104
representing the re-weighted self-attention multi-scale special effect feature. Fig. 5 shows the self-attention multi-scale special effect features obtained after the self-attention mechanism is calculated in fig. 4.
And finally, calculating the self-attention multi-scale special effect characteristics of each frame in the whole sequence of frames in the time dimension, namely fusing the self-attention multi-scale special effect characteristics containing the attention characteristics of each frame in the time dimension:
Figure BDA0002915831300000105
in the formula (I), the compound is shown in the specification,
Figure BDA0002915831300000106
indicating that the vector merging is performed in the time dimension T,
Figure BDA0002915831300000107
and
Figure BDA0002915831300000108
respectively showing the self-attention multi-scale special effect characteristics of the 1 st frame, the 2 nd frame and the 5 th frame, D is the combined self-attention multi-scale special effect set characteristic,
Figure BDA0002915831300000109
is a real number set, 1 isVector channel size, 5 is the size of the time dimension, and H and W represent the length and width of the attention multi-scale special effect set features.
5) And for the input sequence frame, performing feature extraction through a three-dimensional convolution neural network layer.
Figure BDA0002915831300000111
In the formula, conv is a general three-dimensional convolutional neural network layer, the size of a convolution kernel is 5 × 5 × 3, and F is the extracted image feature. And extracting and obtaining abstract characteristics of the input sequence frame in a convolution mode.
6) The coding and decoding are performed in conjunction with the input sequence frame features and the self attention feature set features. Guiding the extraction of the special effect by using the self-attention special effect set characteristics obtained by the calculation in the step 4), and fusing the self-attention special effect set characteristics and the characteristics of the input sequence frame:
Figure BDA0002915831300000112
in the formula (I), the compound is shown in the specification,
Figure BDA0002915831300000113
representing element-by-element multiplication of matrices, where features specified in the attention-specific set and features of the input sequence frame are multiplied in the channel dimension, F e Are image frame features after fusion.
7) And then coding and decoding the image frame characteristics through a three-dimensional residual convolution neural network to obtain separated special effect sequence frames and transparent channel information:
Figure BDA0002915831300000114
in the formula: g e Representing a three-dimensional residual convolutional neural network, the structural design of the neural network is composed of 2 downsampling convolutional layers without sharing parameters and 4 Non-local layersThe residual module, 1 upsampling convolutional layer and 2 upsampling convolutional layers which do not share parameters, wherein the size of all three-dimensional convolutional kernels is 3 multiplied by 3, and the last 2 unshared upsampling convolutional layers respectively output a special effect sequence frame E and transparent channel information A. E denotes the special effect sequence frame obtained by the separation,
Figure BDA0002915831300000115
indicates that E belongs to
Figure BDA0002915831300000116
Real number set, continuous 5 frames of size 3 × H × W; a represents the transparent channel information and a is,
Figure BDA0002915831300000117
indicates that A belongs to
Figure BDA0002915831300000118
Real number set, 5 consecutive ones of size 1 × H × W.
As shown in fig. 6-1 to 6-5, the frames are separated to obtain special effect sequence frames containing channel information.
8) After the separated special effect sequence frame and the transparent channel are obtained, the residual background sequence frame C to be repaired is obtained by subtracting the separated special effect sequence frame from the input sequence frame r =I-E,
Figure BDA0002915831300000119
For input sequence frames, E for separate special effect sequence frames, C r Representing the computed lossy background sequence frames.
9) And obtaining the separated animation background content by combining the three-dimensional convolution neural network with the front transparent channel information. The method comprises the steps of jointly inputting a background sequence frame to be repaired in a defect mode and transparent channel information into a three-dimensional convolutional neural network, sequentially sensing defect blocks with different sizes and 2 up-sampling layers through 2 down-sampling convolutional layers and 4 expansion convolutions with different expansion rates, replacing all convolutional layers with gate convolutional network layer control channel information, fully utilizing effective channel information to avoid redundancy, and adopting 3 x 3 three-dimensional convolutional kernels for all convolutional kernel sizes. The sequence frames of the animation background contents obtained and separated are as follows:
Figure BDA0002915831300000121
in the formula, G c Representing a three-dimensional convolutional neural network, C represents a repaired background sequence frame of 5 consecutive frames of size 3 × H × W, as shown in fig. 7-1 to 7-5, which is a background sequence frame of 5 consecutive frames obtained by the repair.
After separating the special effect sequence frame and the separated animation background sequence frame of the original input animation video sequence, the animation special effect and the background content are separated.
The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims (10)

1. The method for separating the animation special effect and the background content based on the multi-scale motion information is characterized by comprising the following steps of:
1) Acquiring data, including sequence frames with special effect fragments in the animation video as input;
2) Calculating a single-scale special effect prediction image set between each frame and other frames in the input sequence frames;
3) Merging the single-scale special effect prediction image set of each frame as multi-scale special effect prediction;
4) Adjusting the multi-scale special effect prediction through a self-attention mechanism to obtain self-attention multi-scale special effect set characteristics;
5) Extracting the characteristics of an input sequence frame through a three-dimensional convolution neural network layer;
6) Combining the characteristics of the input sequence frame and the self-attention multi-scale special effect set characteristics;
7) Through a three-dimensional residual convolutional neural network, adding a Non-local module into each residual module to strengthen time sequence information association, and then outputting a separated special effect sequence frame and transparent channel information;
8) Obtaining a damaged background sequence frame by subtracting the input sequence frame from the special effect sequence frame obtained by separation;
9) And combining the damaged background sequence frame and the transparent channel information, inputting the combined frame and the transparent channel information into a three-dimensional convolution neural network, replacing all convolution layers with gate convolutions for dynamic feature selection, replacing the middle layer with expansion convolutions with different expansion rates, and finally outputting to obtain a repaired background sequence frame.
2. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in the step 1), the animation video with the special effect fragment is a video fragment obtained by cutting the collected animation video containing the special effect through professional drawing software Adobe Premier, wherein the special effect type comprises rain, snow, fallen petals and fallen leaves; the sequence frame refers to a continuous image frame obtained by sampling in a video clip at 25 frames per second, and the continuous image is divided into sequence frames with 5 as a number unit through data preprocessing:
Figure FDA0002915831290000011
wherein I represents that the input is a sequence of 5 consecutive frames, wherein I 1 Denotes the first frame, I 2 Denotes the second frame, I 5 Represents a fifth frame;
Figure FDA00029158312900000211
for the real number set, C is the number of channels, and H and W represent the length and width of the frame.
3. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 2), calculating a single-scale special effect prediction image set between each frame and other frames in the input sequence frames, comprising the following steps:
2.1 Each calculated from the optical flow estimation neural network FlowNet2Frame I i With other frames I in the sequence frame j|j≠i And is affine transformed back by an optical flow
Figure FDA0002915831290000021
Figure FDA0002915831290000022
In the formula, V represents the optical flow estimation neural network FlowNet2, I i Denotes the ith frame, I j|j≠i Representing the jth frame and j is not equal to I, V (I) i ,I j|j≠i ) Representing calculating an optical flow estimate from an ith frame to a jth frame; w denotes an affine transformation, i.e. I j|j≠i Affine transformation back to I by estimated optical flow i
Figure FDA0002915831290000023
Representing the result of calculating an affine transformation from the ith frame to the jth frame;
2.2 Based on the special effects and the characteristics of the difference in speed and direction on the background content motion field, each frame I is calculated i The single-scale special effect prediction graph is as follows:
Figure FDA0002915831290000024
in the formula I i Which represents the (i) th frame,
Figure FDA0002915831290000025
representing the result of an affine transformation from the ith frame to the jth frame,
Figure FDA0002915831290000026
is represented by I i With affine transformed results
Figure FDA0002915831290000027
Calculated Euclidean distance from channel to channel and accumulated and summedAs a result, D i→j Represents the calculated I i A single-scale special effect prediction graph from the jth frame,
Figure FDA0002915831290000028
the prediction image is a real number set, 1 is the size of a vector channel, and H and W represent the length and the width of a single-scale special effect prediction image;
2.3 Calculate the ith frame I i To all other frames I of the input sequence frame j|j≠i Obtaining I after the single-scale special effect prediction graph i Set of single-scale special effect prediction maps of (1)
Figure FDA0002915831290000029
D i→j Representing the calculated single-scale special effect prediction graph of the ith frame from the jth frame,
Figure FDA00029158312900000210
indicating that the ith frame is from all other frames I j Wherein j is not equal to i, i, j ∈ [1,5 ]]Indicating that i and j take on a closed interval of 1 to 5.
4. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 3), combining the single-scale special effect prediction image set obtained by calculation in each frame, and fully utilizing different time scale information in an input time sequence to assist special effect prediction of different rates, wherein the multi-scale special effect prediction obtained by calculation is as follows:
Figure FDA0002915831290000031
in the formula D i→j Representing the calculated single-scale special effect prediction graph of the ith frame from the jth frame,
Figure FDA0002915831290000032
representing a single-scale bit for the ith frame from all other jth framesThe set of efficient prediction graphs, where j is not equal to i, i, j ∈ [1,5 ]]Represents a closed interval in which the values of i and j are 1 to 5; max means that 4 single-scale special effect prediction graphs D with different time spans are obtained i→j Taking the maximum values in the time dimension for combination; d i Is represented by the obtained compound of the formula I i The multi-scale special effect prediction of (1),
Figure FDA0002915831290000036
for the real number set, 1 is the vector channel size, and H and W represent the length and width of the multi-scale special effect prediction.
5. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 4), the multi-scale special effect prediction is adjusted through an attention mechanism, namely noise of a non-effective area is adjusted and suppressed through weight, and the method comprises the following steps:
4.1 Respectively passing the multi-scale special effect prediction of each frame through a self-attention mechanism to obtain new weight, and balancing the response size of each feature in the multi-scale special effect prediction again through the weight:
M i =Sigmoid(H(D i ))
in the formula D i Denotes to belong to I i H represents a convolution layer with a convolution kernel size of 1 multiplied by 1, sigmoid represents an activation function for calculating features, M i Representing the calculated weights;
4.2 ) calculate the weight M i Reunion multi-scale special effects prediction D i Combining the components as follows:
Figure FDA0002915831290000033
in the formula (I), the compound is shown in the specification,
Figure FDA0002915831290000034
the representation matrix is multiplied element by element,
Figure FDA0002915831290000035
representing the self-attention multi-scale special effect characteristics of the ith frame after re-weighting;
4.3 Combine the self-attention multi-scale special effect features of all input sequence frames on the channel to obtain a self-attention multi-scale special effect set feature as follows:
Figure FDA0002915831290000041
in the formula (I), the compound is shown in the specification,
Figure FDA0002915831290000042
indicating that the vector merging is performed in the time dimension T,
Figure FDA0002915831290000043
and
Figure FDA0002915831290000044
respectively representing the self-attention multi-scale special effect characteristics of the 1 st frame, the 2 nd frame and the 5 th frame, D is the combined self-attention multi-scale special effect set characteristic,
Figure FDA0002915831290000045
the method is characterized in that the method is a real number set, 1 is a vector channel size, 5 is a time dimension size, and H and W represent the length and width of the attention multi-scale special effect set feature.
6. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 5), extracting the characteristics of the input sequence frames of 5 continuous frames through a three-dimensional convolutional neural network layer as follows:
Figure FDA0002915831290000046
where I is the input sequence frame and Conv is the convolution kernel size of 5 × 5X 3, F is the extracted feature of the input sequence frame,
Figure FDA0002915831290000047
representing the real number set, C the vector channel size, 5 the time dimension size, H and W the length and width of the features of the input sequence frames.
7. The method of separating animated special effects and background content based on multi-scale motion information of claim 1, wherein: in step 6), the extraction of the special effect part is guided by using the self-attention multi-scale special effect set characteristics obtained by calculation, and the self-attention multi-scale special effect set characteristics and the characteristics of the input sequence frame are fused:
Figure FDA0002915831290000048
wherein F is the extracted features of the input sequence frame, D is the features of the self-attention multi-scale special effect set,
Figure FDA0002915831290000049
representing the element-by-element multiplication of a matrix, F e Are image frame features after fusion.
8. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 7), each residual module adds a Non-local module to strengthen time sequence information association through a three-dimensional residual convolutional neural network, and then outputs a separated special effect sequence frame and transparent channel information, which comprises the following steps:
7.1 Coding and decoding sequence frame characteristics through a three-dimensional residual convolutional neural network, wherein the structural design of the three-dimensional residual convolutional neural network consists of 2 down-sampling convolutional layers sharing parameters, 4 residual modules adding Non-local layers, 1 up-sampling convolutional layer and 2 up-sampling convolutional layers not sharing the parameters, and the size of all three-dimensional convolutional cores is 3 multiplied by 3;
7.2 Separate special effect sequence frames and transparent channel information are output:
Figure FDA0002915831290000051
in the formula, G e Representing a three-dimensional residual convolutional neural network, E representing a special effect sequence frame obtained by separation,
Figure FDA0002915831290000052
denotes that E belongs to
Figure FDA0002915831290000053
Real number set, continuous 5 frames of size 3 × H × W; a represents the transparent channel information and a is,
Figure FDA0002915831290000054
indicates that A belongs to
Figure FDA0002915831290000055
Real number set, 5 consecutive ones of size 1 × H × W.
9. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 8), subtracting the input sequence frame and the separated special effect sequence frame to obtain a damaged background sequence frame C without special effect r Comprises the following steps:
C r =I-E
wherein I represents an input sequence frame, E represents a special effect sequence frame obtained by separation, C r Representing the calculated lossy background sequence frames.
10. The method of separating animated special effects and background content based on multi-scale motion information of claim 1, wherein: in step 9), combining the damaged background sequence frame and the transparent channel information, inputting the combined frame into a three-dimensional convolutional neural network, replacing all convolutional layers with gate convolutions to perform dynamic feature selection, replacing the middle layer with different expansion rates of expansion convolutions, and finally outputting to obtain a repaired background sequence frame, wherein the method comprises the following steps:
9.1 The method comprises the following steps that) coding and decoding repair is carried out on a damaged background sequence frame through a three-dimensional convolution neural network, the network frame is designed to sequentially pass through 2 downsampling convolution layers, 4 expansion convolution perception incomplete blocks with different expansion rates and 2 upsampling layers, all the convolution layers are replaced by gate convolution network layers to control channel information to be fully utilized, and therefore redundancy is avoided, and 3 x 3 three-dimensional convolution kernels are adopted for all convolution kernels;
9.2 Jointly inputting the damaged background sequence frame and the transparent channel information into a three-dimensional convolution neural network to obtain a repaired background sequence frame as follows:
Figure FDA0002915831290000061
in the formula, G c Representing a three-dimensional convolutional neural network, C r Is a damaged background sequence frame, A represents transparent channel information, C represents a repaired background sequence frame,
Figure FDA0002915831290000062
denotes C belongs to
Figure FDA0002915831290000063
Real number set, continuous 5 frames of size 3 × H × W;
after separating the special effect layer of the input sequence frame of the original animation video clip and outputting the repaired background content layer, the special effect and the background content in the animation are separated.
CN202110101404.XA 2021-01-26 2021-01-26 Method for separating animation special effect and background content based on multi-scale motion information Active CN112686922B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110101404.XA CN112686922B (en) 2021-01-26 2021-01-26 Method for separating animation special effect and background content based on multi-scale motion information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110101404.XA CN112686922B (en) 2021-01-26 2021-01-26 Method for separating animation special effect and background content based on multi-scale motion information

Publications (2)

Publication Number Publication Date
CN112686922A CN112686922A (en) 2021-04-20
CN112686922B true CN112686922B (en) 2022-10-25

Family

ID=75459206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110101404.XA Active CN112686922B (en) 2021-01-26 2021-01-26 Method for separating animation special effect and background content based on multi-scale motion information

Country Status (1)

Country Link
CN (1) CN112686922B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015052514A2 (en) * 2013-10-08 2015-04-16 Digimania Limited Rendering composites/layers for video animations
CN108520501A (en) * 2018-03-30 2018-09-11 西安交通大学 A kind of video and removes rain snow method based on multiple dimensioned convolution sparse coding
CN111259782A (en) * 2020-01-14 2020-06-09 北京大学 Video behavior identification method based on mixed multi-scale time sequence separable convolution operation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8565525B2 (en) * 2005-12-30 2013-10-22 Telecom Italia S.P.A. Edge comparison in segmentation of video sequences

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015052514A2 (en) * 2013-10-08 2015-04-16 Digimania Limited Rendering composites/layers for video animations
CN108520501A (en) * 2018-03-30 2018-09-11 西安交通大学 A kind of video and removes rain snow method based on multiple dimensioned convolution sparse coding
CN111259782A (en) * 2020-01-14 2020-06-09 北京大学 Video behavior identification method based on mixed multi-scale time sequence separable convolution operation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Foreground–Background Separation From Video Clips via Motion-Assisted Matrix Restoration;xinchen ye 等;《IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 25, Issue: 11, November 2015)》;20150119;1721-1734 *
基于深度感知网络的草图简化和动画特效迁移;缪佩琦;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20200215;I138-I248 *

Also Published As

Publication number Publication date
CN112686922A (en) 2021-04-20

Similar Documents

Publication Publication Date Title
CN111539887B (en) Channel attention mechanism and layered learning neural network image defogging method based on mixed convolution
CN110111366B (en) End-to-end optical flow estimation method based on multistage loss
CN111563508B (en) Semantic segmentation method based on spatial information fusion
CN112926396B (en) Action identification method based on double-current convolution attention
CN111612807B (en) Small target image segmentation method based on scale and edge information
CN111563909B (en) Semantic segmentation method for complex street view image
CN111161277A (en) Natural image matting method based on deep learning
CN113888744A (en) Image semantic segmentation method based on Transformer visual upsampling module
CN111079532A (en) Video content description method based on text self-encoder
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN111241963B (en) First person view video interactive behavior identification method based on interactive modeling
CN110781850A (en) Semantic segmentation system and method for road recognition, and computer storage medium
CN114048822A (en) Attention mechanism feature fusion segmentation method for image
Makarov et al. Self-supervised recurrent depth estimation with attention mechanisms
CN111652081A (en) Video semantic segmentation method based on optical flow feature fusion
CN113313810A (en) 6D attitude parameter calculation method for transparent object
CN112288776A (en) Target tracking method based on multi-time step pyramid codec
CN112906631A (en) Dangerous driving behavior detection method and detection system based on video
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN112686922B (en) Method for separating animation special effect and background content based on multi-scale motion information
Kajabad et al. YOLOv4 for urban object detection: Case of electronic inventory in St. Petersburg
CN115205650B (en) Unsupervised abnormal positioning and detecting method and unsupervised abnormal positioning and detecting device based on multi-scale standardized flow
CN115830094A (en) Unsupervised stereo matching method
Sen et al. Object detection in foggy weather conditions
Wang et al. One-shot summary prototypical network toward accurate unpaved road semantic segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant