CN112686922A - Method for separating animation special effect and background content based on multi-scale motion information - Google Patents

Method for separating animation special effect and background content based on multi-scale motion information Download PDF

Info

Publication number
CN112686922A
CN112686922A CN202110101404.XA CN202110101404A CN112686922A CN 112686922 A CN112686922 A CN 112686922A CN 202110101404 A CN202110101404 A CN 202110101404A CN 112686922 A CN112686922 A CN 112686922A
Authority
CN
China
Prior art keywords
special effect
frame
scale
sequence
frames
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110101404.XA
Other languages
Chinese (zh)
Other versions
CN112686922B (en
Inventor
徐雪妙
屈玮
韩楚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110101404.XA priority Critical patent/CN112686922B/en
Publication of CN112686922A publication Critical patent/CN112686922A/en
Application granted granted Critical
Publication of CN112686922B publication Critical patent/CN112686922B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a method for separating animation special effects and background contents based on multi-scale motion information, which comprises the following steps: 1) acquiring a sequence frame with a special effect fragment in an animation video; 2) calculating a single-scale special effect prediction image set between each frame and other frames in the sequence frames; 3) merging the single-scale special effect prediction image set of each frame to be used as multi-scale special effect prediction; 4) obtaining self-attention multi-scale special effect set characteristics through a self-attention mechanism; 5) extracting the characteristics of an input sequence frame through a three-dimensional convolutional neural network layer; 6) combining the characteristics of the sequence frame and the self-attention multi-scale special effect set characteristics; 7) separating the special effect sequence frame and the transparent channel information through a three-dimensional residual convolution neural network; 8) making a difference between the sequence frame and the special effect sequence frame to obtain a damaged background sequence frame; 9) and obtaining the repaired background sequence frame through a three-dimensional convolutional neural network. The method can be applied to special effect migration and can improve the accuracy of segmenting and identifying the specific object in the animation.

Description

Method for separating animation special effect and background content based on multi-scale motion information
Technical Field
The invention relates to the technical field of video separation, in particular to a method for separating animation special effects and background contents based on multi-scale motion information.
Background
Cartoons have been widely used in the field of animation as a form of artistic expression that is visually related. To present weather conditions and environments in animation, artists often use various cartoon special effects to present effects, such as rain, snow, fallen leaves, fallen flowers, and so on. These special effects are not only used to represent the environment, but also help to enrich the visual expressive power in the animation. Although the study of cartoon vision based on animation has gained wide attention in recent years, a part of background information is often blocked by a cartoon special effect in an animation scene, which causes a part of information to be lost when the animation background is analyzed, which is not beneficial to the study of directions such as segmentation of specific objects of animation, analysis of animation background, and the like. Meanwhile, the cartoon special effect in the animation is used as a special effect displayed visually, and the layering and migration of the cartoon special effect are also widely applied, so that the separation of the cartoon special effect and the background in the animation is an urgent research direction.
However, the cartoon special effects in the animation move irregularly, and the cartoon special effects are complex in type and different in size, so that the difficulty of separating the special effects by using the traditional rule method is increased. Meanwhile, the special effect data set of the cartoon is rare, which also increases the difficulty of separating the special effect by the deep learning method. At present, some methods separate the foreground and the background through deep learning or a traditional time sequence method, but the methods are not suitable for separating the cartoon special effect in the animation, because the videos which are usually considered by the methods are natural videos, and the foreground distribution in the natural videos and the special effect distribution in the cartoon animation are usually inconsistent, for example, the size and the shape of the special effect in the animation are changed at many ends and the appearance position of the special effect in the animation is more unpredictable. Therefore, how to accurately separate the cartoon special effect and the background in the animation becomes a key problem.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art, provides a method for separating animation special effects and background contents based on multi-scale motion information, can separate animation video clips into fine special effect effects and clean background contents, can be effectively suitable for separating different special effects in animations, can further repair the background contents behind the special effects after the special effects are effectively separated, and greatly improves downstream applications such as segmentation, recognition and special effect migration.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: the method for separating the animation special effect and the background content based on the multi-scale motion information comprises the following steps:
1) acquiring data, wherein the data comprises sequence frames with special effect fragments in an animation video as input;
2) calculating a single-scale special effect prediction image set between each frame and other frames in the input sequence frames;
3) merging the single-scale special effect prediction image set of each frame to be used as multi-scale special effect prediction;
4) adjusting the multi-scale special effect prediction through a self-attention mechanism to obtain self-attention multi-scale special effect set characteristics;
5) extracting the characteristics of an input sequence frame through a three-dimensional convolution neural network layer;
6) combining the characteristics of the input sequence frame and the self-attention multi-scale special effect set characteristics;
7) adding a Non-local module into each residual module to strengthen time sequence information association through a three-dimensional residual convolutional neural network, and then outputting a separated special effect sequence frame and transparent channel information;
8) obtaining a damaged background sequence frame by subtracting the input sequence frame from the special effect sequence frame obtained by separation;
9) and combining the damaged background sequence frame and the transparent channel information, inputting the combined frame and the transparent channel information into a three-dimensional convolution neural network, replacing all convolution layers with gate convolutions for dynamic feature selection, replacing the middle layer with expansion convolutions with different expansion rates, and finally outputting to obtain a repaired background sequence frame.
In the step 1), the animation video with the special effect fragment is a video fragment obtained by cutting the collected animation video containing the special effect through professional cartography software Adobe Premier, wherein the special effect types comprise rain, snow, fallen petals and fallen leaves; the sequence frame refers to a continuous image frame obtained by sampling in a video clip at 25 frames per second, and the continuous image is divided into sequence frames with 5 as a number unit through data preprocessing:
Figure BDA0002915831300000031
wherein I represents that the input is a sequence of 5 consecutive frames in size, wherein I1Denotes the first frame, I2Represents a second frame, I5Represents a fifth frame;
Figure BDA0002915831300000032
for the real number set, C is the number of channels, and H and W represent the length and width of the frame.
In step 2), a single-scale special effect prediction image set between each frame and other frames in the input sequence frames is calculated, and the method comprises the following steps:
2.1) calculating each frame I by an optical flow estimation neural network FlowNet2iWith other frames I in the sequence framej|j≠iAnd is affine transformed back by an optical flow
Figure BDA0002915831300000033
Figure BDA0002915831300000034
In the formula, V represents the optical flow estimation neural network FlowNet2, IiDenotes the ith frame, Ij|j≠iRepresenting the j-th frame and j is not equal to I, V (I)i,Ij|j≠i) Representing calculating an optical flow estimate from the ith frame to the jth frame; w denotes an affine transformation, i.e. Ij|j≠iAffine transformation back to I by estimated optical flowi
Figure BDA0002915831300000035
Representing the result of calculating an affine transformation from the ith frame to the jth frame;
2.2) calculating to obtain each of the motion fields according to the characteristics of the special effects and the differences of the speed and the direction on the background content motion fieldFrame IiThe single-scale special effect prediction graph is as follows:
Figure BDA0002915831300000036
in the formula IiWhich represents the (i) th frame,
Figure BDA0002915831300000037
representing the result of an affine transformation from the ith frame to the jth frame,
Figure BDA0002915831300000038
is represented byiWith affine transformed results
Figure BDA0002915831300000039
Calculating Euclidean distances from channel C to channel C, and accumulating and summing the obtained results, Di→jRepresenting calculated IiA single-scale special effect prediction graph from the jth frame,
Figure BDA00029158313000000310
the method comprises the following steps of (1) representing the length and the width of a single-scale special effect prediction graph by a real number set and 1 representing the size of a vector channel;
2.3) calculating the ith frame IiTo all other frames I in the input sequence framej|j≠iObtaining I after the single-scale special effect prediction graphiSet of single-scale special effect prediction maps
Figure BDA0002915831300000041
Di→jRepresenting the calculated single-scale special effect prediction graph of the ith frame from the jth frame,
Figure BDA0002915831300000042
indicating that the ith frame is from all other frames IjIn which j is not equal to i, i, j ∈ [1,5 ]]Indicating that i and j take on a closed interval of 1 to 5.
In step 3), combining the single-scale special effect prediction image set obtained by calculation in each frame, and fully utilizing different time scale information in an input time sequence to assist special effect prediction of different rates, wherein the multi-scale special effect prediction obtained by calculation is as follows:
Figure BDA0002915831300000043
in the formula, Di→jRepresenting the calculated single-scale special effect prediction graph of the ith frame from the jth frame,
Figure BDA0002915831300000044
represents a set of single-scale special effect prediction graphs from all other jth frames at ith frame, where j is not equal to i, i, j ∈ [1,5 ]]A closed interval of values of i and j in a range of 1 to 5 is shown; max means that 4 single-scale special effect prediction graphs D with different time spans are obtainedi→jTaking the maximum values in the time dimension for combination; diIndicates obtained a subject of IiThe multi-scale special effect prediction of the method,
Figure BDA0002915831300000045
is a real number set, 1 is the vector channel size, and H and W represent the length and width of the multi-scale special effect prediction.
In step 4), the multi-scale special effect prediction is adjusted through an attention mechanism, namely noise of a non-effective area is adjusted and suppressed through weight, and the method comprises the following steps:
4.1) respectively enabling the multi-scale special effect prediction of each frame to pass through a self-attention mechanism to obtain new weight, and balancing the response size of each feature in the multi-scale special effect prediction again through the weight:
Mi=Sigmoid(H(Di))
in the formula, DiIs represented as belonging toiH represents a convolution layer with a convolution kernel size of 1 x 1, Sigmoid represents an activation function for computing features, MiRepresenting the calculated weights;
4.2) calculating the weight MiReunion multi-scale special effects prediction DiCombining the components as follows:
Figure BDA0002915831300000051
in the formula (I), the compound is shown in the specification,
Figure BDA0002915831300000052
the representation matrix is multiplied element by element,
Figure BDA0002915831300000053
representing the self-attention multi-scale special effect characteristics of the ith frame after re-weighting;
4.3) combining the self-attention multi-scale special effect characteristics of all the input sequence frames on the channel to obtain a self-attention multi-scale special effect set characteristic as follows:
Figure BDA0002915831300000054
in the formula (I), the compound is shown in the specification,
Figure BDA0002915831300000055
indicating that the vector merging is performed in the time dimension T,
Figure BDA0002915831300000056
and
Figure BDA0002915831300000057
respectively showing the self-attention multi-scale special effect characteristics of the 1 st frame, the 2 nd frame and the 5 th frame, D is the combined self-attention multi-scale special effect set characteristic,
Figure BDA0002915831300000058
the method is characterized in that the method is a real number set, 1 is a vector channel size, 5 is a time dimension size, and H and W represent the length and width of the attention multi-scale special effect set feature.
In step 5), extracting the characteristics of the input sequence frames of 5 continuous frames through a three-dimensional convolutional neural network layer as follows:
Figure BDA0002915831300000059
wherein I is an input sequence frame, Conv is a three-dimensional convolution layer with a convolution kernel size of 5 × 5 × 3, F is a characteristic of the extracted input sequence frame,
Figure BDA00029158313000000510
representing the real number set, C the vector channel size, 5 the time dimension size, H and W the length and width of the features of the input sequence frames.
In step 6), the extraction of the special effect part is guided by using the self-attention multi-scale special effect set characteristics obtained by calculation, and the self-attention multi-scale special effect set characteristics and the characteristics of the input sequence frame are fused:
Figure BDA00029158313000000511
wherein F is the extracted characteristics of the input sequence frame, D is the characteristics of the self-attention multi-scale special effect set,
Figure BDA00029158313000000512
representing the multiplication of the matrix element by element, FeIs the image frame feature after fusion.
In step 7), each residual module adds a Non-local module to strengthen time sequence information association through a three-dimensional residual convolutional neural network, and then outputs a separated special effect sequence frame and transparent channel information, which comprises the following steps:
7.1) coding and decoding sequence frame characteristics through a three-dimensional residual convolutional neural network, wherein the structural design of the three-dimensional residual convolutional neural network consists of 2 downsampling convolutional layers sharing parameters, 4 residual modules adding Non-local layers, 1 upsampling convolutional layer and 2 upsampling convolutional layers not sharing the parameters, and the size of all three-dimensional convolutional cores is 3 multiplied by 3;
7.2) respectively outputting the separated special effect sequence frame and the transparent channel information:
Figure BDA0002915831300000061
in the formula, GeRepresenting a three-dimensional residual convolutional neural network, E representing a special effect sequence frame obtained by separation,
Figure BDA0002915831300000062
indicates that E belongs to
Figure BDA0002915831300000063
Real number set, 5 consecutive frames of size 3 × H × W; a represents the transparent channel information and a is,
Figure BDA0002915831300000064
indicates that A belongs to
Figure BDA0002915831300000065
Real number set, 5 consecutive ones of size 1 × H × W.
In step 8), subtracting the input sequence frame and the separated special effect sequence frame to obtain a damaged background sequence frame C without special effectrComprises the following steps:
Cr=I-E
wherein I represents an input sequence frame, E represents a separated special effect sequence frame, CrRepresenting the computed lossy background sequence frames.
In step 9), merging the missing background sequence frame and the transparent channel information, inputting the merged background sequence frame and the transparent channel information into a three-dimensional convolution neural network, replacing all convolution layers with gate convolutions for dynamic feature selection, replacing the middle layer with expansion convolutions with different expansion rates, and finally outputting to obtain a repaired background sequence frame, wherein the method comprises the following steps:
9.1) coding and decoding the damaged background sequence frame through a three-dimensional convolutional neural network, wherein the network frame is designed to sequentially pass through 2 downsampling convolutional layers, 4 incomplete blocks with different expansion rates and different sizes of expansion convolutional perception and 2 upsampling layers, all the convolutional layers are replaced by gate convolutional network layer control channel information to be fully utilized so as to avoid redundancy, and 3 x 3 three-dimensional convolutional kernels are adopted for all the convolutional kernels;
9.2) jointly inputting the damaged background sequence frame and the transparent channel information into a three-dimensional convolution neural network to obtain a repaired background sequence frame, wherein the repaired background sequence frame is as follows:
Figure BDA0002915831300000071
in the formula, GcRepresenting a three-dimensional convolutional neural network, CrIs a damaged background sequence frame, A represents transparent channel information, C represents a repaired background sequence frame,
Figure BDA0002915831300000072
indicates that C belongs to
Figure BDA0002915831300000073
Real number set, 5 consecutive frames of size 3 × H × W;
after separating the special effect layer of the input sequence frame of the original animation video clip and outputting the repaired background content layer, the special effect and the background content in the animation are separated.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention firstly proposes to use the neural network in deep learning to separate the animation video containing the special effect and separate the animation video segment into a special effect layer and a background content layer.
2. The method can extract various different special effects in the animation, and can recover complete background contents without special effects while accurately extracting the special effects.
3. The invention provides a method for sensing the difference of a motion field in multiple scales for the first time, which can sense and position a special effect with larger difference of direction, speed and shape distribution, and embed the special effect into the learning process of a neural network to be used as prior knowledge to guide the network to capture the special effect motion characteristics so as to further help the neural network to learn how to separate the special effect.
4. The invention provides an attention mechanism to assist the guiding of multi-scale perception motion difference, and can further guide a network to obtain more accurate special effect motion prior information to avoid noise errors.
5. The method introduces the three-dimensional convolution neural network to restore the damaged background, perceives the background of different damaged blocks through a modeling mode that transparency information is used as soft auxiliary input and the expansion convolution of different receptive fields, and considers the time sequence consistency by utilizing the three-dimensional convolution, so that the restored background is clearer and more complete.
6. The method has wide use space in the animation video processing task, and has short inference time and good generalization.
Drawings
FIG. 1 is a logic flow diagram of the method of the present invention.
Fig. 2-1 to fig. 2-5 show input sequence frames of the method of the present invention.
Fig. 3-1 to 3-4 are single-scale special effect prediction image sets calculated by using fig. 2-5 as current frames.
Fig. 4 shows the multi-scale special effect prediction of the current frame obtained by the combination.
FIG. 5 is a self-attention multi-scale effect feature resulting from a multi-scale effect prediction calculation self-attention mechanism.
Fig. 6-1 to 6-5 are separated special effect sequence frames.
Fig. 7-1 to 7-5 are sequence frames of background content obtained after the separation and repair.
Detailed Description
The present invention will be further described with reference to the following specific examples.
As shown in fig. 1, the method for separating an animated special effect and background content based on multi-scale motion information provided by this embodiment includes the following steps:
1) acquiring sequence frames with special effect fragments in the animation video, wherein each frame specifically refers to an image containing a special effect in the background. Video sequence frames with special effect segments are obtained by using the professional graphics software Adobe Premier. The manufacturing process is that firstly, an animation video with a special effect is collected, then, a segment containing the special effect is cut and appointed from a video segment, and a sequence frame refers to a continuous image frame obtained by sampling 25 frames per second; the special effects comprise four different types of raining, snowing, fallen petals and fallen leaves. Firstly, data preprocessing is carried out on animation sequence frames, and all the sequence frames are divided into 5 continuous sequence inputs. As shown in fig. 2-1 to 2-5, each frame is adjacent to each other.
2) And calculating a single-scale special effect prediction image set between the input animation sequence frame and other frames, wherein the single-scale special effect prediction image set comprises single-scale special effect prediction images of the current frame and all other frames. Calculating by an optical flow estimation neural network FlowNet2 to obtain every ith frame IiAnd other frames I in the sequencej|j≠iOptical flow estimation and affine transformation back
Figure BDA0002915831300000091
Figure BDA0002915831300000092
In the formula, V represents classical optical flow estimation neural network FlowNet2, IiDenotes the ith frame, Ij|j≠iRepresenting the j-th frame and j is not equal to I, V (I)i,Ij|j≠i) Representing calculating an optical flow estimate from the ith frame to the jth frame; w denotes an affine transformation, i.e. Ij|j≠iAffine transformation back to I by estimated optical flowi
Figure BDA0002915831300000099
Indicating the result of calculating the affine transformation from the ith frame to the jth frame.
After affine transformation results of the current frame and all other frames are obtained, a single-scale special effect prediction graph D of the current frame is calculated according to the special effect and the speed and direction difference of a background content motion fieldi→jComprises the following steps:
Figure BDA0002915831300000093
in the formula (I), the compound is shown in the specification,
Figure BDA0002915831300000094
represents the current frame IiWith this frame after affine transformation
Figure BDA0002915831300000095
Calculating Euclidean distance channel by channel C, accumulating and summing to obtain a single-scale special effect prediction image from the ith frame to the jth frame,
Figure BDA0002915831300000096
is a real number set, 1 is the vector channel size, and H and W represent the length and width of the single-scale special effect prediction graph. After the single-scale special effect prediction images from all the current frames I to other frames j are calculated, the current frame I is obtainediThe set of single-scale special effect prediction maps. Assume that the current frame is shown in fig. 2-5, and is shown in fig. 3-1 to fig. 3-4, and is a set of calculated single-scale special effect prediction maps.
3) For the obtained single-scale special effect prediction graph Di→jAnd then, combining the set of the single-scale special effect prediction images of each frame into multi-scale special effect prediction through maximum operation:
Figure BDA0002915831300000097
in the formula (I), the compound is shown in the specification,
Figure BDA0002915831300000098
and representing that the current frame is the ith frame, and obtaining a single-scale special effect prediction graph set through calculation from other frames j, wherein j is not equal to i, and the value of the j is within an integer range from 1 to 5. Max is the single-scale special effect prediction graph D of 4 different time spans to be obtained by maximum operationi→jTaking the maximum values in the time dimension and combining to obtain the multi-scale special effect prediction D belonging to the ith framei
Figure BDA0002915831300000101
Is a real number set, 1 is a vector channel size, and H and W represent the length and width of the multi-scale special effect prediction, as shown in fig. 4.
4) After the multi-scale special effect prediction of each frame is obtained, the self-attention multi-scale special effect set characteristics of each frame in the sequence frames are calculated. Firstly, self-attention multi-scale special effect features are calculated for multi-scale special effect prediction of each frame, and each position feature in the multi-scale special effect prediction is balanced again through a weight value obtained by learning:
Mi=Sigmoid(H(Di))
in the formula, H represents a convolution layer with a convolution kernel size of 1 × 1, Sigmoid represents a calculated activation function, and M is obtainediIs the self-attention weight.
Then, the self-attention weight is calculated and combined with the original predicted multi-scale special effect prediction to form a combination:
Figure BDA0002915831300000102
in the formula (I), the compound is shown in the specification,
Figure BDA0002915831300000103
the representation matrix is multiplied element by element,
Figure BDA0002915831300000104
representing the re-weighted self-attention multi-scale special effect feature. See fig. 5 for a self-attention multi-scale special effect feature obtained after the self-attention mechanism is calculated for fig. 4.
And finally, calculating the self-attention multi-scale special effect characteristics of each frame in the whole sequence of frames in the time dimension, namely fusing the self-attention multi-scale special effect characteristics containing the attention characteristics of each frame in the time dimension:
Figure BDA0002915831300000105
in the formula (I), the compound is shown in the specification,
Figure BDA0002915831300000106
indicating that the vector merging is performed in the time dimension T,
Figure BDA0002915831300000107
and
Figure BDA0002915831300000108
respectively showing the self-attention multi-scale special effect characteristics of the 1 st frame, the 2 nd frame and the 5 th frame, D is the combined self-attention multi-scale special effect set characteristic,
Figure BDA0002915831300000109
the method is characterized in that the method is a real number set, 1 is a vector channel size, 5 is a time dimension size, and H and W represent the length and width of the attention multi-scale special effect set feature.
5) And for the input sequence frame, performing feature extraction through a three-dimensional convolution neural network layer.
Figure BDA0002915831300000111
In the formula, Conv is a general three-dimensional convolutional neural network layer, the size of a convolution kernel is 5 × 5 × 3, and F is an extracted image feature. And extracting and obtaining abstract characteristics of the input sequence frame in a convolution mode.
6) The coding and decoding are performed in conjunction with the input sequence frame features and the self attention feature set features. Guiding the extraction of the special effect by using the self-attention special effect set characteristics obtained by the calculation in the step 4), and fusing the self-attention special effect set characteristics and the characteristics of the input sequence frame:
Figure BDA0002915831300000112
in the formula (I), the compound is shown in the specification,
Figure BDA0002915831300000113
representing element-by-element multiplication of matrices, where features specified in the attention-specific set and features of the input sequence frame are multiplied in the channel dimension, FeIs the image frame feature after fusion.
7) And then coding and decoding the image frame characteristics through a three-dimensional residual convolution neural network to obtain separated special effect sequence frames and transparent channel information:
Figure BDA0002915831300000114
in the formula: geThe method is characterized in that a three-dimensional residual convolutional neural network is represented, the structural design of the neural network is composed of 2 downsampling convolutional layers which do not share parameters, 4 residual modules which are replaced by adding Non-local layers, 1 upsampling convolutional layer and 2 upsampling convolutional layers which do not share parameters, the size of all three-dimensional convolutional cores is 3 multiplied by 3, and a special effect sequence frame E and transparent channel information A are respectively output through the last 2 unshared upsampling convolutional layers. E denotes the special effect sequence frame obtained by the separation,
Figure BDA0002915831300000115
indicates that E belongs to
Figure BDA0002915831300000116
Real number set, 5 consecutive frames of size 3 × H × W; a represents the transparent channel information and a is,
Figure BDA0002915831300000117
indicates that A belongs to
Figure BDA0002915831300000118
Real number set, 5 consecutive ones of size 1 × H × W.
As shown in fig. 6-1 to 6-5, the frames are separated to obtain special effect sequence frames containing channel information.
8) After the separated special effect sequence frame and the transparent channel are obtained, the residual background sequence frame C to be repaired is obtained by subtracting the separated special effect sequence frame from the input sequence framer=I-E,
Figure BDA0002915831300000119
For input sequence frames, E for separate special effect sequence frames, CrRepresenting the computed lossy background sequence frames.
9) And obtaining the separated animation background content by combining the three-dimensional convolution neural network with the front transparent channel information. The method comprises the steps of jointly inputting a background sequence frame to be repaired in a defect mode and transparent channel information into a three-dimensional convolutional neural network, sequentially sensing defect blocks with different sizes and 2 up-sampling layers through 2 down-sampling convolutional layers and 4 expansion convolutions with different expansion rates, replacing all convolutional layers with gate convolutional network layer control channel information, fully utilizing effective channel information to avoid redundancy, and adopting 3 x 3 three-dimensional convolutional kernels for all convolutional kernel sizes. The sequence frames of the background contents of the animation obtained and separated are as follows:
Figure BDA0002915831300000121
in the formula, GcRepresenting a three-dimensional convolutional neural network, C represents a repaired background sequence frame of continuous 5 frames with the size of 3 × H × W, as shown in fig. 7-1 to 7-5, which is the repaired background sequence frame of continuous 5 frames.
After separating the special effect sequence frame and the separated animation background sequence frame of the original input animation video sequence, the animation special effect and the background content are separated.
The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims (10)

1. The method for separating the animation special effect and the background content based on the multi-scale motion information is characterized by comprising the following steps of:
1) acquiring data, wherein the data comprises sequence frames with special effect fragments in an animation video as input;
2) calculating a single-scale special effect prediction image set between each frame and other frames in the input sequence frames;
3) merging the single-scale special effect prediction image set of each frame to be used as multi-scale special effect prediction;
4) adjusting the multi-scale special effect prediction through a self-attention mechanism to obtain self-attention multi-scale special effect set characteristics;
5) extracting the characteristics of an input sequence frame through a three-dimensional convolution neural network layer;
6) combining the characteristics of the input sequence frame and the self-attention multi-scale special effect set characteristics;
7) adding a Non-local module into each residual module to strengthen time sequence information association through a three-dimensional residual convolutional neural network, and then outputting a separated special effect sequence frame and transparent channel information;
8) obtaining a damaged background sequence frame by subtracting the input sequence frame from the special effect sequence frame obtained by separation;
9) and combining the damaged background sequence frame and the transparent channel information, inputting the combined frame and the transparent channel information into a three-dimensional convolution neural network, replacing all convolution layers with gate convolutions for dynamic feature selection, replacing the middle layer with expansion convolutions with different expansion rates, and finally outputting to obtain a repaired background sequence frame.
2. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in the step 1), the animation video with the special effect fragment is a video fragment obtained by cutting the collected animation video containing the special effect through professional cartography software Adobe Premier, wherein the special effect types comprise rain, snow, fallen petals and fallen leaves; the sequence frame refers to a continuous image frame obtained by sampling in a video clip at 25 frames per second, and the continuous image is divided into sequence frames with 5 as a number unit through data preprocessing:
Figure FDA0002915831290000011
wherein I represents that the input is a sequence of 5 consecutive frames in size, wherein I1Denotes the first frame, I2Represents a second frame, I5Represents a fifth frame;
Figure FDA00029158312900000211
is a set of real numbers, C is TongThe number of lanes, H and W, represents the length and width of the frame.
3. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 2), a single-scale special effect prediction image set between each frame and other frames in the input sequence frames is calculated, and the method comprises the following steps:
2.1) calculating each frame I by an optical flow estimation neural network FlowNet2iWith other frames I in the sequence framej|j≠iAnd is affine transformed back by an optical flow
Figure FDA0002915831290000021
Figure FDA0002915831290000022
In the formula, V represents the optical flow estimation neural network FlowNet2, IiDenotes the ith frame, Ij|j≠iRepresenting the j-th frame and j is not equal to I, V (I)i,Ij|j≠i) Representing calculating an optical flow estimate from the ith frame to the jth frame; w denotes an affine transformation, i.e. Ij|j≠iAffine transformation back to I by estimated optical flowi
Figure FDA0002915831290000023
Representing the result of calculating an affine transformation from the ith frame to the jth frame;
2.2) calculating to obtain each frame I according to the characteristics of the special effect and the difference in the speed and direction on the background content sports fieldiThe single-scale special effect prediction graph is as follows:
Figure FDA0002915831290000024
in the formula IiWhich represents the (i) th frame,
Figure FDA0002915831290000025
representing the result of an affine transformation from the ith frame to the jth frame,
Figure FDA0002915831290000026
is represented byiWith affine transformed results
Figure FDA0002915831290000027
Calculating Euclidean distances from channel C to channel C, and accumulating and summing the obtained results, Di→jRepresenting calculated IiA single-scale special effect prediction graph from the jth frame,
Figure FDA0002915831290000028
the method comprises the following steps of (1) representing the length and the width of a single-scale special effect prediction graph by a real number set and 1 representing the size of a vector channel;
2.3) calculating the ith frame IiTo all other frames I in the input sequence framej|j≠iObtaining I after the single-scale special effect prediction graphiSet of single-scale special effect prediction maps
Figure FDA0002915831290000029
Di→jRepresenting the calculated single-scale special effect prediction graph of the ith frame from the jth frame,
Figure FDA00029158312900000210
indicating that the ith frame is from all other frames IjIn which j is not equal to i, i, j ∈ [1,5 ]]Indicating that i and j take on a closed interval of 1 to 5.
4. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 3), combining the single-scale special effect prediction image set obtained by calculation in each frame, and fully utilizing different time scale information in an input time sequence to assist special effect prediction of different rates, wherein the multi-scale special effect prediction obtained by calculation is as follows:
Figure FDA0002915831290000031
in the formula, Di→jRepresenting the calculated single-scale special effect prediction graph of the ith frame from the jth frame,
Figure FDA0002915831290000032
represents a set of single-scale special effect prediction graphs from all other jth frames at ith frame, where j is not equal to i, i, j ∈ [1,5 ]]A closed interval of values of i and j in a range of 1 to 5 is shown; max means that 4 single-scale special effect prediction graphs D with different time spans are obtainedi→jTaking the maximum values in the time dimension for combination; diIndicates obtained a subject of IiThe multi-scale special effect prediction of the method,
Figure FDA0002915831290000036
is a real number set, 1 is the vector channel size, and H and W represent the length and width of the multi-scale special effect prediction.
5. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 4), the multi-scale special effect prediction is adjusted through an attention mechanism, namely noise of a non-effective area is adjusted and suppressed through weight, and the method comprises the following steps:
4.1) respectively enabling the multi-scale special effect prediction of each frame to pass through a self-attention mechanism to obtain new weight, and balancing the response size of each feature in the multi-scale special effect prediction again through the weight:
Mi=Sigmoid(H(Di))
in the formula, DiIs represented as belonging toiH represents a convolution layer with a convolution kernel size of 1 x 1, Sigmoid represents an activation function for computing features, MiRepresenting the calculated weights;
4.2) calculating the weight MiReunion multi-scale special effects prediction DiCombining the components as follows:
Figure FDA0002915831290000033
in the formula (I), the compound is shown in the specification,
Figure FDA0002915831290000034
the representation matrix is multiplied element by element,
Figure FDA0002915831290000035
representing the self-attention multi-scale special effect characteristics of the ith frame after re-weighting;
4.3) combining the self-attention multi-scale special effect characteristics of all the input sequence frames on the channel to obtain a self-attention multi-scale special effect set characteristic as follows:
Figure FDA0002915831290000041
in the formula (I), the compound is shown in the specification,
Figure FDA0002915831290000042
indicating that the vector merging is performed in the time dimension T,
Figure FDA0002915831290000043
and
Figure FDA0002915831290000044
respectively showing the self-attention multi-scale special effect characteristics of the 1 st frame, the 2 nd frame and the 5 th frame, D is the combined self-attention multi-scale special effect set characteristic,
Figure FDA0002915831290000045
the method is characterized in that the method is a real number set, 1 is a vector channel size, 5 is a time dimension size, and H and W represent the length and width of the attention multi-scale special effect set feature.
6. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 5), extracting the characteristics of the input sequence frames of 5 continuous frames through a three-dimensional convolutional neural network layer as follows:
Figure FDA0002915831290000046
wherein I is an input sequence frame, Conv is a three-dimensional convolution layer with a convolution kernel size of 5 × 5 × 3, F is a characteristic of the extracted input sequence frame,
Figure FDA0002915831290000047
representing the real number set, C the vector channel size, 5 the time dimension size, H and W the length and width of the features of the input sequence frames.
7. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 6), the extraction of the special effect part is guided by using the self-attention multi-scale special effect set characteristics obtained by calculation, and the self-attention multi-scale special effect set characteristics and the characteristics of the input sequence frame are fused:
Figure FDA0002915831290000048
wherein F is the extracted characteristics of the input sequence frame, D is the characteristics of the self-attention multi-scale special effect set,
Figure FDA0002915831290000049
representing the multiplication of the matrix element by element, FeIs the image frame feature after fusion.
8. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 7), each residual module adds a Non-local module to strengthen time sequence information association through a three-dimensional residual convolutional neural network, and then outputs a separated special effect sequence frame and transparent channel information, which comprises the following steps:
7.1) coding and decoding sequence frame characteristics through a three-dimensional residual convolutional neural network, wherein the structural design of the three-dimensional residual convolutional neural network consists of 2 downsampling convolutional layers sharing parameters, 4 residual modules adding Non-local layers, 1 upsampling convolutional layer and 2 upsampling convolutional layers not sharing the parameters, and the size of all three-dimensional convolutional cores is 3 multiplied by 3;
7.2) respectively outputting the separated special effect sequence frame and the transparent channel information:
Figure FDA0002915831290000051
in the formula, GeRepresenting a three-dimensional residual convolutional neural network, E representing a special effect sequence frame obtained by separation,
Figure FDA0002915831290000052
indicates that E belongs to
Figure FDA0002915831290000053
Real number set, 5 consecutive frames of size 3 × H × W; a represents the transparent channel information and a is,
Figure FDA0002915831290000054
indicates that A belongs to
Figure FDA0002915831290000055
Real number set, 5 consecutive ones of size 1 × H × W.
9. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 8), subtracting the input sequence frame and the separated special effect sequence frame to obtain a damaged background sequence frame C without special effectrComprises the following steps:
Cr=I-E
wherein I represents an input sequence frame, E represents a separated special effect sequence frame, CrRepresenting the computed lossy background sequence frames.
10. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 9), combining the damaged background sequence frame and the transparent channel information, inputting the combined frame into a three-dimensional convolutional neural network, replacing all convolutional layers with gate convolutions for dynamic feature selection, replacing the middle layer with expansion convolutions with different expansion rates, and finally outputting to obtain a repaired background sequence frame, wherein the method comprises the following steps:
9.1) coding and decoding the damaged background sequence frame through a three-dimensional convolutional neural network, wherein the network frame is designed to sequentially pass through 2 downsampling convolutional layers, 4 incomplete blocks with different expansion rates and different sizes of expansion convolutional perception and 2 upsampling layers, all the convolutional layers are replaced by gate convolutional network layer control channel information to be fully utilized so as to avoid redundancy, and 3 x 3 three-dimensional convolutional kernels are adopted for all the convolutional kernels;
9.2) jointly inputting the damaged background sequence frame and the transparent channel information into a three-dimensional convolution neural network to obtain a repaired background sequence frame, wherein the repaired background sequence frame is as follows:
Figure FDA0002915831290000061
in the formula, GcRepresenting a three-dimensional convolutional neural network, CrIs a damaged background sequence frame, A represents transparent channel information, C represents a repaired background sequence frame,
Figure FDA0002915831290000062
indicates that C belongs to
Figure FDA0002915831290000063
Real number set, 5 consecutive frames of size 3 × H × W;
after separating the special effect layer of the input sequence frame of the original animation video clip and outputting the repaired background content layer, the special effect and the background content in the animation are separated.
CN202110101404.XA 2021-01-26 2021-01-26 Method for separating animation special effect and background content based on multi-scale motion information Active CN112686922B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110101404.XA CN112686922B (en) 2021-01-26 2021-01-26 Method for separating animation special effect and background content based on multi-scale motion information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110101404.XA CN112686922B (en) 2021-01-26 2021-01-26 Method for separating animation special effect and background content based on multi-scale motion information

Publications (2)

Publication Number Publication Date
CN112686922A true CN112686922A (en) 2021-04-20
CN112686922B CN112686922B (en) 2022-10-25

Family

ID=75459206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110101404.XA Active CN112686922B (en) 2021-01-26 2021-01-26 Method for separating animation special effect and background content based on multi-scale motion information

Country Status (1)

Country Link
CN (1) CN112686922B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090154807A1 (en) * 2005-12-30 2009-06-18 Telecom Italia S.P.A. Edge Comparison in Segmentation of Video Sequences
WO2015052514A2 (en) * 2013-10-08 2015-04-16 Digimania Limited Rendering composites/layers for video animations
CN108520501A (en) * 2018-03-30 2018-09-11 西安交通大学 A kind of video and removes rain snow method based on multiple dimensioned convolution sparse coding
CN111259782A (en) * 2020-01-14 2020-06-09 北京大学 Video behavior identification method based on mixed multi-scale time sequence separable convolution operation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090154807A1 (en) * 2005-12-30 2009-06-18 Telecom Italia S.P.A. Edge Comparison in Segmentation of Video Sequences
WO2015052514A2 (en) * 2013-10-08 2015-04-16 Digimania Limited Rendering composites/layers for video animations
CN108520501A (en) * 2018-03-30 2018-09-11 西安交通大学 A kind of video and removes rain snow method based on multiple dimensioned convolution sparse coding
CN111259782A (en) * 2020-01-14 2020-06-09 北京大学 Video behavior identification method based on mixed multi-scale time sequence separable convolution operation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XINCHEN YE 等: "Foreground–Background Separation From Video Clips via Motion-Assisted Matrix Restoration", 《IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY ( VOLUME: 25, ISSUE: 11, NOVEMBER 2015)》 *
缪佩琦: "基于深度感知网络的草图简化和动画特效迁移", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Also Published As

Publication number Publication date
CN112686922B (en) 2022-10-25

Similar Documents

Publication Publication Date Title
CN111539887B (en) Channel attention mechanism and layered learning neural network image defogging method based on mixed convolution
CN112926396B (en) Action identification method based on double-current convolution attention
CN112287940B (en) Semantic segmentation method of attention mechanism based on deep learning
CN109299274B (en) Natural scene text detection method based on full convolution neural network
CN110111366B (en) End-to-end optical flow estimation method based on multistage loss
CN111612807B (en) Small target image segmentation method based on scale and edge information
CN110276354B (en) High-resolution streetscape picture semantic segmentation training and real-time segmentation method
CN112991350B (en) RGB-T image semantic segmentation method based on modal difference reduction
CN111079532A (en) Video content description method based on text self-encoder
CN111563909A (en) Semantic segmentation method for complex street view image
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN110781850A (en) Semantic segmentation system and method for road recognition, and computer storage medium
CN111241963B (en) First person view video interactive behavior identification method based on interactive modeling
Makarov et al. Self-supervised recurrent depth estimation with attention mechanisms
CN112508960A (en) Low-precision image semantic segmentation method based on improved attention mechanism
CN115035171B (en) Self-supervision monocular depth estimation method based on self-attention guide feature fusion
CN114048822A (en) Attention mechanism feature fusion segmentation method for image
CN111652081A (en) Video semantic segmentation method based on optical flow feature fusion
CN113313810A (en) 6D attitude parameter calculation method for transparent object
CN111476133A (en) Unmanned driving-oriented foreground and background codec network target extraction method
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN115830094A (en) Unsupervised stereo matching method
CN115527096A (en) Small target detection method based on improved YOLOv5
CN114820423A (en) Automatic cutout method based on saliency target detection and matching system thereof
Zheng et al. Dcu-net: Self-supervised monocular depth estimation based on densely connected u-shaped convolutional neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant