CN112686922B

CN112686922B - Method for separating animation special effect and background content based on multi-scale motion information

Info

Publication number: CN112686922B
Application number: CN202110101404.XA
Authority: CN
Inventors: 徐雪妙; 屈玮; 韩楚
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2022-10-25
Anticipated expiration: 2041-01-26
Also published as: CN112686922A

Abstract

The invention discloses a method for separating animation special effects and background contents based on multi-scale motion information, which comprises the following steps: 1) Acquiring a sequence frame with a special effect fragment in an animation video; 2) Calculating a single-scale special effect prediction image set between each frame and other frames in the sequence frames; 3) Merging the single-scale special effect prediction image set of each frame as multi-scale special effect prediction; 4) Obtaining self-attention multi-scale special effect set characteristics through a self-attention mechanism; 5) Extracting the characteristics of an input sequence frame through a three-dimensional convolutional neural network layer; 6) Combining the characteristics of the sequence frame and the self-attention multi-scale special effect set characteristics; 7) Separating the special effect sequence frame and the transparent channel information through a three-dimensional residual convolution neural network; 8) Making a difference between the sequence frame and the special effect sequence frame to obtain a damaged background sequence frame; 9) And obtaining the repaired background sequence frame through a three-dimensional convolutional neural network. The method can be applied to special effect migration and can improve the accuracy of segmenting and identifying the specific object in the animation.

Description

Method for separating animation special effect and background content based on multi-scale motion information

Technical Field

The invention relates to the technical field of video separation, in particular to a method for separating animation special effects and background contents based on multi-scale motion information.

Background

Cartoons have been widely used in the field of animation as a form of visually relevant artistic expression. To present weather conditions and environments in animation, artists often use various cartoon special effects to present effects, such as rain, snow, fallen leaves, fallen flowers, and so on. These special effects are not only used to represent the environment, but also help to enrich the visual expressive power in the animation. Although the study of cartoon vision based on animation has gained wide attention in recent years, a part of background information is often blocked by a cartoon special effect in an animation scene, which causes a part of information to be lost when the animation background is analyzed, which is not beneficial to the study of directions such as segmentation of specific objects of animation, analysis of animation background, and the like. Meanwhile, the cartoon special effect in the animation is used as a special effect displayed visually, and layering and migration of the cartoon special effect are widely applied, so that the separation of the cartoon special effect and the background in the animation becomes an urgent research direction.

However, the cartoon special effects in the animation move irregularly, and the cartoon special effects are complex in type and different in size, so that the difficulty of separating the special effects by using a traditional rule method is increased. Meanwhile, the special effect data set of the cartoon is rare, so that the difficulty of separating the special effects by the deep learning method is increased. At present, some methods separate the foreground and the background through deep learning or a traditional time sequence method, but the methods are not suitable for separating the cartoon special effect in the animation, because the videos which are usually considered by the methods are natural videos, and the foreground distribution in the natural videos and the special effect distribution in the cartoon animation are usually inconsistent, for example, the size and the shape of the special effect in the animation are changed at many ends and the appearance position of the special effect in the animation is more unpredictable. Therefore, how to accurately separate the cartoon special effect and the background in the animation becomes a key problem.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art, provides a method for separating animation special effects and background contents based on multi-scale motion information, can separate animation video clips into fine special effect effects and clean background contents, can be effectively suitable for separating different special effects in animations, can further repair the background contents behind the special effects after the special effects are effectively separated, and greatly improves downstream applications such as segmentation, recognition and special effect migration.

In order to realize the purpose, the technical scheme provided by the invention is as follows: the method for separating the animation special effect and the background content based on the multi-scale motion information comprises the following steps:

1) Acquiring data, including sequence frames with special effect fragments in the animation video as input;

2) Calculating a single-scale special effect prediction image set between each frame and other frames in the input sequence frames;

3) Merging the single-scale special effect prediction image set of each frame to be used as multi-scale special effect prediction;

4) Adjusting the multi-scale special effect prediction through a self-attention mechanism to obtain self-attention multi-scale special effect set characteristics;

5) Extracting the characteristics of an input sequence frame through a three-dimensional convolution neural network layer;

6) Combining the characteristics of the input sequence frame and the self-attention multi-scale special effect set characteristics;

7) Through a three-dimensional residual convolutional neural network, adding a Non-local module into each residual module to strengthen time sequence information association, and then outputting a separated special effect sequence frame and transparent channel information;

8) Obtaining a damaged background sequence frame by subtracting the input sequence frame from the special effect sequence frame obtained by separation;

9) And combining the damaged background sequence frame and the transparent channel information, inputting the combined frame and the transparent channel information into a three-dimensional convolution neural network, replacing all convolution layers with gate convolutions for dynamic feature selection, replacing the middle layer with expansion convolutions with different expansion rates, and finally outputting to obtain a repaired background sequence frame.

In the step 1), the animation video with the special effect fragment is a video fragment obtained by cutting the collected animation video containing the special effect through professional cartography software Adobe Premier, wherein the special effect types comprise rain, snow, fallen petals and fallen leaves; the sequence frame refers to a continuous image frame obtained by sampling in a video clip at 25 frames per second, and the continuous image is divided into sequence frames with 5 as a number unit through data preprocessing:

wherein I represents that the input is a sequence of 5 consecutive frames, wherein I ₁ Denotes the first frame, I ₂ Represents a second frame, I ₅ Represents a fifth frame;

for the real number set, C is the number of channels, and H and W represent the length and width of the frame.

In step 2), a single-scale special effect prediction image set between each frame and other frames in the input sequence frames is calculated, and the method comprises the following steps:

2.1 Each frame I is calculated by an optical flow estimation neural network FlowNet2 _i With other frames I in the sequence frame _j|j≠i And is affine transformed back by an optical flow

In the formula, V represents the optical flow estimation neural network FlowNet2, I _i Denotes the ith frame, I _j|j≠i Representing the j-th frame and j is not equal to I, V (I) _i ,I _j|j≠i ) Representing calculating an optical flow estimate from the ith frame to the jth frame; w denotes an affine transformation, i.e. I _j|j≠i Affine transformation back to I by estimated optical flow _i ，

Representing the result of calculating an affine transformation from the ith frame to the jth frame;

2.2 Based on the characteristics of the speed and direction difference on the special effect and background content sports field, each frame I is obtained by calculation _i The single-scale special effect prediction graph is as follows:

in the formula I _i Which represents the (i) th frame,

representing affine from frame i to frame jAs a result of the transformation, the result,

is represented by I _i With affine transformed results

Calculating Euclidean distances from channel C to channel C, and accumulating and summing the obtained results, D _i→j Representing calculated I _i A single-scale special effect prediction graph from the jth frame,

the method comprises the following steps of (1) representing the length and the width of a single-scale special effect prediction graph by a real number set and 1 representing the size of a vector channel;

2.3 Calculate the ith frame I _i To all other frames I in the input sequence frame _j|j≠i Obtaining I after the single-scale special effect prediction graph _i Set of single-scale special effect prediction maps

D _i→j Representing the calculated single-scale special effect prediction graph of the ith frame from the jth frame,

indicating that the ith frame is from all other frames I _j Wherein j is not equal to i, i, j ∈ [1,5 ]]Indicating that i and j take on a closed interval of 1 to 5.

In step 3), combining the single-scale special effect prediction image set obtained by calculation in each frame, and fully utilizing different time scale information in an input time sequence to assist special effect prediction of different rates, wherein the multi-scale special effect prediction obtained by calculation is as follows:

in the formula, D _i→j Representing the calculated single-scale special effect prediction graph of the ith frame from the jth frame,

represents a set of single-scale special effect prediction graphs from all other jth frames at ith frame, where j is not equal to i, i, j ∈ [1,5 ]]A closed interval of values of i and j in a range of 1 to 5 is shown; max means that 4 single-scale special effect prediction graphs D with different time spans are obtained _i→j Taking the maximum values in the time dimension for combination; d _i Indicates obtained a subject of I _i The multi-scale special effect prediction of the method,

is a real number set, 1 is the vector channel size, and H and W represent the length and width of the multi-scale special effect prediction.

In step 4), the multi-scale special effect prediction is adjusted through an attention mechanism, namely, the noise of a non-effective area is adjusted and suppressed through weight, and the method comprises the following steps:

4.1 Respectively passing the multi-scale special effect prediction of each frame through a self-attention mechanism to obtain new weight, and balancing the response size of each feature in the multi-scale special effect prediction again through the weight:

M _i ＝Sigmoid(H(D _i ))

in the formula, D _i Is represented as belonging to _i H represents a convolution layer with a convolution kernel size of 1 multiplied by 1, sigmoid represents an activation function for calculating features, M _i Representing the calculated weight;

4.2 ) calculate the weight M _i Reunion multi-scale special effects prediction D _i Combining the components as follows:

in the formula (I), the compound is shown in the specification,

the representation matrix is multiplied element by element,

presentation reassignmentThe weighted self-attention multi-scale special effect characteristics of the ith frame;

4.3 Combine the self-attention multi-scale special effect features of all input sequence frames on the channel to obtain a self-attention multi-scale special effect set feature as follows:

in the formula (I), the compound is shown in the specification,

indicating that the vector merging is performed in the time dimension T,

and

respectively representing the self-attention multi-scale special effect characteristics of the 1 st frame, the 2 nd frame and the 5 th frame, D is the combined self-attention multi-scale special effect set characteristic,

the method is characterized in that the method is a real number set, 1 is a vector channel size, 5 is a time dimension size, and H and W represent the length and width of the attention multi-scale special effect set feature.

In step 5), extracting the characteristics of the input sequence frames of 5 continuous frames through a three-dimensional convolution neural network layer as follows:

wherein I is an input sequence frame, conv is a three-dimensional convolution layer with a convolution kernel size of 5 × 5 × 3, F is a characteristic of the extracted input sequence frame,

representing a set of real numbers, C a vector channel size, 5 a time dimension size, H and W representing the length and width of features of the input sequence frames。

In step 6), the extraction of the special effect part is guided by using the self-attention multi-scale special effect set characteristics obtained by calculation, and the self-attention multi-scale special effect set characteristics and the characteristics of the input sequence frame are fused:

wherein F is the extracted characteristics of the input sequence frame, D is the characteristics of the self-attention multi-scale special effect set,

representing the multiplication of the matrix element by element, F _e Is the image frame feature after fusion.

In step 7), each residual module adds a Non-local module to strengthen time sequence information association through a three-dimensional residual convolutional neural network, and then outputs a separated special effect sequence frame and transparent channel information, which comprises the following steps:

7.1 Coding and decoding sequence frame features through a three-dimensional residual convolutional neural network, wherein the structural design of the three-dimensional residual convolutional neural network consists of 2 down-sampling convolutional layers sharing parameters, 4 residual modules adding Non-local layers, 1 up-sampling convolutional layer and 2 up-sampling convolutional layers not sharing the parameters, and the size of all three-dimensional convolutional cores is 3 multiplied by 3;

7.2 Separate special effect sequence frames and transparent channel information are output:

in the formula, G _e Representing a three-dimensional residual convolutional neural network, E representing a special effect sequence frame obtained by separation,

indicates that E belongs to

Real number set, continuous 5 frames of size 3 × H × W; a represents the transparent channel information and a is,

indicates that A belongs to

Real number set, 5 consecutive ones of size 1 × H × W.

In step 8), subtracting the input sequence frame and the separated special effect sequence frame to obtain a damaged background sequence frame C without special effect _r Comprises the following steps:

C _r ＝I-E

wherein I represents an input sequence frame, E represents a special effect sequence frame obtained by separation, C _r Representing the calculated lossy background sequence frames.

In step 9), combining the missing background sequence frame and the transparent channel information, inputting the combined background sequence frame and transparent channel information into a three-dimensional convolution neural network, replacing all convolution layers with gate convolutions for dynamic feature selection, replacing the middle layer with expansion convolutions with different expansion rates, and finally outputting to obtain a repaired background sequence frame, wherein the method comprises the following steps:

9.1 The method comprises the following steps that) coding and decoding repair is carried out on a damaged background sequence frame through a three-dimensional convolution neural network, the network frame is designed to sequentially pass through 2 downsampling convolution layers, 4 expansion convolution perception incomplete blocks with different expansion rates and 2 upsampling layers, all the convolution layers are replaced by gate convolution network layers to control channel information to be fully utilized, and therefore redundancy is avoided, and 3 x 3 three-dimensional convolution kernels are adopted for all convolution kernels;

9.2 The damaged background sequence frame and the transparent channel information are jointly input into a three-dimensional convolution neural network to obtain a repaired background sequence frame, wherein the repaired background sequence frame is as follows:

in the formula, G _c Representing a three-dimensional convolutional neural network, C _r Is a damaged background sequence frame, A represents transparent channel information, C represents a repaired background sequence frame,

indicates that C belongs to

Real number set, continuous 5 frames of size 3 × H × W;

after separating the special effect layer of the input sequence frame of the original animation video clip and outputting the repaired background content layer, the special effect and the background content in the animation are separated.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention provides a method for separating an animation video containing a special effect by using a neural network in deep learning for the first time, and separating an animation video clip into a special effect layer and a background content layer.

2. The method can extract various different special effects in the animation, and can recover complete background contents without special effects while accurately extracting the special effects.

3. The invention provides a method for sensing the difference of a motion field in multiple scales for the first time, which can sense and position a special effect with larger difference of direction, speed and shape distribution, and embed the special effect into the learning process of a neural network to be used as prior knowledge to guide the network to capture the special effect motion characteristics so as to further help the neural network to learn how to separate the special effect.

4. The invention provides an attention mechanism to assist the guiding of multi-scale perception motion difference, and can further guide a network to obtain more accurate special effect motion prior information to avoid noise errors.

5. The invention introduces a three-dimensional convolution neural network to restore the damage background, the background perception of different damage blocks is realized by a modeling mode taking transparency information as soft auxiliary input and the expansion convolution of different receptive fields, and the three-dimensional convolution is utilized to consider the time sequence consistency, so that the restored background is clearer and more complete.

6. The method has wide use space in the animation video processing task, and has short inference time and good generalization.

Drawings

FIG. 1 is a logic flow diagram of the method of the present invention.

Fig. 2-1 to fig. 2-5 show input sequence frames of the method of the present invention.

Fig. 3-1 to 3-4 are single-scale special effect prediction image sets calculated by using fig. 2-5 as current frames.

Fig. 4 is a multi-scale special effect prediction of a current frame obtained by merging.

FIG. 5 is a self-attention multi-scale effect feature resulting from a multi-scale effect prediction calculation self-attention mechanism.

Fig. 6-1 to 6-5 are separated special effect sequence frames.

Fig. 7-1 to 7-5 are sequence frames of background content obtained after the separation and repair.

Detailed Description

The present invention is further illustrated by the following examples.

As shown in fig. 1, the method for separating an animated special effect and background content based on multi-scale motion information provided by this embodiment includes the following steps:

1) Acquiring sequence frames with special effect fragments in the animation video, wherein each frame specifically refers to an image containing a special effect in the background. Video sequence frames with special effect segments are obtained by using the professional graphics software Adobe Premier. The manufacturing process is that firstly, an animation video with a special effect is collected, then a fragment containing the special effect is cut and appointed from a video segment, and a sequence frame refers to a continuous image frame obtained by sampling 25 frames per second; the special effect comprises four different types of rain, snow, fallen petals and fallen leaves. Firstly, data preprocessing is carried out on animation sequence frames, and all the sequence frames are divided into 5 continuous sequence input. As shown in fig. 2-1 to 2-5, each frame is adjacent to each other.

2) Calculating a single-scale special effect prediction image set between the input animation sequence frame and other frames, wherein the single-scale special effect prediction image set comprises a single scale of the current frame and all other framesAnd (5) degree special effect prediction graph. Calculating to obtain every ith frame I through optical flow estimation neural network FlowNet2 _i And other frames I in the sequence _j|j≠i Optical flow estimation and affine transformation back

In the formula, V represents classical optical flow estimation neural network FlowNet2, I _i Denotes the ith frame, I _j|j≠i Representing the j-th frame and j is not equal to I, V (I) _i ,I _j|j≠i ) Representing calculating an optical flow estimate from the ith frame to the jth frame; w denotes an affine transformation, i.e. I _j|j≠i Affine transformation back to I by estimated optical flow _i ，

Indicating the result of calculating the affine transformation from the ith frame to the jth frame.

After affine transformation results of the current frame and all other frames are obtained, a single-scale special effect prediction graph D of the current frame is calculated according to the special effect and the speed and direction difference of a background content motion field _i→j Comprises the following steps:

in the formula (I), the compound is shown in the specification,

represents the current frame I _i With this frame after affine transformation

Calculating Euclidean distance channel by channel C, accumulating and summing to obtain a single-scale special effect prediction image from the ith frame to the jth frame,

is a real number set, 1 is the vector channel size, and H and W represent the length and width of the single-scale special effect prediction graph. After the single-scale special effect prediction images from all the current frames I to other frames j are calculated, the current frame I is obtained _i The set of single-scale special effect prediction maps. Assume that the current frame is shown in fig. 2-5, and is shown in fig. 3-1 to fig. 3-4, and is a set of calculated single-scale special effect prediction maps.

3) For the obtained single-scale special effect prediction graph D _i→j And then, combining the set of the single-scale special effect prediction images of each frame into multi-scale special effect prediction through maximum operation:

in the formula (I), the compound is shown in the specification,

and representing that the current frame is the ith frame, and obtaining a single-scale special effect prediction graph set through calculation from other frames j, wherein j is not equal to i, and the value of the j is within an integer range from 1 to 5. Max is the single-scale special effect prediction graph D of 4 different time spans to be obtained by maximum operation _i→j Taking the maximum values in the time dimension and combining to obtain the multi-scale special effect prediction D belonging to the ith frame _i ，

For the real number set, 1 is the vector channel size, and H and W represent the length and width of the multi-scale special effect prediction, as shown in fig. 4.

4) After obtaining the multi-scale special effect prediction of each frame, calculating the self-attention multi-scale special effect set characteristics of each frame in the sequence frames. Firstly, self-attention multi-scale special effect characteristics are calculated for multi-scale special effect prediction of each frame, and each position characteristic in the multi-scale special effect prediction is balanced again through a weight value obtained by learning:

M _i ＝Sigmoid(H(D _i ))

wherein H represents a convolution layer having a convolution kernel size of 1X 1, sigmoid represents a calculated activation function,obtained M _i Is the self-attention weight.

Then calculating self-attention weight and combining the self-attention weight with the original predicted multi-scale special effect prediction to obtain the following combination:

in the formula (I), the compound is shown in the specification,

the representation matrix is multiplied element by element,

representing the re-weighted self-attention multi-scale special effect feature. Fig. 5 shows the self-attention multi-scale special effect features obtained after the self-attention mechanism is calculated in fig. 4.

And finally, calculating the self-attention multi-scale special effect characteristics of each frame in the whole sequence of frames in the time dimension, namely fusing the self-attention multi-scale special effect characteristics containing the attention characteristics of each frame in the time dimension:

in the formula (I), the compound is shown in the specification,

indicating that the vector merging is performed in the time dimension T,

and

respectively showing the self-attention multi-scale special effect characteristics of the 1 st frame, the 2 nd frame and the 5 th frame, D is the combined self-attention multi-scale special effect set characteristic,

is a real number set, 1 isVector channel size, 5 is the size of the time dimension, and H and W represent the length and width of the attention multi-scale special effect set features.

5) And for the input sequence frame, performing feature extraction through a three-dimensional convolution neural network layer.

In the formula, conv is a general three-dimensional convolutional neural network layer, the size of a convolution kernel is 5 × 5 × 3, and F is the extracted image feature. And extracting and obtaining abstract characteristics of the input sequence frame in a convolution mode.

6) The coding and decoding are performed in conjunction with the input sequence frame features and the self attention feature set features. Guiding the extraction of the special effect by using the self-attention special effect set characteristics obtained by the calculation in the step 4), and fusing the self-attention special effect set characteristics and the characteristics of the input sequence frame:

in the formula (I), the compound is shown in the specification,

representing element-by-element multiplication of matrices, where features specified in the attention-specific set and features of the input sequence frame are multiplied in the channel dimension, F _e Are image frame features after fusion.

7) And then coding and decoding the image frame characteristics through a three-dimensional residual convolution neural network to obtain separated special effect sequence frames and transparent channel information:

in the formula: g _e Representing a three-dimensional residual convolutional neural network, the structural design of the neural network is composed of 2 downsampling convolutional layers without sharing parameters and 4 Non-local layersThe residual module, 1 upsampling convolutional layer and 2 upsampling convolutional layers which do not share parameters, wherein the size of all three-dimensional convolutional kernels is 3 multiplied by 3, and the last 2 unshared upsampling convolutional layers respectively output a special effect sequence frame E and transparent channel information A. E denotes the special effect sequence frame obtained by the separation,

indicates that E belongs to

indicates that A belongs to

Real number set, 5 consecutive ones of size 1 × H × W.

As shown in fig. 6-1 to 6-5, the frames are separated to obtain special effect sequence frames containing channel information.

8) After the separated special effect sequence frame and the transparent channel are obtained, the residual background sequence frame C to be repaired is obtained by subtracting the separated special effect sequence frame from the input sequence frame _r ＝I-E，

For input sequence frames, E for separate special effect sequence frames, C _r Representing the computed lossy background sequence frames.

9) And obtaining the separated animation background content by combining the three-dimensional convolution neural network with the front transparent channel information. The method comprises the steps of jointly inputting a background sequence frame to be repaired in a defect mode and transparent channel information into a three-dimensional convolutional neural network, sequentially sensing defect blocks with different sizes and 2 up-sampling layers through 2 down-sampling convolutional layers and 4 expansion convolutions with different expansion rates, replacing all convolutional layers with gate convolutional network layer control channel information, fully utilizing effective channel information to avoid redundancy, and adopting 3 x 3 three-dimensional convolutional kernels for all convolutional kernel sizes. The sequence frames of the animation background contents obtained and separated are as follows:

in the formula, G _c Representing a three-dimensional convolutional neural network, C represents a repaired background sequence frame of 5 consecutive frames of size 3 × H × W, as shown in fig. 7-1 to 7-5, which is a background sequence frame of 5 consecutive frames obtained by the repair.

After separating the special effect sequence frame and the separated animation background sequence frame of the original input animation video sequence, the animation special effect and the background content are separated.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims

1. The method for separating the animation special effect and the background content based on the multi-scale motion information is characterized by comprising the following steps of:

3) Merging the single-scale special effect prediction image set of each frame as multi-scale special effect prediction;

2. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in the step 1), the animation video with the special effect fragment is a video fragment obtained by cutting the collected animation video containing the special effect through professional drawing software Adobe Premier, wherein the special effect type comprises rain, snow, fallen petals and fallen leaves; the sequence frame refers to a continuous image frame obtained by sampling in a video clip at 25 frames per second, and the continuous image is divided into sequence frames with 5 as a number unit through data preprocessing:

wherein I represents that the input is a sequence of 5 consecutive frames, wherein I ₁ Denotes the first frame, I ₂ Denotes the second frame, I ₅ Represents a fifth frame;

3. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 2), calculating a single-scale special effect prediction image set between each frame and other frames in the input sequence frames, comprising the following steps:

2.1 Each calculated from the optical flow estimation neural network FlowNet2Frame I _i With other frames I in the sequence frame _j|j≠i And is affine transformed back by an optical flow

In the formula, V represents the optical flow estimation neural network FlowNet2, I _i Denotes the ith frame, I _j|j≠i Representing the jth frame and j is not equal to I, V (I) _i ,I _j|j≠i ) Representing calculating an optical flow estimate from an ith frame to a jth frame; w denotes an affine transformation, i.e. I _j|j≠i Affine transformation back to I by estimated optical flow _i ，

2.2 Based on the special effects and the characteristics of the difference in speed and direction on the background content motion field, each frame I is calculated _i The single-scale special effect prediction graph is as follows:

in the formula I _i Which represents the (i) th frame,

representing the result of an affine transformation from the ith frame to the jth frame,

is represented by I _i With affine transformed results

Calculated Euclidean distance from channel to channel and accumulated and summedAs a result, D _i→j Represents the calculated I _i A single-scale special effect prediction graph from the jth frame,

the prediction image is a real number set, 1 is the size of a vector channel, and H and W represent the length and the width of a single-scale special effect prediction image;

2.3 Calculate the ith frame I _i To all other frames I of the input sequence frame _j|j≠i Obtaining I after the single-scale special effect prediction graph _i Set of single-scale special effect prediction maps of (1)

4. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 3), combining the single-scale special effect prediction image set obtained by calculation in each frame, and fully utilizing different time scale information in an input time sequence to assist special effect prediction of different rates, wherein the multi-scale special effect prediction obtained by calculation is as follows:

in the formula D _i→j Representing the calculated single-scale special effect prediction graph of the ith frame from the jth frame,

representing a single-scale bit for the ith frame from all other jth framesThe set of efficient prediction graphs, where j is not equal to i, i, j ∈ [1,5 ]]Represents a closed interval in which the values of i and j are 1 to 5; max means that 4 single-scale special effect prediction graphs D with different time spans are obtained _i→j Taking the maximum values in the time dimension for combination; d _i Is represented by the obtained compound of the formula I _i The multi-scale special effect prediction of (1),

for the real number set, 1 is the vector channel size, and H and W represent the length and width of the multi-scale special effect prediction.

5. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 4), the multi-scale special effect prediction is adjusted through an attention mechanism, namely noise of a non-effective area is adjusted and suppressed through weight, and the method comprises the following steps:

M _i ＝Sigmoid(H(D _i ))

in the formula D _i Denotes to belong to I _i H represents a convolution layer with a convolution kernel size of 1 multiplied by 1, sigmoid represents an activation function for calculating features, M _i Representing the calculated weights;

in the formula (I), the compound is shown in the specification,

the representation matrix is multiplied element by element,

representing the self-attention multi-scale special effect characteristics of the ith frame after re-weighting;

in the formula (I), the compound is shown in the specification,

indicating that the vector merging is performed in the time dimension T,

and

6. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 5), extracting the characteristics of the input sequence frames of 5 continuous frames through a three-dimensional convolutional neural network layer as follows:

where I is the input sequence frame and Conv is the convolution kernel size of 5 × 5X 3, F is the extracted feature of the input sequence frame,

representing the real number set, C the vector channel size, 5 the time dimension size, H and W the length and width of the features of the input sequence frames.

7. The method of separating animated special effects and background content based on multi-scale motion information of claim 1, wherein: in step 6), the extraction of the special effect part is guided by using the self-attention multi-scale special effect set characteristics obtained by calculation, and the self-attention multi-scale special effect set characteristics and the characteristics of the input sequence frame are fused:

wherein F is the extracted features of the input sequence frame, D is the features of the self-attention multi-scale special effect set,

representing the element-by-element multiplication of a matrix, F _e Are image frame features after fusion.

8. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 7), each residual module adds a Non-local module to strengthen time sequence information association through a three-dimensional residual convolutional neural network, and then outputs a separated special effect sequence frame and transparent channel information, which comprises the following steps:

7.1 Coding and decoding sequence frame characteristics through a three-dimensional residual convolutional neural network, wherein the structural design of the three-dimensional residual convolutional neural network consists of 2 down-sampling convolutional layers sharing parameters, 4 residual modules adding Non-local layers, 1 up-sampling convolutional layer and 2 up-sampling convolutional layers not sharing the parameters, and the size of all three-dimensional convolutional cores is 3 multiplied by 3;

denotes that E belongs to

indicates that A belongs to

Real number set, 5 consecutive ones of size 1 × H × W.

9. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 8), subtracting the input sequence frame and the separated special effect sequence frame to obtain a damaged background sequence frame C without special effect _r Comprises the following steps:

C _r ＝I-E

10. The method of separating animated special effects and background content based on multi-scale motion information of claim 1, wherein: in step 9), combining the damaged background sequence frame and the transparent channel information, inputting the combined frame into a three-dimensional convolutional neural network, replacing all convolutional layers with gate convolutions to perform dynamic feature selection, replacing the middle layer with different expansion rates of expansion convolutions, and finally outputting to obtain a repaired background sequence frame, wherein the method comprises the following steps:

9.2 Jointly inputting the damaged background sequence frame and the transparent channel information into a three-dimensional convolution neural network to obtain a repaired background sequence frame as follows:

denotes C belongs to

Real number set, continuous 5 frames of size 3 × H × W;