CN112686922A

CN112686922A - Method for separating animation special effect and background content based on multi-scale motion information

Info

Publication number: CN112686922A
Application number: CN202110101404.XA
Authority: CN
Inventors: 徐雪妙; 屈玮; 韩楚
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2021-04-20
Anticipated expiration: 2041-01-26
Also published as: CN112686922B

Abstract

The invention discloses a method for separating animation special effects and background contents based on multi-scale motion information, which comprises the following steps: 1) acquiring a sequence frame with a special effect fragment in an animation video; 2) calculating a single-scale special effect prediction image set between each frame and other frames in the sequence frames; 3) merging the single-scale special effect prediction image set of each frame to be used as multi-scale special effect prediction; 4) obtaining self-attention multi-scale special effect set characteristics through a self-attention mechanism; 5) extracting the characteristics of an input sequence frame through a three-dimensional convolutional neural network layer; 6) combining the characteristics of the sequence frame and the self-attention multi-scale special effect set characteristics; 7) separating the special effect sequence frame and the transparent channel information through a three-dimensional residual convolution neural network; 8) making a difference between the sequence frame and the special effect sequence frame to obtain a damaged background sequence frame; 9) and obtaining the repaired background sequence frame through a three-dimensional convolutional neural network. The method can be applied to special effect migration and can improve the accuracy of segmenting and identifying the specific object in the animation.

Description

Method for separating animation special effect and background content based on multi-scale motion information

Technical Field

The invention relates to the technical field of video separation, in particular to a method for separating animation special effects and background contents based on multi-scale motion information.

Background

Cartoons have been widely used in the field of animation as a form of artistic expression that is visually related. To present weather conditions and environments in animation, artists often use various cartoon special effects to present effects, such as rain, snow, fallen leaves, fallen flowers, and so on. These special effects are not only used to represent the environment, but also help to enrich the visual expressive power in the animation. Although the study of cartoon vision based on animation has gained wide attention in recent years, a part of background information is often blocked by a cartoon special effect in an animation scene, which causes a part of information to be lost when the animation background is analyzed, which is not beneficial to the study of directions such as segmentation of specific objects of animation, analysis of animation background, and the like. Meanwhile, the cartoon special effect in the animation is used as a special effect displayed visually, and the layering and migration of the cartoon special effect are also widely applied, so that the separation of the cartoon special effect and the background in the animation is an urgent research direction.

However, the cartoon special effects in the animation move irregularly, and the cartoon special effects are complex in type and different in size, so that the difficulty of separating the special effects by using the traditional rule method is increased. Meanwhile, the special effect data set of the cartoon is rare, which also increases the difficulty of separating the special effect by the deep learning method. At present, some methods separate the foreground and the background through deep learning or a traditional time sequence method, but the methods are not suitable for separating the cartoon special effect in the animation, because the videos which are usually considered by the methods are natural videos, and the foreground distribution in the natural videos and the special effect distribution in the cartoon animation are usually inconsistent, for example, the size and the shape of the special effect in the animation are changed at many ends and the appearance position of the special effect in the animation is more unpredictable. Therefore, how to accurately separate the cartoon special effect and the background in the animation becomes a key problem.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art, provides a method for separating animation special effects and background contents based on multi-scale motion information, can separate animation video clips into fine special effect effects and clean background contents, can be effectively suitable for separating different special effects in animations, can further repair the background contents behind the special effects after the special effects are effectively separated, and greatly improves downstream applications such as segmentation, recognition and special effect migration.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: the method for separating the animation special effect and the background content based on the multi-scale motion information comprises the following steps:

1) acquiring data, wherein the data comprises sequence frames with special effect fragments in an animation video as input;

2) calculating a single-scale special effect prediction image set between each frame and other frames in the input sequence frames;

3) merging the single-scale special effect prediction image set of each frame to be used as multi-scale special effect prediction;

4) adjusting the multi-scale special effect prediction through a self-attention mechanism to obtain self-attention multi-scale special effect set characteristics;

5) extracting the characteristics of an input sequence frame through a three-dimensional convolution neural network layer;

6) combining the characteristics of the input sequence frame and the self-attention multi-scale special effect set characteristics;

7) adding a Non-local module into each residual module to strengthen time sequence information association through a three-dimensional residual convolutional neural network, and then outputting a separated special effect sequence frame and transparent channel information;

8) obtaining a damaged background sequence frame by subtracting the input sequence frame from the special effect sequence frame obtained by separation;

9) and combining the damaged background sequence frame and the transparent channel information, inputting the combined frame and the transparent channel information into a three-dimensional convolution neural network, replacing all convolution layers with gate convolutions for dynamic feature selection, replacing the middle layer with expansion convolutions with different expansion rates, and finally outputting to obtain a repaired background sequence frame.

In the step 1), the animation video with the special effect fragment is a video fragment obtained by cutting the collected animation video containing the special effect through professional cartography software Adobe Premier, wherein the special effect types comprise rain, snow, fallen petals and fallen leaves; the sequence frame refers to a continuous image frame obtained by sampling in a video clip at 25 frames per second, and the continuous image is divided into sequence frames with 5 as a number unit through data preprocessing:

wherein I represents that the input is a sequence of 5 consecutive frames in size, wherein I₁Denotes the first frame, I₂Represents a second frame, I₅Represents a fifth frame;

for the real number set, C is the number of channels, and H and W represent the length and width of the frame.

In step 2), a single-scale special effect prediction image set between each frame and other frames in the input sequence frames is calculated, and the method comprises the following steps:

2.1) calculating each frame I by an optical flow estimation neural network FlowNet2_iWith other frames I in the sequence frame_j|j≠iAnd is affine transformed back by an optical flow

In the formula, V represents the optical flow estimation neural network FlowNet2, I_iDenotes the ith frame, I_j|j≠iRepresenting the j-th frame and j is not equal to I, V (I)_i,I_j|j≠i) Representing calculating an optical flow estimate from the ith frame to the jth frame; w denotes an affine transformation, i.e. I_j|j≠iAffine transformation back to I by estimated optical flow_i，

Representing the result of calculating an affine transformation from the ith frame to the jth frame;

2.2) calculating to obtain each of the motion fields according to the characteristics of the special effects and the differences of the speed and the direction on the background content motion fieldFrame I_iThe single-scale special effect prediction graph is as follows:

in the formula I_iWhich represents the (i) th frame,

representing the result of an affine transformation from the ith frame to the jth frame,

is represented by_iWith affine transformed results

Calculating Euclidean distances from channel C to channel C, and accumulating and summing the obtained results, D_i→jRepresenting calculated I_iA single-scale special effect prediction graph from the jth frame,

the method comprises the following steps of (1) representing the length and the width of a single-scale special effect prediction graph by a real number set and 1 representing the size of a vector channel;

2.3) calculating the ith frame I_iTo all other frames I in the input sequence frame_j|j≠iObtaining I after the single-scale special effect prediction graph_iSet of single-scale special effect prediction maps

D_i→jRepresenting the calculated single-scale special effect prediction graph of the ith frame from the jth frame,

indicating that the ith frame is from all other frames I_jIn which j is not equal to i, i, j ∈ [1,5 ]]Indicating that i and j take on a closed interval of 1 to 5.

In step 3), combining the single-scale special effect prediction image set obtained by calculation in each frame, and fully utilizing different time scale information in an input time sequence to assist special effect prediction of different rates, wherein the multi-scale special effect prediction obtained by calculation is as follows:

in the formula, D_i→jRepresenting the calculated single-scale special effect prediction graph of the ith frame from the jth frame,

represents a set of single-scale special effect prediction graphs from all other jth frames at ith frame, where j is not equal to i, i, j ∈ [1,5 ]]A closed interval of values of i and j in a range of 1 to 5 is shown; max means that 4 single-scale special effect prediction graphs D with different time spans are obtained_i→jTaking the maximum values in the time dimension for combination; d_iIndicates obtained a subject of I_iThe multi-scale special effect prediction of the method,

is a real number set, 1 is the vector channel size, and H and W represent the length and width of the multi-scale special effect prediction.

In step 4), the multi-scale special effect prediction is adjusted through an attention mechanism, namely noise of a non-effective area is adjusted and suppressed through weight, and the method comprises the following steps:

4.1) respectively enabling the multi-scale special effect prediction of each frame to pass through a self-attention mechanism to obtain new weight, and balancing the response size of each feature in the multi-scale special effect prediction again through the weight:

M_i＝Sigmoid(H(D_i))

in the formula, D_iIs represented as belonging to_iH represents a convolution layer with a convolution kernel size of 1 x 1, Sigmoid represents an activation function for computing features, M_iRepresenting the calculated weights;

4.2) calculating the weight M_iReunion multi-scale special effects prediction D_iCombining the components as follows:

in the formula (I), the compound is shown in the specification,

the representation matrix is multiplied element by element,

representing the self-attention multi-scale special effect characteristics of the ith frame after re-weighting;

4.3) combining the self-attention multi-scale special effect characteristics of all the input sequence frames on the channel to obtain a self-attention multi-scale special effect set characteristic as follows:

in the formula (I), the compound is shown in the specification,

indicating that the vector merging is performed in the time dimension T,

and

respectively showing the self-attention multi-scale special effect characteristics of the 1 st frame, the 2 nd frame and the 5 th frame, D is the combined self-attention multi-scale special effect set characteristic,

the method is characterized in that the method is a real number set, 1 is a vector channel size, 5 is a time dimension size, and H and W represent the length and width of the attention multi-scale special effect set feature.

In step 5), extracting the characteristics of the input sequence frames of 5 continuous frames through a three-dimensional convolutional neural network layer as follows:

wherein I is an input sequence frame, Conv is a three-dimensional convolution layer with a convolution kernel size of 5 × 5 × 3, F is a characteristic of the extracted input sequence frame,

representing the real number set, C the vector channel size, 5 the time dimension size, H and W the length and width of the features of the input sequence frames.

In step 6), the extraction of the special effect part is guided by using the self-attention multi-scale special effect set characteristics obtained by calculation, and the self-attention multi-scale special effect set characteristics and the characteristics of the input sequence frame are fused:

wherein F is the extracted characteristics of the input sequence frame, D is the characteristics of the self-attention multi-scale special effect set,

representing the multiplication of the matrix element by element, F_eIs the image frame feature after fusion.

In step 7), each residual module adds a Non-local module to strengthen time sequence information association through a three-dimensional residual convolutional neural network, and then outputs a separated special effect sequence frame and transparent channel information, which comprises the following steps:

7.1) coding and decoding sequence frame characteristics through a three-dimensional residual convolutional neural network, wherein the structural design of the three-dimensional residual convolutional neural network consists of 2 downsampling convolutional layers sharing parameters, 4 residual modules adding Non-local layers, 1 upsampling convolutional layer and 2 upsampling convolutional layers not sharing the parameters, and the size of all three-dimensional convolutional cores is 3 multiplied by 3;

7.2) respectively outputting the separated special effect sequence frame and the transparent channel information:

in the formula, G_eRepresenting a three-dimensional residual convolutional neural network, E representing a special effect sequence frame obtained by separation,

indicates that E belongs to

Real number set, 5 consecutive frames of size 3 × H × W; a represents the transparent channel information and a is,

indicates that A belongs to

Real number set, 5 consecutive ones of size 1 × H × W.

In step 8), subtracting the input sequence frame and the separated special effect sequence frame to obtain a damaged background sequence frame C without special effect_rComprises the following steps:

C_r＝I-E

wherein I represents an input sequence frame, E represents a separated special effect sequence frame, C_rRepresenting the computed lossy background sequence frames.

In step 9), merging the missing background sequence frame and the transparent channel information, inputting the merged background sequence frame and the transparent channel information into a three-dimensional convolution neural network, replacing all convolution layers with gate convolutions for dynamic feature selection, replacing the middle layer with expansion convolutions with different expansion rates, and finally outputting to obtain a repaired background sequence frame, wherein the method comprises the following steps:

9.1) coding and decoding the damaged background sequence frame through a three-dimensional convolutional neural network, wherein the network frame is designed to sequentially pass through 2 downsampling convolutional layers, 4 incomplete blocks with different expansion rates and different sizes of expansion convolutional perception and 2 upsampling layers, all the convolutional layers are replaced by gate convolutional network layer control channel information to be fully utilized so as to avoid redundancy, and 3 x 3 three-dimensional convolutional kernels are adopted for all the convolutional kernels;

9.2) jointly inputting the damaged background sequence frame and the transparent channel information into a three-dimensional convolution neural network to obtain a repaired background sequence frame, wherein the repaired background sequence frame is as follows:

in the formula, G_cRepresenting a three-dimensional convolutional neural network, C_rIs a damaged background sequence frame, A represents transparent channel information, C represents a repaired background sequence frame,

indicates that C belongs to

Real number set, 5 consecutive frames of size 3 × H × W;

after separating the special effect layer of the input sequence frame of the original animation video clip and outputting the repaired background content layer, the special effect and the background content in the animation are separated.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention firstly proposes to use the neural network in deep learning to separate the animation video containing the special effect and separate the animation video segment into a special effect layer and a background content layer.

2. The method can extract various different special effects in the animation, and can recover complete background contents without special effects while accurately extracting the special effects.

3. The invention provides a method for sensing the difference of a motion field in multiple scales for the first time, which can sense and position a special effect with larger difference of direction, speed and shape distribution, and embed the special effect into the learning process of a neural network to be used as prior knowledge to guide the network to capture the special effect motion characteristics so as to further help the neural network to learn how to separate the special effect.

4. The invention provides an attention mechanism to assist the guiding of multi-scale perception motion difference, and can further guide a network to obtain more accurate special effect motion prior information to avoid noise errors.

5. The method introduces the three-dimensional convolution neural network to restore the damaged background, perceives the background of different damaged blocks through a modeling mode that transparency information is used as soft auxiliary input and the expansion convolution of different receptive fields, and considers the time sequence consistency by utilizing the three-dimensional convolution, so that the restored background is clearer and more complete.

6. The method has wide use space in the animation video processing task, and has short inference time and good generalization.

Drawings

FIG. 1 is a logic flow diagram of the method of the present invention.

Fig. 2-1 to fig. 2-5 show input sequence frames of the method of the present invention.

Fig. 3-1 to 3-4 are single-scale special effect prediction image sets calculated by using fig. 2-5 as current frames.

Fig. 4 shows the multi-scale special effect prediction of the current frame obtained by the combination.

FIG. 5 is a self-attention multi-scale effect feature resulting from a multi-scale effect prediction calculation self-attention mechanism.

Fig. 6-1 to 6-5 are separated special effect sequence frames.

Fig. 7-1 to 7-5 are sequence frames of background content obtained after the separation and repair.

Detailed Description

The present invention will be further described with reference to the following specific examples.

As shown in fig. 1, the method for separating an animated special effect and background content based on multi-scale motion information provided by this embodiment includes the following steps:

1) acquiring sequence frames with special effect fragments in the animation video, wherein each frame specifically refers to an image containing a special effect in the background. Video sequence frames with special effect segments are obtained by using the professional graphics software Adobe Premier. The manufacturing process is that firstly, an animation video with a special effect is collected, then, a segment containing the special effect is cut and appointed from a video segment, and a sequence frame refers to a continuous image frame obtained by sampling 25 frames per second; the special effects comprise four different types of raining, snowing, fallen petals and fallen leaves. Firstly, data preprocessing is carried out on animation sequence frames, and all the sequence frames are divided into 5 continuous sequence inputs. As shown in fig. 2-1 to 2-5, each frame is adjacent to each other.

2) And calculating a single-scale special effect prediction image set between the input animation sequence frame and other frames, wherein the single-scale special effect prediction image set comprises single-scale special effect prediction images of the current frame and all other frames. Calculating by an optical flow estimation neural network FlowNet2 to obtain every ith frame I_iAnd other frames I in the sequence_j|j≠iOptical flow estimation and affine transformation back

In the formula, V represents classical optical flow estimation neural network FlowNet2, I_iDenotes the ith frame, I_j|j≠iRepresenting the j-th frame and j is not equal to I, V (I)_i,I_j|j≠i) Representing calculating an optical flow estimate from the ith frame to the jth frame; w denotes an affine transformation, i.e. I_j|j≠iAffine transformation back to I by estimated optical flow_i，

Indicating the result of calculating the affine transformation from the ith frame to the jth frame.

After affine transformation results of the current frame and all other frames are obtained, a single-scale special effect prediction graph D of the current frame is calculated according to the special effect and the speed and direction difference of a background content motion field_i→jComprises the following steps:

in the formula (I), the compound is shown in the specification,

represents the current frame I_iWith this frame after affine transformation

Calculating Euclidean distance channel by channel C, accumulating and summing to obtain a single-scale special effect prediction image from the ith frame to the jth frame,

is a real number set, 1 is the vector channel size, and H and W represent the length and width of the single-scale special effect prediction graph. After the single-scale special effect prediction images from all the current frames I to other frames j are calculated, the current frame I is obtained_iThe set of single-scale special effect prediction maps. Assume that the current frame is shown in fig. 2-5, and is shown in fig. 3-1 to fig. 3-4, and is a set of calculated single-scale special effect prediction maps.

3) For the obtained single-scale special effect prediction graph D_i→jAnd then, combining the set of the single-scale special effect prediction images of each frame into multi-scale special effect prediction through maximum operation:

in the formula (I), the compound is shown in the specification,

and representing that the current frame is the ith frame, and obtaining a single-scale special effect prediction graph set through calculation from other frames j, wherein j is not equal to i, and the value of the j is within an integer range from 1 to 5. Max is the single-scale special effect prediction graph D of 4 different time spans to be obtained by maximum operation_i→jTaking the maximum values in the time dimension and combining to obtain the multi-scale special effect prediction D belonging to the ith frame_i，

Is a real number set, 1 is a vector channel size, and H and W represent the length and width of the multi-scale special effect prediction, as shown in fig. 4.

4) After the multi-scale special effect prediction of each frame is obtained, the self-attention multi-scale special effect set characteristics of each frame in the sequence frames are calculated. Firstly, self-attention multi-scale special effect features are calculated for multi-scale special effect prediction of each frame, and each position feature in the multi-scale special effect prediction is balanced again through a weight value obtained by learning:

M_i＝Sigmoid(H(D_i))

in the formula, H represents a convolution layer with a convolution kernel size of 1 × 1, Sigmoid represents a calculated activation function, and M is obtained_iIs the self-attention weight.

Then, the self-attention weight is calculated and combined with the original predicted multi-scale special effect prediction to form a combination:

in the formula (I), the compound is shown in the specification,

the representation matrix is multiplied element by element,

representing the re-weighted self-attention multi-scale special effect feature. See fig. 5 for a self-attention multi-scale special effect feature obtained after the self-attention mechanism is calculated for fig. 4.

And finally, calculating the self-attention multi-scale special effect characteristics of each frame in the whole sequence of frames in the time dimension, namely fusing the self-attention multi-scale special effect characteristics containing the attention characteristics of each frame in the time dimension:

in the formula (I), the compound is shown in the specification,

indicating that the vector merging is performed in the time dimension T,

and

5) And for the input sequence frame, performing feature extraction through a three-dimensional convolution neural network layer.

In the formula, Conv is a general three-dimensional convolutional neural network layer, the size of a convolution kernel is 5 × 5 × 3, and F is an extracted image feature. And extracting and obtaining abstract characteristics of the input sequence frame in a convolution mode.

6) The coding and decoding are performed in conjunction with the input sequence frame features and the self attention feature set features. Guiding the extraction of the special effect by using the self-attention special effect set characteristics obtained by the calculation in the step 4), and fusing the self-attention special effect set characteristics and the characteristics of the input sequence frame:

in the formula (I), the compound is shown in the specification,

representing element-by-element multiplication of matrices, where features specified in the attention-specific set and features of the input sequence frame are multiplied in the channel dimension, F_eIs the image frame feature after fusion.

7) And then coding and decoding the image frame characteristics through a three-dimensional residual convolution neural network to obtain separated special effect sequence frames and transparent channel information:

in the formula: g_eThe method is characterized in that a three-dimensional residual convolutional neural network is represented, the structural design of the neural network is composed of 2 downsampling convolutional layers which do not share parameters, 4 residual modules which are replaced by adding Non-local layers, 1 upsampling convolutional layer and 2 upsampling convolutional layers which do not share parameters, the size of all three-dimensional convolutional cores is 3 multiplied by 3, and a special effect sequence frame E and transparent channel information A are respectively output through the last 2 unshared upsampling convolutional layers. E denotes the special effect sequence frame obtained by the separation,

indicates that E belongs to

indicates that A belongs to

Real number set, 5 consecutive ones of size 1 × H × W.

As shown in fig. 6-1 to 6-5, the frames are separated to obtain special effect sequence frames containing channel information.

8) After the separated special effect sequence frame and the transparent channel are obtained, the residual background sequence frame C to be repaired is obtained by subtracting the separated special effect sequence frame from the input sequence frame_r＝I-E，

For input sequence frames, E for separate special effect sequence frames, C_rRepresenting the computed lossy background sequence frames.

9) And obtaining the separated animation background content by combining the three-dimensional convolution neural network with the front transparent channel information. The method comprises the steps of jointly inputting a background sequence frame to be repaired in a defect mode and transparent channel information into a three-dimensional convolutional neural network, sequentially sensing defect blocks with different sizes and 2 up-sampling layers through 2 down-sampling convolutional layers and 4 expansion convolutions with different expansion rates, replacing all convolutional layers with gate convolutional network layer control channel information, fully utilizing effective channel information to avoid redundancy, and adopting 3 x 3 three-dimensional convolutional kernels for all convolutional kernel sizes. The sequence frames of the background contents of the animation obtained and separated are as follows:

in the formula, G_cRepresenting a three-dimensional convolutional neural network, C represents a repaired background sequence frame of continuous 5 frames with the size of 3 × H × W, as shown in fig. 7-1 to 7-5, which is the repaired background sequence frame of continuous 5 frames.

After separating the special effect sequence frame and the separated animation background sequence frame of the original input animation video sequence, the animation special effect and the background content are separated.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims

1. The method for separating the animation special effect and the background content based on the multi-scale motion information is characterized by comprising the following steps of:

2. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in the step 1), the animation video with the special effect fragment is a video fragment obtained by cutting the collected animation video containing the special effect through professional cartography software Adobe Premier, wherein the special effect types comprise rain, snow, fallen petals and fallen leaves; the sequence frame refers to a continuous image frame obtained by sampling in a video clip at 25 frames per second, and the continuous image is divided into sequence frames with 5 as a number unit through data preprocessing:

is a set of real numbers, C is TongThe number of lanes, H and W, represents the length and width of the frame.

3. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 2), a single-scale special effect prediction image set between each frame and other frames in the input sequence frames is calculated, and the method comprises the following steps:

2.2) calculating to obtain each frame I according to the characteristics of the special effect and the difference in the speed and direction on the background content sports field_iThe single-scale special effect prediction graph is as follows:

in the formula I_iWhich represents the (i) th frame,

is represented by_iWith affine transformed results

4. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 3), combining the single-scale special effect prediction image set obtained by calculation in each frame, and fully utilizing different time scale information in an input time sequence to assist special effect prediction of different rates, wherein the multi-scale special effect prediction obtained by calculation is as follows:

5. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 4), the multi-scale special effect prediction is adjusted through an attention mechanism, namely noise of a non-effective area is adjusted and suppressed through weight, and the method comprises the following steps:

M_i＝Sigmoid(H(D_i))

in the formula (I), the compound is shown in the specification,

the representation matrix is multiplied element by element,

in the formula (I), the compound is shown in the specification,

indicating that the vector merging is performed in the time dimension T,

and

6. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 5), extracting the characteristics of the input sequence frames of 5 continuous frames through a three-dimensional convolutional neural network layer as follows:

7. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 6), the extraction of the special effect part is guided by using the self-attention multi-scale special effect set characteristics obtained by calculation, and the self-attention multi-scale special effect set characteristics and the characteristics of the input sequence frame are fused:

8. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 7), each residual module adds a Non-local module to strengthen time sequence information association through a three-dimensional residual convolutional neural network, and then outputs a separated special effect sequence frame and transparent channel information, which comprises the following steps:

indicates that E belongs to

indicates that A belongs to

Real number set, 5 consecutive ones of size 1 × H × W.

9. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 8), subtracting the input sequence frame and the separated special effect sequence frame to obtain a damaged background sequence frame C without special effect_rComprises the following steps:

C_r＝I-E

10. The method of claim 1, wherein the method for separating animated special effects from background content based on multi-scale motion information comprises: in step 9), combining the damaged background sequence frame and the transparent channel information, inputting the combined frame into a three-dimensional convolutional neural network, replacing all convolutional layers with gate convolutions for dynamic feature selection, replacing the middle layer with expansion convolutions with different expansion rates, and finally outputting to obtain a repaired background sequence frame, wherein the method comprises the following steps:

indicates that C belongs to

Real number set, 5 consecutive frames of size 3 × H × W;