CN114898021B

CN114898021B - Intelligent cartoon method for music stage performance video

Info

Publication number: CN114898021B
Application number: CN202210812946.2A
Authority: CN
Inventors: 朱春霖; 姜秋晨子; 廖勇; 夏雄军
Original assignee: Hunan Normal University
Current assignee: Hunan Normal University
Priority date: 2022-07-12
Filing date: 2022-07-12
Publication date: 2022-09-27
Anticipated expiration: 2042-07-12
Also published as: CN114898021A

Abstract

The invention provides an intelligent cartoon method for music stage performance videos, which comprises the following steps: acquiring a real stage image data set and a cartoon image data set, and preprocessing image data; performing semantic segmentation on characters, props and backgrounds in the music stage performance video; step three, constructing and training different cartoon video generation models for the music stage performance aiming at different objects such as characters, props, backgrounds and the like; inputting the music stage performance video into the model to obtain cartoon music stage performance video; and fifthly, constructing a composite image coordination model to carry out image harmony processing on the cartoon music stage performance video. The invention can carry out cartoon processing on the music stage performance video, thereby being applied to the fields of music performance, animation production and the like, and being more beneficial to generating the music stage performance video with clean outline, clear boundary and harmonious color.

Description

Intelligent cartoon method for music stage performance video

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an intelligent cartoon method for music stage performance videos.

Background

In recent years, with the continuous development of artificial intelligence, many algorithms are applied in the field of image processing, such as image style conversion. The cartoon is a very popular artistic expression form at present, and the artistic expression form is widely applied to various aspects of the society, including advertisement, games, movie and television works, photography and the like. At present, young people in this era are mostly influenced by Japanese cartoons, and the cartoons are really influenced worldwide, but because the cartoons are drawn and generated by hands and then rendered by computers, the time and the labor are relatively much, which can not be finished by people without drawing bases, and therefore, the modern cartoon animation workflow allows artists to use various resources to create contents. Some famous caricatures are created by converting real-world pictures into available cartoon scene material, a process called image cartoonization.

The image cartoon method can also be applied to the field of music education. The music stage performance video is rendered by a cartoon method, and the music stage performance is displayed in an artistic form of cartoon style, so that the interest of children in the music stage performance can be attracted. Although the existing cartoon method is applied to many fields, the application in the music education field is scarce, and in addition, the existing cartoon method cannot realize cartoon for characters, props and backgrounds at the same time, and harmony processing is not carried out on the characters, props and backgrounds after cartoon, so that a unified cartoon animation is formed.

The noun explains:

DCNN model based on semantic segmentation: a model for semantic segmentation of images using Deep Convolutional Neural Network (DCNN).

Based on the cartoon model for generating the countermeasure network: the images are cartoonized with a model that generates a countermeasure network (GAN).

Disclosure of Invention

The invention aims to provide a novel intelligent cartoon method for music stage performance videos aiming at the defects of the conventional image stylizing method, which is used for semantically segmenting different contents in a complex scene and cartoonizing the different contents by using different image stylizing methods.

The purpose of the invention is realized by the following technical scheme:

an intelligent cartoon method for music stage performance videos comprises the following steps:

acquiring image data and preprocessing the image data; the image data comprises a real stage image data set and a cartoon image data set; the real stage image dataset is obtained from a music stage performance video;

constructing a semantic segmentation model, wherein the semantic segmentation model carries out semantic segmentation on characters, props and backgrounds in the image data;

step three, constructing and training different cartoon video generation models for the music stage performance aiming at characters, props and backgrounds respectively; respectively obtaining a trained character cartoon video generation model, a trained property cartoon video generation model and a trained background cartoon video generation model:

respectively forming a character cartoon video generation model, a prop cartoon video generation model and a background cartoon video generation model which correspond to characters, props and backgrounds on the basis of a cartoon model for generating an antagonistic network;

3.1) Total loss function of character cartoon video Generation modelL _body The following were used:

whereinλ ₁ 、λ ₂ 、λ ₃ 、λ ₄ 、λ ₅ 、λ ₆ Respectively being vague of surface information loss of the personL _surface Loss function of character structure informationL _structure Loss function of character texture informationL _texture Loss function of character content informationL _content Figure total variation loss functionL _tv And l1 regularization termL ₁ The information emphasis point of the generated image is controlled by giving different weights;

3.11) loss function of character surface information

The following:

edge-preserving filtering with a miniboot filter, denoted as

Returning the extracted surface representation with an image I as input, itself as a guide map

Deleting textures and details; introduction of discriminatorD _s To determine if the model output and the reference cartoon image have similar surfaces and to direct the generatorGLearning information stored in the extracted surface representation; whereinGA representation generator for generating a representation of the object,D _s a discriminator for the information on the surface is represented,I _c the representation of a cartoon image is made,I _p representing a real image;

3.12) loss function of character structure informationL _structure The following were used:

using advanced features extracted from a pre-trained VGG16 network, and then strengthening spatial constraints between character cartoon images generated by the character cartoon video generation model and structural representations extracted from the generated character cartoon images;

the method comprises the steps of representing the structural representation extraction of the generated cartoon image of the person, namely the selective searching process and the color filling of the picture area;

representing advanced features extracted in the generated character cartoon image by a VGG network;

and carrying out weighted summation according to the median value and the average value of the region to calculate the color of the region, wherein the formula is as follows:

wherein the content of the first and second substances,S _i,j a pixel value representing a region having a position (i, j),

represents the average of the pixel values of the current region,

a median value representing a current region pixel value; i denotes a row, j denotes a column;σ(S)to representSStandard deviation of (d);

3.13) loss function of human texture information

The following were used:

wherein the content of the first and second substances,F _rcs representing a random color shift algorithm, and extracting a single-channel texture representation from the color image;D _t representing a discriminator;

extracting single-channel texture representations from color images using random color transfer algorithmsF _rcs (I _rgb ) The formula is as follows:

wherein the content of the first and second substances,I _rgb representing a 3-channel RGB color image,I _r 、I _g andI _b there are shown three color channels of the color,Yrepresenting a standard gray image converted from an RGB color image; introduction of discriminator

Distinguishing character cartoon image output generated by a character cartoon video generation model from texture representation extracted from the character cartoon image generated by the model, and guiding the generator to learn clear outlines and fine textures stored in the texture representation; alpha represents the weight of the standard gray-scale image,β ₁ 、β ₂ 、β ₃ respectively representing the weights of the r channel, the g channel and the b channel, and the value range is (-1, 1);

3.14) loss function of the personal content information

：

Feature mapping representing a VGG layer, using VGG feature mapping between input photograph and generated picture after initialization

Sparse regularization to refine semantic content loss;

3.15) Total figure variation loss function as follows:

wherein the content of the first and second substances,H、W、Crepresenting a spatial dimension of an image;

to represent

The backward difference of the forward direction of the differential,

to represent

Backward difference of (d);

3.16) l1 regularization term:

wherein the content of the first and second substances,

the one-norm of the cartoon image of the person generated by the cartoon video generation model of the person is represented;

3.2) Total loss function of Property cartoon video Generation modelL _prop The following were used:

wherein the content of the first and second substances,a、b、c、d、eis thatL _adv 、L _con 、L _tex 、L ₁ 、L _IS The weight of (a) is determined,L _adv 、L _con 、L _tex 、L ₁ 、L _IS edge-contributing antagonism loss function, content information loss function, texture information loss function, l1 regularization term, and illumination smoothing loss, respectively;

3.21) edge-promoted resistance loss:

for each image

Epsilon Sdata (c) and the following three steps are applied: (1) standard Canny was used

An edge detector detects edge pixels; (2) Expanding the edge region; (3) applying gaussian smoothing to the expanded edge region to obtain sdata (e); wherein Sdata (c) represents a collection of cartoon images, Sdata (e) represents a collection of cartoon images with clear boundaries removed,c _i represents the ith sheet of the collection sdata (c) of cartoon images,e _j the second of the set of cartoon images representing the removed clear boundaryjThe paper is stretched and put in a paper-making machine,p _k representing the second in a set of images to be cartoonizedkOpening;

thus, the edge-facilitated antagonism loss functionL _adv The following were used:

representing discrete variables

In the probability distribution S _data (c) The entropy of the lower one of the two,

representing discrete variables

In the probability distribution S _data (e) The entropy of the lower one of the two,

representing discrete variables

In the probability distribution S _data The entropy under (p) is given by the entropy,Dthe presence of the discriminator is indicated by the expression,Ga representation generator for generating a representation of the object,G(p _k ) Represents the image generated by the generator G;

3.22) content information loss functionL _con The following were used:

a feature map representing one VGG layer;

3.23) texture information loss functionL _tex The following were used:

3.3) Total loss function of background cartoon video Generation modelL _background The following were used:

wherein the content of the first and second substances,e、f、g、hrespectively in background cartoon video generation modelL _adv 、L _con 、L _str 、L ₁ The weight of (c);

respectively training a character cartoon video generation model, a prop cartoon video generation model and a background cartoon video generation model to minimize the total loss function of the character cartoon video generation model, the total loss function of the prop cartoon video generation model and the total loss function of the background cartoon video generation model, so as to respectively obtain a trained character cartoon video generation model, a trained prop cartoon video generation model and a trained background cartoon video generation model; respectively inputting different parts of to-be-cartoon videos after semantic segmentation into a trained character cartoon video generation model, a trained property cartoon video generation model and a trained background cartoon video to obtain a character cartoon video, a property cartoon video and a background cartoon video; then compositing each frame of the character cartoon video, the prop cartoon video and the background cartoon video to obtain a composite image

Thereby obtaining a composite cartoon music stage performance video;

step four, preprocessing the music stage performance video to be processed, then, after segmenting out characters, properties and backgrounds through a semantic segmentation model, inputting the characters, the properties and the backgrounds into a trained character cartoon video generation model, a trained property cartoon video generation model and a trained background cartoon video generation model respectively, and obtaining cartoon music stage performance video;

and fifthly, constructing a composite image coordination model to carry out image harmony processing on the cartoon music stage performance video to obtain the final cartoon music stage performance video.

In a further improvement, in the first step, the preprocessing method includes image enhancement and image normalization.

In the second step, the semantic segmentation model is a DCNN model based on semantic segmentation;

firstly, sending a picture into a DCNN model based on semantic segmentation, adding a hole convolution extraction feature to obtain a high-level semantic feature and a low-level semantic feature, wherein the hole convolution process is as follows:

where y [ i ] represents the hole convolution output at position i,

x [ i + τ · K ] represents the input at position i + τ · K, K represents the length of the convolution kernel, w [ K ] represents the convolution filter of length K, τ represents the sampling step of the input signal;

the low-level semantic features are feature information obtained after hole convolution with a hole rate of 1 for one time, the high-level semantic features are feature information obtained after hole convolution with four times, extracted high-level semantic features are input into a hole pyramid pooling module and are convolved with hole convolution layers with different hole rates to obtain four feature maps, wherein the hole convolution hole rates are 1, 6, 12 and 18 respectively; pooling the extracted high-level semantic features to obtain a feature map; all the branches obtain five characteristic graphs, and the five characteristic graphs are spliced together to obtain a first characteristic graph;

putting the first characteristic diagram into a multilayer channel attention module to obtain a second characteristic diagram; carrying out bilinear interpolation upsampling on the second feature map and merging the second feature map with the low-level semantic features to obtain a merged feature map; the decoder part recovers the spatial information of the merged feature map by using 3 multiplied by 3 convolution and samples a fine target boundary on bilinear interpolation to obtain a segmentation result;

since there are multiple objects in the image segmentation task, a multi-classification cross entropy loss function is used

The formula is as follows:

wherein the content of the first and second substances,p _i indicates that the sample belongs toiThe probability of a class is determined by the probability of the class,y _i is an indication of the hit rate of the sample label, when the sample belongs to category i,y _i = 1; when the sample does not belong to the first category i,y _i =0；Crepresents the number of samples;

through the process, the characters and the props are separated from the stage background.

In a further improvement, γ ₁ =20，γ ₂ =40。

Further improvement, the concrete steps of the fifth step are as follows:

combining the images

Resolution into reflectance

And illuminating the intrinsic image

：

Wherein, l is an element-level product;

will coordinate through an image reconstruction loss functionL _rec Embedding the process from the decomposition of the composite image to the reconstruction of the real image:

an entropy value representing a norm of the spatial distance of the outputted harmonised image from the real image,

representing the outputted harmonised image,

to be provided with

≈

As a constraint on coordinating reflectivity, a harmonic loss of reflectivity is generatedL _RH ：

Wherein the content of the first and second substances,

representing the gradient of reflectance of the harmonised image,

representing the gradient of the harmonised image,

an entropy value representing a norm of a difference between a degree of reflection of the harmonised image and a gradient of the harmonised image;

wherein the content of the first and second substances,

the harmonised inherent image is represented and,

represents a gradient;

to coordinate the illumination, the illumination of the foreground and background will be compatible, learning the light first and then transferring the light from the background to the foreground, provided that the image gradient corresponding to the illumination is smooth, with

Constraint of 0 provides for decoupling, providing illumination smoothing loss;

setting lighting coordination lossL _IH The following were used:

a real image is represented by a real image,

the natural image after the harmony is represented,

entropy values of two norms representing spatial distances between the harmonious inherent image and the real image;

constructing a composite image

Discordance loss ofL _IF ：

Wherein the content of the first and second substances,

in the form of a function of the similarity between,

the encoder is shown receiving the composite image as an input, producing as an output a dissonant feature map,Cis composed of

The number of the channels of (a) is,

for the entropy value of the number of channels, the sum

Reduced gray-scale real images of the same size;

get the total loss functionL _harm The following were used:

through training, the total loss function is enabledL _harm Obtaining a final harmony processing model, inputting the obtained composite cartoon music stage performance video into the final harmony processing model to obtain a harmony music stage tablePerforming video playing;λ _RH 、λ _IS 、λ _IH andλ _IF are respectively asL _RH 、L _IS 、L _IH AndL _IF the weight of (c).

The invention has the advantages that:

compared with the prior art, the method and the device can carry out cartoon processing on the shot music stage performance video, and can carry out different cartooning on objects such as characters, props, backgrounds and the like. Wherein edge antagonism loss and high-level feature maps in VGG networks are facilitatedl1Sparse regularization provides good flexibility for reproducing smooth shadows. And training the image harmony model to harmonize the cartoon synthesized video so as to make the foreground consistent with the background. The method is more favorable for generating clean outline, clear boundary and harmonious color, and the generated cartoon video for the music stage performance can be widely applied to the field of music education, so that the interest of children in music is increased.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a test video screenshot;

FIG. 3 is a video semantic segmentation result screenshot;

fig. 4 is a video screenshot after image harmonization.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following examples.

The invention relates to an intelligent cartoon method for music stage performance videos, which comprises the following steps:

the method comprises the following steps of firstly, acquiring a real stage image data set and a cartoon image data set, and preprocessing the image data:

and collecting a real scene image data set and a cartoon image data set, carrying out image preprocessing, and constructing a training set and a testing set.

Step two, performing semantic segmentation on characters, props and backgrounds in the music stage performance video:

and performing framing processing on the original music stage table performance video input by the user, and designing a DCNNs model based on the hole convolution to perform semantic segmentation on each frame image. Extracting features using the DCNNs model and predicting a label, e.g., a person, a background, a prop, etc.;

designing a loss function to measure the difference between the predicted label and the real label;

calculating the gradient of each layer of parameters according to the difference, and then updating the gradient;

repeating the previous steps until the predicted label and the real label reach a certain accuracy;

given a picture, each pixel will output a probability of different categories, thereby generating a corresponding mask to segment characters, props and backgrounds in the music stage performance video.

Step three, constructing and training different cartoon video generation models for the music stage performance aiming at different objects such as characters, props, backgrounds and the like respectively:

designing and training a cartoon model based on the generated countermeasure network to cartoon different objects.

Constructing a cartoon model based on the generated countermeasure network to cartoon the character, wherein the total loss function is as follows:

whereinλ ₁ 、λ ₂ 、λ ₃ 、λ ₄ 、λ ₅ 、λ ₆ Respectively as a function of loss of information on the surface of the personL _surface Loss function of character structure informationL _structure Loss function of character texture informationL _texture Loss function of character content informationL _content Figure total variation loss functionL _tv And l1 regularization termL ₁ The information emphasis point of the generated image is controlled by giving different weights. Wherein the content of the first and second substances,

1) loss function of surface information:

edge-preserving filtering with a miniboot filter, denoted asF _dgf It takes an image I as input, itself as a guide map, and returns the extracted surface representationF _dgf (I,I)And texture and detail are removed. Introduction of discriminatorD _s To determine whether the model output and the reference cartoon image have similar surfaces and to direct the generator G to learn the information stored in the extracted surface representation.

2) Loss of structural information:

the extracted high-level features using the pre-trained VGG16 network then enforce spatial constraints between our results and the extracted structural representation. Let aF _st Representing structural characterization extraction.

An adaptive filtering algorithm is used here in which median filtering is combined with mean filtering. The formula is as follows:

where γ is set ₁ =20，γ ₂ =40。

3) Loss of texture information:

a single-channel texture representation is extracted from the color image using a random color transfer algorithm. The formula is as follows:

we have set a = 0.8,β ₁ ，β ₂ andβ ₃ ∼U(-1,1)。

introduces a discriminatorD _t To distinguish texture representations extracted from the model output and the cartoon and to direct the generator to learn the sharp contours and fine textures stored in the texture representations.

4) Loss of content information:

representing the feature map of a particular VGG layer. Using VGG feature mapping between input photograph and generated picture after initializationl ₁ Sparse regularization to refine semantic content loss.

Sparse regularization can better cope with the effect of large style differences on the feature maps.

5) Total variation loss function:

designing a total variation loss functionL _tv Spatial smoothness is imposed on the generated image. It also reduces high frequency noise, such as salt and pepper noise. In the formula (I), the compound is shown in the specification,H、W、Crepresenting the spatial dimension of the image.

6) l1 regularization term:

constructing a cartoon model based on the generated confrontation network to cartoon the prop, wherein the total loss function is as follows:

wherein the content of the first and second substances,a、b、c、d、eis to balance the weight for a given loss,L _adv 、L _con 、L _tex 、L ₁ 、L _IS respectively, an antagonism loss function, a content information loss function, a texture information loss function, an l1 regularization term, and a lighting smoothing loss.

1) Loss of antagonism:

for each image

E Sdata (c), we apply the following three steps: (1) standard Canny was used

The edge detector detects edge pixels (2) the dilated edge region (3) applies gaussian smoothing to the dilated edge region, resulting in sdata (e). Thus, the edge-promoting antagonism loss function is as follows:

2) loss of content information:

the loss function of the content information is used to ensure that the cartoon results and input photo semantics are unchanged and it is also computed over the pre-trained VGG16 feature space.

3) Loss of texture information:

4) l1 regularization term:

5) loss of smoothness of illumination

To coordinate the lighting, we need to adjust the foreground lighting

By background illumination

From

I is approximately equal, so that the illumination of the foreground and background will be compatible, we have devised a new illumination strategy, learning the light first and then shifting the light from the background to the foreground, provided that the image gradient corresponding to the illumination is small (i.e. the illumination is smooth), we have

Constraint of ≈ 0 to decouple, providing illumination smoothing loss:

constructing a cartoon model based on the generated countermeasure network to cartoon the background, wherein the total loss function is as follows:

wherein the content of the first and second substances,f、g、h、iis to balance the weight of a given penalty,L _adv 、L _con 、L _str 、L ₁ the antagonism loss function, the content information loss function, the structural information loss, the l1 regularization term, and the illumination smoothing loss, respectively.

1) Loss of antagonism:

2) loss of content information:

3) loss of structural information:

the extracted high-level features using the pre-trained VGG16 network then enforce spatial constraints between our results and the extracted structural representation. LetF _st Representing structural characterization extraction.

4) l1 regularization term:

three cartoonization models for different objects based on the generation of the countermeasure network are continuously trained by the loss function.

Step four, inputting the music stage performance video into the model to obtain cartoon music stage performance video:

and inputting the music stage performance video into the music stage performance cartoon video generation model to obtain the music stage performance video with cartoon effect. Firstly, extracting original video frames by using opencv, then carrying out semantic segmentation on images of different categories by using the methods in the steps two, three, four and five on each image, carrying out cartoon processing on the images of different categories by using different style migration algorithms, and finally carrying out image harmony processing. And then reading and writing each frame of the harmonious music stage performance cartoon image into the video, and then obtaining a complete video after the harmonious music stage performance cartoon. And extracting the audio of the original video through movie, and adding the audio to the cartoon music stage performance video to obtain the final cartoon effect of the music stage performance video.

And fifthly, constructing a composite image coordination model to carry out image harmony processing on the cartoon music stage performance video, and acquiring the cartoon music stage performance video with harmonious colors:

and carrying out image harmony processing on the video after cartoonization by using transfer learning. And carrying out image harmony processing on the cartoon video generated by the model by using a new composite image harmony method, wherein incoordination is eliminated mainly through separable reflectivity and illumination intrinsic image harmony, so that foreground and background are better fused. Firstly, an automatic encoder-based framework is constructed, a composite image is decomposed into a reflectivity and an illumination inherent image, then the reflectivity is punished and coordinated through material consistency, meanwhile, illumination is coordinated through adjusting the compatibility of foreground illumination and a background, a coordination relation model between the foreground and the background is further established, the coordination of the inherent image is guided, a mask is used in the illumination and guidance processes to separate the foreground and the background, and finally, the input video is enabled to obtain a harmonious performance video through a trained model.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting the protection scope of the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. The intelligent cartoon method for the music stage performance video is characterized by comprising the following steps of:

constructing a semantic segmentation model, wherein the semantic segmentation model carries out semantic segmentation on figures, props and backgrounds in image data;

3.1) Total loss function of character cartoonization video Generation modelL _body The following were used:

whereinλ ₁ 、λ ₂ 、λ ₃ 、λ ₄ 、λ ₅ 、λ ₆ Respectively being vague of surface information loss of the personL _surface Loss function of character structure information

Loss function of character texture information

Loss function of character content informationL _content Figure total variation loss functionL _tv And l1 regularization termL ₁ By giving different weights to control the information emphasis of the generated image;

3.11) loss function of character surface information

The following:

edge-preserving filtering with a miniboot filter, denoted as

represents the average of the pixel values of the current region,

a median value representing a current region pixel value; i denotes a row and j denotes a column;σ(S)to representSThe standard deviation of (a);

3.13) loss function of human texture information

The following were used:

wherein the content of the first and second substances,I _rgb an RGB color image representing 3 channels is shown,I _r 、I _g andI _b there are shown three color channels of the color,Yrepresenting a standard gray image converted from an RGB color image; introduction of discriminator

3.14) loss function of the personal content information

：

Representing a feature mapping of a VGG layer, using the VGG feature mapping between the input photograph and the generated picture after initialization

Sparse regularization to refine semantic content loss;

3.15) Total figure variation loss function as follows:

to represent

The backward difference of the forward direction of the differential,

to represent

Backward difference of (2);

3.16) l1 regularization term:

wherein the content of the first and second substances,

3.21) edge-promoted resistance loss:

for each image

An edge detector detects edge pixels; (2) expanding the edge region; (3) applying gaussian smoothing to the expanded edge region to obtain sdata (e); wherein Sdata (c) represents a collection of cartoon images, Sdata (e) represents a collection of cartoon images with clear boundaries removed,c _i the fifth of the set Sdata (c) representing cartoon images

The paper is stretched and put in a paper-making machine,e _j the second of the set of cartoon images representing the removed sharp boundariesjThe paper is stretched and put in a paper-making machine,p _k representing the second in a set of images to be cartoonizedkOpening;

representing discrete variables

representing discrete variables

In the probability distribution S _data (e) The entropy of the lower part of the entropy is,

representing discrete variables

In the probability distribution S _data The entropy under (p) is given by the entropy,Da decision-maker is shown which is,Ga representation generator for generating a representation of the object,G(p _k ) Represents the image generated by the generator G;

3.22) content information loss functionL _con The following were used:

a feature map representing one VGG layer;

3.23) texture information loss functionL _tex The following were used:

3.3) Total loss function of background cartoon video Generation modelL _background The following:

Thereby obtaining a composite cartoon music stage performance video;

2. The intelligent cartoonification method of a music stage performance video according to claim 1, wherein in the first step, the preprocessing method comprises image enhancement and image normalization.

3. The intelligent cartoonization method of a music stage performance video according to claim 1, wherein in the second step, the semantic segmentation model is a DCNN model based on semantic segmentation;

where y [ i ] represents the hole convolution output at position i,

the low-level semantic features are feature information obtained after cavity convolution with a cavity rate of 1 for one time, the high-level semantic features are feature information obtained after cavity convolution for four times, extracted high-level semantic features are input into a cavity pyramid pooling module and are convolved with cavity convolution layers with different cavity rates, and four feature graphs are obtained, wherein the cavity convolution cavity rates are 1, 6, 12 and 18 respectively; pooling the extracted high-level semantic features to obtain a feature map; all the branches obtain five characteristic graphs, and the five characteristic graphs are spliced together to obtain a first characteristic graph;

The formula is as follows:

4. The intelligent cartoonizing method for music stage performance video as claimed in claim 3, wherein γ is ₁ =20，γ ₂ =40。

5. The intelligent cartoonization method of a music stage performance video according to claim 1, wherein the concrete steps of the fifth step are as follows:

combining the images

Resolution into reflectance

And illuminating the intrinsic image

：

Wherein, l is an element-level product;

will coordinate through an image reconstruction loss functionL _rec Embedding into the process from composite image decomposition to real image reconstruction:

representing the outputted harmonised image,

to be provided with

≈

As a constraint on coordinating reflectivity, a reflectivity harmonic loss is generatedL _RH ：

Wherein the content of the first and second substances,

representing the gradient of reflectance of the harmonised image,

representing the gradient of the image after the harmony,

wherein, the first and the second end of the pipe are connected with each other,

the natural image after the harmony is represented,

represents a gradient;

Constraint of 0 provides for decoupling, providing illumination smoothing loss;

setting lighting coordination lossL _IH The following were used:

a real image is represented by a real image,

the natural image after the harmony is represented,

entropy values representing two norms of spatial distances between the harmonious inherent images and the real images;

constructing a composite image

Discordance loss ofL _IF ：

Wherein the content of the first and second substances,

in the form of a function of the similarity between,

The number of the channels of (a) is,

is the entropy value at the number of channels,

is shown and

reduced gray-scale real images of the same size;

get the total loss functionL _harm The following were used:

through training, the total loss function is enabledL _harm Obtaining a final harmony processing model, and inputting the obtained composite cartoon music stage performance video into the final harmony processing model to obtain a harmony music stage performance video;λ _RH 、λ _IS 、λ _IH andλ _IF are respectively asL _RH 、L _IS 、L _IH AndL _IF the weight of (c).