CN114898021B - Intelligent cartoon method for music stage performance video - Google Patents
Intelligent cartoon method for music stage performance video Download PDFInfo
- Publication number
- CN114898021B CN114898021B CN202210812946.2A CN202210812946A CN114898021B CN 114898021 B CN114898021 B CN 114898021B CN 202210812946 A CN202210812946 A CN 202210812946A CN 114898021 B CN114898021 B CN 114898021B
- Authority
- CN
- China
- Prior art keywords
- cartoon
- image
- loss function
- representing
- generation model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Image Analysis (AREA)
Abstract
The invention provides an intelligent cartoon method for music stage performance videos, which comprises the following steps: acquiring a real stage image data set and a cartoon image data set, and preprocessing image data; performing semantic segmentation on characters, props and backgrounds in the music stage performance video; step three, constructing and training different cartoon video generation models for the music stage performance aiming at different objects such as characters, props, backgrounds and the like; inputting the music stage performance video into the model to obtain cartoon music stage performance video; and fifthly, constructing a composite image coordination model to carry out image harmony processing on the cartoon music stage performance video. The invention can carry out cartoon processing on the music stage performance video, thereby being applied to the fields of music performance, animation production and the like, and being more beneficial to generating the music stage performance video with clean outline, clear boundary and harmonious color.
Description
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to an intelligent cartoon method for music stage performance videos.
Background
In recent years, with the continuous development of artificial intelligence, many algorithms are applied in the field of image processing, such as image style conversion. The cartoon is a very popular artistic expression form at present, and the artistic expression form is widely applied to various aspects of the society, including advertisement, games, movie and television works, photography and the like. At present, young people in this era are mostly influenced by Japanese cartoons, and the cartoons are really influenced worldwide, but because the cartoons are drawn and generated by hands and then rendered by computers, the time and the labor are relatively much, which can not be finished by people without drawing bases, and therefore, the modern cartoon animation workflow allows artists to use various resources to create contents. Some famous caricatures are created by converting real-world pictures into available cartoon scene material, a process called image cartoonization.
The image cartoon method can also be applied to the field of music education. The music stage performance video is rendered by a cartoon method, and the music stage performance is displayed in an artistic form of cartoon style, so that the interest of children in the music stage performance can be attracted. Although the existing cartoon method is applied to many fields, the application in the music education field is scarce, and in addition, the existing cartoon method cannot realize cartoon for characters, props and backgrounds at the same time, and harmony processing is not carried out on the characters, props and backgrounds after cartoon, so that a unified cartoon animation is formed.
The noun explains:
DCNN model based on semantic segmentation: a model for semantic segmentation of images using Deep Convolutional Neural Network (DCNN).
Based on the cartoon model for generating the countermeasure network: the images are cartoonized with a model that generates a countermeasure network (GAN).
Disclosure of Invention
The invention aims to provide a novel intelligent cartoon method for music stage performance videos aiming at the defects of the conventional image stylizing method, which is used for semantically segmenting different contents in a complex scene and cartoonizing the different contents by using different image stylizing methods.
The purpose of the invention is realized by the following technical scheme:
an intelligent cartoon method for music stage performance videos comprises the following steps:
acquiring image data and preprocessing the image data; the image data comprises a real stage image data set and a cartoon image data set; the real stage image dataset is obtained from a music stage performance video;
constructing a semantic segmentation model, wherein the semantic segmentation model carries out semantic segmentation on characters, props and backgrounds in the image data;
step three, constructing and training different cartoon video generation models for the music stage performance aiming at characters, props and backgrounds respectively; respectively obtaining a trained character cartoon video generation model, a trained property cartoon video generation model and a trained background cartoon video generation model:
respectively forming a character cartoon video generation model, a prop cartoon video generation model and a background cartoon video generation model which correspond to characters, props and backgrounds on the basis of a cartoon model for generating an antagonistic network;
3.1) Total loss function of character cartoon video Generation modelL body The following were used:
whereinλ 1 、λ 2 、λ 3 、λ 4 、λ 5 、λ 6 Respectively being vague of surface information loss of the personL surface Loss function of character structure informationL structure Loss function of character texture informationL texture Loss function of character content informationL content Figure total variation loss functionL tv And l1 regularization termL 1 The information emphasis point of the generated image is controlled by giving different weights;
edge-preserving filtering with a miniboot filter, denoted asReturning the extracted surface representation with an image I as input, itself as a guide mapDeleting textures and details; introduction of discriminatorD s To determine if the model output and the reference cartoon image have similar surfaces and to direct the generatorGLearning information stored in the extracted surface representation; whereinGA representation generator for generating a representation of the object,D s a discriminator for the information on the surface is represented,I c the representation of a cartoon image is made,I p representing a real image;
3.12) loss function of character structure informationL structure The following were used:
using advanced features extracted from a pre-trained VGG16 network, and then strengthening spatial constraints between character cartoon images generated by the character cartoon video generation model and structural representations extracted from the generated character cartoon images;the method comprises the steps of representing the structural representation extraction of the generated cartoon image of the person, namely the selective searching process and the color filling of the picture area;representing advanced features extracted in the generated character cartoon image by a VGG network;
and carrying out weighted summation according to the median value and the average value of the region to calculate the color of the region, wherein the formula is as follows:
wherein the content of the first and second substances,S i,j a pixel value representing a region having a position (i, j),represents the average of the pixel values of the current region,a median value representing a current region pixel value; i denotes a row, j denotes a column;σ(S)to representSStandard deviation of (d);
wherein the content of the first and second substances,F rcs representing a random color shift algorithm, and extracting a single-channel texture representation from the color image;D t representing a discriminator;
extracting single-channel texture representations from color images using random color transfer algorithmsF rcs (I rgb ) The formula is as follows:
wherein the content of the first and second substances,I rgb representing a 3-channel RGB color image,I r 、I g andI b there are shown three color channels of the color,Yrepresenting a standard gray image converted from an RGB color image; introduction of discriminatorDistinguishing character cartoon image output generated by a character cartoon video generation model from texture representation extracted from the character cartoon image generated by the model, and guiding the generator to learn clear outlines and fine textures stored in the texture representation; alpha represents the weight of the standard gray-scale image,β 1 、β 2 、β 3 respectively representing the weights of the r channel, the g channel and the b channel, and the value range is (-1, 1);
Feature mapping representing a VGG layer, using VGG feature mapping between input photograph and generated picture after initializationSparse regularization to refine semantic content loss;
3.15) Total figure variation loss function as follows:
wherein the content of the first and second substances,H、W、Crepresenting a spatial dimension of an image;to representThe backward difference of the forward direction of the differential,to representBackward difference of (d);
3.16) l1 regularization term:
wherein the content of the first and second substances,the one-norm of the cartoon image of the person generated by the cartoon video generation model of the person is represented;
3.2) Total loss function of Property cartoon video Generation modelL prop The following were used:
wherein the content of the first and second substances,a、b、c、d、eis thatL adv 、L con 、L tex 、L 1 、L IS The weight of (a) is determined,L adv 、L con 、L tex 、L 1 、L IS edge-contributing antagonism loss function, content information loss function, texture information loss function, l1 regularization term, and illumination smoothing loss, respectively;
3.21) edge-promoted resistance loss:
for each imageEpsilon Sdata (c) and the following three steps are applied: (1) standard Canny was used
An edge detector detects edge pixels; (2) Expanding the edge region; (3) applying gaussian smoothing to the expanded edge region to obtain sdata (e); wherein Sdata (c) represents a collection of cartoon images, Sdata (e) represents a collection of cartoon images with clear boundaries removed,c i represents the ith sheet of the collection sdata (c) of cartoon images,e j the second of the set of cartoon images representing the removed clear boundaryjThe paper is stretched and put in a paper-making machine,p k representing the second in a set of images to be cartoonizedkOpening;
thus, the edge-facilitated antagonism loss functionL adv The following were used:
representing discrete variablesIn the probability distribution S data (c) The entropy of the lower one of the two,representing discrete variablesIn the probability distribution S data (e) The entropy of the lower one of the two,representing discrete variablesIn the probability distribution S data The entropy under (p) is given by the entropy,Dthe presence of the discriminator is indicated by the expression,Ga representation generator for generating a representation of the object,G(p k ) Represents the image generated by the generator G;
3.22) content information loss functionL con The following were used:
a feature map representing one VGG layer;
3.23) texture information loss functionL tex The following were used:
3.3) Total loss function of background cartoon video Generation modelL background The following were used:
wherein the content of the first and second substances,e、f、g、hrespectively in background cartoon video generation modelL adv 、L con 、L str 、L 1 The weight of (c);
respectively training a character cartoon video generation model, a prop cartoon video generation model and a background cartoon video generation model to minimize the total loss function of the character cartoon video generation model, the total loss function of the prop cartoon video generation model and the total loss function of the background cartoon video generation model, so as to respectively obtain a trained character cartoon video generation model, a trained prop cartoon video generation model and a trained background cartoon video generation model; respectively inputting different parts of to-be-cartoon videos after semantic segmentation into a trained character cartoon video generation model, a trained property cartoon video generation model and a trained background cartoon video to obtain a character cartoon video, a property cartoon video and a background cartoon video; then compositing each frame of the character cartoon video, the prop cartoon video and the background cartoon video to obtain a composite imageThereby obtaining a composite cartoon music stage performance video;
step four, preprocessing the music stage performance video to be processed, then, after segmenting out characters, properties and backgrounds through a semantic segmentation model, inputting the characters, the properties and the backgrounds into a trained character cartoon video generation model, a trained property cartoon video generation model and a trained background cartoon video generation model respectively, and obtaining cartoon music stage performance video;
and fifthly, constructing a composite image coordination model to carry out image harmony processing on the cartoon music stage performance video to obtain the final cartoon music stage performance video.
In a further improvement, in the first step, the preprocessing method includes image enhancement and image normalization.
In the second step, the semantic segmentation model is a DCNN model based on semantic segmentation;
firstly, sending a picture into a DCNN model based on semantic segmentation, adding a hole convolution extraction feature to obtain a high-level semantic feature and a low-level semantic feature, wherein the hole convolution process is as follows:
where y [ i ] represents the hole convolution output at position i,
x [ i + τ · K ] represents the input at position i + τ · K, K represents the length of the convolution kernel, w [ K ] represents the convolution filter of length K, τ represents the sampling step of the input signal;
the low-level semantic features are feature information obtained after hole convolution with a hole rate of 1 for one time, the high-level semantic features are feature information obtained after hole convolution with four times, extracted high-level semantic features are input into a hole pyramid pooling module and are convolved with hole convolution layers with different hole rates to obtain four feature maps, wherein the hole convolution hole rates are 1, 6, 12 and 18 respectively; pooling the extracted high-level semantic features to obtain a feature map; all the branches obtain five characteristic graphs, and the five characteristic graphs are spliced together to obtain a first characteristic graph;
putting the first characteristic diagram into a multilayer channel attention module to obtain a second characteristic diagram; carrying out bilinear interpolation upsampling on the second feature map and merging the second feature map with the low-level semantic features to obtain a merged feature map; the decoder part recovers the spatial information of the merged feature map by using 3 multiplied by 3 convolution and samples a fine target boundary on bilinear interpolation to obtain a segmentation result;
since there are multiple objects in the image segmentation task, a multi-classification cross entropy loss function is usedThe formula is as follows:
wherein the content of the first and second substances,p i indicates that the sample belongs toiThe probability of a class is determined by the probability of the class,y i is an indication of the hit rate of the sample label, when the sample belongs to category i,y i = 1; when the sample does not belong to the first category i,y i =0;Crepresents the number of samples;
through the process, the characters and the props are separated from the stage background.
In a further improvement, γ 1 =20,γ 2 =40。
Further improvement, the concrete steps of the fifth step are as follows:
Wherein, l is an element-level product;
will coordinate through an image reconstruction loss functionL rec Embedding the process from the decomposition of the composite image to the reconstruction of the real image:
an entropy value representing a norm of the spatial distance of the outputted harmonised image from the real image,representing the outputted harmonised image,
to be provided with ≈ As a constraint on coordinating reflectivity, a harmonic loss of reflectivity is generatedL RH :
Wherein the content of the first and second substances,representing the gradient of reflectance of the harmonised image,representing the gradient of the harmonised image,an entropy value representing a norm of a difference between a degree of reflection of the harmonised image and a gradient of the harmonised image;
wherein the content of the first and second substances,the harmonised inherent image is represented and,represents a gradient;
to coordinate the illumination, the illumination of the foreground and background will be compatible, learning the light first and then transferring the light from the background to the foreground, provided that the image gradient corresponding to the illumination is smooth, withConstraint of 0 provides for decoupling, providing illumination smoothing loss;
setting lighting coordination lossL IH The following were used:
a real image is represented by a real image,the natural image after the harmony is represented,entropy values of two norms representing spatial distances between the harmonious inherent image and the real image;
Wherein the content of the first and second substances,in the form of a function of the similarity between,the encoder is shown receiving the composite image as an input, producing as an output a dissonant feature map,Cis composed ofThe number of the channels of (a) is,for the entropy value of the number of channels, the sumReduced gray-scale real images of the same size;
get the total loss functionL harm The following were used:
through training, the total loss function is enabledL harm Obtaining a final harmony processing model, inputting the obtained composite cartoon music stage performance video into the final harmony processing model to obtain a harmony music stage tablePerforming video playing;λ RH 、λ IS 、λ IH andλ IF are respectively asL RH 、L IS 、L IH AndL IF the weight of (c).
The invention has the advantages that:
compared with the prior art, the method and the device can carry out cartoon processing on the shot music stage performance video, and can carry out different cartooning on objects such as characters, props, backgrounds and the like. Wherein edge antagonism loss and high-level feature maps in VGG networks are facilitatedl1Sparse regularization provides good flexibility for reproducing smooth shadows. And training the image harmony model to harmonize the cartoon synthesized video so as to make the foreground consistent with the background. The method is more favorable for generating clean outline, clear boundary and harmonious color, and the generated cartoon video for the music stage performance can be widely applied to the field of music education, so that the interest of children in music is increased.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a test video screenshot;
FIG. 3 is a video semantic segmentation result screenshot;
fig. 4 is a video screenshot after image harmonization.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following examples.
The invention relates to an intelligent cartoon method for music stage performance videos, which comprises the following steps:
the method comprises the following steps of firstly, acquiring a real stage image data set and a cartoon image data set, and preprocessing the image data:
and collecting a real scene image data set and a cartoon image data set, carrying out image preprocessing, and constructing a training set and a testing set.
Step two, performing semantic segmentation on characters, props and backgrounds in the music stage performance video:
and performing framing processing on the original music stage table performance video input by the user, and designing a DCNNs model based on the hole convolution to perform semantic segmentation on each frame image. Extracting features using the DCNNs model and predicting a label, e.g., a person, a background, a prop, etc.;
designing a loss function to measure the difference between the predicted label and the real label;
calculating the gradient of each layer of parameters according to the difference, and then updating the gradient;
repeating the previous steps until the predicted label and the real label reach a certain accuracy;
given a picture, each pixel will output a probability of different categories, thereby generating a corresponding mask to segment characters, props and backgrounds in the music stage performance video.
Step three, constructing and training different cartoon video generation models for the music stage performance aiming at different objects such as characters, props, backgrounds and the like respectively:
designing and training a cartoon model based on the generated countermeasure network to cartoon different objects.
Constructing a cartoon model based on the generated countermeasure network to cartoon the character, wherein the total loss function is as follows:
whereinλ 1 、λ 2 、λ 3 、λ 4 、λ 5 、λ 6 Respectively as a function of loss of information on the surface of the personL surface Loss function of character structure informationL structure Loss function of character texture informationL texture Loss function of character content informationL content Figure total variation loss functionL tv And l1 regularization termL 1 The information emphasis point of the generated image is controlled by giving different weights. Wherein the content of the first and second substances,
1) loss function of surface information:
edge-preserving filtering with a miniboot filter, denoted asF dgf It takes an image I as input, itself as a guide map, and returns the extracted surface representationF dgf (I,I)And texture and detail are removed. Introduction of discriminatorD s To determine whether the model output and the reference cartoon image have similar surfaces and to direct the generator G to learn the information stored in the extracted surface representation.
2) Loss of structural information:
the extracted high-level features using the pre-trained VGG16 network then enforce spatial constraints between our results and the extracted structural representation. Let aF st Representing structural characterization extraction.
An adaptive filtering algorithm is used here in which median filtering is combined with mean filtering. The formula is as follows:
where γ is set 1 =20,γ 2 =40。
3) Loss of texture information:
a single-channel texture representation is extracted from the color image using a random color transfer algorithm. The formula is as follows:
we have set a = 0.8,β 1 ,β 2 andβ 3 ∼U(-1,1)。
introduces a discriminatorD t To distinguish texture representations extracted from the model output and the cartoon and to direct the generator to learn the sharp contours and fine textures stored in the texture representations.
4) Loss of content information:
representing the feature map of a particular VGG layer. Using VGG feature mapping between input photograph and generated picture after initializationl 1 Sparse regularization to refine semantic content loss.Sparse regularization can better cope with the effect of large style differences on the feature maps.
5) Total variation loss function:
designing a total variation loss functionL tv Spatial smoothness is imposed on the generated image. It also reduces high frequency noise, such as salt and pepper noise. In the formula (I), the compound is shown in the specification,H、W、Crepresenting the spatial dimension of the image.
6) l1 regularization term:
constructing a cartoon model based on the generated confrontation network to cartoon the prop, wherein the total loss function is as follows:
wherein the content of the first and second substances,a、b、c、d、eis to balance the weight for a given loss,L adv 、L con 、L tex 、L 1 、L IS respectively, an antagonism loss function, a content information loss function, a texture information loss function, an l1 regularization term, and a lighting smoothing loss.
1) Loss of antagonism:
The edge detector detects edge pixels (2) the dilated edge region (3) applies gaussian smoothing to the dilated edge region, resulting in sdata (e). Thus, the edge-promoting antagonism loss function is as follows:
2) loss of content information:
the loss function of the content information is used to ensure that the cartoon results and input photo semantics are unchanged and it is also computed over the pre-trained VGG16 feature space.
3) Loss of texture information:
introduces a discriminatorD t To distinguish texture representations extracted from the model output and the cartoon and to direct the generator to learn the sharp contours and fine textures stored in the texture representations.
4) l1 regularization term:
5) loss of smoothness of illumination
To coordinate the lighting, we need to adjust the foreground lightingBy background illuminationFromI is approximately equal, so that the illumination of the foreground and background will be compatible, we have devised a new illumination strategy, learning the light first and then shifting the light from the background to the foreground, provided that the image gradient corresponding to the illumination is small (i.e. the illumination is smooth), we haveConstraint of ≈ 0 to decouple, providing illumination smoothing loss:
constructing a cartoon model based on the generated countermeasure network to cartoon the background, wherein the total loss function is as follows:
wherein the content of the first and second substances,f、g、h、iis to balance the weight of a given penalty,L adv 、L con 、L str 、L 1 the antagonism loss function, the content information loss function, the structural information loss, the l1 regularization term, and the illumination smoothing loss, respectively.
1) Loss of antagonism:
2) loss of content information:
3) loss of structural information:
the extracted high-level features using the pre-trained VGG16 network then enforce spatial constraints between our results and the extracted structural representation. LetF st Representing structural characterization extraction.
4) l1 regularization term:
three cartoonization models for different objects based on the generation of the countermeasure network are continuously trained by the loss function.
Step four, inputting the music stage performance video into the model to obtain cartoon music stage performance video:
and inputting the music stage performance video into the music stage performance cartoon video generation model to obtain the music stage performance video with cartoon effect. Firstly, extracting original video frames by using opencv, then carrying out semantic segmentation on images of different categories by using the methods in the steps two, three, four and five on each image, carrying out cartoon processing on the images of different categories by using different style migration algorithms, and finally carrying out image harmony processing. And then reading and writing each frame of the harmonious music stage performance cartoon image into the video, and then obtaining a complete video after the harmonious music stage performance cartoon. And extracting the audio of the original video through movie, and adding the audio to the cartoon music stage performance video to obtain the final cartoon effect of the music stage performance video.
And fifthly, constructing a composite image coordination model to carry out image harmony processing on the cartoon music stage performance video, and acquiring the cartoon music stage performance video with harmonious colors:
and carrying out image harmony processing on the video after cartoonization by using transfer learning. And carrying out image harmony processing on the cartoon video generated by the model by using a new composite image harmony method, wherein incoordination is eliminated mainly through separable reflectivity and illumination intrinsic image harmony, so that foreground and background are better fused. Firstly, an automatic encoder-based framework is constructed, a composite image is decomposed into a reflectivity and an illumination inherent image, then the reflectivity is punished and coordinated through material consistency, meanwhile, illumination is coordinated through adjusting the compatibility of foreground illumination and a background, a coordination relation model between the foreground and the background is further established, the coordination of the inherent image is guided, a mask is used in the illumination and guidance processes to separate the foreground and the background, and finally, the input video is enabled to obtain a harmonious performance video through a trained model.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting the protection scope of the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims (5)
1. The intelligent cartoon method for the music stage performance video is characterized by comprising the following steps of:
acquiring image data and preprocessing the image data; the image data comprises a real stage image data set and a cartoon image data set; the real stage image dataset is obtained from a music stage performance video;
constructing a semantic segmentation model, wherein the semantic segmentation model carries out semantic segmentation on figures, props and backgrounds in image data;
step three, constructing and training different cartoon video generation models for the music stage performance aiming at characters, props and backgrounds respectively; respectively obtaining a trained character cartoon video generation model, a trained property cartoon video generation model and a trained background cartoon video generation model:
respectively forming a character cartoon video generation model, a prop cartoon video generation model and a background cartoon video generation model which correspond to characters, props and backgrounds on the basis of a cartoon model for generating an antagonistic network;
3.1) Total loss function of character cartoonization video Generation modelL body The following were used:
whereinλ 1 、λ 2 、λ 3 、λ 4 、λ 5 、λ 6 Respectively being vague of surface information loss of the personL surface Loss function of character structure informationLoss function of character texture informationLoss function of character content informationL content Figure total variation loss functionL tv And l1 regularization termL 1 By giving different weights to control the information emphasis of the generated image;
edge-preserving filtering with a miniboot filter, denoted asReturning the extracted surface representation with an image I as input, itself as a guide mapDeleting textures and details; introduction of discriminatorD s To determine if the model output and the reference cartoon image have similar surfaces and to direct the generatorGLearning information stored in the extracted surface representation; whereinGA representation generator for generating a representation of the object,D s a discriminator for the information on the surface is represented,I c the representation of a cartoon image is made,I p representing a real image;
3.12) loss function of character structure informationL structure The following were used:
using advanced features extracted from a pre-trained VGG16 network, and then strengthening spatial constraints between character cartoon images generated by the character cartoon video generation model and structural representations extracted from the generated character cartoon images;the method comprises the steps of representing the structural representation extraction of the generated cartoon image of the person, namely the selective searching process and the color filling of the picture area;representing advanced features extracted in the generated character cartoon image by a VGG network;
and carrying out weighted summation according to the median value and the average value of the region to calculate the color of the region, wherein the formula is as follows:
wherein the content of the first and second substances,S i,j a pixel value representing a region having a position (i, j),represents the average of the pixel values of the current region,a median value representing a current region pixel value; i denotes a row and j denotes a column;σ(S)to representSThe standard deviation of (a);
wherein the content of the first and second substances,F rcs representing a random color shift algorithm, and extracting a single-channel texture representation from the color image;D t representing a discriminator;
extracting single-channel texture representations from color images using random color transfer algorithmsF rcs (I rgb ) The formula is as follows:
wherein the content of the first and second substances,I rgb an RGB color image representing 3 channels is shown,I r 、I g andI b there are shown three color channels of the color,Yrepresenting a standard gray image converted from an RGB color image; introduction of discriminatorDistinguishing character cartoon image output generated by a character cartoon video generation model from texture representation extracted from the character cartoon image generated by the model, and guiding the generator to learn clear outlines and fine textures stored in the texture representation; alpha represents the weight of the standard gray-scale image,β 1 、β 2 、β 3 respectively representing the weights of the r channel, the g channel and the b channel, and the value range is (-1, 1);
Representing a feature mapping of a VGG layer, using the VGG feature mapping between the input photograph and the generated picture after initializationSparse regularization to refine semantic content loss;
3.15) Total figure variation loss function as follows:
wherein the content of the first and second substances,H、W、Crepresenting a spatial dimension of an image;to representThe backward difference of the forward direction of the differential,to representBackward difference of (2);
3.16) l1 regularization term:
wherein the content of the first and second substances,the one-norm of the cartoon image of the person generated by the cartoon video generation model of the person is represented;
3.2) Total loss function of Property cartoon video Generation modelL prop The following were used:
wherein the content of the first and second substances,a、b、c、d、eis thatL adv 、L con 、L tex 、L 1 、L IS The weight of (a) is determined,L adv 、L con 、L tex 、L 1 、L IS edge-contributing antagonism loss function, content information loss function, texture information loss function, l1 regularization term, and illumination smoothing loss, respectively;
3.21) edge-promoted resistance loss:
for each imageEpsilon Sdata (c) and the following three steps are applied: (1) standard Canny was used
An edge detector detects edge pixels; (2) expanding the edge region; (3) applying gaussian smoothing to the expanded edge region to obtain sdata (e); wherein Sdata (c) represents a collection of cartoon images, Sdata (e) represents a collection of cartoon images with clear boundaries removed,c i the fifth of the set Sdata (c) representing cartoon imagesThe paper is stretched and put in a paper-making machine,e j the second of the set of cartoon images representing the removed sharp boundariesjThe paper is stretched and put in a paper-making machine,p k representing the second in a set of images to be cartoonizedkOpening;
thus, the edge-facilitated antagonism loss functionL adv The following were used:
representing discrete variablesIn the probability distribution S data (c) The entropy of the lower one of the two,representing discrete variablesIn the probability distribution S data (e) The entropy of the lower part of the entropy is,representing discrete variablesIn the probability distribution S data The entropy under (p) is given by the entropy,Da decision-maker is shown which is,Ga representation generator for generating a representation of the object,G(p k ) Represents the image generated by the generator G;
3.22) content information loss functionL con The following were used:
3.23) texture information loss functionL tex The following were used:
3.3) Total loss function of background cartoon video Generation modelL background The following:
wherein the content of the first and second substances,e、f、g、hrespectively in background cartoon video generation modelL adv 、L con 、L str 、L 1 The weight of (c);
respectively training a character cartoon video generation model, a prop cartoon video generation model and a background cartoon video generation model to minimize the total loss function of the character cartoon video generation model, the total loss function of the prop cartoon video generation model and the total loss function of the background cartoon video generation model, so as to respectively obtain a trained character cartoon video generation model, a trained prop cartoon video generation model and a trained background cartoon video generation model; respectively inputting different parts of to-be-cartoon videos after semantic segmentation into a trained character cartoon video generation model, a trained property cartoon video generation model and a trained background cartoon video to obtain a character cartoon video, a property cartoon video and a background cartoon video; then compositing each frame of the character cartoon video, the prop cartoon video and the background cartoon video to obtain a composite imageThereby obtaining a composite cartoon music stage performance video;
step four, preprocessing the music stage performance video to be processed, then, after segmenting out characters, properties and backgrounds through a semantic segmentation model, inputting the characters, the properties and the backgrounds into a trained character cartoon video generation model, a trained property cartoon video generation model and a trained background cartoon video generation model respectively, and obtaining cartoon music stage performance video;
and fifthly, constructing a composite image coordination model to carry out image harmony processing on the cartoon music stage performance video to obtain the final cartoon music stage performance video.
2. The intelligent cartoonification method of a music stage performance video according to claim 1, wherein in the first step, the preprocessing method comprises image enhancement and image normalization.
3. The intelligent cartoonization method of a music stage performance video according to claim 1, wherein in the second step, the semantic segmentation model is a DCNN model based on semantic segmentation;
firstly, sending a picture into a DCNN model based on semantic segmentation, adding a hole convolution extraction feature to obtain a high-level semantic feature and a low-level semantic feature, wherein the hole convolution process is as follows:
where y [ i ] represents the hole convolution output at position i,
x [ i + τ · K ] represents the input at position i + τ · K, K represents the length of the convolution kernel, w [ K ] represents the convolution filter of length K, τ represents the sampling step of the input signal;
the low-level semantic features are feature information obtained after cavity convolution with a cavity rate of 1 for one time, the high-level semantic features are feature information obtained after cavity convolution for four times, extracted high-level semantic features are input into a cavity pyramid pooling module and are convolved with cavity convolution layers with different cavity rates, and four feature graphs are obtained, wherein the cavity convolution cavity rates are 1, 6, 12 and 18 respectively; pooling the extracted high-level semantic features to obtain a feature map; all the branches obtain five characteristic graphs, and the five characteristic graphs are spliced together to obtain a first characteristic graph;
putting the first characteristic diagram into a multilayer channel attention module to obtain a second characteristic diagram; carrying out bilinear interpolation upsampling on the second feature map and merging the second feature map with the low-level semantic features to obtain a merged feature map; the decoder part recovers the spatial information of the merged feature map by using 3 multiplied by 3 convolution and samples a fine target boundary on bilinear interpolation to obtain a segmentation result;
since there are multiple objects in the image segmentation task, a multi-classification cross entropy loss function is usedThe formula is as follows:
wherein the content of the first and second substances,p i indicates that the sample belongs toiThe probability of a class is determined by the probability of the class,y i is an indication of the hit rate of the sample label, when the sample belongs to category i,y i = 1; when the sample does not belong to the first category i,y i =0;Crepresents the number of samples;
through the process, the characters and the props are separated from the stage background.
4. The intelligent cartoonizing method for music stage performance video as claimed in claim 3, wherein γ is 1 =20,γ 2 =40。
5. The intelligent cartoonization method of a music stage performance video according to claim 1, wherein the concrete steps of the fifth step are as follows:
Wherein, l is an element-level product;
will coordinate through an image reconstruction loss functionL rec Embedding into the process from composite image decomposition to real image reconstruction:
an entropy value representing a norm of the spatial distance of the outputted harmonised image from the real image,representing the outputted harmonised image,
to be provided with ≈ As a constraint on coordinating reflectivity, a reflectivity harmonic loss is generatedL RH :
Wherein the content of the first and second substances,representing the gradient of reflectance of the harmonised image,representing the gradient of the image after the harmony,an entropy value representing a norm of a difference between a degree of reflection of the harmonised image and a gradient of the harmonised image;
wherein, the first and the second end of the pipe are connected with each other,the natural image after the harmony is represented,represents a gradient;
to coordinate the illumination, the illumination of the foreground and background will be compatible, learning the light first and then transferring the light from the background to the foreground, provided that the image gradient corresponding to the illumination is smooth, withConstraint of 0 provides for decoupling, providing illumination smoothing loss;
setting lighting coordination lossL IH The following were used:
a real image is represented by a real image,the natural image after the harmony is represented,entropy values representing two norms of spatial distances between the harmonious inherent images and the real images;
Wherein the content of the first and second substances,in the form of a function of the similarity between,the encoder is shown receiving the composite image as an input, producing as an output a dissonant feature map,Cis composed ofThe number of the channels of (a) is,is the entropy value at the number of channels,is shown andreduced gray-scale real images of the same size;
get the total loss functionL harm The following were used:
through training, the total loss function is enabledL harm Obtaining a final harmony processing model, and inputting the obtained composite cartoon music stage performance video into the final harmony processing model to obtain a harmony music stage performance video;λ RH 、λ IS 、λ IH andλ IF are respectively asL RH 、L IS 、L IH AndL IF the weight of (c).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210812946.2A CN114898021B (en) | 2022-07-12 | 2022-07-12 | Intelligent cartoon method for music stage performance video |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210812946.2A CN114898021B (en) | 2022-07-12 | 2022-07-12 | Intelligent cartoon method for music stage performance video |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114898021A CN114898021A (en) | 2022-08-12 |
CN114898021B true CN114898021B (en) | 2022-09-27 |
Family
ID=82729610
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210812946.2A Active CN114898021B (en) | 2022-07-12 | 2022-07-12 | Intelligent cartoon method for music stage performance video |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114898021B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115100334B (en) * | 2022-08-24 | 2022-11-25 | 广州极尚网络技术有限公司 | Image edge tracing and image animation method, device and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112295211A (en) * | 2019-07-31 | 2021-02-02 | 上海虞姿信息技术有限公司 | Stage performance virtual entertainment practical training system and method |
CN112561786A (en) * | 2020-12-22 | 2021-03-26 | 作业帮教育科技(北京)有限公司 | Online live broadcast method and device based on image cartoonization and electronic equipment |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011045768A2 (en) * | 2009-10-15 | 2011-04-21 | Yeda Research And Development Co. Ltd. | Animation of photo-images via fitting of combined models |
US10671838B1 (en) * | 2019-08-19 | 2020-06-02 | Neon Evolution Inc. | Methods and systems for image and voice processing |
CN112070080A (en) * | 2020-08-19 | 2020-12-11 | 湖南师范大学 | Method for classifying cartoon characters playing songs based on Faster R-CNN |
CN112102153B (en) * | 2020-08-20 | 2023-08-01 | 北京百度网讯科技有限公司 | Image cartoon processing method and device, electronic equipment and storage medium |
CN112132922A (en) * | 2020-09-24 | 2020-12-25 | 扬州大学 | Method for realizing cartoon of images and videos in online classroom |
-
2022
- 2022-07-12 CN CN202210812946.2A patent/CN114898021B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112295211A (en) * | 2019-07-31 | 2021-02-02 | 上海虞姿信息技术有限公司 | Stage performance virtual entertainment practical training system and method |
CN112561786A (en) * | 2020-12-22 | 2021-03-26 | 作业帮教育科技(北京)有限公司 | Online live broadcast method and device based on image cartoonization and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN114898021A (en) | 2022-08-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Low-light image enhancement via progressive-recursive network | |
CN109299274B (en) | Natural scene text detection method based on full convolution neural network | |
CN110378985B (en) | Animation drawing auxiliary creation method based on GAN | |
CN112887698B (en) | High-quality face voice driving method based on nerve radiation field | |
CN111242841B (en) | Image background style migration method based on semantic segmentation and deep learning | |
CN108830913B (en) | Semantic level line draft coloring method based on user color guidance | |
CN111553837B (en) | Artistic text image generation method based on neural style migration | |
CN112508991B (en) | Panda photo cartoon method with separated foreground and background | |
CN111967533B (en) | Sketch image translation method based on scene recognition | |
CN114898021B (en) | Intelligent cartoon method for music stage performance video | |
Chen et al. | A review of image and video colorization: From analogies to deep learning | |
Zhao et al. | Cartoon image processing: a survey | |
Xiao et al. | Image hazing algorithm based on generative adversarial networks | |
Zhao et al. | Research on the application of computer image processing technology in painting creation | |
Xing et al. | Diffsketcher: Text guided vector sketch synthesis through latent diffusion models | |
Yi et al. | Animating portrait line drawings from a single face photo and a speech signal | |
Mun et al. | Texture preserving photo style transfer network | |
Ye et al. | Hybrid scheme of image’s regional colorization using mask r-cnn and Poisson editing | |
CN115018729A (en) | White box image enhancement method for content | |
Ruan | Anime Characters Generation with Generative Adversarial Networks | |
Chen et al. | Image Colored-Pencil-Style Transformation Based on Generative Adversarial Network | |
Raut et al. | Generative Adversial Network Approach for Cartoonifying image using CartoonGAN. | |
Ye et al. | Method of Image Style Transfer Based on Edge Detection | |
Guo | Oil painting art style extraction method based on image data recognition | |
Wang et al. | Deep Learning in Computer Real-time Graphics and Image Using the Visual Effects of Non-photorealistic Rendering of Ink Painting |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |