CN109859095B

CN109859095B - Automatic cartoon generation system and method

Info

Publication number: CN109859095B
Application number: CN201811545763.9A
Authority: CN
Inventors: 马宗亮; 杨鑫; 尹宝才; 张强; 魏小鹏
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2018-12-18
Filing date: 2018-12-18
Publication date: 2022-09-20
Anticipated expiration: 2038-12-18
Also published as: CN109859095A

Abstract

The invention provides a system and a method for automatically generating a cartoon, belonging to the field of computer vision and media. The method comprises the steps of firstly selecting key frames and stylizing videos, then obtaining four parameters required by page layout design, then carrying out multi-page cartoon layout design, and finally carrying out emotion-driven cartoon dialog box generation and placing the generated dialog box on a cartoon to generate a final result. The cartoon page layout of the invention takes a plurality of pages as a parameter into consideration at the beginning and is optimized integrally. The method is closer to the actual use mode, and the generated result is more fit with the real cartoon. Compared with the traditional dialog box, the dialog box generated by the method has more styles and the selection of each style is more in line with the emotion of the dialog.

Description

Automatic cartoon generation system and method

Technical Field

The invention belongs to the field of computer vision and media; and more particularly to generating elements required for a caricature and a final work using a deep learning method.

Background

Comics are a popular form of art, traditional comic creation is a time-consuming and labor-consuming task, and comic creation is not easy despite the fact that many tools are available to help create comics. Much research has been devoted to implementing an automatic caricature generation system that outputs a caricature book, including all the important and interesting information in the video, using a video stream as input, without manual assistance. The automatic cartoon generation system generally comprises three parts, namely, firstly extracting key frames of an input video, then stylizing the obtained key frames of the cartoon, and finally laying out the stylized pictures to obtain a complete cartoon.

As for the automatic generation Technology of cartoons, as early as 2008, Ryu et al (D.S. Ryu, S.H. park, J.W.Lee, D.H.Lee, and H.G.Cho, "Cinetoon: A semi-automatic system for rendering black/white cosmetic books from video streams," in IEEE International Conference on Computer and display Technology works, 2008, pp.336-341.) proposed a semi-automated system for generating cartoons, which extracts key frames manually, stylizes by an automated method, and does not design a layout of cartoons pages, so the results obtained are rough and complete automatic generation is not achieved. In 2012, Wang et al (m.wang, r.hong, x.t.yuan, s.yan, and t.s.chua, "Movie 2 communications: Towards a virtual video content presentation," IEEE Transactions on multimedia, vol.14, No.3, pp.858-870,2012.) proposed a fully automated system that included key frame selection, stylization, page layout, etc., but because the page layout was only selected from a few existing templates and only considered to generate a caricature and not the scene generated by the multi-page caricature in actual use, the results were relatively monotonous. In 2015, Chu et al (w.t.chu, c.h.yu, and h.h.wang, "Optimized communications-based analysis for temporal image sequences," IEEE Transactions on multimedia, vol.17, No.2, pp.201-215,2015.) designed an automatic cartoon generating system from an optimization perspective, but this system only inputs animated movies and not ordinary movies, and thus was limited in use. Moreover, the system does not perform the design of the page layout, and therefore, the obtained result is very rough.

These systems suffer from several problems: firstly, only the generation of single-page comics is considered, but the generation of multi-page comics is not considered, so that the system has low application value in practice; secondly, the generated cartoon is difficult to be accepted by readers because the page layout is not designed or the design is too rough; third, the influence of the shape of the dialog box in the caricature on the emotional expression is not considered, but a single-shape (oval or rectangular) dialog box is used, which is far from the variety of dialog boxes of a real caricature.

Regarding the page layout design technology, in 2015, jin et al (g.jin, y.hu, y.guo, y.yu, and w.wang, "Content-aware 2 communications with manga-style layout," IEEE Transactions on Multimedia, vol.17, No.12, pp.2122-2133,2015.) propose to design the page layout from an optimization perspective, first determining the local structure of the page according to the video Content, then automatically generating the initial layout of the cartoon page, and finally obtaining the final panel layout by executing the layout optimization algorithm. In 2012, Cao et al (y.cao, a.b.chan, and r.w.h.lau, "Automatic structural organization," Acm transformations on Graphics, vol.31, No.6, pp.1-10,2012.) proposed to design the layout of a panel in a data-driven manner, first, creating an initial layout that is most suitable for the input artwork and the layout structure model according to the generated probability framework; then, an efficient optimization program is used for jointly perfecting the layout and the geometric shape of each cartoon, so that a cartoon layout design with a professional appearance is formed.

However, the above methods all have some problems: the page layout obtained by the method proposed by the lacing has a difference with the real cartoon; the Cao method requires four manually derived parameters as inputs, respectively: roi (region of interest), importance level of each caricature (ImportanceRank), number of caricature boxes per page, and relevance between caricatures per box.

Regarding generation technology of comic dialog boxes, in 2007, prev et al (j.prev, "From movie to public, expressed by the screen," in ACMSIGGRAPH,2007, p.99.) performed text analysis on human dialog in video, which was assigned to two types of dialog boxes according to the statistical results of text: a Speech dialog box, a Noise dialog box.

Although the result is only the classification of the dialog using text statistics and is only two simple categories, it provides us with the idea of using text emotion analysis to assist in generating the cartoon dialog (i.e. by analyzing the emotion of the dialog in the cartoon to help us select the appropriate dialog shape). Especially today, where machine learning techniques are widely used, detailed emotion analysis of text is no longer a difficult task, and these help us to generate rich and accurate caricature dialogs.

Disclosure of Invention

The invention aims to solve the technical problems of high threshold and time consumption of manual cartoon creation, and designs a method for automatically generating cartoons.

The technical scheme of the invention is as follows:

a cartoon automatic generation method is completed by adopting a cartoon automatic generation system, the cartoon automatic generation system comprises a key frame selection module, a stylization module, a multi-page cartoon page layout design module and an emotion-driven cartoon dialog box generation module, and the cartoon automatic generation method comprises the following steps:

step one, selecting video key frames

(1.1) inputting a section of video material into the key frame selection module, and determining the start and end time point information corresponding to the subtitles in the video material.

And (1.2) dividing the video into a subtitle segment and a non-subtitle segment by using the start and end time point information corresponding to the subtitle.

(1.3) for the caption segment, calculating the similarity of the GIST image characteristics of two continuous frames, and when the similarity of the two continuous frames is smaller than a threshold value theta ₁ Selecting the next frame as a key frame, wherein the selection of the key frame is continued until the caption segment is finished; and when the key frame cannot be selected according to the similarity in the subtitle clips, defaulting a frame corresponding to the middle time point of the subtitle clip as the key frame.

(1.4) for the subtitle-free segment, firstly, selecting a key frame by adopting the same method in the step (1.3); then calculating the similarity between the selected key frame and the key frame selected in the step (1.3), and when the similarity is greater than a threshold value theta ₂ And deleting the key frames in the corresponding subtitle-free fragments to complete the screening of the key frames.

And (1.5) forming all the caption key frames and the caption-free key frames screened in the step (1.4) into key frames of the whole input video.

Step two, stylizing the video

And (2.1) inputting the key frames of the whole input video obtained in the first step into the stylizing module.

(2.2) obtaining an edge image i of each key frame picture by using a Difference of Gaussians (DoG) method ₁ 。

(2.3) carrying out 8-bit quantization sampling on each key frame picture to reduce the color number of the key frame picture to obtain a color picture i ₂ 。

(2.4) edge image i ₁ And color picture i ₂ Taken together, the final picture is composed: i ═ I ₁ +i ₂ 。

Step three, page layout design of multi-page cartoon

And (3.1) inputting the key frames of the whole input video obtained in the first step into a multi-page cartoon page layout design module.

(3.2) obtaining seven heat maps of the picture through a CAM (class activation mapping) algorithm, merging the seven heat maps, and obtaining the heat maps according to a threshold value theta ₃ Finding the minimum bounding box of the final image, wherein the minimum bounding box is smaller than the threshold value theta ₃ Is regarded as being outside the bounding box, larger than the threshold value theta ₃ Is considered to be within a bounding box, the smallest bounding box being the roi (region of interest) of the image.

(3.3) construction of LSTM neural network

The LSTM neural network consists of two layers of LSTM units, wherein the first layer is forward LSTM and is used for modeling a video frame sequence; the second layer is inverse LSTM, which models the relationship between video frames at different time points before and after. The input to the LSTM neural network is a feature vector (x) of a series of pictures ₁ ,x ₂ ,x ₃ ,…,x _n ) Wherein x is _n And representing a feature vector of the picture of the nth frame, wherein the length of the vector is 1024, and the vector is an output vector of a 5 th pooling layer (Pool5) of the GoogleNet (inclusion v4) after the picture is input into the GoogleNet. The output of the LSTM neural network is the importance level (ImportanceRank) corresponding to each input video frame picture.

(3.4) inputting the key frames of the whole input video into an LSTM neural network, and finally outputting the importance level (ImportanceRank) of each grid cartoon of all the key frames by the LSTM neural network; setting key frames corresponding to the same sentence of subtitle as closely related frames, wherein each key frame corresponds to one frame in the cartoon, thereby obtaining the relevance (Panel relation) between each frame of the cartoon; and (3) obtaining the lattice number of the cartoon of each page by using a genetic algorithm, wherein the optimization goal of the genetic algorithm is as follows:

wherein alpha is ₁ 、α ₂ 、α ₃ 、α ₄ 、α ₅ For the weight value set in advance, NP is the total number of pages of the cartoon set in advance, i is 1,2 …, NP, n _i Is the number of cartoon frames representing the ith page, N is the total number of key frames, N _min And N _max Respectively representing the set minimum and maximum lattices of each cartoon, SD is a sequence n ₁ ,n ₂ ,n ₃ ,…,n _i … standard deviation, R represents the degree of correlation between each caricature.

And (3.5) inputting the obtained four parameters and the key frame into a data-driven model together to complete the design of the page layout of the multi-page cartoon. The four parameters are: ROI, importance level of each cartoon, relevance between each cartoon and cartoon lattice number of each page. The data driving model comprises two stages, wherein the first stage is a model training stage, firstly, a data set designed by a cartoon page is obtained, and then the probability distribution of the data set is learned; the second stage is to input the above four parameters into the probability distribution to obtain the probability of each cartoon page design, and select one of the designs with the highest probability as the final result. The output of the model is a cartoon designed through a page.

Step four, generation of cartoon dialog box driven by emotion

And (4.1) inputting the subtitles corresponding to the video into an emotion-driven cartoon dialog box generation module.

(4.2) collecting texts in a plurality of cartoon dialog boxes and shape information of the dialog boxes, and inputting each text into a text sentiment analyzer (Mixer emotions) to obtain scores of six indexes (Joy, Sadness, Dispost, Anger, Surcrise and Fear); the scores are used as parameters, and the corresponding dialog box shape is used as GT (ground Truth), and is input into a classifier for training.

And (4.3) inputting six index scores of the text required to generate the dialog box into a trained classifier, and giving the shape of the dialog box by the classifier.

And (4.4) rendering the dialog text into the dialog box, and completing the generation of the cartoon dialog box.

The invention has the beneficial effects that:

(1) a fully automated caricature generation system: the traditional cartoon generating system usually needs manual assistance or has high input requirement, and is only a semi-automatic system. Different from the video input method and the video input method, the system is full-automatic and more intelligent, and only a section of video and subtitles corresponding to the video need to be input without any manual assistance, so that a complete comic book can be generated. These inputs are readily available, which helps to reduce the user's usage threshold and time cost, allowing the user to complete the creation of the caricature more efficiently.

(2) Cartoon page layout design of multiple pages: unlike conventional caricature generation systems which only consider the generation of a single-page caricature, but not the case of multiple pages, the caricature page layout of the present invention initially takes multiple pages as a parameter and performs global optimization. Therefore, the method is closer to the actual use mode, and the generated result is more fit with the real cartoon.

(3) Generating various cartoon dialog boxes based on emotion analysis: the traditional cartoon dialog box only adopts a single oval or rectangle, is quite monotonous and is not easy to attract the interest of readers. Therefore, the invention provides an emotion-driven cartoon dialog box generation system which helps a user select the most appropriate dialog box by emotion analysis of a text and a trained classifier. Compared with the traditional dialog box, the dialog box generated by the method has more styles, and the selection of each style is more in line with the emotion of the dialog.

Drawings

FIG. 1 is a flow chart of the system of the present invention. The system input is a video, the key frame selection process, the stylization process, the process of acquiring four parameters required by page layout design, the multi-page cartoon layout design, the emotion-driven cartoon dialog generation process and the final result generation process by placing the generated dialog on the cartoon.

FIG. 2 is a flowchart of key frame selection according to the present invention. The method comprises the steps of firstly, dividing a video into a caption segment and a caption-free segment by utilizing time information in a caption, secondly, selecting a key frame aiming at the caption segment, thirdly, selecting a key frame aiming at the caption-free segment, and fourthly, merging the two key frames together by adopting a certain strategy.

FIG. 3 is a stylized flow diagram of the present invention. Firstly, acquiring the image edge of a key frame by using Gaussian difference, secondly, carrying out 8-bit color quantization on the key frame image, and thirdly, combining the image edge and the quantized image to obtain a stylized image.

FIG. 4 is a flow chart of caricature dialog selector training and use in accordance with the present invention. Firstly, carrying out emotion analysis on text information in an existing cartoon dialog box to obtain scores of six indexes (Joy, Sadness, Dispost, Anger, Surrise and Fear); secondly, training a classifier by taking the six indexes as variables and taking dialog box shape information corresponding to the text in the cartoon as GT (ground Truth); thirdly, performing sentiment analysis on the caption text in the video to obtain six indexes, and fourthly, inputting the obtained six indexes into a trained classifier, so that the classifier can help to select a proper dialog box shape and render the text into a final result.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments, but the present invention is not limited to the specific embodiments.

Examples

A fully automated caricature generation system includes the following components: key frame selection, stylization, multi-page comic layout design, comic dialog generation and placement.

1. Key frame selection

This part corresponds to the process of fig. 1, and the specific steps refer to fig. 2. The input to this part is the original video, and in the embodiment, to reduce the time cost, we first sample the video at 0.5 second intervals using FFmpeg software, and use the sampled frames instead of the original video. Then, the time information in the caption is utilized to divide the video into the segments with the caption and the segments without the caption, and different strategies are respectively adopted to select the key frames. And finally, combining the two parts of obtained key frames together by adopting a certain strategy to obtain the required key frame, wherein the specific strategy refers to the first part of the content of the invention. In practical operation, the parameter θ for selecting key frames for the subtitle segment and the subtitle-free segment ₁ Depending on the requirements, in our test the value θ is given ₁ The best effect is 0.81; parameter theta when merging two-part key frames ₂ Depending on the requirements, in our test the value θ is given ₂ The best effect is 0.73.

2. Stylization

This part corresponds to the process of fig. 1, and the specific steps refer to fig. 3. Firstly, the image edge is obtained by using the Difference of Gaussians (this process can replace the gaussian method with other operators (such as laplacian), but we actually use the gaussian operator to have the best effect, the size of the gaussian operator core used in this process is 3 × 3, and the tool platform used is OpenCV. Meanwhile, 8-bit color quantization is carried out on the original key frame, the color number of the original picture can be reduced by the step, the picture is closer to a cartoon style, other colors such as 16-bit can be selected for quantization in actual operation, and the 8-bit effect is the best in our test. And finally, combining the image edges and the quantized images together to form a stylized picture, wherein the combining principle in actual operation is as follows: and covering the image edge on the color quantized image, wherein the pixel of the point where the two are overlapped is the pixel of the image edge.

3. Layout design of multi-page cartoon

The input of the part is the stylized key frame and the subtitle information obtained in the previous stage. In practice, we use an LSTM network proposed in 2016 by Zhang et al (k.zhang, w.l.chao, f.sha, and k.grauman, "Video summary with long short-term memory," pp.766-782,2016.) to output the importance level of each key frame (inportanancerank) after a sequence of key frames is input into the network. Then, for each key frame, the CAM algorithm is used to obtain seven classification heat maps, the seven heat maps are combined into a gray map, and a fixed threshold value theta is set ₃ And find the smallest bounding box, the region within this bounding box, i.e. the ROI. Theta.theta. ₃ Can be set according to requirements, and in actual test, the theta is ₃ The best results are shown at 217. Then, the relationship between different key frames can be obtained by using the information obtained in the key frame selection stage, and in actual operation, it is considered that the key frames corresponding to the same caption are related, but the key frames corresponding to different captions are unrelated. Then, we use a genetic algorithm to assign the number of the grids of each page of the cartoon, the optimization target of the genetic algorithm refers to formula (1), the meaning of each item of formula (1) is introduced in the third part of the summary, and the optimal parameter values in the actual test are as follows: alpha is alpha ₁ ＝0.35，α ₂ ＝0.35，α ₃ ＝0.15，α ₄ ＝0.05，α ₅ 0.10. Finally, we input the obtained four parameters into a data-driven model proposed in 2012 by Cao et al (y.cao, a.b.chan, and r.w.h.lau, "Automatic structural requirements," Acm Transactions on Graphics, vol.31, No.6, pp.1-10,2012.) to obtain the final layout design of multi-page comic book.

4. Cartoon dialog generation

First, the text and the shape of the dialog box in the dialog box of the existing comic book are obtained. Then, the text is sent to a text sentiment analyzer (mixemotations) to obtain six sentiment indexes: joy, Sadness, Disgust, Anger, Surrise, Fear. Next, a classifier is trained with the six emotion indicators as variables and the dialog box shape as GT (GroudTruth). The classifier can use a Support Vector Machine (SVM), a random forest or a neural network, and the SVM has the best effect in actual tests. When the method is used, the text is sent into the emotion analyzer to obtain six indexes, then the indexes are sent into the trained classifier, and the classifier can finish the selection of the shape of the dialog box. Finally, the text information is rendered into the dialog box of the selected place.

Claims

1. An automatic cartoon generation method is characterized in that the automatic cartoon generation method is completed by adopting an automatic cartoon generation system, the automatic cartoon generation system comprises a key frame selection module, a stylization module, a multi-page cartoon page layout design module and an emotion-driven cartoon dialog box generation module, and the automatic cartoon generation method comprises the following steps:

step one, selecting video key frames

(1.1) inputting a section of video material into a key frame selection module, and determining start and end time point information corresponding to subtitles in the video material;

(1.2) dividing the video into subtitle segments and subtitle-free segments by using the start and end time point information corresponding to the subtitles;

(1.3) for the caption segment, calculating the similarity of the GIST image characteristics of two continuous frames, and when the similarity of the two continuous frames is smaller than a threshold value theta ₁ Selecting the next frame as a key frame, wherein the selection of the key frame is continued until the caption segment is finished; when a key frame cannot be selected according to the similarity in the subtitle fragments, defaulting a frame corresponding to the middle time point of the subtitle fragments as the key frame;

(1.4) for the subtitle-free segment, firstly, selecting a key frame by adopting the same method in the step (1.3); then calculating the similarity between the selected key frame and the key frame selected in the step (1.3), and when the similarity is greater than a threshold value theta ₂ Deleting the key frame in the corresponding non-caption segment to complete the key frameScreening;

(1.5) all the caption key frames and the caption-free key frames screened in the step (1.4) form key frames of the whole input video;

step two, stylizing the video

(2.1) inputting the key frame of the whole input video obtained in the first step into a stylization module;

(2.2) obtaining an edge image i of each key frame picture by using a Gaussian difference method ₁ ；

(2.3) carrying out 8-bit quantization sampling on each key frame picture to reduce the color number of the key frame picture to obtain a color picture i ₂ ；

(2.4) edge image i ₁ And color picture i ₂ Taken together, the final picture is composed: i ═ I ₁ +i ₂ ；

Step three, page layout design of multi-page cartoon

(3.1) inputting the key frame of the whole input video obtained in the first step into a multi-page cartoon page layout design module;

(3.2) obtaining seven heat maps of the picture through a CAM algorithm, combining the seven heat maps and obtaining the picture according to a threshold value theta ₃ Finding the minimum bounding box of the final image, wherein θ is less than the threshold ₃ Is regarded as being outside the bounding box, larger than the threshold value theta ₃ The point of (a) is regarded as being within a bounding box, and the smallest bounding box is the ROI of the image;

(3.3) construction of LSTM neural network

The LSTM neural network consists of two layers of LSTM units, wherein the first layer is forward LSTM and is used for modeling a video frame sequence; the second layer is reverse LSTM, which is used for modeling the relation between the video frames at different time points before and after; the input to the LSTM neural network is a feature vector (x) of a series of pictures ₁ ,x ₂ ,x ₃ ,…,x _n ) Wherein x is _n A feature vector representing the nth frame of picture, the length of the vector is 1024, and the vector is an output vector of the 5 th pooling layer of GoogleNet after the picture is input into the GoogleNet; the output of the LSTM neural network is the importance level corresponding to each input video frame picture;

(3.4) inputting the key frames of the whole input video into an LSTM neural network, and finally outputting the importance level of each grid cartoon of all the key frames by the LSTM neural network; setting key frames corresponding to the same sentence of subtitles as closely related frames, wherein each key frame corresponds to one frame in the cartoon, so as to obtain the relevance between each frame of the cartoon; and (3) obtaining the lattice number of the cartoon of each page by using a genetic algorithm, wherein the optimization goal of the genetic algorithm is as follows:

wherein alpha is ₁ 、α ₂ 、α ₃ 、α ₄ 、α ₅ For the weight value set in advance, NP is the total number of pages of the cartoon set in advance, i is 1,2 …, NP, n _i Is the number of cartoon frames representing the ith page, N is the total number of key frames, N _min And N _max Respectively representing the set minimum lattice number and maximum lattice number of the cartoon of each page, SD is a number sequence n ₁ ,n ₂ ,n ₃ ,…,n _i … standard deviation, R represents the degree of correlation between each caricature;

(3.5) inputting the obtained four parameters and the key frame into a data driving model together to complete the design of the layout of the multi-page cartoon page; the four parameters are: ROI, importance level of each cartoon, relevance between each cartoon and cartoon lattice number of each page; the data driving model comprises two stages, wherein the first stage is a model training stage, firstly, a data set designed by a cartoon page is obtained, and then the probability distribution of the data set is learned; the second stage is to input the four parameters into probability distribution to obtain the probability of each cartoon page design, and select one of the designs with the maximum probability as the final result; the output of the model is a cartoon designed by a page;

step four, generation of cartoon dialog box driven by emotion

(4.1) inputting subtitles corresponding to the video into an emotion-driven cartoon dialog box generation module;

(4.2) collecting texts in a plurality of cartoon dialog boxes and shape information of the dialog boxes, and inputting each text segment into a text emotion analyzer to obtain six indexes: scores for Joy, Sadness, Disgust, Anger, Surrise, Fear; inputting the scores serving as parameters and the corresponding dialog box shape serving as GT into a classifier for training;

(4.3) inputting six index scores of the text needing to generate the dialog box into a trained classifier, wherein the classifier gives the shape of the dialog box;