CN112995537A

CN112995537A - Video construction method and system

Info

Publication number: CN112995537A
Application number: CN202110175132.8A
Authority: CN
Inventors: 张旻晋; 许达文
Original assignee: Chengdu Shihaixintu Microelectronics Co ltd
Current assignee: Chengdu Shihaixintu Microelectronics Co ltd
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-06-18
Anticipated expiration: 2041-02-09
Also published as: CN112995537B

Abstract

The invention discloses a video construction method and a system, wherein the method comprises the following steps: firstly, respectively carrying out feature conversion on multiple kinds of input information describing the same video to obtain feature representation information of each input information; then, sequentially obtaining representation abstract model views and representation videos of all input information, then performing fusion processing on all representation videos to obtain a group of fusion image sets, and finally, taking the harmony fusion image set as construction video output of all input information to construct a smooth video work; the method can realize video generation aiming at different styles and scenes, simultaneously performs fusion and harmonious processing on the generated video, finally constructs smooth video works, and simultaneously has an acceleration function on the parallel operation process, thereby reducing the calculation amount and memory occupation, reducing the workload of edge equipment and enabling the edge equipment to rapidly construct the video.

Description

Video construction method and system

Technical Field

The invention relates to the technical field of video animation, in particular to a video construction method and a video construction system.

Background

The deep learning intelligent perception algorithm enables the electronic equipment to have accurate semantic perception capability, such as text-based semantic recognition, voice information-based semantic recognition and image semantic-based recognition, and provides a good method basis for describing and characterizing environment and intention of the equipment. The method for constructing the video based on the semantic information also obtains good expression effect on the video prediction and generation of the characters, and the function realization of generating the video from the voice, the text and the image also improves the efficiency of the design work of industries such as animation, propagation, education, construction and the like.

The current intelligent algorithm can generate video aiming at human body gestures, expressions, mouth shapes, gestures and scenes according to simple plane composition, and can also generate painting images aiming at special styles aiming at intelligent algorithm training. However, the currently used intelligent methods all predict one type of video, and in real-time applications, multiple types of videos need to be predicted, and meanwhile, fusion processing needs to be performed on videos with different prediction types.

In addition, the video construction method aiming at the current method has huge computation amount, and the running time on the terminal equipment is difficult to meet the requirements of users.

Disclosure of Invention

In order to overcome the technical defects, the invention provides a video construction method and a video construction system, the method can realize the generation of videos with styles and scenes aiming at different input types, and simultaneously perform fusion and harmonious processing on the generated videos to finally construct smooth video works.

The invention is realized by the following technical scheme:

the video construction method provided by the scheme comprises the following steps:

s1, respectively carrying out feature conversion on multiple kinds of input information describing the same video to obtain feature representation information of each input information;

s2, respectively matching the characteristic representation information of each input information with an abstract model library to generate a representation abstract model view based on each input information;

s3, respectively inputting representation abstract model views of all input information into a video generation algorithm model to generate corresponding representation videos;

s4, performing imaging processing on each representation video respectively to obtain a group of fusion image sets;

and S5, performing harmony processing on the fusion image set to generate a harmony fusion atlas, and outputting the harmony fusion atlas serving as a constructed video of all input information.

The working principle of the scheme is as follows: according to the video construction method provided by the scheme, different description information describing the same characteristic video can be simultaneously used as input information to construct a complete video, the current intelligent method can generate videos according to human postures, expressions, mouth shapes, gestures and scenes according to simple plane composition, and the intelligent method can also be used for training drawing images of special styles to generate videos; however, the existing method can only simultaneously construct videos with the same format, but needs to predict multiple types of videos in real-time application, and needs to perform fusion processing on videos with different prediction types, and the method provided by the scheme can realize simultaneous input of video description information with multiple formats (for example, voice description information is used as first input information, and image description information/image description information and voice description information are used as second input information), obtain representation videos of each input information through synchronous processing on multiple input information, finally perform fusion processing on all representation videos to obtain a fusion image set, and finally obtain a complete video which can contain all the characteristics of the input information; the video construction method not only realizes the generation of videos of different styles and scenes, but also realizes the synchronous processing of a plurality of input information, and effectively improves the video construction efficiency.

The further optimization scheme is that the type of the input information in S1 is one or more of hand drawing, abstract, voice, text and image.

Further, in the optimization scheme, the feature characterization information in S1 includes: text representation information, semantic representation information and feature map representation information.

Further optimization is that the abstract model library in S2 includes, but is not limited to, abstract model information for pose, gesture, mouth shape, expression or scene, and its information representation form includes, but is not limited to, vector data, coordinate set, point cloud data.

The further optimization scheme is that the imaging processing process of the characterization video of the input information in S4 is as follows:

s41, performing image segmentation processing on each frame image of the representation video of each input information respectively, wherein each frame image obtains a plurality of segmented image blocks;

s42, adding foreground or background marks to each segmented image block, and obtaining a group of scene sets by each input information;

s43, fusing the foreground and the background of each group of scene picture set according to frames to generate a group of fused image sets.

The further optimization scheme is that the method for fusion processing of the foreground and the background comprises but is not limited to: a spatial domain fusion method, a transform domain fusion method, and a neural network-based image fusion method.

The further optimization scheme is that the method for adding the mark to the foreground or the background comprises but is not limited to an image segmentation marking method and an image semantic identification marking method.

The further optimization scheme is that the harmony processing method in the step S6 is an image harmony method based on a neural network method.

Based on the video construction method, the invention also provides a video construction system, which comprises the following steps: the device comprises a first feature extraction device, an abstract model matching device, a representation video generation device, an imaging processing device, a fusion device and a calculation device;

the first feature extraction device is used for respectively carrying out feature conversion on the plurality of input information to obtain feature representation information of each input information;

the abstract model matching device is used for respectively matching the characteristic representation information of each input information with an abstract model library to generate a representation abstract model view based on each input information;

the representation video generation device respectively inputs representation abstract model views of each input information into a video generation algorithm model to generate corresponding representation videos;

the imaging processing device respectively carries out imaging processing on each representation video to obtain a scene graph set of each representation video;

the fusion device fuses scene sets representing videos to generate a fusion image set;

and the computing device executes a harmony algorithm on the fused image set to generate a harmony fused atlas, and the harmony fused atlas is used as a constructed video output of all input information.

The further optimization scheme is that the input information is one or more of hand drawing abstract pictures, voice, texts and images.

The video construction system also comprises a memory module, a table look-up matching module, a parallel operation acceleration module and a main control processor module;

the memory module comprises a storage medium, is used for transmitting and temporarily storing external input information, and temporarily storing the weight of the intelligent algorithm, the characteristic representation information, the abstract characteristic library, the abstract model view, the representation video, the image segmentation result, the scene graph set, the fusion image set and the constructed video;

the table look-up matching module internally comprises a matching operation unit and an address mapping unit and is used for searching and matching the information in the abstract characteristic library in the memory aiming at the characteristic representation information and outputting an abstract model view aiming at the characteristic representation information;

the parallel operation acceleration module internally comprises at least one parallel operation processing unit and is used for accelerating the execution of harmony processing operation in the information characteristic extraction, video generation operation, image segmentation nursing and harmony image set;

the main control processor module internally comprises a control logic module and a processor module and is used for controlling the memory module, searching the matching module, accelerating data transmission among the modules through parallel operation, executing fusion processing calculation aiming at the foreground and the background and performing non-parallel operation in the video construction method.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention provides a video construction method and a video construction system, which can realize video generation aiming at different styles and scenes, simultaneously carry out fusion and harmonious processing on the generated videos, and finally construct smooth video works; the video construction system formed by connecting the video construction device and the interactive equipment can construct the video works with the styles and descriptions required by the users according to various input information.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.

FIG. 1 is a flow chart of a video construction method of the present invention;

fig. 2 is a schematic diagram of the video construction system of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

Example 1

As shown in fig. 1, this embodiment provides a video construction method, which specifically includes the steps of:

step 1, performing feature conversion on first input information to acquire first feature representation information aiming at the first input information;

step 2, matching the first representation information obtained in the step 1 with an abstract model library to generate a first representation abstract model view based on first input information;

step 3, the video generation algorithm model executes video generation operation on the first abstract model view generated in the step 2 to generate a first representation video aiming at first input information;

step 4, performing feature conversion on the second input information to acquire second feature representation information aiming at the second input information; matching the second representation information with the abstract model library to generate a second representation abstract model view based on second input information; inputting the second abstract model view into the video generation algorithm model, executing the video generation operation, and generating the second input information

A second characterizing video;

step 5, respectively executing image segmentation processing on each frame of image of the first representation video in the step 3 and each frame of image of the second representation video in the step 4 to obtain an image segmentation result of each frame of each video;

step 6, respectively adding foreground or background marks according to the divided image blocks in each frame of image in the step 5 to obtain a first scene image set and a second scene image set;

step 7, fusing the foreground and the background of the marked first scene set and the second scene set according to frames to generate a fused image set;

and 8, performing harmony processing on each frame of image in the fusion image set generated in the step 7 to generate a harmonious fusion image set and obtain a constructed video.

It should be noted that the input information includes, but is not limited to, hand-drawing abstract, voice, text, and image input information;

the feature representation information includes but is not limited to text, semantics, feature map information;

the abstract feature library includes, but is not limited to, abstract information for gestures, mouth shapes, expressions, scenes, and its information representation forms include, but are not limited to, vector data, coordinate sets, and point cloud data.

The input information includes, but is not limited to, first input information, second input information, third input information; generating at least one characterization video;

it should be noted that, the foreground and background marking method in step 6 includes, but is not limited to, an image segmentation marking method and an image semantic identification marking;

it should be noted that, the image fusion method of the foreground and the background in step 7 includes, but is not limited to, a spatial domain fusion method, a transform domain fusion method, and an image fusion method based on a neural network;

it should be noted that the image harmony processing method described in step 8 includes, but is not limited to, a neural network method and an image harmony method.

Example 2

Further explanation is given by taking scene video construction in an animation task as an embodiment.

In this embodiment, a "scene where a character walks in a grassland" in an animation video scene construction is taken as an example, where an initial image and a text description are used as input information, an abstract model library based on a wash ink style is used to generate a confrontation network execution video generation method, an encoding and decoding structure neural network is used to execute image segmentation processing and harmony processing of a background mark and an image, and an image fusion method based on a deep neural network is used to execute fusion processing of each frame of image in a video.

Step S1, performing feature conversion on the character picture drawn by the first original hand and the description text of the image as input information to acquire feature representation information aiming at the character information;

step S2, matching the character feature representation information obtained in step S1 with the ink and wash style abstract model library to generate a representation abstract model view based on the character information;

step S3, generating an antagonistic network algorithm model, executing video generation operation on the character representation abstract model view, and generating a first representation video aiming at character information;

step s4, performing feature conversion on the hand-drawn sketch picture and the text information drawn by the hand-drawn sketch picture to acquire representation information aiming at the sketch information; matching the grassland representation information with the ink and wash style abstract model library to generate a representation abstract model view based on the grassland information; inputting the grassland abstract model view into a video generation algorithm model, executing video generation operation, and generating a second representation video aiming at the Chinese ink style grassland;

step s5, performing image segmentation processing on each frame of image of the first character representation video of the person in step s3 and the second character representation video of the grassland in step s4 respectively to obtain an image segmentation result of each frame of each video;

step s6, respectively adding foreground or background marks according to the segmentation image blocks in each frame of image in step s5 to obtain a character scene image set and a grassland scene image set;

step s7, carrying out foreground and background fusion processing on the marked character scene image set and the grassland scene image set according to frames to generate a fusion image set;

and step s8, performing harmony processing on each frame of image in the fusion image set generated in step s7 to generate a harmonious fusion image set, and acquiring a wash and ink style task to walk on the grassland video.

Example 3

As shown in fig. 2, the video construction system provided in this embodiment includes: the device comprises a first feature extraction device, an abstract model matching device, a representation video generation device, an imaging processing device, a fusion device and a calculation device;

The device also comprises a memory module, a table look-up matching module, a parallel operation acceleration module and a main control processor module;

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A video construction method, comprising the steps of:

2. The video construction method according to claim 1, wherein the input information in S1 is in the form of one or more of hand-drawn abstract, speech, text, and image.

3. The video construction method according to claim 1, wherein the feature characterization information in S1 includes: text representation information, semantic representation information and feature map representation information.

4. The video construction method according to claim 1, wherein the abstract model library in S2 includes, but is not limited to, abstract model information for gesture, hand gesture, mouth shape, expression or scene, and its information representation form includes, but is not limited to, vector data, coordinate set, point cloud data.

5. The video construction method according to claim 1, wherein the step of performing an imaging process on the input information representation video in S4 includes:

6. The video construction method according to claim 5, wherein the foreground and background fusion processing method includes but is not limited to: a spatial domain fusion method, a transform domain fusion method, and a neural network-based image fusion method.

7. The video construction method according to claim 5, wherein the method for adding a mark to the foreground or the background includes, but is not limited to, image segmentation marking method and image semantic identification marking method.

8. The video construction method according to claim 1, wherein the harmonization processing method in S6 is an image harmonization method based on a neural network method.

9. A video construction system for use in the video construction method of any one of claims 1 to 8, comprising: the device comprises a first feature extraction device, an abstract model matching device, a representation video generation device, an imaging processing device, a fusion device and a calculation device;

10. A video construction system according to claim 9, wherein the input information is one or more of hand-drawn abstract, speech, text, image.