CN114363712A

CN114363712A - AI digital person video generation method, device and equipment based on templated editing

Info

Publication number: CN114363712A
Application number: CN202210039411.6A
Authority: CN
Inventors: 刘玉婷; 丁淑华; 刘子健
Original assignee: Shenzhen Dlp Digital Technology Co ltd; Shenzhen Dillop Intelligent Technology Co ltd
Current assignee: Shenzhen Dlp Digital Technology Co ltd; Shenzhen Dillop Intelligent Technology Co ltd
Priority date: 2022-01-13
Filing date: 2022-01-13
Publication date: 2022-04-15
Anticipated expiration: 2042-01-13
Also published as: CN114363712B

Abstract

The invention discloses an AI digital man video generation method, a device and equipment based on templated editing, wherein the method comprises the following steps: the method comprises the steps of collecting audio and video data of a target object, conducting AI model training to generate an AI digital human image package, conducting templated image-text video editing and previewing on a 3D template and the AI digital human image package to obtain a broadcast list containing broadcast items, combining the broadcast items containing blue background preview videos with text contents to generate broadcast item manuscripts, conducting AI inference according to the AI digital human image model and each broadcast item manuscript to generate blue background AI digital human videos and control instructions, conducting image matting on the blue background AI digital human videos and then conducting rendering synthesis on the blue background AI digital human videos and the corresponding 3D template to obtain a synthetic video. The invention belongs to the technical field of artificial intelligence, a composite video containing an AI digital human video is generated by a templated editing method, and the AI digital human video content can be rapidly produced in batches only by modifying template parameters, so that the video generation efficiency is greatly improved.

Description

AI digital person video generation method, device and equipment based on templated editing

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an AI digital man video generation method, device and equipment based on templated editing.

Background

The operation flow of the existing AI virtual anchor and virtual image products is anchor video acquisition- > data processing- > model training- > image output; in the production output stage, video generation and output are carried out on the basis of the trained anchor image according to the input manuscript and voice and optional anchor emotion, background pictures, videos, standing postures, sitting postures and the like; only the drive from characters and sound to expressions can be realized, and no limb movement exists; only pictures or prefabricated videos can be used as backgrounds for synthesis and output, and the three-dimensional image-text packaging content cannot be synthesized; and the playing control of the three-dimensional image-text packaging content in the video generation process cannot be carried out. Therefore, the existing video generation method based on the AI virtual anchor has the problems of insufficient flexibility and low video generation efficiency.

Disclosure of Invention

The embodiment of the invention provides an AI digital man video generation method, device, equipment and medium based on templated editing, aiming at solving the problem that the AI digital man video cannot be generated efficiently in the prior art.

In a first aspect, an embodiment of the present invention provides an AI digital human video generation method based on templated editing, where the method includes:

generating an AI digital human figure packet corresponding to audio and video data of a target object collected in a blue box according to a preset AI model, wherein the AI digital human figure packet comprises a plurality of combined figures corresponding to each target object and a blue background preview video corresponding to each combined figure; the AI digital avatar packet may include a 2D AI digital avatar data packet and/or a 3D AI digital avatar data packet;

performing templated image-text video editing and previewing on the 3D template and each blue background preview video in the AI digital human image package to manufacture a broadcast list consisting of broadcast items corresponding to each blue background preview video;

combining broadcast items containing the blue background preview video with text content to generate broadcast item manuscripts corresponding to each broadcast item, wherein the text content comprises texts, emotions and body actions;

exporting the broadcasting items and the changeable/replaceable contents in the broadcasting item manuscript as template parameters so as to convert the broadcasting list into a broadcasting template;

performing AI inference according to each broadcast item manuscript in the broadcast list or each broadcast item manuscript in the broadcast list corresponding to the broadcast template to generate a corresponding blue background AI digital human video and a control instruction, or generating corresponding facial expression data, skeleton model data and a control instruction; after keying the image of the AI digital human video with the blue background corresponding to each broadcast item manuscript, rendering and synthesizing the image with a 3D template corresponding to the broadcast item manuscript to obtain a synthesized video corresponding to each broadcast item; or, rendering and synthesizing the facial expression data and the skeleton model data corresponding to each broadcast item manuscript to drive an AI digital person 3D model and a 3D template corresponding to the broadcast item manuscript to obtain a synthesized video corresponding to each broadcast item; and the control instruction is used for playing and controlling the playing content when the composite video is generated.

In a second aspect, an embodiment of the present invention provides an AI digital human video generating apparatus based on templated editing, including:

the image packet generating unit is used for generating an AI digital image packet corresponding to the audio and video data of the target object collected in the blue box according to a preset AI model, wherein the AI digital image packet comprises a plurality of combined images corresponding to each target object and a blue background preview video corresponding to each combined image; the AI digital avatar packet may include a 2D AI digital avatar data packet and/or a 3D AI digital avatar data packet;

the broadcasting list generating unit is used for performing templated image-text video editing and previewing on the 3D template and each blue background preview video in the AI digital human image package so as to manufacture a broadcasting list consisting of broadcasting items corresponding to each blue background preview video;

the broadcast item manuscript generating unit is used for combining broadcast items containing the blue background preview video with text contents to generate a broadcast item manuscript corresponding to each broadcast item, wherein the text contents comprise texts, emotions and limb actions;

the broadcast template acquisition unit is used for exporting the broadcast items and the variable/replaceable contents in the broadcast item manuscript into template parameters so as to convert the broadcast list into a broadcast template;

the synthetic video generating unit is used for carrying out AI inference according to each broadcast item manuscript in the broadcast list or each broadcast item manuscript in the broadcast list corresponding to the broadcast template so as to generate a corresponding blue background AI digital human video and a control instruction, or generating corresponding facial expression data, skeleton model data and a control instruction; after keying the image of the AI digital human video with the blue background corresponding to each broadcast item manuscript, rendering and synthesizing the image with a 3D template corresponding to the broadcast item manuscript to obtain a synthesized video corresponding to each broadcast item; or, rendering and synthesizing the facial expression data and the skeleton model data corresponding to each broadcast item manuscript to drive an AI digital person 3D model and a 3D template corresponding to the broadcast item manuscript to obtain a synthesized video corresponding to each broadcast item; and the control instruction is used for playing and controlling the playing content when the composite video is generated.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the AI digital human video generating method based on templated editing according to the first aspect.

In a fourth aspect, the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the AI digital human video generation method based on templated editing according to the first aspect.

The embodiment of the invention provides an AI digital man video generation method, device and equipment based on templated editing. The method comprises the steps of collecting audio and video data of a target object, conducting AI model training to generate an AI digital human image package, conducting templated image-text video editing and previewing on a 3D template and the AI digital human image package to obtain a broadcast list containing broadcast items, combining the broadcast items containing blue background preview videos with text contents to generate broadcast item manuscripts, conducting AI inference according to the AI digital human image model and each broadcast item manuscript to generate blue background AI digital human videos and control instructions, conducting image matting on the blue background AI digital human videos and then conducting rendering synthesis on the blue background AI digital human videos and the corresponding 3D template to obtain a synthetic video. By the method, the composite video containing the AI digital human video is generated by the template editing manufacturing method, and the AI digital human video content can be quickly manufactured in batches only by modifying the template parameters, so that the generation efficiency of the AI digital human video content is greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of an AI digital human video generation method based on templated editing according to an embodiment of the present invention;

FIG. 2 is a schematic sub-flow chart of a method for generating an AI digital human video based on templated editing according to an embodiment of the present invention;

FIG. 3 is a schematic view of another sub-flow chart of the AI digital human video generation method based on templated editing according to the embodiment of the present invention;

FIG. 4 is a schematic view of another sub-flow chart of the AI digital human video generation method based on templated editing according to the embodiment of the present invention;

FIG. 5 is a schematic view of another sub-process of the AI digital human video generation method based on templated editing according to the embodiment of the present invention;

FIG. 6 is a schematic block diagram of an AI digital human video generation device based on templated editing according to an embodiment of the present invention;

FIG. 7 is a schematic block diagram of a computer device provided by an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, fig. 1 is a schematic flow chart of an AI digital person video generation method based on templated editing according to an embodiment of the present invention; the AI digital man video generation method based on the templated edition is applied to a user terminal or a management server, and is executed by application software installed in the user terminal or the management server; the user terminal can be a desktop computer, a notebook computer, a tablet computer or a mobile phone and other terminal equipment, and the management server is a server end for executing the AI digital human video generation method based on the templated editing and generating the AI main board video content in the manner based on the templated editing, such as a server end built in an enterprise or a government department. As shown in fig. 1, the method includes steps S110 to S150.

And S110, generating an AI digital human figure packet corresponding to the audio and video data of the target object collected in the blue box according to a preset AI model, wherein the AI digital human figure packet comprises a plurality of combined figures corresponding to each target object and a blue background preview video corresponding to each combined figure.

Specifically, the audio and video data of each target object can be collected in a blue box, the blue box is an external scene for video shooting, the blue box can be red, green, blue or other single colors, an AI model can be trained based on the characteristics of the audio and video data of the target object, and an AI digital avatar packet is generated according to the AI digital avatar model obtained by training.

Specifically, the method comprises the steps of collecting video and audio data of a target object, generating an AI digital human figure of the target object based on an AI digital human technology platform, and training a 2D AI digital human figure data packet by the following steps: collecting the video and audio of a target object in a blue box, and generating a 2D AI digital human image of the target object based on an AI digital human technology platform; collecting corresponding target object blue box video keying parameters; generating an AI digital human figure preview video; and generating a 2D AI digital human figure data packet by combining the ID number of the target object. The training process for the 3D AI digital human figure data packet comprises the following steps: shooting a target object video in multiple angles or 3D scanning the target object and collecting the audio of the target object, and generating a 3D AI digital human image of the target object based on an AI digital human technology platform; generating an AI digital human figure preview video; and combining the ID number of the target object to generate an AI digital human figure data packet. Aiming at the target object, the AI digital human figure packet of the target object can be formed by a group of AI digital human figure data packets generated by training and can be stored in an AI digital human figure library, thereby being convenient for searching and using based on ID numbers.

The specific type of the AI digital avatar comprises a 2D real person and a 3D real person, the AI digital avatar packet may comprise a 2D AI digital avatar data packet and/or a 3D AI digital avatar data packet, wherein the 2D AI digital avatar data packet comprises but is not limited to a 2D AI digital avatar, a digital person ID, a corresponding blue box video matting parameter, a preview video, and the 3D AI digital avatar data packet comprises but is not limited to a 3D AI digital avatar, a digital person ID, a preview video.

In one embodiment, as shown in FIG. 2, step S110 includes sub-steps S111 and S112.

S111, training the AI model by using audio and video data of the target object collected in the blue box to obtain an AI digital human image model corresponding to each combined image; and S112, respectively generating a blue background preview video corresponding to each combined image according to the AI digital human image model, wherein each combined image is obtained by combining postures, limb actions and expressions.

Specifically, each target object may correspond to a plurality of combined images, and the combined images include postures, body movements and expressions, that is, the combined images are formed by combining a specific posture, a specific body movement and a specific expression, for example, the posture is a sitting posture or a standing posture, the body movements are hand-lifting instructions, head-pointing, etc., and the expressions are smiling, serious, etc. If a target object (a real person) can record lossless compressed video in a segment mode in a blue box, the video resolution can be selected from 1280x720P, 1920x1080P or 3840x 2160P; wherein the header region requires a resolution of no less than 192 pixels by 192 pixels; and combining and recording videos according to two postures of standing and sitting, expressions such as smiling and surprise and hand motions such as hand raising indication.

Each combined image corresponds to a section of audio and video, the image characteristics corresponding to each section of audio and video can be extracted, the audio characteristics corresponding to each section of audio and video are extracted, the image characteristics and the audio characteristics of the same section of audio and video are input into an AI model for training, the image characteristics and the audio characteristics are used as input, the corresponding audio and video are used as a training target, parameter values in the AI model are fed back and adjusted by obtaining loss values between video frames and audio output in the training process and the training target, so that the AI model is trained, the combined image of each target object can be correspondingly trained to obtain an AI digital human image model, each AI digital human image model can respectively generate corresponding blue background preview video, namely blue is used as the background, and a section of video frames and audio output based on the AI digital human image model are used as the preview video generated by the AI digital human image, if the duration of the preview video can be 10 seconds, 15 seconds or 30 seconds, an object image label is added in each blue background preview video of the AI digital human image package, and the object image label can be used for carrying out classification identification on the blue background preview video, so that the fast search in the post-production process is facilitated.

The video of the target object can be shot from a plurality of angles to carry out three-dimensional modeling or a three-dimensional model image of the target object is constructed by a three-dimensional scanning method; the method directly constructs a three-dimensional model image of a target object, and can directly generate facial expression data and skeleton data to be sent to a synthesis unit to drive an AI digital human 3d model to render and synthesize videos besides generating a blue background AI digital human video during AI inference.

And S120, performing templated image-text video editing and previewing on the 3D template and each blue background preview video in the AI digital human image package to manufacture a broadcast list consisting of broadcast items corresponding to each blue background preview video.

And selecting a 3D template and an AI digital human figure preview video to edit and preview the image-text video, and setting parameters of various image-text elements to make a broadcasting list containing broadcasting items.

In an embodiment, as shown in fig. 3, step S120 includes substeps S121 and S122.

S121, according to the image identification in the 3D template, obtaining a blue background preview video of the combined image corresponding to each image identification in the AI digital human image packet as a scene element to replace the image identification.

And S122, after the blue background preview video added into the 3D template is subjected to image matting to generate a video frame sequence with an Alpha channel, combining the video frame sequence with the three-dimensional virtual elements in the 3D template, and performing templated image-text video editing and previewing to manufacture and obtain broadcast items corresponding to each target object.

Specifically, based on a modular image-text editing tool such as a VR editor, an image-text column package and the like, each AI digital human figure is taken as an element in a three-dimensional scene, a corresponding blue background preview video is used for replacing the element, a video frame sequence with an Alpha channel is generated after image matting, the video frame sequence is combined with various three-dimensional virtual elements such as a three-dimensional scene, a three-dimensional object, a combined animation, three-dimensional simulation, internet of things information access, media information access fusion, big data acquisition, arbitrary algorithm control, AI drive, image-text chart data visual display, PPT, a picture, a video, characters and the like in a 3D template, the video is packaged and previewed in the modular image-text packaging mode, broadcast items are made, and a plurality of broadcast items are combined to form a broadcast list (or broadcast template). The broadcasting modes of the broadcasting items in the broadcasting list comprise automatic broadcasting, manual broadcasting, timing broadcasting, sequential broadcasting, hot key triggering broadcasting and VR handle triggering, and the broadcasting items can be broadcasted independently or in combination with other broadcasting items.

In this embodiment, the research object refers to an object displayed by a 3D template, and its constituent elements include, but are not limited to: three-dimensional models (e.g., three-dimensional models in formats such as obj/. fbx/.3ds/. ac/. stl/. wrl/. igs), three-dimensional animations, special effects, fused media access, algorithms, databases, data access, text, pictures, graphics, video, and the like. The acquisition mode of the three-dimensional model can be realized by directly importing the existing research object or re-manufacturing a new research object, the acquisition of the research object can be realized by selecting related objects from external modeling software and importing the related objects into the existing VR editor during the process of importing the existing research object, and for re-manufacturing the new research object, the research object meeting the requirements can be directly established in the VR editor according to actual requirements, or the research object meeting the requirements can be established from the external modeling software according to the actual requirements and transmitted to the VR editor for acquisition by means of the association between the external modeling software and the VR editor.

A user selects a three-dimensional model obtaining mode on display interfaces of external equipment and various display terminals, the three-dimensional model obtaining mode can be a historical template directly imported into a database, a model can be newly built according to personal needs, 3D modeling is carried out on a research object or a plurality of objects in an arbitrary mode, and a template animation designer reads the obtained three-dimensional model.

In the embodiment, the 3D template is used for showing the change process of the attribute and/or the state value of the research object; the study object includes, but is not limited to, a three-dimensional model. The 3D template is a set of attribute/state changes of a researched object/objects, and is used for generating various attribute connections, attribute simulation and combined animations according to specific logic relations and state change requirements based on various 3D template elements aiming at the attribute/state changes of the researched object/objects, internally realizing the parts which do not need to be externally connected, and displaying the parts which need to be externally connected according to attribute/state values, input and output in a classified manner.

When acquiring the 3D templates, one 3D template may be used alone, or may be used within other 3D templates. When used alone or in other 3D templates, the way of use can be imported directly by the VR editor for use or imported for modification for use, including but not limited to the following: the replaceable research object is an arbitrary object; different logical relationship triggers may be employed; a mechanistic model may be modified, including but not limited to physical formulas, mathematical functions, biochemical characteristics, etc.; the association of the attributes (the association of the attributes includes an attribute internal association and an attribute external association) may also be changed, so as to modify the data source according to which the attribute change process of the object under study is based, and the data source according to which is not limited to the attribute association set inside the 3D template, any data structure (e.g. access of stocks, meteorological data, sensor data, and merged media — micro-blogs, micro-messages, short messages, self-media, websites, etc.), any external program/algorithm, etc.

When the 3D template is used, a plurality of attributes and/or state values need to be adjusted and modified in real time, so that the correlation of the attributes needs to be carried out when the 3D template is manufactured, and the corresponding real-time adjustment and modification can be conveniently realized when the 3D template is used. Corresponding to the need of adjusting and modifying the real-time attribute and/or the state value, the association modes of the corresponding attributes can be divided into the following two types: the correlation inside the attribute can be directly input, adjusted and modified attribute and/or state value, and can also be used for calculating assignment, adjusting and modifying through a function; the attribute external association is performed by associating the attribute with an arbitrary external data structure, an arbitrary external program/algorithm, or the like, thereby realizing real-time update of the attribute and/or the state value.

The first method of attribute internal association is to directly input numerical values to realize adjustment and modification of corresponding attributes and/or state values; the second method for internal association of attributes is to adjust by calculating assignments through functions, and adding custom script functions to attributes including, but not limited to, y1 sinx and y 22 x²For example, when x is 1, y1 is sin1, and y2 is 2, and then the calculation result is assigned to the attribute parameter to realize the custom adjustment of the attribute value; attribute-external associations encompass associations of corresponding attributes and/or state values to any data structure, any program/algorithm, and the like. Any data structure, including but not limited to: stock, meteorological data, sensor data, and converged media access such as microblog, wechat, short message, self-media, website, etc.; association with any program/algorithm, including but not limited to: and various programs realized by any program language can read and modify any data in the 3D template.

The attribute external association realizes real-time update by associating with any external data structure, and realizes that a user can associate the attribute and/or state value of the 3D template with the external data source without programming by reserving an interface accessed by the external data structure and an access module of various data sources, including but not limited to: the text file, the excel document, the odbc data, the sql data and the like are associated in real time and can be updated to the related content of the 3D template in real time. For example: connecting the meteorological data to the relevant attributes of the model, and finishing the visual display of the weather condition for virtual simulation; connecting information of the fusion media, such as microblog information, WeChat information and website information, and completing real-time fusion of an information system; and connecting the motion capture data in real time, and being used for supporting the editing and the making of skeleton animation and realizing biological virtual simulation.

The attribute external association controls the change of the attribute and/or the state value through an external program/algorithm, for example, the flight attitude and the power system attribute of an aircraft model are controlled in real time through an external flight control algorithm, so that the aircraft simulation is realized; and controlling flow field parameters inside the 3D template through a flow field analog control algorithm to realize the test simulation of the aerodynamic shape of the new energy automobile and the like.

In addition, the attribute external association can be used for realizing the adjustment and modification of the 3D template attribute and/or the state value by an external data structure and an external program/algorithm, and also can realize the output of the attribute and/or the state value of the 3D template to the external data structure and the external algorithm so as to realize the output or control of the 3D template on the external data and/or the external algorithm.

And S130, combining the broadcast items containing the blue background preview video with text content to generate a broadcast item manuscript corresponding to each broadcast item, wherein the text content comprises text, emotion and limb actions.

The text content of the weather forecast can be combined with the corresponding broadcast item, for example, the text content of the weather forecast is matched with the combined image corresponding to the standing posture, the hand-raising instruction and the smile, and the text content of the weather forecast can be combined with the corresponding broadcast item, wherein the text content comprises a text, an emotion and a limb action, and the emotion and the limb action can be used as identifiers corresponding to the text. Based on the modular image-text editing tools such as a VR editor, image-text column packaging and the like, corresponding texts and identifiers such as emotions, actions and the like are input into the edited entries containing the AI digital human model, and the played entry manuscript is generated.

And S140, exporting the broadcasting items and the changeable/replaceable contents in the broadcasting item manuscript as template parameters so as to convert the broadcasting list into a broadcasting template.

The contents which need to be changed and replaced in the playing items when the video is generated are exported to be template parameters, so that the playing list is encapsulated into an easy-to-use playing template, the video with different contents can be made only by modifying the template parameters when the video is generated, and the complex editing and making process of the playing list is avoided.

S150, carrying out AI inference according to each broadcast item manuscript in the broadcast list or each broadcast item manuscript in the broadcast list corresponding to the broadcast template to generate a corresponding blue background AI digital human video and a control instruction, or generating corresponding facial expression data, skeleton model data and a control instruction; after keying the image of the AI digital human video with the blue background corresponding to each broadcast item manuscript, rendering and synthesizing the image with a 3D template corresponding to the broadcast item manuscript to obtain a synthesized video corresponding to each broadcast item; or, rendering and synthesizing the facial expression data and the skeleton model data corresponding to each broadcast item manuscript to drive an AI digital person 3D model and a 3D template corresponding to each broadcast item manuscript to obtain a synthesized video corresponding to each broadcast item, wherein the control instruction is used for performing broadcast control on broadcast contents when the synthesized video is generated.

Inputting corresponding manuscript contents aiming at broadcasting items containing AI digital human images in a broadcasting list, and sending the contents to an AI reasoning unit to generate AI digital human data in a reasoning way; for the broadcasting template, modifying corresponding manuscript content by inputting template parameters, and sending the manuscript content to an AI reasoning unit to generate AI digital human data in a reasoning way; the AI digital personal data includes: 1) an AI digital person video with a blue background, or 3D expression data of the AI digital person, 3D limb action data of the AI digital person, and the like; 2) voice data corresponding to the manuscript text; 3) and playing the control command. The content of the manuscript includes text, emotion, body movement and control instructions.

Specifically, the broadcast item manuscript can be pushed to an AI inference server, so that a blue background AI digital human video and a control instruction are generated. Specifically, videos which conform to mouth shapes, emotions and limb actions of characters in the text can be inferred from the AI digital human image library according to contents such as the text, the emotions and the limb actions in the broadcast entry manuscript and combined to generate an AI digital human video with a blue background; and matching and analyzing the keywords in the text to obtain corresponding control instructions, wherein the keywords corresponding to the control instructions are 'play', 'next page', 'pause', and the like, and the control instructions are used for performing play control on labeling, animation play, character display or video play of the PPT.

Or performing AI inference according to each broadcast item manuscript to generate corresponding facial expression data, skeleton model data and control instructions. Specifically, facial expression data which are consistent with the mouth shape and the emotion of characters in the text can be deduced from the AI digital human image library according to the text and the emotion in the broadcast item manuscript, the facial expression data are data for recording the change characteristics of the face of the AI digital human, the facial video of the AI digital human can be restored and obtained according to the facial expression data, the skeleton model data which are consistent with the body actions can be deduced from the AI digital human image library according to the body actions in the broadcast item manuscript, the skeleton model data are data for recording the body action change characteristics of the AI digital human, and the body action video of the AI digital human can be restored and obtained according to the skeleton model data. And matching and analyzing the keywords in the text to obtain a corresponding control instruction, wherein the control instruction is used for carrying out play control on the labeling, animation playing, character displaying or video playing of the PPT.

The rendering and synthesizing unit receives the AI digital person data generated by the AI reasoning unit and a corresponding broadcasting list or a broadcasting template, and renders and synthesizes the AI digital person data and the corresponding broadcasting list or the corresponding broadcasting template to obtain an AI digital person synthesized video; the control instruction is used for playing and controlling the playing content when the AI digital man synthesized video is generated; and the voice data and the AI digital man-made video are synchronously output.

The rendering server acquires a blue background AI digital human video corresponding to the broadcast item manuscript from the AI inference server, performs rendering synthesis with various three-dimensional virtual elements after image matting, and outputs a final synthesized video; and acquiring a control instruction corresponding to the broadcast entry manuscript from the AI inference server to control the broadcast content (PPT labeling, animation playing, text display, video playing and the like), and realizing interaction between the AI virtual AI digital man and the broadcast content.

Specifically, the facial expression data and the bone model data corresponding to each broadcast entry manuscript may drive an AI digital human 3D model to be rendered and synthesized with the 3D template corresponding to the broadcast entry manuscript, so as to obtain a synthesized video corresponding to each broadcast entry.

That is, the method includes performing AI rendering on facial expression data based on the AI digital person 3d model to obtain a facial video corresponding to the facial expression data, performing AI rendering on bone model data based on the AI digital person 3d model to obtain a limb action video corresponding to the bone model data, performing voice synthesis on other objects such as hair and clothes consistent with original contents in the AI digital person 3d model to obtain text voice contents based on texts in a broadcast entry manuscript and audio features in the AI digital person 3d model, combining the above information to render an AI digital person video corresponding to the facial expression data and the bone model, and then generating a synthesized video based on the AI digital person video, as follows.

In one embodiment, as shown in fig. 4, step S150 includes sub-steps S151, S152, and S153.

And S151, keying the video of the blue background AI digital person corresponding to each broadcast item manuscript to obtain video keying.

The video of the AI digital person with the blue background can be scratched, namely the scratching is to scratch the blue background, and the video scratch containing the image of the AI digital person is left. The video matting can be video composited as an element.

S152, adjusting the 3D template and configuring parameters.

In one embodiment, as shown in fig. 5, step S152 includes sub-steps S1521, S1522 and S1523.

S1521, adjusting the attribute and/or the state value of the research object through an internally set relationship to update the attribute and/or the state value of the research object in real time; and/or adjusting through the connection relation between the 3D templates to update the attribute and/or the state value of the research object in real time; and/or adjusting the connection relation between the attribute and/or the state value of the research object and external data and/or an external algorithm so as to update the attribute and/or the state value of the research object in real time.

When the 3D template is adjusted, corresponding trigger conditions may also be set, where the trigger conditions include triggering of processes such as VR simulation, starting and ending operations of VR animation, and change of intermediate state, and the setting manner of the trigger conditions includes, but is not limited to: the time axis ordered triggering mode, the event triggering mode and the condition triggering mode are at least one.

The time axis sequential trigger method is to sequentially trigger VR animation, VR simulation, and the like according to the time set by the editor. For example, it is edited that the car is started and driven at a time 1 minute and 30 seconds after the start of the corresponding time axis, and this car wash is started and driven when 1 minute and 30 seconds have elapsed. The event triggering mode refers to a triggering mode of the VR animation and the VR simulation on the premise that a certain event occurs, that is, when the certain event occurs, the VR animation and the VR simulation are triggered. For example, a left mouse click on the fan may cause the fan to turn or the sky to darken and drift so that a large number of trees in the dark clouds may begin to swing. The condition triggering mode refers to a triggering mode of VR animation and VR simulation on the premise that a certain condition is met. For example, boiling occurs at a temperature attribute of 100 ℃ or higher for water at normal atmospheric pressure.

S1522, configuring relative position parameters between the video matting and the 3D template.

The corresponding relation between the video matting and the VR elements in the 3D template can be configured, such as adjusting the position of the video matting in the three-dimensional scene in the 3D template.

S1523, configuring position parameters of the subtitle templates in the 3D templates.

The position of the caption template can be configured in the 3D template, so that the characters in the manuscript can be displayed in the synthetic video through the caption, and the corresponding sound is output along with the generated synthetic video.

S153, according to the configured parameters, VR simulation and VR animation generated based on the 3D template and the video matting and triggering the attribute and/or state of the research object in the 3D template to change the corresponding value, and obtaining the synthetic video corresponding to each broadcast item.

The method can generate VR simulation and VR animation according to configured parameters and based on a 3D template and video matting and trigger the attributes and/or states of research objects in the 3D template to change corresponding values, for example, an AR/VR model performs amplification, reduction and adjustment and the like according to set adjustment information, and combines the changes of various three-dimensional model materials based on the 3D template with the playing of video matting and playing contents (PPT labeling, animation playing, character displaying, video playing and the like), so that a synthetic video corresponding to each playing item can be generated. Any material includes, but is not limited to: the three-dimensional model, the three-dimensional animation, the special effect, the integration media access, the algorithm, the database, the data access, the characters, the pictures, the graphs, the videos and the like can be edited, combined, transplanted and mixed; VR content editing and manufacturing based on a 3D template, wherein the 3D template can be a template with multiple function types aiming at any content; the method has the advantages that the correlation between any attribute value in the 3D template is realized, any attribute value between the 3D templates can be correlated, the correlation between the 3D template and an external data structure and an external algorithm can also be realized, and the method is used for realizing VR simulation and VR animation; the system has a powerful motion effect editing function, supports 3D track animation and key frame animation, and realizes any coordinate parameter, attitude parameter, appearance parameter and attribute animation of any research object (including but not limited to a three-dimensional model); in the 3D template making process, the making output effect can be previewed at any time, so that the animation and the simulation effect can be conveniently modified in real time, and the editing of visual operation is realized; the research object can define any attribute by self, including but not limited to weight, speed, density, pressure, molecular weight, pH value, illumination, magnetic field intensity, resistivity, hardness and other attributes, and any association can be realized among any attributes; the method can directly assign values and modify in real time, and can also realize custom adjustment of attribute values by giving mathematical functions; the method supports various broadcasting modes such as automatic broadcasting, manual broadcasting, timing broadcasting, sequential broadcasting, hot key triggering broadcasting and the like, and supports the combined broadcasting of various broadcasting items so as to realize various combined animation broadcasting modes; controlling editing and broadcasting based on a time-orbit and event quantity triggering mode, realizing complex logic dynamic virtual simulation of real scenes, and being used for virtual simulation and data visualization; real-time connection of non-programming external data with any attribute is realized; the real-time fusion of fusion media such as social media, website information and the like is realized; the method supports connection of motion capture data, is used for editing and making skeleton animation, and is used for realizing biological virtual simulation; the edited and manufactured VR content can be played based on various terminals, and the various terminals include but are not limited to VR all-in-one machines, helmets, stereoscopic LED large screens/projections, naked eye 3D screens and the like.

Taking courseware manufacturing as an example, adding a 3D template (including a blackboard and a lecture table) of lectures, adding a virtual large-screen model for playing and displaying courseware, selecting a combined image of a target object from an AI digital human image packet, and adding a blue background preview video corresponding to the combined image to the current 3D template to generate a broadcast item; setting the relative positions of a three-dimensional lecture scene, a virtual large-screen model and an AI digital person in the 3D template, and setting the entrance animation and the exit animation for the virtual large-screen (adjusting the parameters of model elements in the 3D template); adding text content corresponding to a lecture, and fusing an AI digital man video with a three-dimensional lecture scene, the text content and a virtual large-screen model into a whole after image matting to obtain a blue background AI digital man video and a control instruction; adding a subtitle template and adjusting the position of the subtitle template to display the manuscript subtitle of the target object when the video is generated; the virtual large screen is used for displaying courseware content: pictures, videos, ppt, etc., and finally generates a composite video corresponding to the presentation item.

The above editing and production can further generate a template: the contents which need to be changed and replaced when the video is generated can be exported as template modification parameters, and when the video is generated, the videos with different contents can be manufactured only by modifying and setting the template parameters through a simple web page or a mobile phone App, so that a complex editing and manufacturing process is avoided.

In the method for generating the AI digital human video based on templated editing provided by the embodiment of the invention, the audio and video data of a target object are collected to perform AI model training to generate an AI digital human image package, the 3D template and the AI digital human image package are subjected to templated image-text video editing and previewing to manufacture a play list containing play items, the play items containing the blue background preview video are combined with text contents to generate play item manuscripts, AI reasoning is performed according to the AI digital human image model and each play item manuscript to generate the blue background AI digital human video and a control instruction, and the blue background AI digital human video is subjected to image matting and then is rendered and synthesized with the corresponding 3D template to obtain a synthesized video. By the method, the composite video containing the AI digital human video is generated by the template editing manufacturing method, and the AI digital human video content can be quickly manufactured in batches only by modifying the template parameters, so that the generation efficiency of the AI digital human video content is greatly improved.

The embodiment of the invention also provides an AI digital human video generating device based on templated editing, which can be configured in a user terminal or a management server, and is used for executing any embodiment of the aforementioned AI digital human video generating method based on templated editing. Specifically, referring to fig. 6, fig. 6 is a schematic block diagram of an AI digital human video generating device based on templated editing according to an embodiment of the present invention.

As shown in fig. 6, the AI digital human video generating apparatus 100 based on templated editing includes a character pack generating unit 110, a play list generating unit 120, a play item document generating unit 130, a play template acquiring unit 140, and a composite video generating unit 150.

The avatar pack generating unit 110 is configured to generate an AI digital avatar pack corresponding to audio and video data of a target object collected in the blue box according to a preset AI model, where the AI digital avatar pack includes each of a plurality of composite avatars corresponding to the target object and a blue background preview video corresponding to each of the composite avatars.

In a specific embodiment, the image packet generating unit 110 includes sub-units: the AI model training unit is used for training the AI model by using the audio and video data of the target object collected in the blue box to obtain an AI digital human image model corresponding to each combined image; and the blue background preview video generating unit is used for respectively generating a blue background preview video corresponding to each combined image according to the AI digital human image model, and each combined image is obtained by combining postures, limb actions and expressions.

And the play list generating unit 120 is configured to perform templated image-text video editing and previewing on the 3D template and each blue background preview video in the AI digital avatar package to create a play list composed of play items corresponding to each blue background preview video.

In a specific embodiment, the play-out list generating unit 120 includes sub-units: and the scene element replacing unit is used for acquiring a blue background preview video of the combined image corresponding to each image identifier in the AI digital human image packet as a scene element to replace the image identifiers according to the image identifiers in the 3D template. And the broadcast item generating unit is used for matting the blue background preview video added into the 3D template to generate a video frame sequence with an Alpha channel, combining the video frame sequence with the three-dimensional virtual elements in the 3D template, editing and previewing the templated image-text video, and making and obtaining the broadcast item corresponding to each target object.

A broadcast item manuscript generating unit 130, configured to combine broadcast items including the blue background preview video with text content to generate a broadcast item manuscript corresponding to each of the broadcast items, where the text content includes text, emotion, and body movement.

A broadcast template acquisition unit 140, configured to derive the broadcast items and the content that can be changed/replaced in the broadcast item document as template parameters, so as to convert the broadcast list into a broadcast template.

A composite video generating unit 150, configured to perform AI inference according to each broadcast item manuscript in the broadcast list or each broadcast item manuscript in the broadcast list corresponding to the broadcast template to generate a corresponding blue background AI digital human video and a control instruction, or generate corresponding facial expression data, skeleton model data, and a control instruction; after keying the image of the AI digital human video with the blue background corresponding to each broadcast item manuscript, rendering and synthesizing the image with a 3D template corresponding to the broadcast item manuscript to obtain a synthesized video corresponding to each broadcast item; or, rendering and synthesizing the facial expression data and the skeleton model data corresponding to each broadcast item manuscript to drive an AI digital person 3D model and a 3D template corresponding to the broadcast item manuscript to obtain a synthesized video corresponding to each broadcast item; and the control instruction is used for playing and controlling the playing content when the composite video is generated.

In a specific embodiment, the composite video generating unit 150 includes sub-units: and the image matting unit is used for matting the video of the blue background AI digital person corresponding to each broadcast item manuscript to obtain the video image matting. The adjusting and configuring unit is used for adjusting the 3D template and configuring parameters; and the video generating unit is used for generating VR simulation and VR animation based on the 3D template and the video matting according to the configured parameters and triggering the attribute and/or the state of a research object in the 3D template to change the corresponding value so as to obtain a composite video corresponding to each broadcast item.

In an embodiment, the adjusting and configuring unit includes a sub-unit: the adjusting unit is used for adjusting the attribute and/or the state value of the research object through an internally set relationship so as to update the attribute and/or the state value of the research object in real time; and/or adjusting through the connection relation between the 3D templates to update the attribute and/or the state value of the research object in real time; and/or adjusting based on the connection relation between the attribute and/or the state value of the research object and external data and/or an external algorithm so as to update the attribute and/or the state value of the research object in real time; the first position parameter configuration unit is used for configuring relative position parameters between the video matting and the 3D template; and the second position parameter configuration unit is used for configuring the position parameters of the caption template in the 3D template.

The AI digital human video generation device based on templated editing provided by the embodiment of the invention applies the AI digital human video generation method based on templated editing, acquires audio and video data of a target object to perform AI model training to generate an AI digital human image package, performs templated image-text video editing and previewing on a 3D template and the AI digital human image package to manufacture a broadcast list containing broadcast items, combines the broadcast items containing blue background preview video with text content to generate broadcast item manuscripts, performs AI inference according to the AI digital human image model and each broadcast item manuscript to generate blue background AI digital human video and control instructions, performs image matting on the blue background AI digital human video and then renders and synthesizes the blue background AI digital human video with a corresponding 3D template to obtain a synthesized video. By the method, the composite video containing the AI digital human video is generated by the template editing manufacturing method, and the AI digital human video content can be quickly manufactured in batches only by modifying the template parameters, so that the generation efficiency of the AI digital human video content is greatly improved.

The above-described AI digital human video generating apparatus based on templated editing may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 7.

Referring to fig. 7, fig. 7 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device may be a user terminal or a management server for executing an AI digital human video generation method based on templated editing to generate AI-motherboard video content in a manner based on templated editing.

Referring to fig. 7, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a storage medium 503 and an internal memory 504.

The storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, may cause the processor 502 to perform the AI digital human video generation method based on templated editing, where the storage medium 503 may be a volatile storage medium or a non-volatile storage medium.

The processor 502 is used to provide computing and control capabilities that support the operation of the overall computer device 500.

The internal memory 504 provides an environment for running the computer program 5032 in the storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can be caused to execute the AI digital human video generation method based on templated editing.

The network interface 505 is used for network communication, such as providing transmission of data information. Those skilled in the art will appreciate that the configuration shown in fig. 7 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing device 500 to which aspects of the present invention may be applied, and that a particular computing device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The processor 502 is configured to run the computer program 5032 stored in the memory to implement the corresponding functions in the AI digital human video generation method based on templated editing as described above.

Those skilled in the art will appreciate that the embodiment of a computer device illustrated in fig. 7 does not constitute a limitation on the specific construction of the computer device, and that in other embodiments a computer device may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 7, and are not described herein again.

It should be understood that, in the embodiment of the present invention, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In another embodiment of the invention, a computer-readable storage medium is provided. The computer readable storage medium may be a volatile or non-volatile computer readable storage medium. The computer-readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the steps included in the above-described AI digital human video generation method based on templated editing.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions when the actual implementation is performed, or units having the same function may be grouped into one unit, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a computer-readable storage medium, which includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned computer-readable storage media comprise: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An AI digital human video generation method based on templated editing, the method comprising:

2. The AI digital human video generation method based on templated editing according to claim 1, wherein generating an AI digital human image package corresponding to the audio/video data of the target object collected in the blue box according to a preset AI model comprises:

training the AI model by using audio and video data of a target object collected in a blue box to obtain an AI digital human image model corresponding to each combined image;

and respectively generating a blue background preview video corresponding to each combined image according to the AI digital human image model, wherein each combined image is obtained by combining the posture, the limb action and the expression.

3. The AI digital human video generation method based on templated editing according to claim 1, wherein the performing templated teletext video editing and preview on each of the blue background preview video in the 3D template and the AI digital human image package to make a playlist consisting of broadcast items corresponding to each of the blue background preview video comprises:

according to the image identification in the 3D template, obtaining a blue background preview video of a combined image corresponding to each image identification in the AI digital human image packet as a scene element to replace the image identification;

and after the blue background preview video added into the 3D template is subjected to keying to generate a video frame sequence with an Alpha channel, the video frame sequence is combined with the three-dimensional virtual element in the 3D template to carry out template image-text video editing and previewing so as to make a broadcast item corresponding to each target object.

4. The AI digital human video generation method based on templated editing of claim 1, wherein the broadcast items in the broadcast list are broadcast by automatic, manual, timed, sequential, hot-key triggered, and VR handle triggered, and the broadcast items can be broadcast alone or in combination with other broadcast items.

5. The AI digital human video generation method based on templated editing according to claim 1, wherein the control command is used to perform playback control of annotation, animation playback, text display, or video playback of the PPT.

6. The AI digital person video generation method based on templated editing according to claim 1, wherein rendering and synthesizing a 3D template corresponding to each of the broadcast item manuscripts to obtain a synthesized video corresponding to each of the broadcast items after matting the blue background AI digital person video corresponding to each of the broadcast item manuscripts comprises:

keying is carried out on the video of the blue background AI digital person corresponding to each broadcast item manuscript to obtain video keying;

adjusting the 3D template and configuring parameters;

and according to the configured parameters, generating VR simulation and VR animation based on a 3D template and the video matting, and triggering the attribute and/or state of a research object in the 3D template to change the corresponding value, so as to obtain a synthetic video corresponding to each broadcast item.

7. The AI digital human video generation method based on templated editing according to claim 6, wherein adjusting the 3D template and configuring trigger conditions includes:

adjusting the attribute and/or state value of the research object through an internally set relationship so as to update the attribute and/or state value of the research object in real time; and/or adjusting through the connection relation between the 3D templates to update the attribute and/or the state value of the research object in real time; and/or adjusting based on the connection relation between the attribute and/or the state value of the research object and external data and/or an external algorithm so as to update the attribute and/or the state value of the research object in real time;

configuring relative position parameters between the video matting and the 3D template;

and configuring the position parameters of the caption template in the 3D template.

8. An AI digital human video generating apparatus based on templated editing, the apparatus comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the AI digital human video generation method based on templated editing according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium characterized in that it stores a computer program which, when executed by a processor, implements the AI digital human video generation method based on templated editing according to any one of claims 1 to 7.