CN114741541A

CN114741541A - Interactive control method and device for interactive control of AI digital person on PPT (Power Point) based on templated editing

Info

Publication number: CN114741541A
Application number: CN202210369117.1A
Authority: CN
Inventors: 吴天生
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2022-07-12
Anticipated expiration: 2042-04-08

Abstract

The invention discloses an interactive control method and device of an AI digital person to PPT based on templated editing, wherein the method comprises the following steps: determining the attribute connection relation between media material elements, generating a playing template according to the set display area and display mode combination, extracting page information from a PPT file and constructing a 2D mapping relation between the text content of a lecture and a PPT page, carrying out inference according to the text content of the lecture to generate an AI digital human video frame, an AI digital human voice frame and a subtitle frame, adding the AI digital human video frame and a page picture to the playing template for display, adding a prompt graphic representation and a laser pen identifier in a corresponding display area according to the 2D mapping relation to generate an image frame, and synchronously outputting the image frame and the AI digital human voice frame. The invention belongs to the technical field of artificial intelligence, and can add a prompt graphic and a laser pen identifier in a corresponding display area according to a 2D mapping relation so as to realize synchronous control on PPT background materials in a video synthesis process and greatly improve the application function of AI digital human videos.

Description

Interactive control method and device for interactive control of AI digital person on PPT (Power Point) based on templated editing

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an AI digital person-to-PPT interactive control method and device based on templated editing.

Background

The operation flow of the existing AI virtual anchor and virtual image products is anchor video acquisition- > data processing- > model training- > image output; in the production and output stage, based on a trained anchor image, video generation and output are carried out according to input manuscripts and voices and optional anchor emotions, background pictures, videos, standing postures, sitting postures and the like, however, in the current AI digital human video generation system, data flow in a single direction, only the pictures and the videos can be input as backgrounds to be synthesized with AI digital humans, and the 3D pictures, the videos, the 3D templates and other background materials cannot be synchronously controlled in the synthesis process, especially PPT content is synchronously marked, so that the video interaction function of the generated AI digital human is lack, and the application function of the generated AI digital human video is influenced. Therefore, the prior art method has the problem that PPT background materials cannot be synchronously controlled in the process of synthesizing AI digital human videos.

Disclosure of Invention

The embodiment of the invention provides an interactive control method, device, equipment and medium of an AI digital person to PPT based on templated editing, aiming at solving the problem that PPT background materials cannot be synchronously controlled in the process of synthesizing AI digital person videos in the prior art.

In a first aspect, an embodiment of the present invention provides a method for controlling interaction between an AI digital person and a PPT based on templated editing, where the method includes:

importing media material elements and generating attribute connection, attribute simulation and combined animation corresponding to the media material elements according to specific logical relations and state change requirements;

setting a display area and a display mode of a PPT page and a combination mode of AI digital persons and the attribute connection, attribute simulation and combined animation according to the input setting parameters, thereby generating a corresponding playing template;

importing a PPT file into the playing template, and extracting page information corresponding to each PPT page in the PPT file, wherein the page information comprises page pictures and lecture text contents corresponding to each PPT page;

establishing a 2D mapping relation between the text content of the lecture and a PPT page;

transmitting the lecture text content to an AI digital human reasoning model to generate an AI digital human video frame, an AI digital human voice frame and a subtitle frame corresponding to the rendered synthetic frame;

displaying an AI digital man video frame in an AI digital man display area of the playing template according to the set combination mode of the AI digital man and the attribute connection, the attribute simulation and the combined animation;

displaying the page picture in the PPT display area of the playing template according to the display mode according to the display area and the display mode of the PPT page;

acquiring the regional coordinates of the text content corresponding to the caption frame on the PPT page according to the 2D mapping relation between the text content of the lecture and the PPT page, and generating a prompt graphic representation of the text content corresponding to the voice frame of the current AI digital person;

rendering and drawing a circular bright point in the region coordinate of the PPT page by the text content corresponding to the caption frame according to the 2D mapping relation between the lecture text content and the PPT page, and performing laser pen identification to generate an image frame by rendering;

and synchronously outputting the rendered and synthesized image frame and the current AI digital human voice frame to a video board card, recording the image frame and the current AI digital human voice frame to a file or generating network plug flow output.

In a second aspect, an embodiment of the present invention provides an interactive control device for PPT based on AI digital people with templated editing, including:

the media material element attribute acquisition unit is used for importing media material elements and generating attribute connection, attribute simulation and combined animation corresponding to the media material elements according to specific logical relations and state change requirements;

the playing template generating unit is used for setting a display area and a display mode of a PPT page according to the input setting parameters, and a combination mode of AI digital persons, attribute connection, attribute simulation and combined animation, so as to generate a corresponding playing template;

the page information extraction unit is used for importing a PPT file into the playing template and extracting page information corresponding to each PPT page in the PPT file, wherein the page information comprises page pictures and lecture text contents corresponding to each PPT page;

the mapping relation construction unit is used for establishing a 2D mapping relation between the text contents of the lecture and the PPT page;

the reasoning generation unit is used for sending the text content of the lecture to the AI digital human reasoning model so as to generate an AI digital human video frame, an AI digital human voice frame and a subtitle frame corresponding to the rendering synthetic frame;

the AI digital human video frame display unit is used for displaying AI digital human video frames in an AI digital human display area of the playing template according to the set combination mode of the AI digital human and the attribute connection, the attribute simulation and the combined animation;

the page picture display unit is used for displaying the page picture in the PPT display area of the playing template according to the display area and the display mode of the PPT page;

the prompt graphic generation unit is used for acquiring the region coordinates of the text content corresponding to the caption frame on the PPT page according to the 2D mapping relation between the lecture text content and the PPT page, and generating a prompt graphic of the text content corresponding to the current AI digital person voice frame;

the image frame rendering unit is used for rendering and drawing a circular bright point in the region coordinate of the PPT page for laser pen identification so as to generate an image frame by rendering the circular bright point according to the 2D mapping relation between the lecture text content and the PPT page;

and the output unit is used for rendering the synthesized image frame and the current AI digital human voice frame and synchronously outputting the image frame and the current AI digital human voice frame to the video board card, recording the image frame and the current AI digital human voice frame to a file or generating network plug flow output.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the method for controlling interaction between a PPT and an AI digital person based on templated editing according to the first aspect.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the method for controlling interaction between PPT and AI digital persons based on templated editing according to the first aspect.

The embodiment of the invention provides an AI digital person-to-PPT interactive control method, device, equipment and medium based on templated editing. Determining the attribute connection relation among media material elements, generating a playing template according to the set display area and display mode combination, extracting page information from a PPT file, constructing a 2D mapping relation between the text content of a lecture and a PPT page, performing inference according to the text content of the lecture to generate an AI digital human video frame, an AI digital human voice frame and a subtitle frame, respectively adding the AI digital human video frame and a page picture to the playing template for display, adding a prompt graphic and a laser pen identifier to a corresponding display area of the PPT page according to the 2D mapping relation to render and generate an image frame, and synchronously outputting the image frame and the AI digital human voice frame. By the method, page information can be extracted from the PPT file, a corresponding 2D mapping relation is constructed, and the prompt graphic and the laser pen identifier are added to the corresponding display area of the PPT page according to the 2D mapping relation, so that PPT background materials are synchronously controlled in the process of synthesizing the AI digital human video, and the application function of the AI digital human video is greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart illustrating an interactive control method for PPT by AI digital people based on templated editing according to an embodiment of the present invention;

fig. 2 is a sub-flow diagram of an interaction control method for PPT by AI digital people based on templated editing according to an embodiment of the present invention;

fig. 3 is another flowchart illustrating an AI digital person-to-PPT interaction control method based on templated editing according to an embodiment of the present invention;

fig. 4 is a schematic sub-flow chart of an interaction control method for PPT by AI digital people based on templated editing according to an embodiment of the present invention;

fig. 5 is a schematic sub-flow chart of a method for controlling interaction between an AI digital person and a PPT based on templated editing according to an embodiment of the present invention;

fig. 6 is a schematic sub-flowchart of a method for interactive control of PPT by AI digital people based on templated editing according to an embodiment of the present invention;

fig. 7 is a schematic block diagram of an interactive control device for PPT based on AI digital people edited by templet according to an embodiment of the present invention;

fig. 8 is a schematic block diagram of a computer device provided in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an interactive control method for PPT by AI digital people based on templated editing according to an embodiment of the present invention; the interactive control method of the AI digital person to the PPT based on the templated edition is applied to a user terminal or a management server, and is executed through application software installed in the user terminal or the management server; the user terminal can be used for executing an AI digital man interactive control method based on templated editing to PPT so as to generate a corresponding video board card according to parameter information input by a user and a PPT file, record the video board card to the file or generate network plug-flow output, the user terminal can be terminal equipment such as a desktop computer, a notebook computer, a tablet computer or a mobile phone, and the management server is a server end, such as a server end constructed in an enterprise, for executing the AI digital man interactive control method based on templated editing to PPT so as to obtain the parameter information uploaded by the user terminal and the PPT file to generate the corresponding video board card, record the video board card to the file or generate the network plug-flow output. As shown in fig. 1, the method includes steps S101 to S110.

S101, importing media material elements and generating attribute connection, attribute simulation and combined animation corresponding to the media material elements according to specific logical relations and state change requirements.

Specifically, the user can import various media material elements through the template making unit, the media material elements include pictures, videos, flash, webpages, streaming media, 3D models, animations, sounds and the like, and various attribute connections, attribute simulations and combined animations of the various media material elements are generated according to specific logical relations and state change requirements.

And S102, setting a display area and a display mode of a PPT page according to the input setting parameters, and a combination mode of an AI digital person, attribute connection, attribute simulation and combined animation, thereby generating a corresponding playing template.

The user can also set parameters through the template making unit, namely, the display area and mode (such as windowing or full screen) of the PPT page and the synthesis mode (such as picture-in-picture, full screen and the like) of the AI digital person and the image-text elements can be set according to the input set parameters, so that a corresponding playing template is generated, and the image-text elements are the attribute connection, the attribute simulation and the combined animation.

And S103, importing a PPT file into the playing template, and extracting page information corresponding to each PPT page in the PPT file, wherein the page information comprises page pictures and lecture text contents corresponding to each PPT page.

The manufactured playing template can be opened in the rendering synthesis unit, a user can import a PPT file into the playing template, the PPT file comprises a plurality of PPT pages, page pictures of each PPT page can be respectively obtained from the PPT file, and the lecture text content of each PPT page is tried to be extracted. .

In an embodiment, as shown in fig. 2, step S103 includes sub-steps S131 and S132.

S131, generating a page picture corresponding to each page in the PPT file.

For example, a screen shot can be performed on a PPT page in a PPT file to obtain a corresponding page picture.

S132, extracting the text content of the lecture of each page in the PPT file.

For example, for a PPT page in a text mode, the contents in the page can be directly read to obtain the corresponding text contents of a lecture; for example, for the PPT page embedded in a picture manner, OCR (Optical Character Recognition) Recognition technology may be adopted to extract corresponding text contents of the lecture from the PPT page, or manually input the text contents of the lecture.

And S104, establishing a 2D mapping relation between the text content of the lecture and the PPT page.

Specifically, two virtual windows can be constructed, one window displays the text content of the lecture, the other window displays the PPT page picture, a section of text set is selected in the lecture text window in a separation mode according to punctuation marks, and the display region coordinate of the corresponding text set is selected in the PPT page window, so that the 2D mapping relation from the lecture text to the PPT page is established.

And S105, transmitting the text content of the lecture to the AI digital human reasoning model so as to generate an AI digital human video frame, an AI digital human voice frame and a subtitle frame corresponding to the rendered synthetic frame.

The obtained lecture text content can be sent to a pre-trained AI digital human reasoning model for reasoning so as to generate corresponding AI digital human video, AI digital human voice and caption, and corresponding AI digital human video frame, AI digital human voice frame and caption frame are respectively obtained based on the AI digital human video, the AI digital human voice and the caption.

In one embodiment, step S105 is preceded by step S1501, as shown in fig. 3.

S151, generating an AI digital human reasoning model corresponding to audio and video data of a target object collected in a blue box according to a preset AI model, wherein the AI digital human reasoning model comprises a plurality of combined images corresponding to each target object and a blue background preview video corresponding to each combined image; the AI digital human reasoning model comprises a 2D AI digital human and/or a 3D AI digital human.

Specifically, the audio and video data of each target object can be collected in a blue box, the blue box is an external scene for video shooting, the blue box can be red, green, blue or other single colors, and an AI model can be trained based on the characteristics of the audio and video data of the target object, so that the obtained AI digital human reasoning model can be trained, and the AI digital human reasoning model can contain an AI digital human image library.

Specifically, the video and audio data of the target object are collected, an AI digital human reasoning model of the target object is generated based on an AI digital human technology platform, and the training process of the 2D AI digital human reasoning model comprises the following steps: collecting the video and audio of a target object in a blue box, and generating a 2D AI digital human image of the target object based on an AI digital human technology platform; collecting corresponding target object blue box video keying parameters; generating an AI digital human figure preview video; and generating an AI digital human reasoning model containing the 2D AI digital human image by combining the ID number of the target object. The training process for the 3D AI digital human reasoning model comprises: shooting a target object video in multiple angles or 3D scanning the target object and collecting the audio of the target object, and generating a 3D AI digital human image of the target object based on an AI digital human technology platform; generating an AI digital human figure preview video; and generating an AI digital human reasoning model containing the AI digital human image by combining the ID number of the target object. Aiming at the target object, the AI digital human figure of the target object can be formed by a group of AI digital human figures generated by training, and the AI digital human figure can be stored in an AI digital human figure library of an AI digital human reasoning model, so that the retrieval and use based on the ID number are convenient.

The specific types of the AI digital human figures comprise 2D real persons and 3D real persons, the AI digital human figure library can store 2D AI digital human figures and/or 3D AI digital human figures, wherein the 2D AI digital human figures comprise but are not limited to 2D AI digital human figures, digital human IDs, corresponding basket video matting parameters and preview videos, and the 3D AI digital human figures comprise but are not limited to 3D AI digital human figures, digital human IDs and preview videos.

Specifically, each target object may correspond to a plurality of combined images, and the combined images include postures, body movements and expressions, that is, the combined images are formed by combining a specific posture, a specific body movement and a specific expression, for example, the posture is a sitting posture or a standing posture, the body movements are hand-lifting instructions, head-pointing, etc., and the expressions are smiling, serious, etc. If the target object (human) can record lossless compressed video in a segmented manner in a blue box, the video resolution can be 1280x720P, 1920x1080P or 3840x 2160P; wherein the header region requires a resolution of no less than 192 pixels by 192 pixels; and combining and recording videos according to two postures of standing and sitting, expressions such as smiling and surprise and hand motions such as hand raising indication.

Each combined image corresponds to a section of audio and video, the image characteristics corresponding to each section of audio and video can be extracted, the audio characteristics corresponding to each section of audio and video are extracted, the image characteristics and the audio characteristics of the same section of audio and video are input into an AI model for training, the image characteristics and the audio characteristics are used as input, the corresponding audio and video are used as a training target, parameter values in the AI model are fed back and adjusted by obtaining loss values between video frames and audio output in the training process and the training target, so that the AI model is trained, the combined image of each target object can be correspondingly trained to obtain an AI digital human reasoning model, each AI digital human reasoning model can respectively generate corresponding blue background preview video, namely blue is used as the background, and a section of video frames and audio output based on the AI digital human reasoning model are used as the preview video generated by the AI digital human image, if the preview video duration can be 10 seconds, 15 seconds or 30 seconds, an object image label is added in each blue background preview video of the AI digital human image, and the object image label can be used for carrying out classification identification on the blue background preview video, so that the fast search in the post-production process is facilitated.

The video of the target object can be shot from a plurality of angles to carry out three-dimensional modeling or a three-dimensional model image of the target object is constructed by a three-dimensional scanning method; the method directly constructs a three-dimensional model image of a target object, can directly generate facial expression data and skeleton data to construct an AI digital human reasoning model based on an AI digital human 3D model besides generating a blue background AI digital human video during AI reasoning, and carries out rendering synthesis on the video through the AI digital human reasoning model.

The intelligent identification can be carried out on the input text content of the lecture so as to obtain the information such as emotional characteristics, limb action characteristics and the like corresponding to the text content of the lecture, and the information such as the text content of the lecture, the emotional characteristics, the limb action characteristics and the like is sent to an AI reasoning unit to generate AI digital person data in a reasoning way; the AI digital personal data includes: 1) the method comprises the following steps of (1) blue background AI digital human videos, or AI digital human 3D expression data, AI digital human 3D limb action data and the like; 2) and voice data corresponding to the manuscript text.

In one embodiment, as shown in fig. 4, step S105 includes sub-steps S151, S152, S153, and S154.

And S151, reasoning the blue background AI digital human video which accords with the text information from the AI digital human reasoning model according to the text information in the text content of the lecture.

In an embodiment, step S151 specifically includes: and deducing videos which accord with the mouth shape, the emotion and the body action of characters in the text information from the AI digital human reasoning model according to the text information in the text content of the lecture so as to combine the videos to generate the blue background AI digital human video.

Specifically, an AI digital figure can be selected from an AI digital figure library of the AI digital figure inference model, and video which is the AI digital figure and conforms to the mouth shape, emotion and limb actions of characters in the text content is inferred from the AI digital figure inference model and combined according to the content of the lecture text content, emotion and limb actions, so as to generate a blue background AI digital figure video corresponding to the AI digital figure.

In an embodiment, step S151 may further specifically include: according to text information in the text content of the lecture, facial expression data which accord with the mouth shape and the emotion of characters in the text information are deduced from the AI digital human reasoning model, and skeletal model data which accord with the limb actions in the text information are deduced from the AI digital human reasoning model; and restoring to obtain a facial video of the AI digital person according to the facial expression data, restoring to obtain a limb action video of the AI digital person according to the skeleton model data, and combining to generate a blue background AI digital person video.

Specifically, an AI digital human figure can be selected from an AI digital human figure library of an AI digital human reasoning model, according to the lecture text content and emotion, facial expression data of the AI digital human figure and conforming to the mouth shape and emotion of characters in the text content is inferred from the AI digital human reasoning model, the facial expression data is data recording the change characteristics of the AI digital human face, according to the facial expression data, a facial video of the AI digital human can be restored, according to the limb actions in the broadcast entry text, skeletal model data of the AI digital human figure and conforming to the limb actions is inferred from the AI digital human reasoning model, the skeletal model data is data recording the limb action change characteristics of the AI digital human, and according to the skeletal model data, the limb action video of the AI digital human can be restored.

In addition, the keywords in the text content of the manuscript can be matched and analyzed to obtain a corresponding control instruction, and the control instruction is used for performing play control on labeling, animation playing, text displaying or video playing of the PPT.

And S152, carrying out voice synthesis according to the text information in the text content of the lecture and the audio characteristics in the AI digital human reasoning model to obtain the AI digital human voice.

Specifically, an AI digital figure can be selected from an AI digital figure library of the AI digital figure inference model, and an AI digital figure voice corresponding to the AI digital figure is obtained by performing voice synthesis according to text information in the lecture text and audio features of the AI digital figure.

And S153, carrying out paragraph splitting according to the text content of the lecture to obtain the subtitles.

The method can be used for segmenting and splitting the content of the lecture, for example, the text content of the lecture is split into subtitles containing a plurality of text segments according to the maximum number of characters, and the number of the characters contained in each text segment is not more than the maximum number of the characters. If the maximum number of characters is 30, the characters are divided into two groups of characters; ",". And the punctuation marks are used as splitting nodes to paragraph the content of the lecture, so as to obtain a subtitle comprising a plurality of corresponding character segments.

And S154, acquiring an AI digital human video frame, an AI digital human voice frame and a caption frame corresponding to the rendered composite frame according to the timestamps of the AI digital human video, the AI digital human voice and the caption.

Specifically, the time lengths of the AI digital human video and the AI digital human voice are equal, each text segment in the caption corresponds to a certain time point in the AI digital human video or the AI digital human voice, one video segment is formed by combining a plurality of video frames, one voice segment is formed by combining a plurality of voice frames, the time point corresponding to the text segment in the caption and corresponding video content or corresponding voice content can be used as a timestamp of the text segment, and the text segment and corresponding video content or corresponding voice content are split based on the timestamp to obtain an AI digital human video frame, an AI digital human voice frame and a caption frame corresponding to a current rendered composite frame.

And S106, displaying the video frame of the AI digital person in the display area of the AI digital person of the playing template according to the set combination mode of the connection of the AI digital person and the attributes, the attribute simulation and the combined animation.

And inserting the AI digital human video frame into an AI digital human display area of the playing template according to the combination mode of attribute connection, attribute simulation and combined animation in the AI digital human and the playing template, and displaying the AI digital human video frame in a specified mode.

And S107, displaying the page picture in the PPT display area of the playing template according to the display mode according to the display area and the display mode of the PPT page.

Inserting the PPT page picture into the PPT display area of the playing template according to the display area and the display mode of the PPT page set in the playing template, and displaying the PPT page picture according to the specified display mode.

And S108, acquiring the region coordinates of the text content corresponding to the caption frame on the PPT page according to the 2D mapping relation between the text content of the lecture and the PPT page, and generating a prompt graphic representation of the text content corresponding to the current AI digital human voice frame.

In one embodiment, as shown in FIG. 5, step S108 includes sub-steps S181 and S182.

S181, acquiring region coordinates of the text content corresponding to the caption frame on the PPT page according to the 2D mapping relation between the text content of the lecture and the PPT page, and determining the geometric vertex parameter of the PPT page in a 3D space according to the camera parameter of 3D rendering; and S182, covering the prompt graphic corresponding to the AI digital human voice frame to the corresponding PPT page picture through 3D mapping, and realizing synchronous marking of the AI digital human voice and the corresponding PPT lecture text.

Specifically, mapping between a 2D space and a 3D space can be realized based on a projection matrix, so as to construct camera parameters for 3D rendering, a PPT page picture and an AI digital human video frame displayed by a playing template are correspondingly presented as the 2D space at a camera end, a 3D position relationship between the PPT page picture and the AI digital human video frame and a virtual camera end can be constructed to form the 3D space, the camera parameters include parameters such as the position and orientation of a camera in the 3D space, and the camera parameters determine how the camera views in the 3D space and determine how the camera view pictures are displayed on a screen. The method comprises the steps of generating a corresponding prompt graphic representation according to AI digital human voice, acquiring regional coordinates of text contents corresponding to caption frames on a PPT page based on a 2D mapping relation between the text contents of a lecture manuscript and the PPT page, determining geometric vertex parameters (namely three-dimensional coordinate parameters of four vertex angles of the PPT page) of the PPT page in a 3D space according to camera parameters, and covering the corresponding prompt graphic representation on a corresponding PPT page picture in a 3D mapping mode, so that synchronous marking of the AI digital human voice and the corresponding PPT lecture manuscript text is realized.

And S109, rendering and drawing a circular bright point in the region coordinate of the PPT page by the text content corresponding to the caption frame according to the 2D mapping relation between the lecture text content and the PPT page, and performing laser pen identification to generate an image frame in a rendering manner.

In one embodiment, as shown in FIG. 6, step S109 includes sub-steps S191, S192, and S193.

S191, obtaining the region coordinates of the text content corresponding to the caption frame on the PPT page according to the 2D mapping relation between the text content of the lecture and the PPT page, and calculating the 2D coordinates of each caption text on the PPT page; s192, converting 2D coordinates of the caption characters into 3D coordinates through the camera parameters of 3D rendering and the geometric vertex parameters of the PPT page in a 3D space; s193, converting the 3D coordinates of the caption characters into screen coordinates through the camera parameters of 3D rendering; rendering and drawing the circular bright point at the coordinate realizes laser pen identification in the PPT page.

The mapping between the 2D space and the 3D space can be realized based on the projection matrix, and the camera parameters of the 3D rendering can be obtained, and the specific mode is the same as the steps. The method can acquire region coordinates of text contents corresponding to caption frames on a PPT page according to a 2D mapping relation between text contents of a lecture manuscript and the PPT page, further acquire 2D coordinates of each caption text on the PPT page, convert 2D coordinates of the caption texts into 3D coordinates by mapping a 2D space and a 3D space according to geometric vertex parameters of the PPT page in the 3D space, specifically, acquire a corresponding view matrix according to camera parameters, map geometric coordinate positions of the caption texts in the 2D space into a 3D scene by the view matrix and a projection matrix, acquire the 3D coordinates corresponding to the geometric coordinate positions of the 2D space, convert the 3D coordinates of the caption texts into screen coordinates, namely map the 3D scene coordinates into the 2D screen coordinates by the view matrix and the projection matrix, thus realizing the conversion from the 2D coordinates of the PPT page to the screen coordinates, and then displaying a circular bright spot on the screen coordinate, namely rendering and drawing a circular bright spot on the corresponding coordinate, and displaying the circular bright spot which can move in the PPT page in the video, thereby realizing laser pen identification in the PPT page, namely simulating the function of a laser pen in real application.

And S110, synchronously outputting the rendered and synthesized image frame and the current AI digital human voice frame to a video board, recording to a file or generating network plug-flow output.

The rendered and synthesized image frame and the current AI digital human voice frame can be combined and synchronously output to a video board card, or synchronously recorded to generate a video file, and also synchronously generated network plug flow output (such as video live broadcast stream).

In the interactive control method for PPT by AI digital persons based on templated editing provided by the embodiment of the invention, the attribute connection relation between media material elements is determined, a playing template is generated according to the set display area and display mode combination, page information is extracted from a PPT file, the 2D mapping relation between the text content of a lecture manuscript and a PPT page is established, the video frames, the AI digital persons and the voice frames are generated by inference according to the text content of the lecture manuscript, the video frames and the page pictures of the AI digital persons are respectively added to the playing template for display, the prompt graphic representation and the laser pen identification are added to the corresponding display area of the PPT page according to the 2D mapping relation to render and generate the image frames, and the image frames and the voice frames of the AI digital persons are synchronously output. By the method, page information can be extracted from the PPT file, a corresponding 2D mapping relation is constructed, and the prompt graphic and the laser pen identifier are added to the corresponding display area of the PPT page according to the 2D mapping relation, so that PPT background materials are synchronously controlled in the process of synthesizing the AI digital human video, and the application function of the AI digital human video is greatly improved.

The embodiment of the invention also provides an interactive control device for the PPT based on the templated editing AI digital person, which can be configured in a user terminal or a management server, and is used for executing any one of the aforementioned interactive control methods for the PPT based on the templated editing AI digital person. Specifically, referring to fig. 7, fig. 7 is a schematic block diagram of an interactive control device for PPT based on AI digital human pairs edited by templating according to an embodiment of the present invention.

As shown in fig. 7, the interactive control device 100 for the PPT based on the templated edited AI digital person includes a media material element attribute acquisition unit 101, a playback template generation unit 102, a page information extraction unit 103, a mapping relationship construction unit 104, an inference generation unit 105, an AI digital person video frame display unit 106, a page picture display unit 107, a prompt graph generation unit 108, an image frame rendering unit 109, and an output unit 110.

The media material element attribute acquisition unit 101 is configured to import media material elements and generate attribute connections, attribute simulations, and combined animations corresponding to the media material elements according to specific logical relationships and state change requirements.

And the playing template generating unit 102 is configured to set a display area and a display mode of the PPT page according to the input setting parameters, and a combination mode of AI digital people and the attribute connection, attribute simulation and combined animation, so as to generate a corresponding playing template.

The page information extracting unit 103 is configured to import a PPT file into the play template, and extract page information corresponding to each PPT page in the PPT file, where the page information includes page pictures and text contents of a lecture corresponding to each PPT page.

And the mapping relationship establishing unit 104 is configured to establish a 2D mapping relationship between the text content of the lecture and the PPT page.

And an inference generating unit 105, configured to send the text content of the lecture to the AI digital human reasoning model to generate an AI digital human video frame, an AI digital human voice frame, and a subtitle frame corresponding to the rendered composite frame.

And an AI digital human video frame display unit 106, configured to display an AI digital human video frame in the AI digital human display area of the playing template according to the set combination manner of the AI digital human and the attribute connection, the attribute simulation, and the combined animation.

And the page picture display unit 107 is configured to display the page picture in the display mode in the PPT display region of the play template according to the display region and the display mode of the set PPT page.

And a prompt graphic generation unit 108, configured to obtain, according to the 2D mapping relationship between the lecture text content and the PPT page, the region coordinates of the text content corresponding to the subtitle frame on the PPT page, and generate a prompt graphic of the text content corresponding to the current AI digital person voice frame.

And the image frame rendering unit 109 is used for rendering and drawing a circular bright point in the region coordinate of the PPT page for the text content corresponding to the caption frame according to the 2D mapping relation between the lecture text content and the PPT page, and performing laser pointer identification to generate an image frame in a rendering manner.

And the output unit 110 is configured to render the synthesized image frame and the current AI digital human voice frame and output the rendered image frame and the current AI digital human voice frame to the video board, record the rendered image frame and the current AI digital human voice frame to a file, or generate a network plug flow output.

The interactive control device of the AI digital person to the PPT based on the templated editing, provided by the embodiment of the invention, applies the interactive control method of the AI digital person to the PPT based on the templated editing, determines the attribute connection relationship between media material elements, generates a playing template according to the set display area and display mode combination, extracts page information from a PPT file, constructs a 2D mapping relationship between text contents of a lecture and PPT pages, generates an AI digital person video frame, an AI digital person and a subtitle frame by performing voice frame inference according to the text contents of the lecture, respectively adds the AI digital person video frame and a page picture to the playing template for display, adds a prompt graphic representation and a laser pen identifier in a corresponding display area of the PPT pages according to the 2D mapping relationship to render and generate an image frame, and synchronously outputs the image frame and the AI digital person voice frame. By the method, page information can be extracted from the PPT file, a corresponding 2D mapping relation is constructed, and the prompt graphic and the laser pen identifier are added to the corresponding display area of the PPT page according to the 2D mapping relation, so that PPT background materials are synchronously controlled in the process of synthesizing the AI digital human video, and the application function of the AI digital human video is greatly improved.

The above-mentioned interactive control device for the PPT based on the AI digital person edited by the template may be implemented in the form of a computer program, which can be run on a computer device as shown in fig. 8.

Referring to fig. 8, fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer equipment can be a user terminal or a management server which is used for executing an AI digital person-to-PPT interactive control method based on templated editing so as to generate a corresponding video board card according to input parameter information and a PPT file, record the video board card to the file or generate network plug-flow output.

Referring to fig. 8, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a storage medium 503 and an internal memory 504.

The storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, can cause the processor 502 to execute the interactive control method for PPT based on the AI digital person with templated editing, wherein the storage medium 503 can be a volatile storage medium or a non-volatile storage medium.

The processor 502 is used to provide computing and control capabilities that support the operation of the overall computer device 500.

The internal memory 504 provides an environment for running the computer program 5032 in the storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can be enabled to execute the interactive control method for PPT based on the AI digital person edited by templating.

The network interface 505 is used for network communication, such as providing transmission of data information. Those skilled in the art will appreciate that the configuration shown in fig. 8 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing device 500 to which aspects of the present invention may be applied, and that a particular computing device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The processor 502 is configured to run a computer program 5032 stored in the memory to implement the corresponding functions in the interactive control method for PPT based on the templated edited AI digital human.

Those skilled in the art will appreciate that the embodiment of a computer device illustrated in fig. 8 does not constitute a limitation on the specific construction of the computer device, and that in other embodiments a computer device may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 8, and are not described herein again.

It should be understood that, in the embodiment of the present invention, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In another embodiment of the invention, a computer-readable storage medium is provided. The computer readable storage medium may be a volatile or non-volatile computer readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the steps included in the above-described interactive control method for PPT based on templated edited AI digital human.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions in actual implementation, or units with the same function may be grouped into one unit, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a computer-readable storage medium, which includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned computer-readable storage media comprise: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An AI digital person interaction control method based on templated editing for PPT, the method comprises:

setting a display area and a display mode of a PPT page according to the input setting parameters, and a combination mode of AI digital persons and the attribute connection, attribute simulation and combined animation, thereby generating a corresponding playing template;

acquiring the region coordinates of the text content corresponding to the caption frame on the PPT page according to the 2D mapping relation between the text content of the lecture and the PPT page to generate a prompt graphic representation of the text content corresponding to the current AI digital human voice frame;

rendering and drawing a circular bright point in the region coordinate of the PPT page by the text content corresponding to the caption frame according to the 2D mapping relation between the text content of the lecture and the PPT page, and performing laser pen identification to generate an image frame by rendering;

2. The interactive control method for PPT based on templated editing AI digital people as claimed in claim 1, wherein the extracting page information corresponding to each page in the PPT file comprises:

generating a page picture corresponding to each page in the PPT file;

and extracting the text content of the lecture of each page in the PPT file.

3. The method for interactive control over PPT by AI digital people based on templated editing according to claim 1, wherein before sending the lecture text to the AI digital people inference model to generate the AI digital people video frames, AI digital people voice frames, and subtitle frames corresponding to the rendered composite frames, further comprising:

generating an AI digital human reasoning model corresponding to audio and video data of a target object collected in a blue box according to a preset AI model, wherein the AI digital human reasoning model comprises a plurality of combined images corresponding to each target object and a blue background preview video corresponding to each combined image; the AI digital human reasoning model comprises 2D AI digital human and/or 3D AI digital human.

4. The method for interactive control of PPT by AI digital people based on templated editing according to claim 1 wherein the sending lecture text to the AI digital people push model to generate AI digital human video frames, AI digital human voice frames, and subtitle frames corresponding to the rendered composite frames comprises:

reasoning a blue background AI digital man video which accords with text information from the AI digital man reasoning model according to the text information in the text content of the lecture;

carrying out voice synthesis according to text information in the text content of the lecture and audio characteristics in the AI digital human reasoning model to obtain AI digital human voice;

according to the text content of the lecture, paragraph splitting is carried out to obtain subtitles;

and acquiring an AI digital human video frame, an AI digital human voice frame and a caption frame corresponding to the rendered composite frame according to the timestamps of the AI digital human video, the AI digital human voice and the caption.

5. The method for interactive control over PPT by AI digital people based on templated editing according to claim 4, wherein the step of inferring a blue background AI digital people video conforming to text information from an AI digital people inference model based on text information in the lecture text content comprises:

deducing videos which accord with mouth shapes, emotions and limb actions of characters in the text information from the AI digital human reasoning model according to the text information in the text content of the lecture, and combining the videos to generate a blue background AI digital human video;

or the following steps:

according to text information in the text content of the lecture, facial expression data which accord with the mouth shape and the emotion of characters in the text information are deduced from the AI digital human reasoning model, and skeletal model data which accord with the limb actions in the text information are deduced from the AI digital human reasoning model;

and restoring to obtain a facial video of the AI digital person according to the facial expression data, restoring to obtain a limb action video of the AI digital person according to the skeleton model data, and combining to generate a blue background AI digital person video.

6. The method as claimed in claim 1, wherein the obtaining of the region coordinates of the text content corresponding to the caption frame on the PPT page according to the 2D mapping relationship between the lecture text content and the PPT page to generate the prompt graphic representation of the text content corresponding to the current AI digital person voice frame includes:

acquiring region coordinates of the text content corresponding to the caption frame on the PPT page according to the 2D mapping relation between the text content of the lecture and the PPT page, and determining the geometric vertex parameter of the PPT page in a 3D space according to the camera parameter of 3D rendering;

and covering a prompt graphic corresponding to the AI digital human voice frame to a corresponding PPT page picture through 3D mapping, so as to realize synchronous marking of the AI digital human voice and corresponding PPT lecture text.

7. The method for controlling interaction between a PPT and an AI digital person based on templated editing according to claim 1, wherein the obtaining of the 2D mapping relationship between the lecture text content and the PPT page and the rendering of the text content corresponding to the subtitle frame on the PPT page by rendering the area coordinates of the PPT page and drawing the circular highlight for laser pointer identification to render and generate the image frame comprises:

acquiring the region coordinates of the text contents corresponding to the caption frames on the PPT page according to the 2D mapping relation between the text contents of the lecture and the PPT page, and calculating the 2D coordinates of each caption text on the PPT page;

converting 2D coordinates of the caption characters into 3D coordinates through camera parameters of 3D rendering and geometric vertex parameters of a PPT page in a 3D space;

converting the 3D coordinates of the caption characters into screen coordinates through the camera parameters of 3D rendering; rendering and drawing the circular bright point at the coordinate realizes laser pen identification in the PPT page.

8. An interactive control device for PPT based on AI digital human of templated edition, characterized in that the device comprises:

the image frame rendering unit is used for rendering and drawing a circular bright point in the region coordinate of the PPT page for the text content corresponding to the caption frame according to the 2D mapping relation between the lecture text content and the PPT page, and performing laser pen identification to generate an image frame in a rendering manner;

and the output unit is used for rendering the synthesized image frame and the current AI digital human voice frame and synchronously outputting the image frame and the current AI digital human voice frame to the video board card, recording the image frame and the current AI digital human voice frame to a file or generating network plug-flow output.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the interactive control method for PPT based on templated edited AI digital human according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the interactive control method for PPT by AI digital people based on templated editing according to any one of claims 1 to 7.