CN114401431A

CN114401431A - Virtual human explanation video generation method and related device

Info

Publication number: CN114401431A
Application number: CN202210061976.4A
Authority: CN
Inventors: 涂必超
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-04-26
Anticipated expiration: 2042-01-19
Also published as: CN114401431B

Abstract

The embodiment of the application discloses a virtual human explanation video generation method and a related device, wherein the virtual human explanation video generation method comprises the following steps: receiving question information input by a user; acquiring a target document related to the question information from a database; generating an animation video based on the target document, wherein the animation video comprises voice audio; acquiring a character image and a standard character model from the database, and forming a virtual human based on the character image and the standard character model; and fusing the virtual human into the animation video to form a virtual human explanation video. In the embodiment of the application, the question information initiated by the virtual human explanation video answering user is generated, so that the problem of the user is solved, the virtual human explanation video simulates the situation of real human lecture through the virtual human, and the user can conveniently understand the content meaning to be presented by the virtual human explanation video.

Description

Virtual human explanation video generation method and related device

Technical Field

The invention relates to the technical field of data conversion, in particular to a virtual human explanation video generation method and a related device.

Background

In the prior art, when a user has a question, a search engine is used for searching for a related document, the searched document has a large number of characters, the process that the user reads the characters in the document to understand the meaning of the content in the document is tedious, some users cannot read the characters in the document with great mind, and some users are easy to catch the mind in the process of understanding the document, so that the understanding is not in place.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present invention is to provide a method and a related device for generating a virtual human explanation video, which solves the problem of users by generating question information initiated by a virtual human explanation video answering user.

In a first aspect, an embodiment of the present application provides a method for generating a virtual human explanation video, including:

receiving question information input by a user;

acquiring a target document related to the question information from a database;

generating an animation video based on the target document, wherein the animation video comprises voice audio;

acquiring a character image and a standard character model from the database, and forming a virtual human based on the character image and the standard character model;

and fusing the virtual human into the animation video to form a virtual human explanation video.

In a possible implementation manner, the obtaining, from a database, a target document related to the question information includes:

interpreting the questioning information to obtain the meaning content of the questioning information;

extracting keywords from the meaning content;

and acquiring the target document corresponding to the meaning content from the database based on the keyword.

In a possible implementation manner, the obtaining, from a database, a target document corresponding to the meaning content based on the keyword includes:

inputting the keywords into a database to query to obtain a pre-stored document set associated with the keywords;

and screening out target documents consistent with the meaning content from the pre-stored document set.

In one possible implementation, the generating an animated video based on the target document includes:

acquiring a field of a target document;

generating the voice audio based on the field;

a video template matched with the field from a database;

inserting the field, voice audio, into the video template to form the animated video.

In one possible implementation, the inserting the field, voice audio, into the video template to form the animated video includes:

decoding the video template to obtain a plurality of video frames, wherein the video frames are provided with caption frames capable of being inserted into fields;

aligning the starting time point and the ending time point of the voice audio with the starting video frame and the ending video frame of the video template respectively to determine the corresponding relation between the voice audio and the plurality of video frames;

splitting the field to form a plurality of subfields, and determining correspondence of the plurality of subfields to a plurality of video frames in the video template;

and inserting each subfield into a subtitle frame in the video frame corresponding to the subfield to form the animation video.

In one possible implementation, the forming a virtual human based on the character image and a standard character model includes:

collecting characteristic parameters of the character image and standard character model parameters;

and forming the virtual human on the basis of the characteristic parameters and the standard character model parameters.

In one possible implementation, the voice audio includes a plurality of readings, and the fusing the avatar into the animation video to form an avatar explanation video includes:

determining mouth shapes of the virtual human corresponding to the pronunciations according to the pronunciations of the voice audio;

determining lip movement tracks of the virtual human on the basis of mouth shapes of the virtual human corresponding to the pronunciations;

and synchronizing the motion trail of the lip of the virtual human with the voice audio to form a virtual human explanation video.

In a second aspect, an embodiment of the present application provides a virtual human explanation video generating apparatus, where the virtual human explanation video generating apparatus includes:

the receiving module is used for receiving question information input by a user;

the acquisition module is used for acquiring a target document related to the question information from a database;

the animation video generation module is used for generating an animation video based on the target document, and the animation video comprises voice audio;

the virtual human forming module is used for acquiring the character image and the standard character model from the database and forming a virtual human based on the character image and the standard character model;

and the fusion module is used for fusing the virtual human into the animation video so as to form a virtual human explanation video.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a storage and a processor, where the storage is used to store computer instructions, and the processor is used to call the computer instructions to execute the method described above.

In a fourth aspect, embodiments of the present application provide a computer storage medium storing computer instructions that, when executed by a processor, implement a method as described above.

In the embodiment provided by the application, after receiving question information of a user, the virtual human explanation video generation device queries a target document related to the question information from a database, generates an animation video based on the target document, fuses a generated virtual human into the animation video to generate the virtual human explanation video, and answers question information provided by the user through the virtual human explanation video to decipher the doubt of the user, meanwhile, the user can conveniently understand the content meaning of the target document, and the efficiency of the user in understanding the target document is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for generating a virtual human explanation video according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a virtual human explanation video generation apparatus according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings in the embodiments of the present application.

The terms "including" and "having," and any variations thereof, in the description and claims of this application and the drawings described above, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Referring to fig. 1, the embodiment of the present application discloses a method for generating a virtual human explanation video, which includes, but is not limited to, steps S1-S5.

And S1, receiving the question information input by the user.

The executing body of the method can be the virtual human explanation video generating device 100, and the virtual human explanation video generating device 100 can be an intelligent device such as a computer and a mobile phone.

In the embodiment provided by the present application, the question information may be a question issued by the user to the virtual human explanation video generation apparatus 100 by voice, or a question input by the user at the virtual human explanation video generation apparatus 100 manually.

Correspondingly, the virtual human explanation video generating apparatus 100 may receive a question uttered by a voice from a user and may also receive question information manually input by the user, for example, "what insurance is suitable for purchase with low back and leg pain".

And S2, acquiring the target document related to the question information from the database.

In the embodiment provided by the present application, after receiving the target question information, the virtual human explanation video generation apparatus 100 interprets the question information, identifies the meaning content of the question information, and then queries the data for the target document associated with the question, where the content in the target document is used to answer the question information, and for example, when the question information is "what role to buy insurance", the content in the target document may be "insurance can help individuals or organizations reduce economic hazards, enhance the risk management awareness of the individuals or organizations, and ensure the timely recovery and transfer risks when the individuals or organizations are damaged".

In the embodiment provided by the present application, the database stores a plurality of documents in advance, and when the virtual human explanation video generation apparatus 100 receives the question information, the virtual human explanation video generation apparatus 100 acquires a target document capable of answering the question information from the database.

S3, generating an animation video based on the target document, wherein the animation video comprises voice audio.

In the embodiment provided by the application, the target document may have a field and an image, and when generating the animation video based on the target document, the image in the target document may be used as a video frame, the voice audio may be generated according to the field in the target document, and the voice audio and the video may be combined to form the animation video.

In generating an animated video based on the target document, voice audio may be generated based on fields of the target document, a video template may be selected from a database, and the voice audio may be inserted into the video template to form the animated video.

In the embodiment provided by the present application, the animation video has the voice audio corresponding to the field of the target document, and after the avatar explaining video is generated, the avatar explaining video generating apparatus 100 can play the voice audio to answer the question of the user, which is convenient for the user to understand.

S4, obtaining the character image and the standard character model from the database, and forming the virtual human based on the character image and the standard character model;

in the embodiment provided by the application, the character image and the standard character model are prestored in the data, and the virtual human can be constructed through the character image and the standard character model.

The character image is a two-dimensional image, the standard character module can be a three-dimensional character model, when the virtual human is constructed, the characteristic parameters of the two-dimensional image and the standard character module parameters can be obtained, and the virtual human is generated according to the characteristic parameters of the two-dimensional image and the standard character module parameters.

S5, fusing the virtual human into the animation video to form a virtual human explanation video.

In the embodiment provided by the application, the virtual human is inserted into the animation video, and when the formed virtual human explains that the video is played, the virtual human is arranged in the video picture.

Specifically, after the virtual human is inserted into the animation video, when the animation video is played, the action expression of the virtual human can be driven to be synchronous with the played voice so as to simulate the situation that the virtual human speaks.

In the embodiment provided by the application, after receiving the question information of the user, the virtual human explanation video generation device 100 queries the target document related to the question information from the database, generates an animation video based on the target document, fuses the generated virtual human into the animation video to generate the virtual human explanation video, and answers the question information provided by the user through the virtual human explanation video to decipher the doubt of the user, meanwhile, the user can conveniently understand the content meaning of the target document, and the efficiency of the user in understanding the target document is improved.

The obtaining of the target document related to the question information from the database includes:

extracting keywords from the meaning content;

In the embodiment provided by the present application, the question information input by the user may be verbally expressed by the user, and the avatar explains that the video generation apparatus 100 needs to understand the question information, and extracts the keyword after newly sorting the question.

For example, when the question information output by the user is "what insurance is suitable for purchase with low back and leg pain", the virtual human explanation video generation apparatus 100 may extract keywords such as "low back and leg pain" and "insurance", and obtain a target document whose meaning content matches the question information from a database by using the keywords.

The acquiring of the target document corresponding to the meaning content from the database based on the keyword comprises:

inputting the key words into a database to be inquired to obtain a pre-stored document set associated with the key words;

In this embodiment of the application, various pre-stored documents are pre-stored in the database, when the virtual human explanation video generation device 100 can query the pre-stored documents about "low back pain" and "insurance" in the database through "low back pain" and "insurance", the number of the queried pre-stored documents may be large, and when a target document is specifically obtained, the target document most consistent with the question information is screened out from the pre-stored documents.

In the embodiment provided by the application, when the target document which is most consistent with the question information is screened out from the pre-stored documents, the pre-stored documents which are associated with the target document can be inquired according to the keywords, the number of the keywords in each pre-stored document is calculated, the pre-stored documents are ranked, and the pre-stored document which contains the most keywords is determined as the target document.

In the embodiment provided by the application, when an animation video is generated based on the target document, a field of the target document is obtained, the field may include a plurality of characters, the language of the field may be chinese, english, and the like, and the field may also include a plurality of characters with mixed languages.

In the embodiment provided by the application, the field of the target document is converted into a voice audio, specifically, the pronunciation of each character in the field is obtained by reading the meaning content of the field in the target question, and the pronunciations of each character in the field are connected in series to form the voice audio.

In the embodiment provided by the application, taking the type of the field as the chinese as an example, the characters in the field may have polyphones, and specifically, when the pronunciation audio is generated, the judgment is performed according to the overall meaning of the sub-field where the polyphone is located, and the pronunciation of each character in the sub-field can be determined by identifying the content meaning of the sub-field.

When the field includes a plurality of words of mixed languages, taking the case that the field includes chinese characters and english words, when the speech audio is generated based on the field, the chinese characters pronounce in chinese and the english words pronounce in english.

In the embodiment provided by the application, a video template can be obtained from a database, and the duration of the video template is greater than or equal to the duration of the voice audio.

In the embodiment provided by the present application, when the duration of the video template is greater than the duration of the voice audio, the video template may be clipped, so that the duration of the voice audio is the same as the duration of the video template.

In the embodiment provided by the present application, a field of the target document may be inserted into a video template to serve as a subtitle of the video template, and then the voice audio is inserted into the video template to form the animation video, wherein when the animation video is played, a presentation process of the field serving as the subtitle is synchronized with a playing process of the voice audio.

When the field is inserted into a video template to be used as a subtitle of the video template, the field can be split into a plurality of sub-fields, and the plurality of sub-fields can be displayed one by one when the animation video is played.

In the embodiment provided by the present application, when the animation video is formed, the video template may be decoded to obtain a plurality of video frames, where the video frames have a subtitle box with an insertable field.

In the embodiment provided by the present application, the video frame has an image frame into which an image can be inserted, and when the animation video is formed, the avatar interpretation video generation apparatus 100 can analyze a field of a target document to identify a content meaning of the target document, query an image related to the content meaning from a database based on the content meaning of the target document, and insert the image into the image frame.

Specifically, the field has a plurality of subfields, each subfield corresponds to at least one video frame, when the image is inserted into the image frame, the content meaning of each subfield is specifically analyzed, the image is inquired from a database according to the content meaning, when the subfield corresponds to a plurality of video frames, the avatar teaching video generating device 100 can inquire a plurality of consecutive images related to the subfield content meaning from the database and insert the images into the image frames of the plurality of video frames respectively, wherein the number of the consecutive images inquired from the database can be the same as or different from the number of the video frames corresponding to the subfield; when the number of consecutive images is greater than the number of video frames corresponding to the subfield, a plurality of images may be simultaneously inserted into an image frame of one video frame, and when the number of consecutive images is less than the number of video frames corresponding to the subfield, no image may be inserted into a corresponding portion of the video frames in the subfield.

In a possible implementation manner, when the animation video is formed, the background of each video frame in the video template may be removed, and then an image obtained by querying from the database is inserted into an image frame of each video frame, and the image is used as the background of the video frame.

In an embodiment provided by the present application, the voice audio has a start time point and an end time point, the video template has a start time point and an end time, and when the voice audio is inserted into the animation video, the start time point and the end time point of the voice audio are respectively aligned with the start video frame and the end video frame of the video template, and at this time, the correspondence between the voice audio and the plurality of video frames can be determined.

In the embodiments provided in the present application, the size of the subtitle box of each video frame may be predetermined, the size of the text in the sub-field may be manually set, and when the predetermined subtitle box is inserted into the sub-field, the number of characters of the sub-field that can be inserted into the subtitle box is limited, for example, when the number of characters of the sub-field is 20, and the number of characters of the subtitle box of the video frame corresponding to the sub-field is limited to 15, the sub-field may be split into two sub-fields, the number of both sub-fields may need to be less than 15, specifically, the sub-field with the number of characters of 20 may be split into two sub-fields with the number of characters of 10, and the sub-field with the number of characters of 20 may be split into a first sub-field and a second sub-field, the number of characters of the first sub-field is 15, the number of the second sub-field is 5, when the sub-field is split into two sub-fields, the original meaning of the sub-field is not changed.

Specifically, when a field in the target document is split into a plurality of subfields, the audio is correspondingly split into a plurality of sub-audio, and when a start time point and an end time point of the audio are respectively aligned with a start video frame and an end video frame of the video template, a video frame corresponding to each subfield can be determined, and a video frame corresponding to each sub-audio can be determined at the same time.

After determining the corresponding relation between each subfield and the video frame in the video template, inserting each subfield into the caption frame in the video frame corresponding to each subfield.

In the embodiment of the present application, when forming the virtual person, the virtual person may be formed by acquiring the characteristic parameters of the character image and the standard character model parameters and then forming the virtual person based on the characteristic parameters of the character image and the standard character model parameters.

In the embodiment provided by the application, the voice audio comprises a plurality of pronunciations, and when the virtual human is fused to the animation video, the motion trail of the virtual human is synchronized with the voice audio so as to simulate the situation that the virtual human speaks.

Specifically, when the virtual human simulates a speaking situation, the lips of the virtual human need to have a movement trajectory, and in the embodiment provided by the application, the mouth shape of the virtual human corresponding to each reading is determined according to each reading of the voice audio.

Determining the lip movement locus of the virtual human on the basis of the mouth shapes of the virtual human corresponding to the pronunciations, so that the movement locus of the virtual human is synchronous with the voice audio.

Referring to fig. 2, an embodiment of the present application further provides a virtual human explanation video generating apparatus 100, where the virtual human explanation video generating apparatus 100 includes:

a receiving module 110, configured to receive question information input by a user;

an obtaining module 120, configured to obtain a target document related to the question information from a database;

an animation video generation module 130, configured to generate an animation video based on the target document, where the animation video includes a voice audio;

a virtual human forming module 140 for acquiring the character image and the standard character model from the database, and forming a virtual human based on the character image and the standard character model;

and the fusion module 150 is used for fusing the virtual human into the animation video to form a virtual human explanation video.

For the concepts, explanations, details and other steps related to the technical solutions provided in the embodiments of the present application, please refer to the description of the method or the contents of the method steps executed by the apparatus in other embodiments, which are not described herein again.

Referring to fig. 3, an electronic device provided in the embodiments of the present application may include a processor 210, a storage 220, and a communication interface 230. The processor 210, the storage 220, and the communication interface 230 are connected by a bus 240, the storage 220 for storing instructions, the processor 210 for executing instructions stored by the storage 220.

The processor 210 is used to execute the instructions stored in the storage 220 to control the communication interface 230 to receive and transmit signals, and to complete the steps of the above-mentioned method. The storage 220 may be integrated in the processor 210, or may be provided separately from the processor 210.

In one possible implementation, the function of the communication interface 230 may be implemented by a transceiver circuit or a dedicated chip for transceiving. Processor 210 may be considered to be implemented by a dedicated processing chip, processing circuit, processor, or a general-purpose chip.

Embodiments of the present application also provide a computer storage medium, which stores computer instructions, and when the computer instructions are executed by a processor, the method described above is implemented.

In another possible implementation manner, the apparatus provided by the embodiment of the present application may be implemented by using a general-purpose computer. Program code that implements the functions of the processor 210 and the communication interface 230 is stored in the storage 220, and a general-purpose processor implements the functions of the processor 210 and the communication interface 230 by executing the code in the storage 220.

As another implementation of the present embodiment, a computer program product is provided that contains instructions that, when executed, perform the method in the above-described method embodiments.

Those skilled in the art will appreciate that in an actual terminal or server, there may be multiple processors and storage. The storage may also be referred to as a storage medium or a storage device, and the like, which is not limited in this application.

It should be understood that, in the embodiment of the present Application, the processor may be a Central Processing Unit (CPU), and the processor may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.

It should also be understood that references to a memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile storage may be Random Access Memory (RAM) which acts as external cache Memory. By way of example and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM).

It should be noted that when the processor is a general-purpose processor, a DSP, an ASIC, an FPGA or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, the memory (memory module) is integrated in the processor.

It should be noted that the reservoirs described herein are intended to include, but are not limited to, these and any other suitable types of reservoirs.

The bus may include a power bus, a control bus, a status signal bus, and the like, in addition to the data bus. But for clarity of illustration the various buses are labeled as buses in the figures.

It should also be understood that reference herein to first, second, third, fourth, and various numerical designations is made only for ease of description and should not be used to limit the scope of the present application.

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor. The software module may be located in a random access memory, a flash memory, a read-only memory, a programmable read-only memory or an electrically erasable programmable memory, a register, etc. storage media that are well known in the art. The storage medium is located in a storage, and the processor reads information in the storage and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

In the embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various Illustrative Logical Blocks (ILBs) and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), among others.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A virtual human explanation video generation method is characterized by comprising the following steps:

receiving question information input by a user;

2. A virtual human explanation video generation method as claimed in claim 1, wherein said obtaining a target document related to said question information from a database comprises:

extracting keywords from the meaning content;

3. The virtual human explanation video generation method as claimed in claim 2, wherein the obtaining of the target document corresponding to the meaning content from the database based on the keyword comprises:

4. A method for generating a avatar interpretation video according to any of claims 1-3, wherein said generating an animation video based on said target document comprises:

acquiring a field of a target document;

generating the voice audio based on the field;

acquiring a video template matched with the field from a database;

5. A method for generating a virtual human interpretive video according to claim 4, wherein said inserting said fields, voice audio into said video template to form said animated video comprises:

6. A method for creating an explanation video of a virtual human being as claimed in claim 4, wherein said forming a virtual human being based on said image of a character and a standard character model comprises:

7. The method for generating a human avatar explanation video according to claim 6, wherein the voice audio includes a plurality of reading voices, and the fusing the human avatar into the animation video to form a human avatar explanation video includes:

8. A virtual human explanation video generation apparatus, characterized by comprising:

9. An electronic device, comprising storage to store computer instructions and a processor to invoke the computer instructions to perform the method of any of claims 1-7.

10. A computer storage medium storing computer instructions which, when executed by a processor, implement the method of any one of claims 1 to 7.