CN115690277A

CN115690277A - Video generation method, system, device, electronic equipment and computer storage medium

Info

Publication number: CN115690277A
Application number: CN202211311695.6A
Authority: CN
Inventors: 杨晓波; 段瑞波; 洪星; 曹凯; 李岩; 刘建双; 蒋健
Original assignee: Mobvoi Innovation Technology Co Ltd
Current assignee: Mobvoi Innovation Technology Co Ltd
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2023-02-03

Abstract

The embodiment of the invention discloses a video generation method, a system, a device, electronic equipment and a computer storage medium, wherein the method comprises the steps of acquiring text information corresponding to a target document, generating corresponding audio information according to the text information, driving a preset target digital image to generate an avatar video based on the audio information, and generating the target video according to the avatar video and the video corresponding to the target document so as to explain the target document through the target digital image in the target video. Therefore, the target video is generated by combining the virtual image video and the video corresponding to the target document, so that the explanation of the text information in the target document is more specific and vivid, the text demonstration effect can be improved, extra labor time and manufacturing cost are not required to be added, and the text demonstration cost is favorably reduced.

Description

Video generation method, system, device, electronic equipment and computer storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a video generation method, system, apparatus, electronic device, and computer storage medium.

Background

The conventional text demonstration is single in display form and poor in watching experience. To obtain a better display effect, a lot of labor time and manufacturing cost are required to be invested. With the development of artificial intelligence technology, the generation technology of the virtual digital image video is mature day by day. Meanwhile, along with the existing multimedia technology, virtual digital images will appear in more and more application occasions, including scenes such as information broadcasting and culture propaganda. Based on the method, the text demonstration can be performed by combining the digital image video generation technology so as to improve the text demonstration effect and reduce the text demonstration cost.

Disclosure of Invention

In view of this, embodiments of the present invention are directed to a video generation method, an apparatus, an electronic device, and a computer storage medium, so as to improve a text presentation effect and reduce a text presentation cost.

In a first aspect, an embodiment of the present invention is directed to a video generation method, where the method includes:

acquiring text information corresponding to a target document;

generating corresponding audio information according to the text information;

driving a preset target digital image to generate an avatar video based on the audio information;

and generating a target video according to the virtual image video and the video corresponding to the target document so as to explain the target document through a target digital image in the target video.

Further, the target digital image in the target video has editability, and the method further comprises:

adjusting attribute information of a target digital image in the target video, wherein the attribute information comprises at least one of a display position and a display size of the target digital image in the target video.

Further, the generating the target video according to the avatar video and the video corresponding to the target document includes:

converting the target document into a corresponding video file;

and synthesizing the virtual image video and the video file to generate the target video.

Further, the audio information corresponding to the avatar video has a temporal correspondence with the text information displayed in the video file.

Further, the driving of the preset target digital character generation avatar video based on the audio information includes:

generating a part action sequence corresponding to the text information according to the audio information;

and generating a corresponding virtual image video according to the part action sequence and the preset target digital image.

Further, the generating corresponding audio information according to the text information includes:

and inputting the text information into a preset speech synthesis model to generate corresponding audio information.

In a second aspect, an embodiment of the present invention is directed to a video generation system, including:

a front-end layer configured to determine a target document and a corresponding target digital persona based on the interactive operation;

the back-end service layer is configured to acquire text information corresponding to a target document, generate corresponding audio information according to the text information, drive a preset target digital image to generate an avatar video based on the audio information, and generate the target video according to the avatar video and the video corresponding to the target document so as to explain the target document through the target digital image in the target video;

a back-end base layer configured to provide service support to the back-end business layer.

In a third aspect, an embodiment of the present invention is directed to a video generating apparatus, where the apparatus includes:

the acquisition unit is used for acquiring text information corresponding to the target document;

the processing unit is used for generating corresponding audio information according to the text information;

the generating unit is used for driving a preset target digital image to generate an avatar video based on the audio information;

and the display unit is used for generating a target video according to the virtual image video and the video corresponding to the target document so as to explain the content in the target document through the target digital image in the target video.

In a fourth aspect, embodiments of the present invention are directed to an electronic device, including a memory and a processor, the memory storing one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method as in any one of the above.

In a fifth aspect, embodiments of the present invention are directed to a computer-readable storage medium having a computer program stored therein, where the computer program is executed by a processor to implement the method steps of any one of the above.

The technical scheme of the embodiment of the invention comprises the steps of acquiring text information corresponding to a target document, generating corresponding audio information according to the text information, driving a preset target digital image to generate an avatar video based on the audio information, and generating the target video according to the avatar video and the video corresponding to the target document so as to explain the target document through the target digital image in the target video. Therefore, the target video is generated by combining the virtual image video and the video corresponding to the target document, so that the explanation of the text information in the target document is more specific and vivid, the text demonstration effect can be improved, extra labor time and manufacturing cost are not required to be added, and the text demonstration cost is favorably reduced.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a video generation system of an embodiment of the present invention;

FIG. 2 is a flow chart of a video generation method of an embodiment of the present invention;

FIG. 3 is a flow chart of generating an avatar video in accordance with an embodiment of the present invention;

FIG. 4 is a flow diagram of generating a target video according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a video generation apparatus of an embodiment of the present invention;

fig. 6 is a schematic view of an electronic device of an embodiment of the invention.

Detailed Description

The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.

Furthermore, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.

Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".

In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

The conventional text demonstration has a single display form, a vivid demonstration effect and poor watching experience. Meanwhile, the cost for optimizing the text demonstration display effect is high. In view of this, embodiments of the present invention are directed to providing a video generating method, which improves a text demonstration effect by combining with application of a digital image, and can reduce a text demonstration cost.

Fig. 1 is a schematic diagram of a video generation system of an embodiment of the present invention. As shown in fig. 1, the video generation system of the present embodiment includes a front-end layer 100, a back-end business layer 200, and a back-end basic layer 300. Wherein the front end layer 100 is user-facing and configured to determine a target document and a corresponding target digital persona based on the interaction. The back-end service layer 200 is configured to acquire text information corresponding to a target document, generate corresponding audio information according to the text information, drive a preset target digital image based on the audio information to generate an avatar video, and generate the target video according to the avatar video and a video corresponding to the target document, so as to explain the target document through the target digital image in the target video. The backend infrastructure layer 300 is configured to provide service support to the backend business layer.

Optionally, the target document in this embodiment is a text presentation format file (including. Ppt,. Pptx, etc. format). The front end layer 100 in this embodiment includes a document editing module and a digital character editing module. The document editing module is provided with text demonstration software, a user can edit and modify a target document through the editing function of the text demonstration software, and meanwhile, a video file corresponding to the target document can be generated based on the text-to-video function of the text demonstration software. The digital image editing module has a digital image management function, a user can select target digital images of different images from the digital image management library according to actual needs through interactive operation related to the digital image management function, and display characteristics (including pronunciation tone, pronunciation rhythm, image size, accessories and the like) in the target digital images can be set.

Optionally, the back-end service layer 200 in this embodiment includes an interface service module, a service application module, and a basic service module. The interface service module comprises a digital image interface and a document interface. The digital image interface receives the target digital image determined by the digital image editing module, so that the back-end service layer 200 obtains the target digital images corresponding to different tasks (which may be orders), and further generates corresponding virtual image video and target video according to the target digital image. The document interface is used for receiving the target document determined by the document editing module so as to obtain text information corresponding to the target document.

Optionally, the business application module in this embodiment includes a document conversion service module and a composition service module. The document conversion service module is configured to convert the target document into a corresponding video file when receiving the text-to-video function of the front end layer 100. The synthesis service module is used for generating corresponding audio information according to the text information of the target document, driving the target digital image to generate an avatar video based on the audio information, and generating the target video according to the avatar video and the video file corresponding to the target document so as to perform explanation through the target document in the target video.

Optionally, the basic service module in this embodiment includes a right management unit, an authentication management unit, an OSS (operation support unit, configured to provide technical support and management), a Consul (open source software providing service registration and discovery function), and the like. The authority management unit provides data or function authority for each module through preset safety rules or strategies. The authentication management unit is used for checking the legality and integrity of the user data so as to improve the system security. Therefore, reliability and safety of function realization of the service application module are ensured through mutual matching of all units in the basic service module.

Optionally, the back-end basic layer in this embodiment includes a basic data layer and a basic resource layer, and the basic data layer includes Redis (remote dictionary service), mysql (database management unit), mongoDB (database based on distributed file storage, data can be copied to multiple servers), and RabbitMQ (message-oriented middleware, open source message proxy software that can implement a high-level message queue protocol). The basic resource layer comprises a cloud service unit, a link monitoring unit and a Docker. Therefore, a fault processing model of the integrated assembly is constructed through each module unit of the back-end base layer, network and function processing are fused, and the system response rate is high (SSL can reach 99.9%). Meanwhile, multi-service timely registration and realization are completed through the distributed registration center, and high-level fault-tolerant mechanisms among system service activity, grouping isolation and system degradation providing services can be guaranteed.

The technical scheme of the embodiment of the invention obtains the text information corresponding to the target document through the system architecture, generates the corresponding audio information according to the text information, drives the preset target digital image to generate the virtual image video based on the audio information, and generates the target video according to the virtual image video and the video corresponding to the target document so as to explain the target document through the target digital image in the target video. Therefore, the target video is generated by combining the virtual image video and the video corresponding to the target document, so that the explanation of the text information in the target document is more specific and vivid, the text demonstration effect can be improved, and the user can generate the target video for text demonstration only by editing the target document and the target digital image while improving the text demonstration effect, so that the extra manual time and the manufacturing cost are not required to be increased, and the text demonstration cost is favorably reduced. Meanwhile, the technical scheme of the embodiment can be configured based on the existing computer system architecture to improve the text demonstration effect without greatly changing hardware settings, thereby being beneficial to further reducing the text demonstration cost.

Fig. 2 is a flowchart of a video generation method according to an embodiment of the present invention. As shown in fig. 2, the video generation method in the present embodiment includes the following steps.

In step S100, text information corresponding to the target document is acquired.

In this embodiment, the text information includes different forms of content such as characters, numbers, symbols, and the like, and is displayed in a sequential manner. And after receiving the target document, a back-end service layer in the video generation system analyzes the target document to obtain the text information corresponding to the target document.

In step S200, corresponding audio information is generated according to the text information.

In this embodiment, the text information is input to the preset speech synthesis model to generate corresponding audio information, where the audio information includes continuous multi-frame audio. The preset speech synthesis model can adopt a TTS model and is determined based on text corpus training stored in a database in advance.

When generating corresponding audio information according to the text information, in this embodiment, the phoneme information in the text information may be determined first, the phoneme duration (i.e., pronunciation time) and the fundamental frequency (i.e., pitch and intonation) may be predicted according to the pronunciation principle of the phoneme information, and finally, the phoneme and the corresponding duration and fundamental frequency are synthesized and converted into an original waveform, so as to output a sound signal (i.e., audio information). Alternatively, in this embodiment, the duration and fundamental frequency of the phoneme information corresponding to the text information may be determined by a segmentation model embedded in the speech synthesis model. Therefore, in the embodiment, the audio information corresponding to the text information is generated through the voice synthesis model, so that the generation efficiency is high, the efficiency of the video generation method is improved, and the process of processing the text information can enable other parts of different accents, emotions, breathing and human voice to be contained in the audio information, so that the reality of the audio information and real human voice is enhanced, and the effect of demonstrating the text information corresponding to the target document is improved.

Optionally, in this embodiment, TTS models with different parameters may be selected in different scenarios to generate different forms of audio information. For example, audio information with different timbres, morphemes, volumes or tones can be generated by adjusting different model parameters, audio information with different audio coding formats at different sampling rates can be generated, and audio information with more characteristics can be generated according to emotion or other forms of labels of text information. Therefore, different forms of audio information are generated by selecting the speech synthesis models with different parameters, the use requirements under different use scenes are met, and the applicability of the video generation method in the embodiment is improved.

In step S300, a preset target digital character is driven based on audio information to generate an avatar video.

In this embodiment, the preset target digital image is selected from a plurality of digital images based on the user requirements of the front layer in the video generation system. The target digital image may be any human image, and the target digital image may be determined based on a photographed image including the real human image, or may be determined by extracting a frame including the real human image from the video. For example, when the target digital image is determined by a video containing a human image, the target digital image may be generated by shooting with green or blue as a background during shooting or photographing, and subtracting the background by using a color key of a special effects machine during post-production. Therefore, the target digital image is more convenient to determine and use, and the more common determination mode is beneficial to improving the generation efficiency of the virtual image video and reducing the generation cost of the virtual image video.

Fig. 3 is a flowchart of generating an avatar video according to an embodiment of the present invention. As shown in fig. 3, the method of generating an avatar video in the present embodiment includes the following steps.

In step S310, a part motion sequence corresponding to the text information is generated from the audio information.

In the present embodiment, the part motion sequence includes a facial motion sequence. In the embodiment, when the avatar video is generated, the corresponding facial motion sequence is generated by processing the audio information by using the neural network model which is generated by training in advance and generates the motion sequence. Therefore, the audio information is converted into the feature vector through the neural network model, the facial action sequence is predicted and mapped based on the feature vector, and the efficiency of demonstrating the facial action in the virtual image video corresponding to the target digital image can be greatly improved.

Optionally, the facial motion sequence in this embodiment includes a mouth motion sequence. In this embodiment, a sequence model is used to generate a mouth action sequence corresponding to the text information according to the audio information. The sequence model may be a GRU, LSTM, RNN, or the like model. In the process of generating the mouth action sequence, firstly, processing audio information through a sequence model, extracting Mel spectrum (Mel Spectrum) characteristics in the audio information, predicting coordinates and phoneme labels of key parts of the mouth in the process of broadcasting the audio information according to the Mel spectrum characteristics to obtain a target digital image, splicing the coordinates and the phoneme coordinates of the key parts of the mouth and the audio information to generate a splicing result, and finally processing the splicing result to obtain corresponding action parameters of the facial mouth shape when each frame of audio is output.

Optionally, in order to make the generated avatar video more vivid, in this embodiment, after the expressionless mouth motion sequence is generated, the text information is further analyzed, the emotion keywords in the text information are located, tag information is labeled on the emotion keywords, an emotion tag is generated, and the mouth motion sequence with specific expression changes is generated according to the generated expressionless mouth motion sequence and the emotion tag. The emotion labels comprise emotion keywords (such as positive, happy, positive energy and the like), broadcast timestamps of the emotion keywords, emotion characteristics and the like.

Optionally, after generating the facial motion sequence with expression changes, the embodiment may further add an eye motion sequence to the facial motion sequence, and/or add a limb motion sequence based on the part motion sequence of the target digital image. For example, when the mouth of the target digital character moves as an active keyword in the category of "look", actions such as blinking eyes, like likes may be added to the sequence of facial actions. Therefore, the virtual image video is more vivid through the combination of the mouth action sequence, the eye action sequence and the limb action sequence. Meanwhile, due to the fact that corresponding timestamps are arranged between the emotion label settings and the corresponding audio information, the corresponding correspondence between the corresponding mouth action, eye action and limb action under the same audio information can be guaranteed, and the corresponding mouth action, eye action, limb action and the like are displayed while the audio information is broadcasted, so that the virtual image video is displayed more vividly and vividly.

In step S320, a corresponding avatar video is generated according to the part motion sequence and the preset target digital avatar.

In this embodiment, after the part action sequence corresponding to the audio information is generated, in a neural network model preset by the preset target digital image input value and the part action sequence corresponding to the audio information of the target document, a corresponding part action diagram is generated according to the action parameters in the part action sequence, and the part action diagram is spliced to the position of the corresponding part in the preset target digital image, so that the part action sequence is adapted to the target digital image, and further, an avatar video under the audio information corresponding to the text information in the target document is generated.

Alternatively, in order to make the quality of the generated avatar video higher, the position action map may be adjusted according to the position action map and the corresponding position in the target digital avatar. For example, the part motion map is precisely attached to the position corresponding to the target digital Image by an Image Warping algorithm (i.e., an Image Warping algorithm, such as an Inverse Distance weighted difference IDW-Inverse weighted interpolation), so that the degree of adaptation between the motion sequence corresponding to the part motion map and the target digital Image is improved, the quality of the generated virtual Image video is improved, and the quality and the display effect of the subsequently generated target video are improved.

In step S400, a target video is generated according to the avatar video and a video corresponding to the target document, so that the target document is interpreted by a target digital avatar in the target video.

In the embodiment, the target video is generated according to the virtual image video and the video corresponding to the target document, so that the target digital image explains the target document, the interestingness of target document demonstration is enhanced, the text demonstration effect can be improved, and meanwhile, the explanation cost of a real person can be reduced.

Fig. 4 is a flow chart of generating a target video according to an embodiment of the present invention. As shown in fig. 4, the target video is generated by the following steps in the present embodiment.

In step S410, the target document is converted into a corresponding video file.

In this embodiment, after receiving the text-to-video function issued by the front end layer, the target document is converted into a corresponding video file based on the document conversion service module in the service application module. The text information in the target document is sequentially provided with the time stamps according to the display sequence in the target document, and correspondingly, when the video file is generated, the text information displayed in the video file still carries the time stamps so as to ensure the display sequence of each text information in the text demonstration process.

In step S420, the avatar video is synthesized with the video file to generate a target video.

In the embodiment, the virtual image video is added into the video file, the target video is generated so that the text demonstration process of the target document is realized, the three display modes of the video file, the audio information and the target virtual image are combined simultaneously, the text information in the target document is displayed through the video file, the text information content is broadcasted through the audio information, the explanation of the text information is simulated through the action behavior of the target virtual image, the text demonstration process corresponding to the target document is realized, the broadcasted sound-shape effect is good, the interest of the text demonstration can be enhanced, and the performance of meeting the requirements of different audiences through the text demonstration is improved.

Further, in the present embodiment, the audio information corresponding to the avatar video has a temporal correspondence with the text information displayed in the video file. Specifically, the action or behavior of the target digital character and the audio information in the avatar video correspond to text information at a word level, and the text information corresponding to the avatar video is the same as the text information corresponding to the video file, so that the timestamps corresponding to the same text information fields in the avatar video and the video file are the same when the explanation process is presented. Therefore, through the setting and the correspondence of the timestamps, in the process of playing the target video, the broadcasting of the audio information is aligned with the display time of the corresponding text information, and the action behavior of the target digital image is aligned with the corresponding audio information time, so that the matching of the audio information, the text information and the action behavior of the target digital image can be ensured, and the text demonstration effect is further optimized.

Further, in order to further optimize the text demonstration effect, the target digital image in the target video of the embodiment has editability, and the video generating method of the embodiment further includes:

in step S500, attribute information of a target digital character in a target video is adjusted.

In this embodiment, the attribute information includes at least one of a display position and a display size of the target digital character in the target video. Therefore, the display position and the display size of the target digital image in the target video are adjusted by adjusting the attribute information of the target digital image in the target video, and the display effect of the target video and the text demonstration effect of the corresponding target document are further optimized.

Optionally, in this embodiment, different layers are respectively set for the avatar video and the video file to display respective corresponding contents, so as to adjust the display effects of the avatar video and the video file in the corresponding layers, for example, after a layer corresponding to the avatar video is selected, attribute information of a target digital avatar in the avatar video may be adjusted based on an operation option in the current layer, or after a layer corresponding to the video file is selected, display information such as a text font or a text interval in the video file may be adjusted based on an operation option in the current layer. Therefore, through setting different layers, the adjustment of the content in the virtual image video and the video file can obtain a better display effect on the premise of mutual noninterference, the interestingness of target document display can be further enhanced, and the text demonstration effect of the target document is optimized.

The technical scheme of the embodiment of the invention comprises the steps of acquiring text information corresponding to a target document, generating corresponding audio information according to the text information, driving a preset target digital image to generate an avatar video based on the audio information, and generating the target video according to the avatar video and the video corresponding to the target document so as to explain the target document through the target digital image in the target video. Therefore, the target video is generated by combining the virtual image video and the video corresponding to the target document, so that the explanation of the text information in the target document is more specific and vivid, and the text demonstration effect can be improved. Meanwhile, the interest of target document display can be further enhanced and the text demonstration effect of the target document can be optimized by adjusting the attribute information of the target digital image in the target video. Moreover, for a user, the complete and interesting text demonstration effect can be realized through the video generation system only by editing the good target document and selecting the target digital image, the realization is simple, the target video generation efficiency is high, the batch processing can be realized, the extra labor time and the manufacturing cost are not required to be increased, and the text demonstration cost is favorably reduced.

Fig. 5 is a schematic diagram of a video generation apparatus according to an embodiment of the present invention. As shown in fig. 5, the video generating apparatus of the present embodiment includes an acquisition unit 1, a processing unit 2, a generating unit 3, and a presentation unit 4. The acquiring unit 1 is used for acquiring text information corresponding to a target document. The processing unit 2 is configured to generate corresponding audio information according to the text information. The generating unit 3 is used for driving a preset target digital image to generate an avatar video based on the audio information. The display unit 4 is used for generating a target video according to the avatar video and the video corresponding to the target document so as to explain the content in the target document through the target digital avatar in the target video.

Optionally, the processing unit 2 of this embodiment is specifically configured to input the text information to a preset speech synthesis model to generate audio information corresponding to the target text. The generating unit 3 of the present embodiment is specifically configured to generate a part motion sequence corresponding to the text information according to the audio information, and generate a corresponding avatar video according to the part motion sequence and a preset target digital avatar. The display unit 4 is specifically configured to convert the target document into a corresponding video file, and synthesize the avatar video and the video file to generate a target video. Further, in this embodiment, the audio information corresponding to the avatar video has a time correspondence with the text information displayed in the video file, so as to ensure that the sound emitted by the target digital avatar is aligned with the mouth shape of the target digital avatar and the display time of the corresponding text in the video file during the text presentation.

Optionally, as shown in fig. 1, the target digital image in the target video of the present embodiment has editability. The video generating apparatus in the present embodiment further includes an adjusting unit 5. The adjusting unit 5 is used for adjusting the attribute information of the target digital image in the target video. Wherein the attribute information includes at least one of a display position and a display size of the target digital character in the target video.

According to the technical scheme, the target video is generated by combining the virtual image video and the video corresponding to the target document, so that the explanation of the text information in the target document is more specific and vivid, the text demonstration effect can be improved, extra manual time and manufacturing cost are not required to be increased, and the text demonstration cost is favorably reduced.

Fig. 6 is a schematic diagram of an electronic device of an embodiment of the invention. As shown in fig. 6, the electronic device shown in fig. 6 is a general address query device, which includes a general computer hardware structure, which includes at least a processor 61 and a memory 62. The processor 61 and the memory 62 are connected by a bus 63. The memory 62 is adapted to store instructions or programs executable by the processor 61. The processor 61 may be a stand-alone microprocessor or a collection of one or more microprocessors. Thus, the processor 61 implements the processing of data and the control of other devices by executing instructions stored by the memory 62 to perform the method flows of embodiments of the present invention as described above. The bus 63 connects the above-described components together, as well as connecting the above-described components to a display controller 64 and a display device and an input/output (I/O) device 65. Input/output (I/O) devices 65 may be a mouse, keyboard, modem, network interface, touch input device, motion sensitive input device, printer, and other devices known in the art. Typically, the input/output device 65 is connected to the system through an input/output (I/O) controller 66.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus (device) or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may employ a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions.

These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.

These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.

Another embodiment of the invention is directed to a non-transitory storage medium storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be accomplished by specifying the relevant hardware through a program, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made to the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of video generation, the method comprising:

acquiring text information corresponding to a target document;

generating corresponding audio information according to the text information;

2. The method of claim 1, wherein the target digital avatar in the target video is editable, the method further comprising:

3. The method of claim 1, wherein generating a target video from the avatar video and a video corresponding to a target document comprises:

converting the target document into a corresponding video file;

4. The method of claim 3, wherein the avatar video corresponds to audio information having a temporal correspondence with textual information displayed in the video file.

5. The method of claim 1, wherein said driving a preset target digital character generation avatar video based on said audio information comprises:

6. The method of claim 1, wherein generating corresponding audio information from the textual information comprises:

7. A video generation system, the system comprising:

8. A video generation apparatus, characterized in that the apparatus comprises:

9. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-6.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1-6.