CN116074576A

CN116074576A - Video generation method, device, electronic equipment and storage medium

Info

Publication number: CN116074576A
Application number: CN202211534770.5A
Authority: CN
Inventors: 董浩; 孙昊; 胡晓文; 范茂伟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-05-05

Abstract

The disclosure provides a video generation method, relates to the technical field of artificial intelligence, in particular to the technical fields of augmented reality, virtual reality, computer vision, deep learning and the like, and can be applied to scenes such as metauniverse, virtual digital people and the like. The specific implementation scheme is as follows: rendering the target virtual image by using a target computing unit according to a target action driving coefficient in the action driving coefficients to obtain a target image; in response to obtaining the target image, encoding the target image with a target computing unit to obtain target video sequence data; obtaining current video data according to the previous video sequence data of the target moment and the target video sequence data, wherein the previous video sequence data of the target moment corresponds to a previous period of the target moment; and responding to the fact that the target moment meets the preset condition, and obtaining a target video according to the current video data. The disclosure also provides a video generating device, an electronic device and a storage medium.

Description

Video generation method, device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of augmented reality, virtual reality, computer vision, deep learning and the like, and can be applied to scenes such as metauniverse, virtual digital people and the like. More particularly, the present disclosure provides a video generation method, apparatus, electronic device, and storage medium.

Background

With the development of artificial intelligence technology, the application scene of the avatar is increasing. The avatar may be rendered to generate a video. In the video, the avatar may perform a plurality of actions.

Disclosure of Invention

The present disclosure provides a video generation method, apparatus, device, and storage medium.

According to an aspect of the present disclosure, there is provided a video generating method, the method including: rendering the target virtual image by using a target computing unit according to a target action driving coefficient in a plurality of action driving coefficients to obtain a target image, wherein the plurality of action driving coefficients respectively correspond to a plurality of moments, and the target action driving coefficient corresponds to a target moment in the plurality of moments; in response to obtaining the target image, encoding the target image with a target computing unit to obtain target video sequence data; obtaining current video data according to the previous video sequence data of the target time and the target video sequence data, wherein the previous video sequence data of the target time corresponds to a previous period of the target time, and the previous period of the target time comprises at least one time before the target time; and responding to the fact that the target moment meets the preset condition, and obtaining a target video according to the current video data.

According to another aspect of the present disclosure, there is provided a video generating apparatus including: the rendering module is used for rendering the target virtual image by utilizing the target computing unit according to the target action driving coefficient in the action driving coefficients to obtain a target image, wherein the action driving coefficients correspond to a plurality of moments respectively, and the target action driving coefficient corresponds to a target moment in the moments; the encoding module is used for responding to the obtained target image, and encoding the target image by utilizing the target computing unit to obtain target video sequence data; the first obtaining module is used for obtaining current video data according to the previous video sequence data of the target moment and the target video sequence data, wherein the previous video sequence data of the target moment corresponds to a previous period of the target moment, and the previous period of the target moment comprises at least one moment before the target moment; and the second obtaining module is used for obtaining the target video according to the current video data in response to the fact that the target moment meets the preset condition.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method provided according to the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method provided according to the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an exemplary system architecture to which video generation methods and apparatus may be applied, according to one embodiment of the present disclosure;

FIG. 2 is a flow chart of a video generation method according to one embodiment of the present disclosure;

FIG. 3 is a flow chart of a video generation method according to another embodiment of the present disclosure;

FIG. 4 is a flow chart of a video generation method according to another embodiment of the present disclosure;

FIG. 5 is a schematic diagram of an interactive interface according to one embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a video frame according to one embodiment of the present disclosure;

fig. 7 is a block diagram of a video generating apparatus according to one embodiment of the present disclosure; and

fig. 8 is a block diagram of an electronic device to which a video generation method may be applied according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Text may be broadcast with avatars to generate related videos. For example, the broadcast text may be displayed on an interactive interface. For another example, the broadcast text may be processed using a Voice-to-Animation (VTA) algorithm to obtain the processing result. The processing results may include a time stamp of each character in the broadcast text and at least one action driving coefficient corresponding to each character. The motion driving coefficient may be a Blend Shape (BS) coefficient. The motion driving coefficients may be applied to a plurality of mixed shape bases of the avatar face such that the shapes of the mixed shape bases are changed so that the avatar performs a corresponding expressive motion.

In some embodiments, according to the voice animation synthesis technology, a central processing unit (Central Processing Unit, CPU) may be used to process the broadcast text, resulting in a processing result. According to the action driving coefficient in the processing result, the virtual image is rendered by a rendering engine deployed in a graphic processing unit (Graphic Processing Unit, GPU), and a plurality of images can be obtained. These images may be stored in memory or hard disk. The CPU can read the images from the memory or the disk to perform video coding. However, these images occupy a larger memory space, and the time required for the central processing unit to read these images is also longer, resulting in lower video generation efficiency. For example, for a segment of the text, thousands or tens of thousands of images may be rendered, which may occupy tens of gigabytes (Gb) of memory.

Fig. 1 is a schematic diagram of an exemplary system architecture to which video generation methods and apparatus may be applied, according to one embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the

terminal devices

101, 102, 103 to receive or send messages or the like. The

terminal devices

101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the

terminal devices

101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the video generating method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the video generating apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The video generation method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the video generating apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

Fig. 2 is a flowchart of a video generation method according to one embodiment of the present disclosure.

As shown in fig. 2, the method 200 may include operations S210 to S240.

In operation S210, a target avatar is rendered using a target calculation unit according to a target motion driving coefficient among a plurality of motion driving coefficients, to obtain a target image.

In the embodiment of the present disclosure, a plurality of motion driving coefficients correspond to a plurality of times, respectively, and a target motion driving coefficient corresponds to a target time among the plurality of times. For example, the plurality of motion driving coefficients may be obtained by processing the target text using a preset algorithm. The target text may be the broadcast text described above.

In the disclosed embodiments, the target text may be user-entered. For example, the target Text1 may be "you good, welcome experience speech animation synthesis technology".

In the embodiment of the present disclosure, the target motion driving coefficient may be any one of a plurality of motion driving coefficients. For example, the target action driving coefficient may correspond to one character in the target text. It is understood that one character in the target text may correspond to at least one action driving coefficient.

In the disclosed embodiments, the target computing unit may be a variety of general and/or special purpose processing components having processing and computing capabilities. For example, some examples of computing units may include central processing units, graphics processing units, various specialized artificial intelligence (Artificial Intelligence, AI) computing chips, various computing units running machine learning model algorithms, digital signal processors (Digital Signal Processor, DSP), and any suitable processors, controllers, microcontrollers, and the like.

In the embodiments of the present disclosure, the target avatar may be various avatars. For example, the target avatar may be various avatars such as a dummy, a virtual animal, and the like.

In response to obtaining the target image, the target image is encoded with a target computing unit to obtain target video sequence data in operation S220.

In embodiments of the present disclosure, rendering and encoding operations may be performed with a target computing unit. For example, the rendering operations described above may be performed using a graphics processing unit. Next, the encoding operation may also be performed using the image processing unit.

In the disclosed embodiments, the encoding operation may be a video encoding operation. For example, the target image may be encoded based on various video encoding standards. In one example, the video encoding standard may be the generation 4 motion picture expert group (Moving Picture Experts Group, mp 4) encoding standard.

In operation S230, current video data is obtained from the previous video sequence data of the target time and the target video sequence data.

In an embodiment of the present disclosure, the preceding video sequence data of the target time corresponds to a preceding period of the target time, the preceding period of the target time including at least one time preceding the target time. For example, the preceding period may include all times before the target time.

In the embodiment of the disclosure, the target video sequence data and the video sequence data before the target time can be fused to obtain the current video data.

In operation S240, in response to determining that the target time satisfies the preset condition, a target video is obtained from the current video data.

In the embodiment of the present disclosure, the preset condition may be various conditions. For example, the preset conditions may include: the target time is a preset time among a plurality of times. The preset time may be any one of a plurality of times (e.g., the second last time or the last time).

By using the target computing unit to perform rendering and video encoding operations according to the embodiments of the present disclosure, time for data transmission between the computing unit (e.g., a graphics processing unit) and the storage unit can be saved, and time for other computing units (e.g., a central processing unit) to read data from the storage unit can be saved.

In addition, by the embodiment of the disclosure, the target computing unit is utilized to execute the rendering and video encoding operations, so that the hard disk space required for storing a plurality of images respectively corresponding to a plurality of motion driving coefficients is saved. In addition, after the target image is obtained, the target computing unit can be used for video coding, and after the target image is coded, the target image can be deleted from the cache, so that the cache space occupied by the target image can be effectively saved.

It will be appreciated that while the method flow of the present disclosure is described above, a number of motion driving coefficients of the present disclosure will be described below.

In some embodiments, the plurality of motion drive coefficients may be K motion drive coefficients. The K motion driving coefficients correspond to K times respectively, the target motion driving coefficient is the ith motion driving coefficient, the target time is the ith time, i is an integer greater than 1 and less than or equal to K, and K is an integer greater than 1.

In an embodiment of the present disclosure, the plurality of motion driving coefficients are obtained by processing target data using a preset algorithm, the target data including at least one of target text and target audio. For example, the target data may be target text. At least one of the plurality of motion driving coefficients may correspond to one character of the target text.

In an embodiment of the present disclosure, the preset algorithm may include a voice animation synthesis algorithm. For example, after processing by a preset algorithm, a timestamp for each character in the target text may be obtained. From the time stamp of each sub-item, at least one respective moment in time of at least one motion driving coefficient corresponding to the character may be determined.

In the embodiment of the present disclosure, the motion driving coefficient may be the above-described mixed shape coefficient. The motion driving coefficients may be applied to a plurality of mixed shape bases of the avatar face such that the shapes of the mixed shape bases are changed so that the avatar performs a corresponding expressive motion. According to the embodiment of the invention, the action driving coefficient is utilized to drive the target virtual image to execute the expression action, so that the target virtual image can display the corresponding mouth shape when broadcasting the content, the authenticity of the target virtual image is improved, and the user experience is improved.

It will be appreciated that the motion driving coefficients of the present disclosure are described above, and the preset conditions will be described below in connection with the related embodiments.

In some embodiments, the preset conditions may include: the target time is the last time of the plurality of times. For example, in the case where i=k, the i-th time, which can be determined as the target time, satisfies the preset condition. For another example, in the case where i < K, it may be determined that the i-th time as the target time does not satisfy the preset condition. According to the embodiment of the disclosure, the plurality of action driving coefficients can be rendered into the image, so that the video corresponding to the target text or the target audio is generated, the virtual image video can be quickly and conveniently generated, and the user experience is improved.

It will be appreciated that the motion driving coefficients and preset conditions of the present disclosure are described above, and the method of the present disclosure will be further described below with reference to the related embodiments.

Fig. 3 is a flowchart of a video generation method according to one embodiment of the present disclosure.

As shown in fig. 3, the method 300 may include operations S310 to S330 and operations S341 to S343. Further, in the embodiment of the present disclosure, before operation S310 shown in fig. 3, rendering with the target computing unit may include: and according to the 1 st action driving coefficient, rendering by using a target computing unit to obtain the 1 st image. From the 1 st picture, the previous video sequence data at the i-th moment is obtained.

For example, after receiving K motion driving coefficients, the 1 st motion driving coefficient may be set as the target motion driving coefficient. The graphic processing unit may perform the 1 st rendering of the target avatar to obtain the 1 st image. Next, the graphics processing unit may also perform video encoding on the 1 st image to obtain 1 st video sequence data as the previous video data at the i-th time. It is understood that i is an integer greater than 1 and less than or equal to K. Next, operation S310 may be performed.

In operation S310, the target avatar is rendered using the target computing unit according to the i-th motion driving coefficient of the K motion driving coefficients, to obtain the i-th image.

In the embodiment of the present disclosure, in the process of rendering the target avatar, the target avatar may be rendered based on the preset material. For example, the preset material may include preset background material. As another example, a graphics processing unit may be deployed with various rendering engines. Rendering is carried out according to the ith action driving coefficient, and the ith rendering result can be obtained. And fusing the ith rendering result with a preset background material to obtain an ith image.

In response to the i-th image being obtained, the i-th image is encoded using the target computing unit to obtain i-th video sequence data in operation S320.

In the embodiment of the present disclosure, after the i-th image is obtained, the i-th image may be video-encoded using the graphic processing unit. For example, the graphics processing unit may be deployed with a fast forward moving picture experts group (Fast Forward mpeg, FFmpeg) video compositing engine. Thus, video encoding can be performed using the graphics processing unit.

In operation S330, current video data is obtained from the previous video sequence data at the i-th time and the i-th video sequence data.

For example, the i-th video sequence data may be fused with the video sequence data preceding the i-th time to obtain the current video data.

Next, it may be determined whether the i-th time satisfies a preset condition. In an embodiment of the present disclosure, the preset condition may include: the i-th time is the last time of the K times. For example, it may be determined whether i is equal to K.

In operation S341, it is determined whether i is equal to K.

In the embodiment of the present disclosure, in response to determining that i is not equal to K, it may be determined that the target time does not satisfy the preset condition, operation S342 may be performed.

In the embodiment of the present disclosure, in response to determining that i is equal to K, it may be determined that the target time satisfies the preset condition, operation S343 may be performed.

In operation S342, the current video data is determined as the previous video sequence data at the i+1th time.

In the embodiment of the present disclosure, in the case where i is not equal to K, the current video data obtained from the i-th video sequence data may be taken as the preceding video sequence data at the i+1th time. Next, operations S310 to S330 may be performed at least once according to other motion driving coefficients. Until operations S310 to S330 are performed according to the kth motion driving coefficient. According to the embodiment of the invention, the image can be continuously rendered and the video coding can be performed under the condition that the target moment does not meet the preset condition. In this process, the image is encoded into video sequence data after generation, which may be fused with the previous video sequence data to obtain the current video data. Therefore, in the process of rendering and video coding, the storage space required by the image is greatly reduced, the storage resource is saved, and the video generation efficiency is improved.

In operation S343, a target video is obtained from the current video data.

For example, after fusing the kth video training data and the previous video sequence data at the kth time, the resulting current video data may be used as the target video.

According to the embodiment of the disclosure, the image processing capability of the graphic processing unit is high, and the video generation efficiency can be improved by utilizing the graphic processing unit to conduct rendering and decoding.

It will be appreciated that the target computing unit described above may be deployed at a server, and the target text for determining the action driving coefficients may be obtained from a client. The method of the present disclosure will be further described in connection with a client and a server.

Fig. 4 is a flowchart of a video generation method according to another embodiment of the present disclosure.

As shown in fig. 4, the method 400 may be performed jointly by a client and a server. The client may perform operations S4101 to S4105.

In operation S4101, a target text input by a user is acquired.

For example, a user may enter target text on the interactive interface of the client. The target Text2 entered by the user may be "you good, welcome experience speech animation synthesis technology. Next, we look at a piece of news: a certain published a product formally opens the directional inner test.

In operation S4102, request data for generating a video is constructed and transmitted.

For example, the target text TextT2 and related information may be used as the request data. The client may send the requested data to the server.

Next, the server may receive the request data and perform operation 4201 according to the request data.

In operation S4201, a preset material is acquired.

For example, the preset background material may be obtained from a preset material library.

In operation 4202, target audio is obtained.

For example, the target Text2 may be converted to target audio based on a Speech-to-Speech (TTS) algorithm.

In operation S4203, the target text is processed using a preset algorithm, resulting in a processing result.

For example, the target text may be processed using a speech animation synthesis algorithm to obtain the processing result. For another example, the processing result may include a plurality of moments related to the target text and a plurality of action driving coefficients corresponding to the target text. It is understood that operation S4203 may be performed by a central processing unit of the server.

Rendering the target avatar to obtain a target image in operation S4200; and coding according to the target image to obtain a target video.

It is understood that operation S4200 may be performed by a graphics processing unit of a server. It is also understood that operation S4200 is the same as or similar to method 200 or method 300 described above, and is not repeated herein.

It is understood that the video format of the target video may be the 4 th generation moving picture expert group format, the video with transparent channel (e.g., mov) format, and the webm format. The video in the 4 th generation moving picture expert group format may be based on h264 encoding. Video with transparent channels may be based on qtrle encoding. The webm format may be based on vp9 encoding.

In operation S4204, a target video is uploaded and a target video link is generated.

For example, the target video may be uploaded to cloud storage. For another example, a link to the target video may also be generated. It is understood that the links may be uniform resource locators (Uniform Resource Locator, URLs).

In operation S4205, the video generation state is updated.

For example, the video generation status may be updated to "complete".

It is understood that after the client performs operation S4102, the client may also perform operation S4103.

In operation S4103, the success of the request is confirmed.

For example, the client may obtain relevant information from the server to confirm whether the request was successful.

In operation S4104, a poll generation state.

For example, the video generation status may be queried at preset time intervals.

It is understood that, after operation S4205, the next time the video generation state is queried, the video generation state that the client can query is "complete". Further, after operation S4205, the server may perform operation S4206 in response to receiving the query request of the client.

In operation S4206, a target video link is transmitted.

For example, the target video link generated in operation S4204 may be transmitted to the client.

After receiving the target video link, the client may perform operation S4105.

In operation S4105, a target video link is returned.

For example, the client may present the target video link on the relevant interactive interface to return the link to the user.

The method of the present disclosure will be described with reference to the relevant schematic drawings.

Fig. 5 is a schematic diagram of an interactive interface according to one embodiment of the present disclosure.

The user may enter the target text in the interactive interface. For example, the target Text2 may be "you good, welcome experience speech animation synthesis technology. Next, we look at a piece of news: a certain published a product formally opens the directional inner test. Target text T510 is shown in fig. 5.

In the embodiment of the present disclosure, preset news materials may also be fused with rendering results. For example, the preset news material M520 may be fused with a rendering result to obtain an image. It will be appreciated that the preset news stories M520 correspond to the "news" mentioned in the target text T510.

It will be appreciated that the client may send the target text T510 of the interactive interface to the server such that the server receives the target text. After receiving the target text, the server may generate a video.

Fig. 6 is a schematic diagram of a video frame according to one embodiment of the present disclosure.

As shown in fig. 6, a preset news material M620 is shown in a video frame F600 as shown in fig. 6. Also shown in the video frame F600 is a target avatar 630.

It will be appreciated that the above description is given taking characters as examples of chinese characters. However, the present disclosure is not limited thereto, and in the embodiments of the present disclosure, the characters may be characters of other language characters. For example, the character may be an english character. For another example, the target Text3 may be "Hello, welcome to experience voice animation synthesis technology". The preset algorithm can also process the target Text3 to obtain a plurality of action driving coefficients of the target Text 3. Next, video generation may be performed in accordance with the method 200 described above.

Fig. 7 is a block diagram of a video generating apparatus according to one embodiment of the present disclosure.

As shown in fig. 7, the apparatus 700 may include a rendering module 710, an encoding module 720, a first obtaining module 730, and a second obtaining module 740.

And a rendering module 710, configured to render the target avatar by using the target computing unit according to the target motion driving coefficient of the plurality of motion driving coefficients, so as to obtain a target image. For example, the plurality of operation drive coefficients correspond to a plurality of times, respectively, and the target operation drive coefficient corresponds to a target time among the plurality of times.

And an encoding module 720, configured to encode the target image by using the target computing unit in response to obtaining the target image, so as to obtain target video sequence data.

The first obtaining module 730 is configured to obtain current video data according to the previous video sequence data at the target time and the target video sequence data. For example, the preceding video sequence data of the target time corresponds to a preceding period of the target time, the preceding period of the target time including at least one time preceding the target time.

The second obtaining module 740 is configured to obtain the target video according to the current video data in response to determining that the target time satisfies the preset condition.

In some embodiments, the preset conditions include: the target time is the last time of the plurality of times.

In some embodiments, the plurality of motion driving coefficients is K, the K motion driving coefficients respectively correspond to K times, the target motion driving coefficient is an ith motion driving coefficient, the target time is an ith time, i is an integer greater than 1 and less than K, and K is an integer greater than 1, the apparatus 700 further includes: and the determining module is used for determining the current video data as the previous video sequence data of the (i+1) th moment in response to the fact that the target moment does not meet the preset condition.

In some embodiments, the rendering module includes: the rendering unit is used for rendering by using the target computing unit according to the 1 st action driving coefficient to obtain a 1 st image; and an obtaining unit for obtaining the previous video sequence data at the ith moment according to the 1 st image.

In some embodiments, the plurality of motion driving coefficients are derived by processing target data using a preset algorithm, the target data comprising at least one of target text and target audio.

In some embodiments, the target data is a target text, and the characters of the target text correspond to at least one action driving coefficient.

In some embodiments, the target computing unit comprises a graphics processing unit.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, for example, a video generation method. For example, in some embodiments, the video generation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When a computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the video generation method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the video generation method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) display or an LCD (liquid crystal display)) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A video generation method, comprising:

rendering the target virtual image by using a target computing unit according to target action driving coefficients in a plurality of action driving coefficients to obtain a target image, wherein the action driving coefficients respectively correspond to a plurality of moments, and the target action driving coefficients correspond to target moments in the moments;

in response to obtaining the target image, encoding the target image by using the target computing unit to obtain target video sequence data;

obtaining current video data according to the previous video sequence data of the target moment and the target video sequence data, wherein the previous video sequence data of the target moment corresponds to a previous period of the target moment, and the previous period of the target moment comprises at least one moment before the target moment; and

and responding to the fact that the target moment meets the preset condition, and obtaining a target video according to the current video data.

2. The method of claim 1, wherein the preset condition comprises: the target time is the last time in a plurality of times.

3. The method according to claim 2, wherein the plurality of motion driving coefficients is K, the K motion driving coefficients correspond to the K times, respectively, the target motion driving coefficient is an i-th motion driving coefficient, the target time is an i-th time, i is an integer greater than 1 and less than K, K is an integer greater than 1,

the method further comprises the steps of:

and in response to determining that the target time does not meet the preset condition, determining the current video data as the (i+1) th preceding video sequence data of the time.

4. The method of claim 1, wherein the rendering the target avatar with the target computing unit comprises:

according to the action driving coefficient 1, rendering by using the target computing unit to obtain a 1 st image; and

from the 1 st picture, the previous video sequence data of the i-th said moment is obtained.

5. The method of claim 1, wherein a plurality of the motion driving coefficients are obtained by processing target data using a preset algorithm, the target data including at least one of target text and target audio.

6. The method of claim 5, wherein the target data is target text, and characters of the target text correspond to at least one of the action driving coefficients.

7. The method of claim 1, wherein the target computing unit comprises a graphics processing unit.

8. A video generating apparatus comprising:

the rendering module is used for rendering the target virtual image by utilizing the target computing unit according to the target action driving coefficient in the action driving coefficients to obtain a target image, wherein the action driving coefficients respectively correspond to a plurality of moments, and the target action driving coefficient corresponds to a target moment in the moments;

the encoding module is used for responding to the obtained target image, and encoding the target image by utilizing the target computing unit to obtain target video sequence data;

a first obtaining module, configured to obtain current video data according to previous video sequence data of the target time and the target video sequence data, where the previous video sequence data of the target time corresponds to a previous period of the target time, and the previous period of the target time includes at least one time before the target time; and

and the second obtaining module is used for obtaining a target video according to the current video data in response to determining that the target moment meets a preset condition.

9. The apparatus of claim 8, wherein the preset condition comprises: the target time is the last time in a plurality of times.

10. The apparatus according to claim 9, wherein the plurality of motion driving coefficients is K, the K motion driving coefficients correspond to the K times, respectively, the target motion driving coefficient is an i-th motion driving coefficient, the target time is an i-th time, i is an integer greater than 1 and less than K, K is an integer greater than 1,

the apparatus further comprises:

and the determining module is used for determining the current video data as the (i+1) th previous video sequence data of the moment in response to the fact that the target moment does not meet the preset condition.

11. The apparatus of claim 8, wherein the rendering module comprises:

the rendering unit is used for rendering by utilizing the target computing unit according to the action driving coefficient 1 to obtain a 1 st image; and

and the obtaining unit is used for obtaining the video sequence data before the ith moment according to the 1 st image.

12. The apparatus of claim 8, wherein a plurality of the motion driving coefficients are obtained by processing target data using a preset algorithm, the target data including at least one of target text and target audio.

13. The apparatus of claim 12, wherein the target data is target text, a character of the target text corresponding to at least one of the action driving coefficients.

14. The apparatus of claim 8, wherein the target computing unit comprises a graphics processing unit.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.

16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 7.