CN115942039B

CN115942039B - Video generation method, device, electronic equipment and storage medium

Info

Publication number: CN115942039B
Application number: CN202211534769.2A
Authority: CN
Inventors: 李浩文; 刘朋; 董浩; 谢帅; 孙昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2024-02-23
Anticipated expiration: 2042-11-30
Also published as: CN115942039A

Abstract

The disclosure provides a video generation method, relates to the technical field of artificial intelligence, in particular to the technical fields of augmented reality, virtual reality, computer vision, deep learning and the like, and can be applied to scenes such as metauniverse, virtual digital people and the like. The specific implementation scheme is as follows: in response to receiving a target text, determining at least one first target time information corresponding to at least one target action tag text according to a plurality of initial time information related to the target text, wherein the target text is obtained by processing the initial text by using the at least one target action tag text, and the target action tag text corresponds to a preset action; rendering the target virtual image according to at least one first target time information to obtain at least one first video segment, wherein the first video segment corresponds to a preset action; and generating a target video according to the at least one first video segment. The disclosure also provides a video generating device, an electronic device and a storage medium.

Description

Video generation method, device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of augmented reality, virtual reality, computer vision, deep learning and the like, and can be applied to scenes such as metauniverse, virtual digital people and the like. More particularly, the present disclosure provides a video generation method, apparatus, electronic device, and storage medium.

Background

With the development of artificial intelligence technology, the application scene of the avatar is increasing. The avatar may be rendered to generate a video. In the video, the avatar may perform a plurality of actions.

Disclosure of Invention

The present disclosure provides a video generation method, apparatus, device, and storage medium.

According to an aspect of the present disclosure, there is provided a video generating method, the method including: in response to receiving a target text, determining at least one first target time information corresponding to at least one target action tag text according to a plurality of initial time information related to the target text, wherein the target text is obtained by processing the initial text by using the at least one target action tag text, the target action tag text corresponds to a preset action, and the initial time information corresponds to characters in the target text; rendering the target virtual image according to at least one first target time information to obtain at least one first video segment, wherein the first video segment corresponds to a preset action; and generating a target video according to the at least one first video segment.

According to another aspect of the present disclosure, there is provided a video generating apparatus including: the determining module is used for responding to the received target text, determining at least one first target time information corresponding to at least one target action tag text respectively according to a plurality of initial time information related to the target text, wherein the target text is obtained by processing the initial text by using the at least one target action tag text, the target action tag text corresponds to a preset action, and the initial time information corresponds to characters in the target text; the rendering module is used for rendering the target virtual image according to the at least one first target time information to obtain at least one first video segment, wherein the first video segment corresponds to a preset action; and the generating module is used for generating a target video according to the at least one first video segment.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method provided according to the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method provided according to the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an exemplary system architecture to which video generation methods and apparatus may be applied, according to one embodiment of the present disclosure;

FIG. 2 is a flow chart of a video generation method according to one embodiment of the present disclosure;

FIG. 3 is a flow chart of a video generation method according to another embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an interactive interface according to one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a third video frame according to one embodiment of the present disclosure;

fig. 6 is a block diagram of a video generating apparatus according to one embodiment of the present disclosure; and

fig. 7 is a block diagram of an electronic device to which a video generation method may be applied according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Text may be broadcast with avatars to generate related videos. For example, the broadcast text may be displayed on an interactive interface. The user can insert a preset action tag into the text displayed on the interactive interface so as to specify the execution time of the preset action corresponding to the preset action tag. The preset action tag may be from a preset action library. Each preset action tag may correspond to a preset action. However, when more text is broadcasted or videos are generated in batches, the user needs to frequently insert a preset action tag into the interactive interface, so that the cost is high and the user experience is poor.

Fig. 1 is a schematic diagram of an exemplary system architecture to which video generation methods and apparatus may be applied, according to one embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the video generating method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the video generating apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The video generation method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the video generating apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.

Fig. 2 is a flowchart of a video generation method according to one embodiment of the present disclosure.

As shown in fig. 2, the method 200 may include operations S210 to S240.

In response to receiving the target text, at least one first target time information corresponding to at least one target action tag text, respectively, is determined according to a plurality of initial time information related to the target text in operation S210.

In an embodiment of the present disclosure, the target text may be obtained by processing the initial text using at least one target action tag text. For example, at least one target action tag text may correspond to the initial text.

In the disclosed embodiments, the initial text may be user-entered. For example, the initial text Textl may be "you good, welcome experience speech animation synthesis technology".

In the embodiment of the disclosure, the target action tag text may correspond to a preset action. For example, the target action tag text may be "waving hands". The preset motion corresponding to the target motion label text may be the avatar waving both hands.

In the embodiments of the present disclosure, the initial text may be processed using the target action tag text based on various processing methods. For example, one of the initial sub-text may be replaced with the target action tag text. In one example, the initial sub-text "your good" in the initial text may be replaced with the target action tag text "waving hand". For another example, the target action tag text may be added to the initial text. In one example, the target action tag text "hand-put" may be added to the text position after the initial sub-text "your good".

In the disclosed embodiments, a plurality of initial time information of text may be obtained in various ways.

In the disclosed embodiment, the initial time information corresponds to characters in the target text. For example, the initial time information may be implemented as a time stamp. Each timestamp corresponds to a character in the target text. For another example, the target text may be converted to the first audio. The first audio may include a plurality of audio clips. Each audio segment corresponds to a character of the target text. It will be appreciated that each audio clip may correspond to a time stamp. Thus, the time stamp of each character of the target text can be determined. It will be appreciated that other means of determining the timestamp of each character in the target text may be utilized, as the disclosure is not limited in this regard.

In the embodiment of the present disclosure, after the initial text is processed using the target action tag text, the target action tag text may be included in the target text. For example, as described above, a timestamp for each character in the target text may be determined. Based on this, a timestamp for each character in the target action tag text may be determined. In one example, a timestamp of two characters corresponding to the target action tag text "hands-on" may be used as the first target time information for the target action tag text.

In operation S220, the target avatar is rendered according to the at least one first target time information, resulting in at least one first video clip.

In the embodiments of the present disclosure, the target avatar may be various avatars. For example, the target avatar may be various avatars such as a dummy, a virtual animal, and the like.

In the embodiment of the disclosure, the first video clip corresponds to a preset action.

In the embodiment of the present disclosure, the target avatar is rendered, and a plurality of first video frames may be obtained. From the plurality of first video frames, a first video clip may be obtained. For example, the first video clip corresponding to the preset action "waving" may include a plurality of first video frames. In a different first video frame, the hand of the target avatar is located at a different position. Thus, the first video clip may demonstrate the process of the target avatar performing a preset action "hands-on".

In operation S230, a target video is generated from at least one first video clip.

For example, at least one first video clip may be stitched into one video as the target video.

According to the embodiment of the invention, the virtual image is rendered according to the text processed by the action label text, so that the video can be efficiently generated based on the text, and the operation efficiency of a user is improved. Particularly, in the case of a very large text amount, the labor cost and the time cost for setting an appropriate action for the avatar can be reduced, the difficulty of generating the avatar video can be simplified, and the videos including the custom avatar actions can be generated in batch.

It will be appreciated that the method flow of the present disclosure is described above and initial text of the present disclosure will be described below.

In some embodiments, the initial text may include M initial sub-text. The M initial sub-texts may be extracted from the initial texts.

In embodiments of the present disclosure, M may be an integer greater than or equal to 1. For example, the initial Text2 may be "you good, welcome experience speech animation synthesis technology. Next, we look at a piece of news: a certain published a product formally opens the directional inner test. The initial text may be extracted using a semantic model to obtain 2 initial sub-texts. The 2 initial sub-texts may include: the initial sub-text "your good" and the initial sub-text "news". It will be appreciated that for this initial text, M may be 2. It is understood that the semantic model may be a variety of trained deep learning models. In one example, the backbone network of the semantic model may be a convolutional neural network. According to the embodiment of the disclosure, the subtrees in the initial text can be automatically extracted by utilizing the semantic model, so that the labor cost can be greatly reduced, and the video generation efficiency is improved.

In the embodiment of the disclosure, the at least one target action tag text is N target action tag texts, the N target action tag texts are respectively matched with N initial sub-texts, and N is an integer greater than or equal to 1 and less than or equal to M. For example, the M initial sub-texts may be matched with a plurality of preset action tag texts of a preset action library. In the case that the initial sub-text is matched with the preset action tag text, the preset action tag text may be used as the target action tag text. For another example, the matching may be performed according to semantic similarity between the initial sub-text and the preset action tag text. In one example, the semantic similarity between the initial sub-text "your good" and the preset action tag text "waving hand" is greater than a preset action similarity threshold. It may be determined that the two match. The preset action tag text "waving hand" can be used as the target action tag text "waving hand". According to the method and the device for processing the initial text, the extracted initial sub-text is matched with the text in the preset action library, the action tag text matched with the initial sub-text can be rapidly and automatically determined, further the action tag can be efficiently added to the text, labor cost in a subsequent processing process can be reduced, and the initial text can be efficiently and automatically processed.

It will be appreciated that while the initial text is described above, some implementations of processing initial text using target action tag text will be described below in connection with related embodiments.

In some embodiments, the target text may be obtained by processing the initial text by: and determining N pieces of first target position information according to N pieces of initial sub-texts respectively matched with the N pieces of target action label texts. And respectively adding N target action label texts to the initial text according to the N first target position information to obtain a target text. For example, for the initial Text2 described above, the initial sub-Text "your good" matches the target action tag Text "waving hand". The initial sub-Text "your good" corresponds to the 1 st character and the 2 nd character of the initial Text 2. The first target position information determined according to the initial sub-text "your good" may correspond to the 3 rd character of the initial text. For another example, the target action tag text "hand-waving" may be added to the initial text such that the two characters of "hand-waving" are the 3 rd and 4 th characters, respectively. It will be appreciated that after the target action tag Text is added, the original character "cheerful" of the initial Text2 may be the 5 th character. After adding the target action tag Text to the initial Text2, the target Text2 can be obtained. According to the embodiment of the disclosure, after the action tag text is inserted into the initial sub-text, the time stamp of the action tag text is determined, the execution time of the preset action corresponding to the action tag text is determined, the generation efficiency of the video clip can be further improved, and the generation efficiency of the video is further improved.

It will be appreciated that while the initial text is described above, some embodiments of determining the first target time information will be described below.

In some embodiments, in some implementations of operation S210 described above, determining at least one first target time information corresponding to at least one target action tag text, respectively, according to the plurality of initial time information related to the target text may include: and processing the target text by using a preset algorithm to obtain a processing result. At least one initial time information corresponding to the target action tag text is determined based on at least one character corresponding to the target action tag text. And determining first target time information corresponding to the target action label text according to at least one piece of initial time information corresponding to the target action label text.

For example, the preset algorithm may include a Voice-to-Animation (VTA) algorithm. For example, the processing result may include a plurality of initial time information. For another example, after the processing of the preset algorithm, a time stamp of each character in the target text may be obtained as each initial time information. For another example, the target text TextT2 may include the target action tag text "waving hand". The processing results may include a "dangling" timestamp and a "hand" timestamp. The time stamp of "hand" may be taken as the first target time information. According to the method and the device for displaying the virtual images, the time information corresponding to the action label text can be determined, and further the time for executing the preset action by the virtual images in the video can be determined, so that the virtual images are rendered at the corresponding time, and the corresponding video fragments are obtained.

It will be appreciated that some embodiments of determining the first target time information are described above and some embodiments of obtaining the first video clip will be described below.

In some embodiments, the preset motion may correspond to at least one first motion driving coefficient. For example, the preset motion "hands-on" may correspond to at least one first motion drive factor. For another example, the preset motion "bow" may also correspond to at least one first motion drive factor.

In some embodiments, the first target time information corresponds to a first target time instant. For example, the first target time information corresponding to the target action tag text "waving hand" may correspond to 1 st second. It will be appreciated that for the initial sub-text "your good", the character "good" may correspond to, for example, 0.4 seconds.

In some embodiments, in some implementations of operation S220 described above, rendering the target avatar according to the at least one first target time information may include: and determining at least one first target action driving coefficient according to the preset action corresponding to the target action label text. And rendering the target virtual image according to the first target moment and at least one first target action driving coefficient to obtain a first video segment. For example, in the first video clip, the target avatar may start to perform a preset action corresponding to the target action tag text at the first target time. For example, the target action tag text "waving" corresponds to a preset action. The at least one first motion driving coefficient of the preset motion may be used as the at least one first target motion driving coefficient. At least one rendering of the avatar may be performed according to the at least one first target driving coefficient, resulting in at least one first video frame. A first video clip may be derived from at least one first video frame. It will be appreciated that in the case where the first target time information corresponding to the target action tag text "hands-on" corresponds to 1 st second, the first video frame of the first video clip may be referred to as the 25 th video frame of the target video.

According to the method and the device for generating the video, after the target text with the target action tag text is obtained, the action driving coefficient can be determined according to the preset action corresponding to the action tag text, and then the virtual image can be rendered based on the action driving coefficient to obtain a plurality of video frames, so that the video can be generated by utilizing the text, and the video generation efficiency is greatly improved. The target action tag text corresponds to the first target time information, so that the position of the first video clip in the target video can be determined, and further improvement of video generation efficiency is facilitated.

It will be appreciated that the method of the present disclosure has been described above with the example of a processing result including a plurality of initial time information. However, the present disclosure is not limited thereto, and the processing result may further include a second motion driving coefficient, which will be described in detail below.

In some embodiments, the processing results may further include a plurality of second action driving coefficients, which may correspond to characters of the target text. In the embodiment of the present disclosure, the second motion driving coefficient may be a Blend Shape (BS) coefficient. The second motion driving coefficient may be applied to a plurality of mixed shape bases of the avatar face such that the shapes of the mixed shape bases are changed so that the avatar performs a corresponding expressive motion. According to the embodiment of the disclosure, the target virtual image is driven by the second action driving coefficient to execute the expression action, so that the target virtual image can display the corresponding mouth shape when broadcasting the content, the authenticity of the target virtual image is improved, and the user experience is improved.

Further embodiments of rendering the target avatar will be described below in conjunction with the second motion driving coefficient.

In some embodiments, the initial text may also include K characters. The plurality of second action driving coefficients may include K second target action driving coefficients corresponding to K characters of the initial text, respectively. K is an integer greater than or equal to 1. For example, for the initial Text2, K may be 41. It will be appreciated that, unlike the initial text, the target action tag text may also be included in the target text, and the number of words of the target text may be greater than the number of words of the initial text. For example, the number of words of the target Text2 is greater than the number of words of the initial Text 2. K second target action driving coefficients corresponding to K characters of the initial Text2 respectively can be determined from a plurality of second driving coefficients of the target Text.

In some embodiments, in other implementations of operation S220 described above, rendering the target avatar further includes: and determining K pieces of second target time information corresponding to the K characters of the initial text respectively according to the plurality of pieces of initial time information. And rendering the target virtual image according to the K second target action driving coefficients and the K second target time information to obtain K second video clips.

For example, after processing by a preset algorithm, a time stamp of each character in the target text may be obtained as each initial time information. For another example, the target text TextT2 may include the character "you". The processing result may include a timestamp of the character "you". The timestamp of this character "you" may be taken as the second target time information.

In the embodiment of the present disclosure, the second target time information corresponds to a second target time. For example, the character "you" may correspond to 0.2 seconds. According to the method and the device for rendering the target virtual image, the K second target time information corresponding to the K characters of the initial text is utilized to render the target virtual image, so that the emitted sound is consistent with the displayed mouth shape, and the authenticity of the virtual image is fully improved. In addition, the K second target time information of the initial text and the time information corresponding to the action label text are utilized to respectively render the target avatar, so that the mouth shape of the target avatar can be more coordinated with the limb actions.

In an embodiment of the present disclosure, rendering the target avatar may include: and rendering the target virtual image according to a second target action driving coefficient corresponding to the character of the initial text to obtain a second video segment. For example, after the second target time, adjusting the plurality of mixed shape bases of the face of the target avatar in accordance with the second target motion driving coefficient may cause the target avatar to perform an expression motion corresponding to the target motion driving coefficient at the second target time.

It will be appreciated that the method of the present disclosure is described above in connection with action tag text. However, the present disclosure is not limited thereto, and the initial text may further include an initial sub-text matched with the preset material tag text, which will be described in detail below.

In some embodiments, the M initial sub-texts include I initial sub-texts that match the I target material tag texts.

In embodiments of the present disclosure, M may be an integer greater than or equal to 1. I may be an integer greater than or equal to 1 and less than or equal to M. For example, the initial Text2 includes: the initial sub-text "your good" and the initial sub-text "news". For another example, the M initial sub-texts may be matched with a plurality of preset material tag texts of a preset material library. And under the condition that the initial sub-text is matched with the preset material tag text, the preset material tag text can be used as the target material tag text. For another example, matching may be performed according to semantic similarity between the initial sub-text and the preset material tag text. In one example, the semantic similarity between the initial sub-text "news" and the preset story tag text "news 02" is greater than a preset story similarity threshold. It may be determined that the two match. The preset material tag text "news 02" may be regarded as the target material tag text "news 02".

It will be appreciated that the initial text of the present disclosure is further described above and the target text is further described below.

In some embodiments, the target text is obtained by processing the initial text by: and determining I pieces of second target position information according to the I pieces of initial sub-texts respectively matched with the I pieces of target material label texts. And respectively adding the I target material tag texts to the initial texts according to the I second target position information to obtain target texts. For example, for the initial Text2 described above, the initial sub-Text "news" matches the target action tag Text "news 02". The second target location information determined from the initial sub-text "news" may indicate a text location after the character "smells". For another example, the target story tag Text "news 02" may be added to the initial Text2. It will be appreciated that the initial Text2 may also be processed using the target action tag Text "hand-put" described above. Thus, the target text 2' can be obtained. According to the embodiment of the invention, the virtual image is rendered according to the text processed by the material label text, so that the video can be efficiently generated based on the material text, and the operation efficiency of a user is improved. Especially in the case of very large text volume, the labor cost and the time cost for setting proper materials for the avatar can be reduced, the difficulty in generating the avatar video can be simplified, and the videos containing the custom materials can be generated in batches.

Further embodiments of determining the first target time information will be described below in connection with target material tag text.

In an embodiment of the present disclosure, determining at least one first target time information corresponding to at least one target action tag text, respectively, according to a plurality of initial time information related to the target text, further includes: and determining I pieces of third target time information respectively corresponding to the I pieces of target material label texts according to the plurality of pieces of initial time information. For example, after processing by a preset algorithm, a time stamp of each character in the target text 2' may be obtained as each initial time information. For another example, the target text TextT2' may include the target story tag text "news 02". The processing results may include a "new" timestamp, an "smelling" timestamp, a "0" timestamp, and a "2" timestamp. The time stamp of "2" may be taken as the third target time information. According to the embodiment of the disclosure, the time information corresponding to the material label text can be determined, and further the time for displaying the preset material by the virtual image in the video can be determined, so that the virtual image is rendered at the corresponding time, and the corresponding video fragment is obtained.

Further embodiments of rendering the target avatar will be described below in connection with target material tag text.

In some embodiments, in another implementation of operation S220 described above, rendering the target avatar according to the at least one first target time information further includes: and rendering the target virtual image according to the I third target time information to obtain I third video clips.

In the embodiment of the disclosure, the third video clip corresponds to a preset material.

In the embodiment of the present disclosure, the third target time information corresponds to a third target time.

In an embodiment of the present disclosure, rendering the target avatar may further include: and displaying a preset material corresponding to the target material label text at a third moment to obtain a third video segment. For example, at a third time instant, news material corresponding to the target material tag text "news 02" may be utilized as one or more third video frames to obtain a third video clip. According to the embodiment of the disclosure, the I third target time information of the material label text, the K second target time information of the initial text and the time information corresponding to the action label text are utilized to respectively render the target avatar, so that the mouth shape of the target avatar is more coordinated with the limb action, and the mouth shape of the avatar display material and the mouth shape of the target avatar are more coordinated.

It is understood that the preset material may be various materials such as news material, image material, and the like.

It will be appreciated that various ways of rendering the target avatar are described in detail above, and that some embodiments of generating the target video will be described below.

In some embodiments, the processing results may also include target audio. The target audio may correspond to the initial text. For example, the predetermined algorithm may also include a Text-to-Speech (TTS) algorithm. The speech synthesis algorithm may convert the initial Text2 into target audio.

In some embodiments, in some implementations of operation S230 above, generating the target video from the at least one first video segment may include: and generating a target video according to the target audio, the at least one first video segment, the K second video segments and the I third video segments. For example, the target audio, N first video clips, K second video clips, and I third video clips may be input to a fast forward moving picture experts group (Fast Forward mpeg, FFmpeg) video composition engine to generate a target video. According to the embodiment of the disclosure, the target video is generated according to the audio and various video clips, so that the overall authenticity of the video can be improved, and the user experience is improved.

It will be appreciated that the target text described above may be server-generated or client-generated. The method of the present disclosure will be further described in connection with a client and a server.

Fig. 3 is a flowchart of a video generation method according to another embodiment of the present disclosure.

As shown in fig. 3, the method 300 may be performed jointly by a client and a server. The client may perform operations S3101 to S3106.

In operation S3101, an initial text input by the user is acquired.

For example, a user may enter initial text on the interactive interface of the client. The initial Text input by the user may be the initial Text2 described above. The initial Text2 may be "you good, welcome experience speech animation synthesis technology. Next, we look at a piece of news: a certain published a product formally opens the directional inner test.

In operation S3102, an initial sub-text is extracted and the initial text is processed using the target action tag text, resulting in a target text.

For example, the initial Text2 may be processed using a semantic model to obtain the initial sub-Text "your good" and the initial sub-Text "news". For another example, the initial sub-text is matched with a plurality of preset action tag texts of a preset action library. In the case where the initial sub-text "your good" matches the preset action tag text "waving" may be used as the target action tag text "waving". For another example, the initial sub-text is matched with a plurality of preset material tag texts of a preset material library. In the case that the initial sub-text "news" matches the preset material tag text "news 02", the preset material tag text "news 02" may be used as the target material tag text.

For example, the target action tag text "hand-put" may be added to the text position after the initial sub-text "your good", or the target material tag text "news 02" may be added to the text position after the initial sub-text "news". Thus, the target text 2' can be obtained.

In operation S3103, request data for generating a video is constructed and transmitted.

For example, the target text TextT2' and related information may be used as the request data. The client may send the requested data to the server.

Next, the server may receive the request data and perform operation 3201 according to the request data.

In operation S3201, a preset material is acquired.

For example, news stories corresponding to the target story tag text "news 02" may be acquired from a preset story library.

In operation 3202, target audio is obtained.

For example, the initial Text2 may be converted to target audio based on a speech synthesis algorithm.

In operation S3211, the target text is processed using a preset algorithm to obtain a processing result.

For example, the target text may be processed using a speech animation synthesis algorithm to obtain the processing result. For another example, the processing result may include a plurality of initial time information related to the target text and a plurality of second action driving coefficients corresponding to a plurality of characters in the target text, respectively.

For another example, from the plurality of initial time information, first target time information corresponding to the target action tag text "waving" may be determined, K pieces of second time information corresponding to K characters in the initial text may be determined, or third target time information corresponding to the target material tag text "news 02" may be determined.

For another example, K second target motion driving coefficients corresponding to K characters in the initial text may be determined from the plurality of second motion driving coefficients.

In operation S3220, the target avatar is rendered.

For example, according to the first target time information and the preset action corresponding to the target action label text "waving", the target avatar may be rendered to obtain the first video clip.

For example, according to the K second target time information and the K target motion driving coefficients, the target avatar may be rendered K times to obtain K second video clips.

For example, according to the third target time information, news materials corresponding to the target material tag text "news 02" may be displayed, and a third video clip may be obtained.

In operation S3230, a target video is generated.

For example, the target audio, the first video clip, the second video clip, and the third video clip may be input to a video composition engine to obtain the target video. It will be appreciated that the video format of the target video may be a fourth generation moving picture experts group (Moving Picture experts group, mp 4) format, a video with transparent channels (e.g., mov) format, and a webm format. The fourth generation moving picture expert group format video may be based on h264 encoding. Video with transparent channels may be based on qtrle encoding. The webm format may be based on vp9 encoding.

In operation S3241, a target video is uploaded and a target video link is generated.

For example, the target video may be uploaded to cloud storage. Links to the target video may also be generated. It is understood that the links may be uniform resource locators (Uniform Resource Locator, URLs).

In operation S3242, the video generation state is updated.

For example, after operation S3241, the video generation state may be updated to "complete".

It is understood that after the client performs operation S3103, the client may also perform operation S3104.

In operation S3104, the confirmation request is successful.

For example, the client may obtain relevant information from the server to confirm whether the request was successful.

In operation S3105, the generation state is polled.

For example, the video generation status may be queried at preset time intervals.

It can be appreciated that, after operation S3242, the next time the video generation state is queried, the video generation state that the client can query is "complete". Further, after operation S3242, the server may perform operation S3243 in response to receiving the query request of the client.

In operation S3243, a target video link is transmitted.

For example, the target video link generated in operation S3241 may be transmitted to the client.

After receiving the target video link, the client may perform operation S3106.

In operation S3106, the target video link is returned.

For example, the client may present the target video link on the relevant interactive interface to return the link to the user.

The method of the present disclosure will be described with reference to the relevant schematic drawings.

FIG. 4 is a schematic diagram of an interactive interface, according to one embodiment of the present disclosure.

The user may enter initial text in the interactive interface. After receiving the initial text, the initial text may be processed with the target action tag text. For example, the initial Text2 may be "you good, welcome experience speech animation synthesis technology. Next, we look at a piece of news: a certain published a product formally opens the directional inner test. Multiple initial sub-texts of the initial Text2 can be extracted by using a semantic model: the initial sub-text "your good" and the initial sub-text "news". Next, it may be determined that the initial sub-text "your good" matches the target action tag text "hand-shake", or that the initial sub-text "news" matches the target material tag text "news 02". The target action tab text "hand-put" and the target material tab text "news 02" may be added to the initial text, respectively, resulting in a target text T410 shown in the interactive interface as shown in fig. 4. It is understood that the target text T410 may be the target text 2' described above.

As shown in fig. 4, the target text T410 includes a target action tag text T420 and a target material tag text T430. It will be appreciated that the target action tag text T420 may be the target action tag text "waving hand" described above. The target material tag text T430 may be the target material tag text "news 02" described above.

It will be appreciated that the client may send the target text T410 of the interactive interface to the server such that the server receives the target text. After receiving the target text, the server may generate a video.

Fig. 5 is a schematic diagram of a third video frame according to one embodiment of the present disclosure.

As shown in fig. 5, the third video frame F500 may be from a third video clip. The third video clip may correspond to the target story tag text "news 02".

As shown in fig. 5, news material M530 is shown in a third video frame F500. News material M530 may correspond to the target material tag text "news 02". Also shown in the third video frame F500 is a target avatar 540.

It will be appreciated that the above description is given taking characters as examples of chinese characters. However, the present disclosure is not limited thereto, and in the embodiments of the present disclosure, the characters may be characters of other language characters. For example, the character may be an english character. For another example, the initial Text3 may be "Hello, welcome to experience voice animation synthesis technology". The initial Text3 may include at least an initial sub-Text "Hello". The initial sub-text "Hello" may be matched with the action tag text "Waving". The preset action corresponding to the action tag text "Waving" may be the avatar Waving both hands. The action tag Text "Waving" may be added to the initial Text 3. In one example, the action tag text "Waving" may be added to the text position after the initial sub-text "Hello". Thus, a target text can be obtained. Next, video generation may be performed in accordance with the method 200 described above.

Fig. 6 is a block diagram of a video generating apparatus according to one embodiment of the present disclosure.

As shown in fig. 6, the apparatus 600 may include a determination module 610, a rendering module 620, and a generation module 630.

A determining module 610, configured to determine, in response to receiving the target text, at least one first target time information corresponding to at least one target action tag text, respectively, according to a plurality of initial time information related to the target text. For example, the target text is obtained by processing an initial text by using at least one target action tag text, the target action tag text corresponds to a preset action, and the initial time information corresponds to characters in the target text;

and a rendering module 620, configured to render the target avatar according to the at least one first target time information, to obtain at least one first video segment, where the first video segment corresponds to a preset action.

The generating module 630 is configured to generate a target video according to at least one first video segment.

In some embodiments, the initial text includes M initial sub-texts, M is an integer greater than or equal to 1, M initial sub-texts are extracted from the initial text, at least one target action tag text is N target action tag texts, the N target action tag texts are respectively matched with the N initial sub-texts, and N is an integer greater than or equal to 1 and less than or equal to M.

In some embodiments, the target text is obtained by performing the correlation operation by the following sub-modules: the first determining sub-module is used for determining N pieces of first target position information according to N initial sub-texts which are respectively matched with N pieces of target action label texts. And the first adding sub-module is used for respectively adding the N target action label texts to the initial texts according to the N first target position information to obtain target texts.

In some embodiments, the determining module comprises: and the processing sub-module is used for processing the target text by utilizing a preset algorithm to obtain a processing result, wherein the processing result comprises a plurality of initial time information. And the third determining sub-module is used for determining at least one piece of initial time information corresponding to the target action label text according to at least one character corresponding to the target action label text. And the fourth determining sub-module is used for determining first target time information corresponding to the target action label text according to at least one piece of initial time information corresponding to the target action label text.

In some embodiments, the first target time information corresponds to a first target time, the preset action corresponds to at least one first action driving coefficient, and the rendering module includes: and the fifth determining submodule is used for determining at least one first target action driving coefficient according to the preset action corresponding to the target action label text. And the first rendering sub-module is used for rendering the target virtual image according to the first target moment and at least one first target action driving coefficient to obtain a first video segment.

In some embodiments, the processing results further include a plurality of second action driving coefficients, the second action driving coefficients corresponding to characters of the target text.

In some embodiments, the initial text includes K characters, K is an integer greater than or equal to 1, the plurality of second motion driving coefficients includes K second target motion driving coefficients corresponding to the K characters of the initial text, respectively, and the rendering module further includes: and the sixth determining submodule is used for determining K pieces of second target time information corresponding to K characters of the initial text respectively according to the initial time information. And the second rendering sub-module is used for rendering the target virtual image according to the K second target action driving coefficients and the K second target time information to obtain K second video clips.

In some embodiments, the second target time information corresponds to a second target time, the second rendering sub-module comprising: and the first rendering unit is used for rendering the target virtual image according to the second target action driving coefficient and the second target time information corresponding to the characters of the initial text to obtain a second video fragment.

In some embodiments, the initial text includes M initial sub-texts, M being an integer greater than or equal to 1, the M initial sub-texts being extracted from the initial text. The M initial sub-texts comprise I initial sub-texts matched with 1 target material label text, wherein I is an integer greater than or equal to 1 and less than or equal to M. The target text is obtained by performing related operations through the following submodules: and the seventh determining sub-module is used for determining I second target position information according to the I initial sub-texts respectively matched with the I target material label texts. And the second adding sub-module is used for respectively adding the I target material tag texts to the initial texts according to the I second target position information to obtain target texts.

In some embodiments, the determining module further comprises: and the eighth determining submodule is used for determining I pieces of third target time information respectively corresponding to the I pieces of target material label texts according to the plurality of pieces of initial time information.

In some embodiments, the rendering module further comprises: and the third rendering sub-module is used for rendering the target virtual image according to the I third target time information to obtain I third video clips, wherein the third video clips correspond to the preset materials.

In some embodiments, the third target time information corresponds to a third target time, and the third rendering sub-module includes: and the display unit is used for displaying the preset material corresponding to the target material label text at a third moment to obtain a third video segment.

In some embodiments, the processing result further includes a target audio, the target audio corresponding to the initial text, the generating module includes: the generating sub-module is used for generating a target video according to the target audio, the at least one first video segment, the K second video segments and the I third video segments.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and processes described above, for example, a video generation method. For example, in some embodiments, the video generation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When a computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the video generation method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the video generation method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) display or an LCD (liquid crystal display)) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A video generation method, comprising:

in response to receiving a target text, determining at least one first target time information corresponding to at least one target action tag text respectively according to a plurality of initial time information related to the target text, wherein the target text is obtained by processing the initial text by using at least one target action tag text, the target action tag text corresponds to a preset action, the initial time information corresponds to characters in the target text, and the preset action comprises a limb action;

rendering the target virtual image according to at least one piece of first target time information to obtain at least one first video segment, wherein the first video segment corresponds to the preset action, the first target time information corresponds to a first target moment, and the preset action corresponds to at least one first action driving coefficient; and

Generating a target video according to at least one first video segment;

wherein said rendering the target avatar according to at least one of said first target time information comprises:

determining at least one first target action driving coefficient according to a preset action corresponding to the target action label text; and

and rendering the target virtual image according to the first target moment and at least one first target action driving coefficient to obtain the first video segment.

2. The method of claim 1, wherein the initial text includes M initial sub-texts, M being an integer greater than or equal to 1, M of the initial sub-texts being extracted from the initial text,

at least one target action label text is N target action label texts, the N target action label texts are respectively matched with N initial sub-texts, and N is an integer which is greater than or equal to 1 and less than or equal to M.

3. The method of claim 2, wherein the target text is obtained by processing the initial text by:

determining N pieces of first target position information according to N pieces of initial sub-texts respectively matched with N pieces of target action label texts; and

And respectively adding N target action label texts to the initial text according to the N first target position information to obtain the target text.

4. The method of claim 1, wherein the determining at least one first target time information corresponding to at least one of the target action tag texts, respectively, from a plurality of initial time information related to the target texts comprises:

processing the target text by using a preset algorithm to obtain a processing result, wherein the processing result comprises a plurality of initial time information;

determining at least one piece of initial time information corresponding to the target action tag text according to at least one character corresponding to the target action tag text; and

and determining the first target time information corresponding to the target action label text according to at least one piece of initial time information corresponding to the target action label text.

5. The method of claim 4, wherein the processing result further comprises a plurality of second action driving coefficients, the second action driving coefficients corresponding to the characters of the target text.

6. The method of claim 5, wherein the initial text includes K characters, K being an integer greater than or equal to 1, the plurality of second motion driving coefficients including K second target motion driving coefficients corresponding to the K characters of the initial text, respectively,

the rendering of the target avatar further includes:

determining K pieces of second target time information corresponding to K characters of the initial text according to the initial time information; and

and rendering the target virtual image according to the K second target action driving coefficients and the K second target time information to obtain K second video clips.

7. The method of claim 6, wherein the second target time information corresponds to a second target time,

the rendering of the target avatar includes:

and rendering the target virtual image according to the second target action driving coefficient and the second target time information corresponding to the character of the initial text to obtain the second video segment.

8. The method of claim 6, wherein the initial text includes M initial sub-texts, M being an integer greater than or equal to 1, M of the initial sub-texts being extracted from the initial text,

The M initial sub-texts comprise I initial sub-texts matched with I target material label texts, wherein I is an integer greater than or equal to 1 and less than or equal to M,

the target text is obtained by processing the initial text by:

determining I second target position information according to the I initial sub-texts respectively matched with the I target material tag texts; and

and respectively adding the I target material label texts to the initial text according to the I second target position information to obtain the target text.

9. The method of claim 8, wherein said determining at least one first target time information corresponding to at least one of the target action tag texts, respectively, from a plurality of initial time information related to the target text further comprises:

and determining I pieces of third target time information corresponding to the I pieces of target material label texts respectively according to the initial time information.

10. The method of claim 9, wherein the rendering the target avatar according to at least one of the first target time information further comprises:

And rendering the target virtual image according to the I pieces of third target time information to obtain I pieces of third video clips, wherein the third video clips correspond to preset materials.

11. The method of claim 10, wherein the third target time information corresponds to a third target time,

the rendering of the target avatar includes:

and displaying the preset material corresponding to the target material label text at the third target moment to obtain the third video segment.

12. The method of claim 10, wherein the processing result further comprises a target audio, the target audio corresponding to the initial text,

the generating the target video according to at least one first video segment comprises:

and generating the target video according to the target audio, at least one first video segment, K second video segments and I third video segments.

13. A video generating apparatus comprising:

the determining module is used for determining at least one first target time information corresponding to at least one target action label text respectively according to a plurality of initial time information related to the target text in response to receiving the target text, wherein the target text is obtained by processing the initial text by utilizing at least one target action label text, the target action label text corresponds to a preset action, the initial time information corresponds to characters in the target text, and the preset action comprises a limb action;

The rendering module is used for rendering the target virtual image according to at least one piece of first target time information to obtain at least one first video clip, wherein the first video clip corresponds to the preset action, the first target time information corresponds to a first target moment, and the preset action corresponds to at least one first action driving coefficient; and

a generating module for generating a target video according to at least one of the first video clips,

wherein the rendering module comprises:

a fifth determining submodule, configured to determine at least one first target action driving coefficient according to a preset action corresponding to the target action tag text; and

and the first rendering sub-module is used for rendering the target virtual image according to the first target moment and at least one first target action driving coefficient to obtain the first video segment.

14. The apparatus of claim 13, wherein the initial text comprises M initial sub-texts, M being an integer greater than or equal to 1, M of the initial sub-texts being extracted from the initial text,

15. The apparatus of claim 14, wherein the target text is obtained by performing a correlation operation by:

the first determining sub-module is used for determining N pieces of first target position information according to N pieces of initial sub-texts which are respectively matched with N pieces of target action label texts; and

and the first adding sub-module is used for respectively adding N target action label texts to the initial text according to the N pieces of first target position information to obtain the target text.

16. The apparatus of claim 13, wherein the means for determining comprises:

the processing sub-module is used for processing the target text by utilizing a preset algorithm to obtain a processing result, wherein the processing result comprises a plurality of initial time information;

a third determining sub-module, configured to determine at least one piece of initial time information corresponding to the target action tag text according to at least one of the characters corresponding to the target action tag text; and

and a fourth determining sub-module, configured to determine, according to at least one piece of initial time information corresponding to the target action tag text, the first target time information corresponding to the target action tag text.

17. The apparatus of claim 16, wherein the processing result further comprises a plurality of second action driving coefficients, the second action driving coefficients corresponding to the characters of the target text.

18. The apparatus of claim 17, wherein the initial text includes K characters, K being an integer greater than or equal to 1, the plurality of second motion driving coefficients including K second target motion driving coefficients corresponding to the K characters of the initial text, respectively,

the rendering module further includes:

a sixth determining submodule, configured to determine, according to a plurality of the initial time information, K pieces of second target time information corresponding to K pieces of characters of the initial text, respectively; and

and the second rendering sub-module is used for rendering the target virtual image according to the K second target action driving coefficients and the K second target time information to obtain K second video clips.

19. The apparatus of claim 18, wherein the second target time information corresponds to a second target time,

the second rendering submodule includes:

and the first rendering unit is used for rendering the target virtual image according to the second target action driving coefficient and the second target time information corresponding to the character of the initial text to obtain the second video segment.

20. The apparatus of claim 18, wherein the initial text includes M initial sub-texts, M being an integer greater than or equal to 1, M of the initial sub-texts being extracted from the initial text,

the target text is obtained by performing related operations through the following submodules:

a seventh determining sub-module, configured to determine I second target location information according to I initial sub-texts that are respectively matched with I target material tag texts; and

and the second adding sub-module is used for respectively adding 1 target material label text to the initial text according to the I second target position information to obtain the target text.

21. The apparatus of claim 20, wherein the means for determining further comprises:

and the eighth determining submodule is used for determining I pieces of third target time information respectively corresponding to the I pieces of target material label texts according to the initial time information.

22. The apparatus of claim 21, wherein the rendering module further comprises:

And the third rendering sub-module is used for rendering the target virtual image according to the I third target time information to obtain I third video fragments, wherein the third video fragments correspond to preset materials.

23. The apparatus of claim 22, wherein the third target time information corresponds to a third target time,

the third rendering submodule includes:

and the display unit is used for displaying the preset materials corresponding to the target material label text at the third target moment to obtain the third video fragment.

24. The apparatus of claim 23, wherein the processing result further comprises a target audio, the target audio corresponding to the initial text,

the generation module comprises:

and the generation sub-module is used for generating the target video according to the target audio, at least one first video fragment, K second video fragments and I third video fragments.

25. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 12.

26. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 12.