CN114173067A

CN114173067A - Video generation method, device, equipment and storage medium

Info

Publication number: CN114173067A
Application number: CN202111574773.7A
Authority: CN
Inventors: 疏坤; 何山; 殷兵; 胡金水; 刘聪
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-03-11

Abstract

The application provides a video generation method, a device, equipment and a storage medium, wherein the method comprises the following steps: determining video resources matched with various script keywords in a video script from a pre-constructed video resource library, and determining various playing labels corresponding to the script keywords one by one; the playing labels at least comprise index information of video resources matched with the script keywords; determining special effect labels according to video resources corresponding to the playing labels; the special effect labels comprise video resource playing special effect labels and/or video transition special effect labels between adjacent video resources; and performing video clipping processing at least according to each playing label and the special effect label to obtain a video file. By adopting the method, the video file can be automatically generated, so that the labor cost and the time cost of video production can be reduced.

Description

Video generation method, device, equipment and storage medium

Technical Field

The present application relates to the field of multimedia data processing technologies, and in particular, to a video generation method, apparatus, device, and storage medium.

Background

With the rise of internet self-media and short video platforms, the number of people who net citizens obtain information on televisions, video websites and mobile phone short video APPs is remarkably increased in recent years. Meanwhile, more and more people participate in industries related to video production such as news information, life general knowledge, knowledge and science popularization, how to produce videos efficiently, and meanwhile, the acceptance and acquisition of good public praise by the public are the primary tasks of video workers.

At present, video editing software on the market does not need manual production, needs to consider various elements such as audio, subtitles and videos, is low in video production efficiency, usually needs several hours of manual operation when producing a video of a few minutes, and particularly consumes a large amount of manpower and time when the number of video materials is large.

Disclosure of Invention

Based on the technical current situation, the application provides a video generation method, a device, equipment and a storage medium, which can automatically generate a video file, so that the labor cost and the time cost of video production can be reduced.

A video generation method, comprising:

determining video resources matched with various script keywords in a video script from a pre-constructed video resource library, and determining various playing labels corresponding to the script keywords one by one; the playing labels at least comprise index information of video resources matched with the script keywords;

determining special effect labels according to video resources corresponding to the playing labels; the special effect labels comprise video resource playing special effect labels and/or video transition special effect labels between adjacent video resources;

and performing video clipping processing at least according to each playing label and the special effect label to obtain a video file.

Optionally, determining a video transition special effect label between adjacent video resources according to the video resource corresponding to each play label, including:

and determining the video transition special effect between the video resources corresponding to the adjacent playing labels from the preset video transition special effects according to the key frame density of the video resources corresponding to each playing label.

Optionally, the play tag further includes index information and play mode information of an audio resource matched with the scenario keyword.

Optionally, the method further includes:

generating video voice and/or subtitle labels according to the video script; the caption label comprises an identifier of a script text corresponding to the video voice and display starting and ending time of the script text;

performing video clipping processing at least according to each playing label and the special effect label to obtain a video file, wherein the video file comprises:

and according to each playing label, the special effect label and the video voice and/or subtitle label, carrying out video clipping processing to obtain a video file.

Optionally, the method further includes:

and carrying out video clipping processing according to the playing label and/or the special effect label and/or the subtitle label modified by the user to obtain a video file.

Optionally, when the video scenario is a scenario text, generating a video voice and a subtitle tag according to the video scenario includes:

synthesizing voice according to the video script to obtain video voice;

determining the playing start-stop time of each sentence in the video voice and the display start-stop time of each text sentence in the video script according to the sentence boundary information of the synthesized voice;

and generating a caption label based on the identification of each text sentence in the video script and the display starting and ending time.

Optionally, synthesizing the voice according to the video scenario to obtain the video voice, including:

determining a text semantic label by performing semantic understanding on the video script, wherein the text semantic label comprises at least one of user group information oriented to the video script, the gender of a speaker suitable for the video script when the video script synthesizes voice, emotion state information of the speaker when the video script synthesizes voice and voice atmosphere information when the video script synthesizes voice;

determining a voice synthesis parameter according to the text semantic tag and a pronunciation matching rule, wherein the voice synthesis parameter comprises at least one of voice pronouncing person, speed of speech, emotion and volume;

and synthesizing voice according to the video script and the voice synthesis parameters to obtain video voice.

Optionally, when the video scenario is a scenario voice, the scenario voice is taken as the video voice;

generating caption tags from the video transcript, comprising:

performing voice recognition processing on the video script to obtain a script text;

determining the playing start-stop time of each sentence in the video voice according to the sentence boundary information of the voice recognition, and determining the identification and the display start-stop time of each text sentence in the script text;

and generating a subtitle label according to the script text and the identification and the display starting and ending time of each text sentence in the script text.

Optionally, the video clip processing is performed according to each playing label, the special effect label, and the video voice and subtitle label, so as to obtain a video file, including:

acquiring video resources and audio resources corresponding to each playing label from a pre-constructed video resource library;

performing image coding processing on the acquired video resources and texts in the subtitle labels according to the special effect labels and the subtitle labels; and performing audio coding processing on the audio resource and the video voice;

and synthesizing the image coding result and the audio coding result to obtain a video file.

Optionally, the method further includes:

and cutting the video resource and the audio resource corresponding to each playing label according to the playing starting and ending time of each sentence in the video voice and/or the displaying starting and ending time of each text sentence in the caption label, so that the lengths of the video resource and the audio resource are matched with the playing duration of the corresponding video voice and/or the displaying duration of the caption text.

A video generation apparatus comprising:

the resource retrieval unit is used for determining video resources matched with all scenario keywords in the video scenario from a pre-constructed video resource library and determining all playing labels corresponding to all scenario keywords one by one; the playing labels at least comprise index information of video resources matched with the script keywords;

the special effect setting unit is used for determining the special effect labels according to the video resources corresponding to the playing labels; the special effect labels comprise video resource playing special effect labels and/or video transition special effect labels between adjacent video resources;

and the video generating unit is used for carrying out video clipping processing at least according to each playing label and the special effect label to obtain a video file.

A video generation device comprising:

a memory and a processor;

wherein the memory is connected with the processor;

the processor is used for implementing the video generation method by running the program in the memory.

A storage medium having stored thereon a computer program which, when executed by a processor, implements the video generation method described above.

According to the video generation method, video resources corresponding to the scenario keywords can be determined from a pre-constructed video resource library automatically according to the scenario keywords of the video scenario, video effects matched with the video resources can be determined automatically, and video editing and synthesizing processing can be carried out according to the video effects and the video resources to obtain a video file. When the user applies the video generation method to make a video, only the video script needs to be determined, and the video generation method can automatically generate a video file which accords with the video script input by the user based on the video script input by the user, so that the labor cost and the time cost of video making can be reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a video generation method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of another video generation method provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a video generation device according to an embodiment of the present application.

Detailed Description

The embodiment of the application is suitable for an application scene of automatically generating the video, and by adopting the technical scheme of the embodiment of the application, the video file meeting the user requirement can be automatically generated based on the pre-constructed video resource library, so that the video generation efficiency can be improved, and the time cost and the labor cost of video production can be reduced.

For example, the technical solution of the present application may be applied to hardware devices such as a hardware processor, for example, a smart phone, a tablet computer, a computer, and the like, or the processing flow of the technical solution may also be packaged into a software program to be executed, and when the hardware processor executes the processing procedure of the technical solution of the present application, or the software program is executed, the video file may be automatically generated. The embodiment of the present application only introduces the specific processing procedure of the technical scheme of the present application by way of example, and does not limit the specific execution form of the technical scheme of the present application, and any technical implementation form that can execute the processing procedure of the technical scheme of the present application may be adopted by the embodiment of the present application.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

An embodiment of the present application provides a video generation method, which is shown in fig. 1 and includes:

s101, determining video resources matched with the script keywords in the video script from a pre-constructed video resource library, and determining playing labels corresponding to the script keywords one by one.

Specifically, the video repository refers to a database constructed by collecting and accumulating media files such as audio, pictures, and videos with copyright by a user, and the video repository is used as a source of materials for video generation.

The video frame rate, the resolution and the audio sampling rate in the video resource library are all clearly classified, so that a user can conveniently retrieve and acquire corresponding materials from the video resource library to carry out video production according to the required video frame rate, resolution and audio sampling rate.

For each media file in the video asset library, attribute information is determined separately, and the attribute information includes but is not limited to content type of the media file, characters and things involved in the content of the media file, and attributes of the characters and things.

In combination with attribute information of each media file in the video repository and content of each media file, the embodiment of the present application further sets tag keywords for each media file in the video repository, and each media file may correspond to a plurality of tag keywords. The label key words of the media files are used for representing the content and attribute information of the media files. For example, for videos and pictures, the setting of the label keywords for the video and picture resources can be realized through an image detection and image recognition classification algorithm. For audio, the setting of tag keywords can be performed through voice recognition. Alternatively, the tag keywords may be manually set for each media file.

In addition, the media files in the video resource library are further divided into different video resource sub-libraries according to the types of file contents, and one type of media files such as audio, pictures, videos and the like are stored in each video resource sub-library. When a user needs to retrieve a media file (audio, video or picture) containing a certain category of content from the video resource library, the media file can be directly retrieved from the video resource sub-library of the corresponding category.

The video scenario refers to a voice or text content input by a user and used as a basis for generating a video. The video transcript may be presented in the form of a word or caption in the finally generated video. When a user needs to make a video, the user only needs to design a video scenario, then the video scenario in a voice or text form is input into a video synthesizer, and the video synthesizer can generate a video file which contains the video scenario input by the user and is matched with the video scenario by executing the technical scheme of the embodiment of the application.

Based on the video resource library setting, when a video scenario input by a user is acquired, the embodiment of the application first determines each scenario keyword in the video scenario. The scenario keyword is a keyword capable of representing the content of the scenario, and the keyword may be a certain keyword or some keywords in the video scenario, or a keyword obtained by summarizing and summarizing the content of the video scenario.

As an exemplary implementation manner, in the embodiment of the present application, sentence division processing is performed on a video scenario to obtain text sentences of each video scenario, and then keyword extraction or semantic understanding processing is performed on each video scenario text sentence to obtain keywords corresponding to each video scenario text sentence.

Or, semantic understanding may be performed on the video scenario, and at least one keyword capable of representing the content of the video scenario is determined as the keyword of the video scenario.

When determining the keywords of the video scenario, the keywords of the video scenario are sequentially determined in the order of the front and back of the content of the video scenario. That is, each keyword of the video scenario has a definite precedence relationship, and the precedence relationship is matched with the precedence relationship of the content of the video scenario corresponding to the keyword.

When determining each scenario keyword of a video scenario input by a user, the embodiment of the application searches and determines a video resource matched with each scenario keyword of the video scenario from a pre-constructed video resource library, and determines each playing label corresponding to each scenario keyword one by one according to the determined information of the video resource matched with each scenario keyword.

Illustratively, the script keywords are compared with the label keywords of each video resource in the video resource library, and the video resource with the highest matching degree between the label keywords and the script keywords is selected from the video resource library as the video resource matched with the script keywords.

Because the data volume of the video resources is large, in order to reduce the data volume, the video content in the key frame of the video resources matched with the scenario keywords is used as the video resources matched with the scenario keywords in the embodiment of the application.

Further, according to the related information of the video resource matching with the scenario keyword, a playing label corresponding to the scenario keyword is generated, and the playing label at least includes index information of the video resource matching with the scenario keyword, such as a video resource (sub) library where the video resource is located, an identifier of a video resource file, a position of the video resource in the video file, and the like.

Furthermore, while video resources matching the respective scenario keywords of the video scenario are determined from the video resource library, audio resources matching the respective scenario keywords may also be determined from the video resource library. The audio resource may be an audio resource corresponding to a video resource, for example, for a certain video file, video and audio therein are separately and separately stored in a video resource library, and when a certain video resource is acquired, the audio resource corresponding to the video resource may also be acquired. In addition, the audio resource may also be the audio of the background music set by the user for the video file to be generated.

The audio resource matching with the scenario keyword may be obtained in the same manner as the above-mentioned video resource matching with the scenario keyword.

When the audio resources matched with the scenario keywords in the video scenario are obtained, the index information and the playing mode information of the audio resources matched with the scenario keywords can be added to the playing labels corresponding to the scenario keywords. The index information of the audio resource may include a video resource (sub) library where the audio resource is located, an identifier of an audio resource file, a location of the audio resource in the audio file, and the like. The playing mode of the audio resource may include a loop mode, a playing volume, a playing cut-in and cut-out effect, and the like.

For example, assume that the play label generated according to the above method and corresponding to a certain scenario keyword in a video scenario is [ hub ═ 3; 1, video ═ 1; section 30: 200; and if the play tag is clear, corresponding to the scenario keyword, clipping is performed by using the audio and video material in the video resource (sub) library 3, wherein 30 to 200 frames in the video file 1 are selected for playing, the audio file 3 is selected for playing, the volume is adjusted to 50, and the playing mode is a loop playing mode (1 corresponds to a loop playing mode).

And for each scenario keyword in the video scenario, determining each playing label corresponding to each scenario keyword one by one according to the processing.

It is understood that the video resources and audio resources matching the scenario keyword can be uniquely indexed from the video resource library based on the playtags corresponding to the scenario keyword.

S102, determining special effect labels according to the video resources corresponding to the playing labels.

Specifically, when a video is generated, video resources matched with keywords of each scenario are retrieved from a video resource library based on a video scenario input by a user, and then the retrieved video resources are spliced and edited to obtain a complete video file.

Splicing and clipping a plurality of video resources relates to the setting of the playing special effect and the switching special effect of the video resources. The special effects are set for each video resource participating in the editing or splicing, so that the video file obtained by final editing or splicing is smoother, and stronger in harmony and integrity.

According to the video content or video type and other characteristics of the video resources corresponding to the playing labels, the special effects are automatically set for the video resources, and the special effect labels when the video resources are spliced or clipped are determined.

The special effect label may specifically be composed of a label of a play special effect of each video resource and/or a label of a video transition special effect between adjacent video resources.

The video resource playing special effect refers to a display special effect when the video resource is played, such as edge white filling, split screen display and the like; the video transition special effect between adjacent video resources refers to switching and skipping special effects of the adjacent video resources in the playing process, that is, when the playing of the current video resource is finished, the next video resource is accessed to be played in a special effect form, for example, special effects such as gradual change and rotation can be achieved. In addition, the video transition special effect may further include a duration of the transition special effect, a sound special effect matched with the transition special effect, and the like.

According to the video content or video type characteristics of the video resources corresponding to the playing labels, the playing special effect suitable for each video resource and the video transition special effect suitable for adjacent video resources are determined, and recording is carried out in the form of special effect labels. For example, the special effect label [ scene ═ 2, jump ═ 2] indicates that the video resources are played in a split screen mode, and the left sliding mode is selected as the jumping mode between the adjacent video resources.

The playing special effects of different video resources can be the same or different; the video transition special effects between different adjacent video resources can be the same or different. That is, one or more special effect tags corresponding to the respective play tags may be finally obtained. When there are multiple special effect tags, the video resources corresponding to the special effect tags need to be determined.

Further, when each play label further includes an audio resource, the finally determined special effect labels may further include an audio resource play special effect label and a switch special effect label between adjacent audio resources, in addition to the transition special effect label between the video resource play label and the adjacent video resource. The specific setting mode of the audio resource playing special effect and the switching special effect between the adjacent audio resources can refer to the setting mode of the video resource playing special effect and the transition special effect between the adjacent video resources.

Illustratively, when the video transition special effect label between the adjacent video resources is determined according to the video resource corresponding to each playing label, the video transition special effect between the video resources corresponding to the adjacent playing labels can be determined from the preset video transition special effects according to the key frame density of the video resource corresponding to each playing label.

Specifically, the higher the key frame density of the video resource, the richer the video content. If the key frame densities of the video resources corresponding to the adjacent playing labels are large, or the key frame density of one of the video resources corresponding to the adjacent playing labels is large, the adjacent video resources should be switched at a slower transition speed when the transition switching is performed, or a transition special effect with higher texture complexity is selected, so that the transition switching of the adjacent video resources can be adapted to the content complexity of the video resources, and an excessively violent visual switching effect is not brought to a user.

If the key frame densities of the video resources corresponding to the adjacent playing labels are smaller, the content complexity of the adjacent video resources is lower, and at the moment, when the adjacent video resources are switched, the switching can be carried out at a higher switching speed or by using a switching special effect with lower texture complexity.

Correspondingly, when the transition special effect between the video resources corresponding to the adjacent playing labels is determined, the switching special effect can be set for the audio resources corresponding to the adjacent playing labels. For example, a suitable audio switching special effect may be selected according to the phoneme complexity of the neighboring audio resources. The specific sound effect switching special effect setting can refer to the setting mode of the video switching special effect.

S103, video clipping processing is carried out at least according to the playing labels and the special effect labels, and a video file is obtained.

After the playing labels corresponding to the video script are determined and the special effect labels corresponding to the playing labels are determined, video synthesis and editing processing can be performed according to the playing labels and the special effect labels, and a complete video file is obtained.

Specifically, first, according to the video resource index information included in each play tag, reading a corresponding video resource from a video resource library to obtain a video resource corresponding to each play tag. Then, according to the video playing label contained in the special effect label and/or the video transition special effect label between the adjacent video resources, the playing special effect is set for each video resource, and/or the video transition special effect is set for the adjacent video resources. And secondly, performing play special effect setting on each video resource according to the video play special effect, and generating transition special effect animation according to the adjacent video transition special effect. When the transition special effect animation is generated, pictures can be retrieved from the video resource library to generate the animation. The generation of the transition special effect animation mainly relates to a radial transformation algorithm and a perspective transformation algorithm in digital image processing, and the specific generation process can refer to the generation process of the transition special effect animation in the prior art, and the embodiment of the application is not detailed.

And finally, splicing and synthesizing the video resources with the set special playing effect and transition special effect animations between adjacent video resources to obtain a complete video file.

In addition, when the play tag includes the index information and the play mode information of the audio resource, during the above-mentioned video clip processing, it is also necessary to read the corresponding audio resource from the video resource library according to the index information of the audio resource, and set the play mode for the read audio resource according to the play mode in the play tag. Then, when the video resources are spliced and synthesized, the audio resources with the playing modes are spliced and synthesized, so that the finally obtained video file is a complete multimedia file with synchronous video and audio.

The transition special effect animation between adjacent video resources can be retrieved from the video resource library, and the transition special effect animation is synchronized with the corresponding animation audio when the video file is synthesized. Or, the audio data stream corresponding to the transition special effect animation may be directly and correspondingly filled with the all-0 binary data stream, and in this case, the finally generated transition special effect animation of the video file is in a mute state when played.

As can be seen from the above description, the video generation method provided in the embodiment of the present application can automatically determine, according to the scenario keywords of the video scenario, the video resources corresponding to the scenario keywords from the pre-constructed video resource library, automatically determine the video special effects matched with the video resources, and perform video editing and synthesizing processing according to the video special effects and the video resources to obtain the video file. When the user applies the video generation method to make a video, only the video script needs to be determined, and the video generation method can automatically generate a video file which accords with the video script input by the user based on the video script input by the user, so that the labor cost and the time cost of video making can be reduced.

As a preferred implementation manner, referring to fig. 2, after determining each play label corresponding to each scenario keyword one-to-one, and determining a special effect label, the embodiment of the present application further performs step S203 to generate a video voice and/or subtitle label according to the video scenario.

Specifically, the embodiment of the application further generates a video voice and/or subtitle tag according to the video scenario, and when the video file is generated, the video scenario input by the user is added to the generated video file in the form of the video voice and/or the video subtitle.

In the following, the embodiments of the present application take the addition of video voice and video subtitles to a video file as an example, and describe the above-mentioned processing procedure of generating video voice and subtitle tags from a video scenario and the processing of adding the generated video voice and subtitle text to the video file. In practical application of the technical solution of the embodiment of the present application, one of a video speech and a subtitle text may be generated according to a video scenario and added to a video file, and at this time, specific processing content may be executed by referring to the description of the embodiment of the present application, which is not described in detail in the present application.

The caption label comprises an identification of a script text corresponding to the video voice determined based on the video script and a display starting and ending time of the script text. For example, in the embodiment of the application, the text sentence number of the script text is used as the text sentence mark of the script text, and meanwhile, the display start time and the display end time of the text sentence of the script text in the whole video file are used as the display start-stop time of the script text.

The video scenario may be in a voice form or a text form. When the form of the video scenario is different, the video voice and the text subtitle should be acquired by different methods. The following describes specific processing procedures for generating video speech and subtitle tags when the video scenario is a scenario text and a scenario speech, respectively.

When the video scenario is a scenario text, generating video voice and subtitle tags from the video scenario may be implemented by performing the following steps a 1-A3:

and A1, synthesizing voice according to the video script to obtain video voice.

Specifically, a speech synthesis algorithm is adopted, so that the video script in the text form can be synthesized into speech, and the speech is used as video speech.

In order to improve the effect of the synthesized video voice, the embodiment of the application firstly carries out semantic understanding on the video script, determines the text semantic tag, and then carries out voice synthesis processing based on the text semantic tag, so that the synthesized voice is more vivid and lively.

Specifically, the embodiment of the present application implements speech synthesis processing on a video scenario according to the following steps a11-a 13:

a11, determining text semantic labels by performing semantic understanding on the video script.

The text semantic tag comprises at least one of user group information oriented to the video script, the sex of a speaker suitable for the video script when the video script synthesizes voice, emotional state information of the speaker when the video script synthesizes voice and voice atmosphere information when the video script synthesizes voice.

Illustratively, the video scenario is input into a pre-trained semantic understanding engine to obtain semantic information of the video scenario, where the semantic information includes, but is not limited to, one or more of the following: user group information such as children, young people, middle-aged people and the like oriented to the video scenario; the sex of the speaker, such as male, female, unlimited sex, etc. suitable for the video script when synthesizing voice; the pronunciation man emotion state information of the video script when synthesizing voice, such as happy, feeling, heavy, comfortable, surprise, thoughts and the like; the video script synthesizes voice atmosphere information such as celebration, warmth and the like when the voice is synthesized.

More semantic information is determined through the semantic understanding mode and is used as text semantic labels of the videos. In practical application, a group of text semantic tags can be determined for the whole video script, and text semantic tags corresponding to each text segment of the video script can be determined for each text segment of the video script.

And A12, determining the speech synthesis parameters according to the text semantic labels and the pronunciation matching rules.

Specifically, based on the text semantic tag, the speech synthesis parameters are determined according to the pronunciation matching rule. The pronunciation matching rule is that different voice synthesis speakers are selected according to user group facing tags and the gender of persons suitable for broadcasting, and meanwhile, according to the emotion state tags and the voice atmosphere tags, the emotion fullness, the speech speed (the positive emotion speech speed is high, the negative emotion speech speed is low) and the volume (the happy atmosphere volume is large and the warm atmosphere volume is slightly small) of pronunciation are set as input parameters of a voice synthesis engine.

The specific content of the pronunciation matching rule can be preset or adopt the rule commonly used in the industry. Through the above processing, a speech synthesis parameter including at least one of a speaker, a speech speed, an emotion, and a volume of speech can be determined.

And A13, synthesizing voice according to the video script and the voice synthesis parameters to obtain video voice.

Specifically, the script text is converted into the voice corresponding to the above-mentioned voice synthesis parameters by performing the voice synthesis process according to the script text content of the video script and the above-mentioned voice synthesis parameters, and the voice is used as the video voice.

For a specific processing procedure for synthesizing a text into a speech, a conventional speech synthesis technical scheme may be referred to, and details of the embodiment of the present application are not described.

A2, determining the playing start and stop time of each sentence in the video speech and determining the display start and stop time of each text sentence in the video script according to the sentence boundary information of the synthesized speech.

Specifically, in the speech synthesis process, the speech synthesis algorithm can synchronously determine sentence boundary information of the synthesized speech. According to the sentence boundary information of the synthesized voice and the time length of the finally generated video voice, the accurate time period of each sentence in the video voice can be calculated and determined, namely the playing start and stop time of each sentence in the video voice can be calculated and determined.

Accordingly, after the play start/stop time of each sentence in the video speech is determined, the play start/stop time of the sentence is set as the display start/stop time of the text sentence corresponding to the sentence for each sentence in the video speech, so that the display start/stop time of each text sentence in the video script can be determined separately.

And A3, generating a caption label based on the identification of each text sentence in the video script and the display starting and ending time.

Specifically, in the embodiment of the application, the text sentences in the video scenario are sorted according to the front-back sequence of the text sentences in the video scenario, and the serial number corresponding to each text sentence is used as the identifier of the text sentence.

When the caption label of the video script is generated, the serial number of the text sentence of the video script and the display starting and ending time of the text sentence are used for generating a label, namely the caption label corresponding to the text sentence. For example, [ 2100: 00:45.240-00:00:47.040] indicates that the 21 st text sentence in the video transcript begins to be displayed at the 45.24 th second of the video and ends at the 47.04 th second of the video.

Furthermore, the content of the text sentence can be directly added to the caption tag, and the caption tag comprises the identification of the text sentence, the display starting and ending time of the text sentence and the content of the text sentence. For example, [ 2100: 00:45.240-00:00:47.040 Dajiahao ] indicates that the 21 st text sentence of the video scenario starts to be displayed at the 45.24 th second of the video and ends to be displayed at the 47.04 th second of the video, and the display content is "Dajiahao".

In the above manner, a caption tag corresponding to each text sentence in the video scenario may be generated, or alternatively, a caption tag including an identifier of all text sentences and a display start/stop time of each text sentence may be generated for all text sentences in the video scenario.

In summary, based on the generated caption tags, the display start-stop time of each text sentence in the video script in the finally generated video can be clarified, and the display start-stop time of each text sentence in the video is synchronized with the play start-stop time of the corresponding video voice.

When the video scenario is the scenario voice, the scenario voice can be directly used as the video voice.

On the basis, only the subtitle label is generated according to the video script. Specifically, the subtitle tag may be generated by performing the following processing of steps B1-B3:

and B1, performing voice recognition processing on the video script to obtain a script text.

Illustratively, the speech of the video script is input into a speech recognition engine, and the speech recognition engine performs speech recognition processing on the script speech to obtain a speech recognition text, which is used as the script text.

And B2, determining the playing start-stop time of each sentence in the video voice according to the sentence boundary information of the voice recognition, and determining the identification and the display start-stop time of each text sentence in the script text.

Specifically, in the speech recognition process, the speech recognition algorithm can synchronously determine sentence boundary information of the recognized speech. According to the sentence boundary information of the voice and the total duration of the script voice, the accurate time period of each sentence in the script voice can be calculated and determined, namely the playing start and stop time of each sentence in the script voice can be calculated and determined. Since the script speech is regarded as the video speech, the above processing determines the play start/stop time of each sentence in the finally determined video speech.

Meanwhile, sentence division can be performed on the recognition text of the script speech according to the sentence boundary information of the script speech. On the basis, the text sentences are sequenced according to the sequence of the text sentences in the script speech recognition text, and the sequence numbers corresponding to the text sentences are set as the marks of the text sentences.

After the play start and stop time of each sentence in the video speech is determined, the play start and stop time of the sentence is set as the display start and stop time of the text sentence corresponding to the sentence for each sentence in the video speech, so that the display start and stop time of each text sentence in the video script can be respectively determined.

And B3, generating a caption label according to the script text and the identification and display starting and ending time of each text sentence in the script text.

Specifically, when generating the subtitle tags of the video scenario, in order to determine the subtitle content through the subtitle tags, the embodiment of the present application directly adds the subtitle content to the subtitle tags.

That is, the sequence number of the text sentence of the video script, the text content of the text sentence, and the display start-stop time of the text sentence are used to generate a label, which is the caption label of the corresponding text sentence. For example, [ 2100: 00:45.240-00:00:47.040 Dajiahao ] indicates that the 21 st text sentence of the video scenario starts to be displayed at the 45.24 th second of the video and ends to be displayed at the 47.04 th second of the video, and the display content is "Dajiahao".

In the above manner, for each sentence in the video scenario, a subtitle tag corresponding to the sentence may be generated, or for all sentences in the video scenario, a subtitle tag including an identifier of a text sentence corresponding to all sentences, text sentence content, and display start/stop time of each text sentence may be generated.

After the above processing, the playing labels and the special effect labels corresponding to the video scenario, and the video voice and subtitle labels corresponding to the video scenario are determined, and then video editing processing can be performed. That is, step S204 may be continuously performed to perform video clipping processing according to each play tag, the special effect tag, and the video voice and/or subtitle tag, so as to obtain a video file.

As a preferred implementation manner, in the embodiment of the present application, a video clip is processed according to the following steps C1-C3, so as to obtain a video file:

and C1, acquiring the video resource and the audio resource corresponding to each playing label from a pre-constructed video resource library.

Specifically, according to the index information of the video resource and the audio resource in each play label, the corresponding video resource and audio resource are retrieved from the pre-constructed video resource library, and the video resource and the audio resource corresponding to each play label can be obtained.

C2, according to the special effect label and the caption label, carrying out image coding processing on the acquired video resources and the text in the caption label; and carrying out audio coding processing on the audio resource and the video voice.

Specifically, according to the special effect label, the embodiment of the application firstly carries out playing special effect setting on the retrieved video resource and carries out video resource transition special effect animation generation. And meanwhile, setting the playing mode of the audio resource according to the audio playing mode information in the playing label.

Then, the embodiment of the application divides the video and audio coding queues, and respectively codes the video resource and the audio resource.

For video resources (including each video resource retrieved from a video resource library, a text in a subtitle tag and a transition special effect animation), when the video resources are coded, the video resources are decoded according to a playing special effect and a transition special effect in the special effect tag and the display starting and ending time of the subtitle text in the subtitle tag to obtain a corresponding image frame, then the subtitle text is added into the image frame, and then the image coding processing is carried out on the image frame and the transition special effect animation, so that the coded video comprises subtitles, the transition special effect is formed between adjacent video resources, and the playing special effect of the video meets the requirement.

For audio resources (including audio resources and video speech retrieved from a video resource library), when encoding the audio resources, firstly, the audio resources and the video speech are mixed, and then, audio encoding is performed on the mixed result.

Further, in order to ensure time synchronization of video resources, audio resources, video voices and subtitles, the embodiment of the present application further performs a clipping process on the video resources and the audio resources corresponding to each playing tag according to the playing start-stop time of each sentence in the video voices and/or the display start-stop time of each text sentence in the subtitle tags, so that the lengths of the video resources and the audio resources are matched with the playing duration of the corresponding video voices and/or the display duration of the subtitle texts.

The play start/stop time of each sentence in the video speech generated based on the video scenario is synchronized with the display start/stop time of the text sentence corresponding to each sentence. Therefore, it is only necessary to ensure that the play start and stop times of the video resources and the audio resources corresponding to the video voice and the subtitle text are synchronized with the start and stop times of the video voice and the subtitle text.

Because the determination of the playing label is based on the scenario keywords in the video scenario, the embodiment of the application determines the corresponding relationship between the video resource and the audio resource, and the video voice and the subtitle text according to the scenario segment corresponding to the scenario keywords.

Specifically, first, a scenario segment corresponding to a scenario keyword corresponding to a video resource and an audio resource is determined, and then, according to each text sentence included in the scenario segment, each video voice and subtitle text corresponding to the scenario segment is determined.

Based on the determination of the corresponding relationship, for each scenario segment, according to the playing time length of each video voice contained in the scenario segment and/or the display time length of each subtitle text sentence contained in the scenario segment, the lengths of the video resource and the audio resource corresponding to the scenario segment are cut, so that the lengths of the video resource and the audio resource corresponding to the scenario segment are the same as the playing time lengths of all video voices contained in the scenario segment or the display time lengths of all subtitle text sentences contained in the scenario segment.

Through the cutting processing, the duration of the video resource, the audio resource, the video voice and the subtitle corresponding to each script segment can be equal, and therefore the voice, the video, the audio and the subtitle in the finally generated video file can be kept synchronous.

And C3, synthesizing the image coding result and the audio coding result to obtain a video file.

Specifically, the finally obtained image coding result and audio coding result are fused, and the video stream and the audio stream are kept synchronous, so that a complete video file can be obtained.

Furthermore, when the video file is generated and the generated video file is output, the embodiments of the present application also output the tags that generate the video file, for example, output the play tags, the special effect tags, and the subtitle tags, so that the user can adjust the tags based on the output video file.

And after the playing label and/or the special effect label and/or the subtitle label modified by the user are obtained, video clipping processing is carried out according to the obtained labels modified by the user, and a video file modified by the user is obtained. The specific video clipping process is the same as the video clipping process described above.

Therefore, the video generation method provided by the embodiment of the application can not only realize automatic generation of the video file according to the video script input by the user, but also enable the user to adjust the generated video file by modifying the label, thereby generating the video file which practically meets the user requirement.

In addition, steps S201 and S202 in the video generation method shown in fig. 2 correspond to steps S101 and S102 in the video generation method shown in fig. 1, respectively, for specific contents, please refer to contents corresponding to the method embodiment shown in fig. 1, and embodiments of the present application are not repeated.

In correspondence with the above video generation method, an embodiment of the present application further provides a video generation apparatus, as shown in fig. 3, the apparatus includes:

a resource retrieval unit 100, configured to determine, from a pre-constructed video resource library, video resources that match the scenario keywords in the video scenario, and determine play labels that correspond to the scenario keywords one to one; the playing labels at least comprise index information of video resources matched with the script keywords;

a special effect setting unit 110, configured to determine a special effect label according to a video resource corresponding to each play label; the special effect labels comprise video resource playing special effect labels and/or video transition special effect labels between adjacent video resources;

and the video generating unit 120 is configured to perform video clip processing at least according to each play tag and the special effect tag to obtain a video file.

As an optional implementation manner, determining a video transition special effect label between adjacent video resources according to a video resource corresponding to each play label includes:

As an optional implementation manner, the play label further includes index information and play mode information of an audio resource matched with the scenario keyword.

As an optional implementation, the apparatus further comprises:

the caption setting unit is used for generating video voice and/or caption labels according to the video script; the caption label comprises an identifier of a script text corresponding to the video voice and display starting and ending time of the script text;

As an optional implementation manner, the video generating unit 120 is further configured to:

As an optional implementation, when the video scenario is a scenario text, generating a video voice and a subtitle tag according to the video scenario includes:

synthesizing voice according to the video script to obtain video voice;

As an alternative embodiment, synthesizing speech from a video transcript to obtain video speech includes:

As an optional implementation, when the video scenario is a scenario voice, the scenario voice is taken as a video voice;

generating caption tags from the video transcript, comprising:

As an optional implementation manner, performing video clip processing according to each play tag, the special effect tag, and the video voice and subtitle tag to obtain a video file includes:

Specifically, please refer to the specific contents of the processing steps of the corresponding video generation method for the specific working contents of each unit of the video generation apparatus, which is not repeated here.

Another embodiment of the present application further provides a video generating apparatus, as shown in fig. 4, the apparatus including:

a memory 200 and a processor 210;

wherein, the memory 200 is connected to the processor 210 for storing programs;

the processor 210 is configured to implement the video generation method disclosed in any of the above embodiments by running the program stored in the memory 200.

Specifically, the video generation device may further include: a bus, a communication interface 220, an input device 230, and an output device 240.

The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are connected to each other through a bus. Wherein:

a bus may include a path that transfers information between components of a computer system.

The processor 210 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the present invention. But may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The processor 210 may include a main processor and may also include a baseband chip, modem, and the like.

The memory 200 stores programs for executing the technical solution of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer operating instructions. More specifically, memory 200 may include a read-only memory (ROM), other types of static storage devices that may store static information and instructions, a Random Access Memory (RAM), other types of dynamic storage devices that may store information and instructions, a disk storage, a flash, and so forth.

The input device 230 may include a means for receiving data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.

Output device 240 may include equipment that allows output of information to a user, such as a display screen, a printer, speakers, and the like.

Communication interface 220 may include any device that uses any transceiver or the like to communicate with other devices or communication networks, such as an ethernet network, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The processor 210 executes the program stored in the memory 200 and invokes other devices, which can be used to implement the steps of any of the video generation methods provided by the above-described embodiments of the present application.

Another embodiment of the present application further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program implements the steps of any one of the video generation methods provided in the foregoing embodiments of the present application.

Specifically, the specific working contents of each part of the video generating device and the specific processing contents of the computer program on the storage medium when being executed by the processor can refer to the contents of each embodiment of the video generating method, which are not described herein again.

While, for purposes of simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present application is not limited by the order of acts or acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The steps in the method of each embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs, and technical features described in each embodiment may be replaced or combined.

The modules and sub-modules in the device and the terminal in the embodiments of the application can be combined, divided and deleted according to actual needs.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of a module or a sub-module is only one logical division, and there may be other divisions when the terminal is actually implemented, for example, a plurality of sub-modules or modules may be combined or integrated into another module, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules or sub-modules described as separate parts may or may not be physically separate, and parts that are modules or sub-modules may or may not be physical modules or sub-modules, may be located in one place, or may be distributed over a plurality of network modules or sub-modules. Some or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, each functional module or sub-module in the embodiments of the present application may be integrated into one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated into one module. The integrated modules or sub-modules may be implemented in the form of hardware, or may be implemented in the form of software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software cells may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of video generation, comprising:

2. The method of claim 1, wherein determining video transition special effect labels between adjacent video resources according to the video resources corresponding to the respective playing labels comprises:

3. The method according to claim 1, wherein the play tab further includes index information and play mode information of the audio resource matching the scenario keyword.

4. The method of claim 1, further comprising:

5. The method of claim 4, further comprising:

6. The method of claim 4, wherein generating video speech and subtitle tags from a video transcript when the video transcript is a transcript text comprises:

synthesizing voice according to the video script to obtain video voice;

7. The method of claim 6, wherein synthesizing speech from the video transcript to obtain video speech comprises:

8. The method according to claim 4, wherein when the video scenario is a scenario voice, the scenario voice is taken as a video voice;

generating caption tags from the video transcript, comprising:

9. The method of claim 4, wherein performing video clipping processing according to each play tag, the special effect tag, and the video voice and subtitle tag to obtain a video file comprises:

10. The method of claim 9, further comprising:

11. A video generation apparatus, comprising:

12. A video generation device, comprising:

a memory and a processor;

wherein the memory is connected with the processor;

the processor is configured to implement the video generation method according to any one of claims 1 to 10 by executing the program in the memory.

13. A storage medium, having stored thereon a computer program which, when executed by a processor, implements a video generation method according to any one of claims 1 to 10.