CN117749960A

CN117749960A - Video synthesis method and device and electronic equipment

Info

Publication number: CN117749960A
Application number: CN202410175135.5A
Authority: CN
Inventors: 刘学东; 刘林鹏; 李强
Original assignee: Chengdu Meijing Xinshijie Technology Co ltd
Current assignee: Chengdu Meijing Xinshijie Technology Co ltd
Priority date: 2024-02-07
Filing date: 2024-02-07
Publication date: 2024-03-22
Anticipated expiration: 2044-02-07
Also published as: CN117749960B

Abstract

The embodiment of the application provides a video synthesis method, a video synthesis device and electronic equipment, and relates to the technical field of video synthesis, wherein the method comprises the following steps: matching the text basic unit with a media resource tag library by receiving an input text and identifying the text basic unit of the input text, and determining the text basic unit capable of being matched with at least one media resource tag as a word to be matched; weighting the words to be matched, and determining the matching sequence of all the words to be matched of the input text according to the weight scores of the words to be matched; determining the medium resource range to be matched of each word to be matched; and according to the matching sequence of all the words to be matched, randomly selecting media assets from the media asset range to be matched corresponding to the words to be matched in sequence until the video images reach the preset duration to complete video synthesis, wherein the preset duration is determined by the length of the input text. The method solves the problem of how to automatically, quickly and efficiently synthesize the video according to different types of materials.

Description

Video synthesis method and device and electronic equipment

Technical Field

The embodiment of the application relates to the technical field of video synthesis, in particular to a video synthesis method, a video synthesis device and electronic equipment.

Background

After the preparation of the 'content' such as the video file script and the material is completed and editing is carried out, the processes such as editing integration, visual effect addition, composition rendering and output are needed to complete from the material to the creative and then to the finished video, but the traditional video composition logic and method lead the composition time and cost to be limited by various factors such as hardware, material and the like.

In the video synthesis stage, the video synthesis speed is affected by the configuration of the computer, and meanwhile, the original material needs to be superimposed with a background or be singly scratched, or the memory ratio of a file with a transparent channel format is extremely large, so that the video synthesis processing speed is extremely limited, and for shorter and simple videos, only a few minutes may be needed. However, depending on how fast a large amount of more complex, high resolution video is generated from different types of materials, it may take several hours or more, and there is no efficient and convenient video synthesis method.

Therefore, how to automatically and quickly and efficiently perform video synthesis according to different types of materials becomes a technical problem to be solved in the present day.

Disclosure of Invention

The embodiment of the application provides a video synthesis method, a video synthesis device and electronic equipment, which are used for solving the problem of how to automatically, quickly and efficiently synthesize videos according to different types of materials in the prior art. In order to solve the technical problems, the invention is realized as follows:

in a first aspect, an embodiment of the present application provides a video synthesis method, including:

receiving an input text and identifying a text base unit of the input text;

matching the text basic unit with a media resource tag library, and determining the text basic unit capable of being matched with at least one media resource tag as a word to be matched;

weighting the words to be matched, and determining the matching sequence of all the words to be matched of the input text according to the weight score of the words to be matched;

determining the medium resource range to be matched of each word to be matched;

and according to the matching sequence of all the words to be matched, randomly selecting media asset filling video pictures from the media asset range to be matched corresponding to the words to be matched in sequence until the video pictures reach a preset duration to complete video synthesis, wherein the preset duration is determined by the length of the input text.

Optionally, according to the matching sequence of the to-be-matched words, randomly selecting the media resource filling video frames from the to-be-matched media resource range corresponding to the to-be-matched words in sequence until the video frames reach a preset duration to complete video synthesis includes:

according to the matching sequence of all the words to be matched, randomly selecting a media resource filling video picture from the media resource range to be matched corresponding to the words to be matched in sequence;

if all the words to be matched finish filling for the first time and the video frames still do not reach the preset time length, randomly selecting one media resource filling video frame from the residual media resource range to be matched corresponding to the words to be matched in sequence according to the matching sequence of all the words to be matched until the video frame reaches the preset time length.

Optionally, the label of the media asset includes a depth label and/or a color label of the media asset, and according to the matching sequence of all the words to be matched, randomly selecting media asset filling video frames from the range of the media asset to be matched corresponding to the words to be matched in sequence until the video frames reach a preset duration to complete video synthesis includes:

determining target media assets of a first matching word to be filled into a video picture, and acquiring a depth of field label and/or a color label of the target media assets of the first matching word;

and determining that the target media assets of the next matching word are filled into the picture according to the depth label and/or the color label of the target media assets of the first matching word, the preset depth matching sequence and/or the color matching rule until the matching of the target media assets of all the words to be matched is completed.

Optionally, the assigning weights to the words to be matched includes at least one of the following assigning weights:

determining the part of speech of each word to be matched according to the semantic relation of the word to be matched in the input text, and assigning a weight to each word to be matched according to a preset part of speech rule;

weighting from high to low according to the sequence of each word to be matched in the input text;

and establishing a weight level word stock, and if the word to be matched falls into the weight level word stock, re-weighting the word to be matched.

Optionally, before the matching the text basic unit with the media resource tag library and determining the text basic unit capable of matching at least one media resource tag as the word to be matched, the method further includes:

and establishing a media resource tag library, wherein the media resource tags have multiple stages, and the tags at each stage have different weights.

Optionally, the determining the to-be-matched media resource scope of each to-be-matched word includes:

calculating the score of the medium resource corresponding to each word to be matched according to the label of the medium resource, and determining the highest score of the medium resource of each word to be matched;

and determining the range of the medium resource to be matched of the word to be matched according to the highest medium resource score of the word to be matched.

if the fact that the target media assets of the current matching word are filled into the video images is detected, randomly selecting media asset filling video images from the range of the remaining media assets to be matched corresponding to the word to be matched.

Optionally, the media asset type includes:

the intelligent media assets are obtained by inputting at least one of audio, text, pictures and videos into an intelligent media asset generation model.

In a second aspect, embodiments of the present application further provide a video compositing apparatus, including:

the receiving module is used for receiving input text and identifying text basic units of the input text;

the matching module is used for matching the text basic unit with the media resource tag library and determining the text basic unit capable of being matched with at least one media resource tag as a word to be matched;

the weighting module is used for weighting the words to be matched and determining the matching sequence of all the words to be matched of the input text according to the weight score of the words to be matched;

the determining module is used for determining the to-be-matched media resource range of each word to be matched;

and the filling module is used for randomly selecting media resource filling video pictures from the media resource range to be matched corresponding to the words to be matched in sequence according to the matching sequence of the words to be matched until the video pictures reach the preset duration to complete video synthesis, wherein the preset duration is determined by the length of the input text. In a third aspect, an embodiment of the present application further provides an electronic device, including:

a memory for storing a computer program;

a processor for implementing the steps of the video compositing method according to any of the first aspects when executing the computer program.

In a fourth aspect, embodiments of the present application further provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the video compositing method according to any of the first aspects described above.

In the embodiment of the application, by receiving an input text and identifying a text basic unit of the input text, matching the text basic unit with a media resource tag library, and determining the text basic unit capable of being matched with at least one media resource tag as a word to be matched; weighting the words to be matched, and determining the matching sequence of all the words to be matched of the input text according to the weight scores of the words to be matched; determining the medium resource range to be matched of each word to be matched; and according to the matching sequence of all the words to be matched, randomly selecting media assets from the media asset range to be matched corresponding to the words to be matched in sequence until the video images reach the preset duration to complete video synthesis, wherein the preset duration is determined by the length of the input text. The method for associating texts with various types of media assets and synthesizing videos is provided, wherein the richness and the integrity of synthesized videos are jointly optimized through the modes of text splitting, media asset tag setting, matching after various empowerments and the like, so that the synthesized videos are strongly related to input texts, the integrity and the smoothness of the videos are ensured, the automatic synthesis of the videos in a large batch according to the input texts is supported, and the videos synthesized by the same input text are diversified.

In addition, the application also provides a video synthesis device corresponding to the video synthesis method embodiment, and the video synthesis device is provided with corresponding functional modules and has corresponding effects.

In addition, the invention also provides electronic equipment and a computer readable storage medium, which also have the beneficial effects.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

fig. 1 is a schematic flow chart of a video synthesizing method according to an embodiment of the present application.

Fig. 2 is a second flow chart of a video synthesizing method according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of a video synthesizer according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, fig. 1 provides a video synthesis method according to an embodiment of the present application, including:

step S11: an input text is received and text base units of the input text are identified.

In this embodiment of the present application, the input text refers to a text input by a user in an algorithm program, and the identifying a text basic unit of the input text refers to dividing an original text into two or more paragraphs according to a text structure, where each sentence may be further divided according to a user's requirement until the text basic unit is divided. The text basic unit refers to a basic unit with actual scene meaning in the text, wherein the actual scene meaning refers to scenes capable of reacting time, places, characters (bodies), behaviors and phenomena, such as time scenes of night, daytime, spring, winter, places, channels, gymnasiums, persons (bodies), old people, doctors, fish, factories, bicycles, basketball, steel making, transportation, explosion, lifting and lowering. Through text segmentation of the input text, setting of video time length according to the text length, determining of media asset quantity to be filled and media asset filling, a user can synthesize corresponding video through the input text. The video is converted into voice by using a Text To Speech (TTS) technology according To the Text length, and it can be understood that the voice time length is substantially identical To the video time length, and the video time length is slightly longer than the voice time length in practical application.

Step S12: and matching the text basic unit with the media resource tag library, and determining the text basic unit capable of being matched with at least one media resource tag as a word to be matched.

The tag refers to the feature of stock media resource materials related to media resource matching selection when converting a text into a video, and in practical application, the tag refinement degree and level can be defined according to the product requirement and the characteristics of the media resource materials used by the product, which is not limited herein. It will be appreciated that text-to-video, material characteristics that need to be considered in matching media assets include, but are not limited to: a material content main body, material content details and material content topics. For example, the main body of the material content is 'Chengdu city great lake', the detail of the material content is 'tourist', 'water bird', and the main body of the material content is 'travel', 'play' and 'holiday'.

In the embodiment of the application, the text basic unit is identified through natural language processing, and the tag library is matched to determine the word to be matched. The semantic relationship mainly represents a combination relationship in a longitudinal direction and a horizontal direction, and a logic relationship, and in practical application, the semantic combination relationship refers to an alternative vertical relationship established according to the meaning comparison among language units, wherein the alternative vertical relationship comprises a synonymous relationship, an antisense relationship, a quasi-sense relationship, an antisense relationship and the like; the semantic combination relationship refers to a relationship formed by mutually collocating language units in a language system and a language stream, and comprises an application relationship, a generic relationship, a limiting relationship, a parallel relationship, a dominant relationship, a judging relationship, a description supplement relationship and the like. According to the embodiment of the application, text content is analyzed through a natural language processing basic task model ner (named entity recognition), tags of basic units of the text such as time, place, characters (subjects), behaviors and phenomena are recognized, the recognized vocabulary entry is matched with a material medium resource tag library, and the vocabulary entry which can be matched with at least one medium resource tag is used as a word to be matched.

In some embodiments of the present application, optionally, before the matching the text basic unit with the media resource tag library and determining the text basic unit capable of matching at least one media resource tag as the word to be matched, the method further includes:

In the embodiment of the application, the labeling degree and the labeling level of the material characteristics are not limited, and in practical application, the labeling mode of the material characteristics can be defined according to the needs. One specific implementation mode can use a 'visual body' as a primary label for the image video characteristics, use a 'detail thing' as a secondary label and use a 'association about the visual body' as a tertiary label, so that a tertiary label system is formed to express the characteristics of the media materials.

In the embodiment of the application, the multi-stage media resource tag library is established, and each stage of tag can self-define the weight, so that the diversity of video synthesis and the richness of media resource tags are further improved.

In some embodiments of the present application, optionally, the media asset types include:

In this embodiment of the present application, the media asset refers to media asset materials in a media asset library that generates, owns or purchases an AI to obtain commercial rights, and the media asset materials may be specifically represented by multiple carriers such as audio, pictures, and video, where the video adaptation classification refers to classifying image materials with an aspect ratio greater than 1.6 into horizontal version materials according to an aspect ratio of image resolution, and classifying other image materials into vertical version materials. The method comprises the steps of carrying out labeling treatment on stock media resource materials such as images and texts, and carrying out horizontal and vertical video adaptation classification on the materials according to image resolution.

In the embodiment of the application, the videos with the target duration can be synthesized by freely splicing various media assets, so that the diversity of the synthesized videos is greatly improved.

Step S13: and giving weight to the words to be matched, and determining the matching sequence of all the words to be matched of the input text according to the weight score of the words to be matched.

In some embodiments of the present application, optionally, the assigning weights to the words to be matched includes at least one of the following assignment modes:

In the embodiment of the application, the part of speech of the word to be matched can be determined through semantic relations, and the word to be matched is weighted for the first time. The word part of speech of the word to be matched refers to the positional relation of the word to be matched in sentences, and the word part of speech comprises a schlier, an accident receiver, a schlier, predicates and the like, for example, in the process of implementing traffic single double number restriction on the fact that the word is "the schlier," the traffic single double number restriction on the fact that the word is the accident receiver, "the implementation" is the predicate, and the process of implementing the word is the predicate, namely "the word is the word from the 7 month 26 days to the 7 month 29 days".

The scheme is not limited to a weighting mode, and a specific implementation mode can define a weighting score according to requirements, for example, a practitioner is given 1, a receiver is given 0.8, a scholars is given 0.5, predicates are given 0.1, and if 'Chengdu' and 'traffic' in example sentences are words to be matched, the first weighting score of 'Chengdu' is 1, and the first weighting score of 'traffic' is 0.8.

Further, the second weighting is carried out from high to low according to the sequence of the words to be matched. The sequence is the front-back sequence of the words to be matched in sentences, and the second weighting is carried out from high to low, the specific weighting score is not limited, for example, if the words to be matched are "in the achievement" and the "traffic" appear successively, the achievement "is assigned with 1, the" traffic "is assigned with 0.9, the actual weight is" in the achievement "1*1 =1 after the second weighting, and the" traffic "is 0.8×0.9=0.72.

Furthermore, a fixed weight tag word stock can be set according to the use requirement, the third weighting is carried out, the final weight score is obtained, and the priority of each word to be matched is determined. The fixed weight tag word stock refers to tag word weight scores in the range are always fixed, and the tag word stock is determined according to the requirements of users. The scheme does not limit the fixed weight label, and in practical application, the high-segmentation weight word or the low-segmentation weight word can be fixed according to the requirement of a user. For example, as financial media reported by a marketing company, descriptive words such as "development" and "economy" appear frequently in quick news and reports, and the actual meaning is low, so that the weight score of the words is fixed to be 0.1. As a prefecture-level medium, the vocabularies of a prefecture-marked place, a history culture, a history person, and the like have a representative meaning, and thus the vocabulary weight score is fixed to 0.8.

The preferential matching words are obtained by weighting the words to be matched for three times according to the weight score, the words to be matched with high weight are selected as preferential matching words, and the words to be matched with low weight are selected as secondary preferential matching words.

In the embodiment of the application, the matching sequence of all the words to be matched of the input text is determined according to the weight scores of the words to be matched by weighting the words to be matched. The number of the weight setting modes of the words to be matched is multiple, so that the matching sequence of the words to be matched is also multiple, and correspondingly, the synthetic videos obtained according to the matching sequences of the words to be matched are also multiple, so that the richness of the synthetic videos is greatly improved.

Step S14: and determining the medium resource range to be matched of each word to be matched.

In some embodiments of the present application, optionally, the determining a to-be-matched media resource range of each to-be-matched word includes:

In this embodiment of the present application, the medium resource to be matched refers to a medium resource with at least one label matching corresponding to the medium resource with a priority matching word and a sub-priority matching word, the medium resource matching degree assignment to be matched refers to assigning different scores to the medium resource with a third grade label according to the grade of the medium resource label, for example, assigning 1 score to the medium resource when matching with the first grade label, assigning 0.8 score when matching with the second grade label, assigning 0.5 score when matching with the third grade label, and determining the medium resource matching degree according to the score. If the first-level label of the medium resource is "adult" and the second-level label is "traffic", the medium resource is always divided into 1.8 points, which is higher than the medium resource with only the first-level label being "adult", and becomes the medium resource with the highest matching degree. And the medium resource grouping to be matched refers to grouping all the medium resources with the medium resource matching degree as available medium resources, grouping the medium resources according to the priority of the matching words, and obtaining a threshold value by using the medium resource score coefficient of the highest score of the same group of medium resources, and removing the medium resources with the score lower than the threshold value from the medium resource grouping to obtain the medium resource grouping to be matched. The media asset grouping to be matched, which is obtained by the method, can be regarded as the picture content with high correlation and consistency with the corresponding matching words in the visual main body and the theme in the sentence segmentation.

In a specific implementation manner, matching degree assignment of the medium materials to be matched can be performed according to matching words by utilizing a medium material tag system, the medium materials assigned by the same matching words are divided into a group, the highest score 0.6 in the same group is taken, and the medium materials lower than the score are removed from the medium material group to obtain the medium material group to be matched. In the media resource grouping to be matched, the condition that the same media resource appears in different media resource grouping to be matched, for example, two media resource grouping to be matched obtained through two matching words of "Chengdu" and "traffic" can appear in the media resource grouping to be matched corresponding to the two matching words at the same time when the media resource with two labels of "Chengdu" and "traffic" is provided.

According to the method and the device, the score of the medium resource corresponding to each word to be matched is calculated according to the label of the medium resource, the highest score of the medium resource of each word to be matched is determined, the range of the medium resource to be matched of the word to be matched is determined according to the highest score of the medium resource of the word to be matched, the user can freely set the label score of the medium resource, the range of the medium resource to be matched can be self-defined, and therefore the obtained medium resource synthesized by the video can not only keep the correlation between the video and the text, but also better guarantee the diversity of the video.

Step S15: and according to the matching sequence of all the words to be matched, randomly selecting media asset filling video pictures from the media asset range to be matched corresponding to the words to be matched in sequence until the video pictures reach a preset duration to complete video synthesis, wherein the preset duration is determined by the length of the input text.

In some embodiments of the present application, optionally, according to the matching sequence of all the words to be matched, randomly selecting the media asset filling video frames from the media asset range to be matched corresponding to the words to be matched in sequence until the video frames reach a preset duration to complete video synthesis includes:

In the embodiment of the application, when all media assets to be matched in a segment still cannot meet the segment video duration after media assets are selected once, randomly selecting one piece of media assets one by one from high to low according to the priority of the matching word for filling the corresponding media assets to be matched, automatically ignoring the used media assets until the material duration approaches to be consistent with the video duration, conventionally taking the material, splicing the material for a duration shorter than the video duration for about 3 seconds to longer than the video duration for about 3 seconds, and properly extending and cutting the material to achieve the complete consistency of the material duration and the video duration. In a specific embodiment, assuming that the video duration is twenty seconds (about five seconds in practice), and only two words to be matched are "dues" and "traffic", selecting the medium resources from the corresponding medium resources to be matched repeatedly according to the order of "dues" and "traffic", when one medium resource is selected from the medium resource groups to be matched corresponding to the section priority matching word to finish video content filling, selecting the medium resources from the medium resource groups to be matched corresponding to the sub priority matching word is not needed, assuming that the video duration is 3 seconds (about five seconds in practice), and only two words to be matched are "dues" and "traffic", selecting the medium resource filling from the medium resource groups to be matched corresponding to "dues", cutting or extending after the medium resource filling is basically consistent with the video duration, and selecting the medium resources from the medium resource groups to be matched corresponding to "traffic" is not needed. When all the media assets to be matched corresponding to all the matched words are used, the filling of the video duration still cannot be met, and after all the used media assets are removed, randomly selecting one media asset filling video picture from the range of the media assets to be matched corresponding to the words to be matched in sequence according to the matching sequence of all the words to be matched.

In some embodiments of the present application, optionally, the label of the media asset includes a depth label and/or a color label of the media asset, and according to the matching sequence of all the words to be matched, selecting, in sequence, media asset filling video frames randomly from a range of media assets to be matched corresponding to the words to be matched until the video frames reach a preset duration to complete video synthesis includes:

In this embodiment of the present application, the depth of field label refers to the distance degree of the main body in the picture and the video, that is, the distance view, the middle view, and the near view, and the arrangement sequence may be from near to far, or from far to near. The color labels refer to the color types of dominant vision in pictures and videos, namely, the three primary colors of red, yellow and blue are used as dominant colors for dividing, and media such as pictures and videos with the same type of dominant colors are grouped. And according to the video duration and the quantity of the media assets to be filled, arranging and grouping the media assets in the media asset groups to be matched according to visual elements such as depth labels and/or color labels of the media assets and corresponding preset depth matching sequences and/or color matching rules. One specific embodiment may be: according to the scene and the color, selecting the media assets from the corresponding media asset groups to be matched according to the order of the priority matching words and the sub-priority matching words, filling the pictures, and properly cutting and prolonging the media assets according to TTS voice time length after the pictures are filled.

In the embodiment of the application, when the media assets are selected, the depth labels (Jing Bie sequence) and/or the color labels of the media assets are preferentially considered, the media assets are selected from far to near or from near to far, and the tone convergence of the media assets is ensured, so that the visual jumping feeling is reduced, and the visual effect of the video is enhanced. The quality of the synthesized video is optimized from the aesthetic point of view, and the consistency and the integrity of the synthesized video are greatly improved.

In the embodiment of the application, if the target media assets of the current matching word are detected to be filled into the video images, randomly selecting the media asset filling video images from the residual media asset range to be matched corresponding to the word to be matched again, so that the problem of repeated use of the same media asset in the same video synthesis process is avoided, and the quality of video synthesis is ensured.

Referring to fig. 2, fig. 2 is a second flow chart of a video synthesizing method according to an embodiment of the present application, where the video synthesizing method includes the following steps.

Step S21, a media asset tag library is established, the media asset tags are provided with multiple stages, the assigned weights of the tags at each stage are different, and the tags of the media asset comprise depth tags and/or color tags of the media asset.

Step S22, receiving input text and identifying text basic units of the input text.

And S23, matching the text basic unit with a media resource tag library, and determining the text basic unit capable of being matched with at least one media resource tag as a word to be matched.

And step S24, weighting the words to be matched, and determining the matching sequence of all the words to be matched of the input text according to the weight scores of the words to be matched.

And S25, calculating the score of the medium resource corresponding to each word to be matched according to the label of the medium resource, and determining the highest score of the medium resource of each word to be matched.

And S26, determining the range of the medium resources to be matched of the word to be matched according to the highest medium resource score of the word to be matched.

And step S27, determining that the target media assets of the first matching words are filled into the video picture according to the matching sequence of all the words to be matched, and obtaining the depth of field labels and/or color labels of the target media assets of the first matching words.

And step S28, determining that the target media assets of the next matching word are filled into the picture according to the depth label and/or the color label of the target media assets of the first matching word, the preset depth matching sequence and/or the color matching rule until the matching of the target media assets of all the words to be matched is completed.

Step S29, if all the words to be matched are filled for the first time and the video picture still does not reach the preset time length, randomly selecting one media resource filling video picture from the residual media resource range to be matched corresponding to the words to be matched according to the matching sequence of all the words to be matched until the video picture reaches the preset time length, wherein the preset time length is determined by the length of the input text.

In the embodiment of the application, the word to be matched which is required to be filled with the media asset is determined by utilizing the tagging of the media asset material, particularly the depth of field and the color tag of the media asset material, and the priority of the word to be matched and the matching degree score weight of the media asset to be matched are determined through an algorithm, so that whether the candidate media asset material is used as the picture material of the video corresponding to the text or not is determined according to the matching degree condition, and the correspondence between the text content and the video content is automatically completed in the production process of converting the text into the video content. The tagging of the media material can perform the representational reality on the content represented by the media material, so as to quickly confirm the characteristics of the content such as time, place, main body and the like of the media material. The text is subjected to semantic analysis to capture basic units of the text, and the basic units are weighted so as to obtain priority levels of the priority matching objects and different objects, and key picture contents of the text in the sentence when the text is converted into a video are determined. Searching the matching words for the coincidence degree in the labels of the media resource materials, screening candidate media resource materials, forming media resource groups to be matched, sequentially selecting the media resources from the corresponding media resource groups to be matched according to the priority of the matching words, simultaneously considering the scene sequence and the tone among the media resources, and finally completing the matching process of converting the whole text into the video. In the step of identifying the word to be matched, various subjects and semantic relations in the text are directly identified through a large language model, the cross-modal model is utilized to identify the media assets in the media asset library, visual subjects, detail objects and extension subjects in pictures and videos are extracted, and the text and the media assets are directly matched through relevance comparison. The richness, diversity, consistency, integrity and automation of video synthesis by utilizing media materials such as texts are realized from various aspects.

The embodiment of the application also provides a video synthesis device corresponding to the video synthesis method embodiment, and the video synthesis device is provided with corresponding functional modules, and it is to be noted that specific workflow of the corresponding functional modules can refer to the corresponding content disclosed in the foregoing embodiment, and has corresponding effects.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a video synthesizing apparatus according to an embodiment of the present application. The embodiment of the present application also provides a video compositing apparatus 20, including:

a receiving module 21 for receiving an input text and identifying a text base unit of the input text;

a matching module 22, configured to match the text basic unit with a library of media resource tags, and determine a text basic unit capable of matching at least one media resource tag as a word to be matched;

the weighting module 23 is configured to weight the word to be matched, and determine a matching sequence of all the words to be matched of the input text according to the weight score of the word to be matched;

a determining module 24, configured to determine a to-be-matched media resource range of each to-be-matched word;

and the filling module 25 is configured to randomly select media resource filling video frames from the media resource ranges to be matched corresponding to the words to be matched in sequence according to the matching sequence of the words to be matched until the video frames reach a preset duration to complete video synthesis, where the preset duration is determined by the length of the input text.

In the embodiment of the application, the video synthesis device receives an input text and identifies a text basic unit of the input text, matches the text basic unit with a media resource tag library, and determines the text basic unit capable of being matched with at least one media resource tag as a word to be matched; weighting the words to be matched, and determining the matching sequence of all the words to be matched of the input text according to the weight scores of the words to be matched; determining the medium resource range to be matched of each word to be matched; and according to the matching sequence of all the words to be matched, randomly selecting media assets from the media asset range to be matched corresponding to the words to be matched in sequence until the video images reach the preset duration to complete video synthesis, wherein the preset duration is determined by the length of the input text. The method for associating texts with various types of media assets and synthesizing videos is provided, wherein the richness and the integrity of synthesized videos are jointly optimized through the modes of text splitting, media asset tag setting, matching after various empowerments and the like, so that the synthesized videos are strongly related to input texts, the integrity and the smoothness of the videos are ensured, the automatic synthesis of the videos in a large batch according to the input texts is supported, and the videos synthesized by the same input text are diversified.

Referring to fig. 4, an embodiment of the present application further provides an electronic device 30, including:

a memory 31 for storing a computer program;

a processor 32 for implementing the steps of the video compositing method according to any of the embodiments described above when executing said computer program.

For the specific process of the video synthesizing method, reference may be made to the corresponding content disclosed in the foregoing embodiment, and no further description is given here.

The memory 32 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the storage mode may be transient storage or permanent storage.

In addition, the electronic device 30 further includes a power supply 33, a communication interface 34, an input-output interface 35, and a communication bus 36; wherein the power supply 33 is configured to provide an operating voltage for each hardware device on the electronic device 30; the communication interface 34 can create a data transmission channel between the electronic device 30 and a peripheral device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein; the input/output interface 35 is used for obtaining external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.

Further, the embodiment of the application also discloses a computer readable storage medium for storing a computer program, wherein the computer program is executed by a processor to implement the video compositing method disclosed in the previous embodiment.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The foregoing has described in detail a video compositing method, apparatus, electronic device and computer readable storage medium provided by the present application, and specific examples have been applied herein to illustrate the principles and embodiments of the present application, the above examples are only used to help understand the method and core ideas of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method of video composition, comprising:

receiving an input text and identifying a text base unit of the input text;

determining the medium resource range to be matched of each word to be matched;

2. The method of claim 1, wherein the step of randomly selecting the media filling video frames from the media range to be matched corresponding to the words to be matched in sequence until the video frames reach a preset duration to complete video synthesis according to the matching sequence of all the words to be matched comprises:

3. The method of claim 1, wherein the tags of the assets include depth tags and/or color tags of the assets, and sequentially randomly selecting the assets filling video frames from the range of the assets to be matched corresponding to the words to be matched according to the matching sequence of the words to be matched until the video frames reach a preset duration to complete video synthesis comprises:

4. The method of claim 1, wherein the weighting the word to be matched comprises at least one of the following weighting modes:

5. The method of video synthesis according to claim 1, wherein before the matching the text basic unit with the library of media asset tags and determining the text basic unit capable of matching at least one media asset tag as a word to be matched, further comprising:

6. The method of video composition according to claim 5, wherein said determining a to-be-matched media asset range for each of said to-be-matched words comprises:

7. The method of claim 5, wherein the step of randomly selecting the media filling video frames from the media range to be matched corresponding to the words to be matched in sequence until the video frames reach a preset duration to complete video synthesis according to the matching sequence of all the words to be matched comprises:

8. The video compositing method of claim 1, wherein said media asset types comprise:

9. A video compositing apparatus, comprising:

and the filling module is used for randomly selecting media resource filling video pictures from the media resource range to be matched corresponding to the words to be matched in sequence according to the matching sequence of the words to be matched until the video pictures reach the preset duration to complete video synthesis, wherein the preset duration is determined by the length of the input text.

10. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the video compositing method of any of claims 1-8 when executing the computer program.