CN116915925A

CN116915925A - Video generation method, system, electronic equipment and medium based on video template

Info

Publication number: CN116915925A
Application number: CN202310735557.9A
Authority: CN
Inventors: 郝德禄; 彭杰; 吴伟芬
Original assignee: iMusic Culture and Technology Co Ltd
Current assignee: iMusic Culture and Technology Co Ltd
Priority date: 2023-06-20
Filing date: 2023-06-20
Publication date: 2023-10-20
Anticipated expiration: 2043-06-20
Also published as: CN116915925B

Abstract

The invention discloses a video generation method and system based on a video template, electronic equipment and media, comprising the following steps: acquiring user side features and video template side features, inputting the user side features and the video template side features into a double-tower recall model, and obtaining a plurality of first video templates; determining a picture tag vector of a picture group to be synthesized and a template tag vector of a first video template, and calculating tag correlation coefficients of each first video template and the picture group to be synthesized; determining a picture proportion vector of a picture group to be synthesized and a template proportion vector of a first video template, and calculating proportion correlation coefficients of each first video template and the picture group to be synthesized; and determining the matching degree of each first video template and the picture group to be synthesized according to the label correlation coefficient and the proportion correlation coefficient, selecting a second video template according to the matching degree, and generating a first video according to the picture group to be synthesized and the second video template. The invention improves the video synthesis efficiency and the user experience, so that the video generation effect is better, and the method and the device can be applied to the technical field of video synthesis.

Description

Video generation method, system, electronic equipment and medium based on video template

Technical Field

The invention relates to the technical field of video synthesis, in particular to a video generation method and system based on a video template, electronic equipment and a medium.

Background

Video content production is everywhere visible in daily life, and users record life, highlight individuality and output value by producing video content. The video production is usually carried out in two ways, namely, the video is recorded by itself and is assembled perfectly; and secondly, uploading a plurality of pictures to generate a specific template video. The template video synthesis is a main mode of video content sharing due to convenient operation and rich effects, and each large Internet is huge in head and has research and provides related template video generation capacity, such as volcanic engines, shearing and the like.

With the richness and the increase of types of video template effects, the requirements and scenes for completing video production sharing based on the video templates are increasing. At present, the synthesis of a video template mainly has two operation modes:

1) The user selects the template autonomously. The user finds out the templates of the related topics through some labels, queries one by one and tries template synthesis one by one, looks up the video effect, and finally selects a proper video template. The method is complex in operation and low in efficiency, and a plurality of users need multiple synthesis attempts in the use process to find the most suitable template, so that the use experience of the users is affected.

2) And the universal template is synthesized by one key. And (3) video synthesis is carried out based on a batch of universal video templates, after a user uploads pictures, templates meeting the specified quantity are found out, and when the image proportion is not matched, the proportion adaptation and synthesis processing of the universal templates are completed through automatic cutting or Gaussian blurring processing. But the general video template is adopted for synthesis, so that the situation of mismatched effects often occurs, and the video generation effect and the use experience of a user are affected.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art to a certain extent.

Therefore, an object of the embodiments of the present invention is to provide a video generating method based on a video template, which improves the synthesis efficiency of video and the use experience of users, so that the effect of video generation is better.

It is another object of an embodiment of the present invention to provide a video generating system based on a video template.

In order to achieve the technical purpose, the technical scheme adopted by the embodiment of the invention comprises the following steps:

in a first aspect, an embodiment of the present invention provides a video generating method based on a video template, including the following steps:

acquiring user side features and video template side features, and inputting the user side features and the video template side features into a double-tower recall model to obtain a plurality of recalled first video templates;

Determining a picture tag vector of a picture group to be synthesized and a template tag vector of the first video template, and calculating tag correlation coefficients of each first video template and the picture group to be synthesized according to the picture tag vector and the template tag vector;

determining a picture proportion vector of the picture group to be synthesized and a template proportion vector of the first video template, and calculating a proportion correlation coefficient of each first video template and the picture group to be synthesized according to the picture proportion vector and the template proportion vector;

and determining the matching degree of each first video template and the picture group to be synthesized according to the label correlation coefficient and the proportion correlation coefficient, selecting a second video template according to the matching degree, and generating a first video according to the picture group to be synthesized and the second video template.

Further, in one embodiment of the present invention, the step of obtaining a user side feature and a video template side feature, and inputting the user side feature and the video template side feature into a dual-tower recall model to obtain a first video template of a plurality of recalls specifically includes:

acquiring template use time, template use frequency and template use preference of a target user, and determining the user side characteristics according to the template use time, the template use frequency and the template use preference;

Acquiring a template style, a template type and a template rhythm of a target video template, and determining side characteristics of the video template according to the template style, the template type and the template rhythm;

inputting the user side features and the video template side features into a double-tower recall model, and outputting to obtain recall rates of the target video boards;

and determining the target video template with the recall rate larger than or equal to a preset first threshold value as the first video template.

Further, in one embodiment of the present invention, the step of determining a picture tag vector of a picture group to be synthesized and a template tag vector of the first video template, and calculating tag correlation coefficients of each first video template and the picture group to be synthesized according to the picture tag vector and the template tag vector specifically includes:

acquiring a picture group to be synthesized, which is uploaded by a target user, wherein the picture group to be synthesized comprises a plurality of pictures to be synthesized;

performing label classification on each picture to be synthesized through a convolutional neural network to obtain a first picture label of each picture to be synthesized, and generating a picture label vector according to the first picture label;

Performing label classification on the first video templates through a ResNet residual error network to obtain a plurality of first template labels of the first video templates, and generating template label vectors according to the first template labels;

and determining cosine similarity of the picture tag vector and the template tag vector, and determining tag correlation coefficients of the first video templates and the picture group to be synthesized according to the cosine similarity.

Further, in one embodiment of the present invention, the step of determining a picture scale vector of the to-be-synthesized picture group and a template scale vector of the first video template, and calculating a scale correlation coefficient between each of the first video templates and the to-be-synthesized picture group according to the picture scale vector and the template scale vector specifically includes:

determining a first picture proportion of each picture to be synthesized, and generating a picture proportion vector according to the first picture proportion;

determining a first region proportion of each template region in the first video template, and generating the template proportion vector according to the first region proportion;

comparing the picture proportion vector with the template proportion vector in vector dimension, and filling the picture proportion vector/the template proportion vector through a preset filling vector when the vector dimension of the picture proportion vector and the template proportion vector are inconsistent, so as to obtain a picture proportion vector and a template proportion vector with consistent dimensions;

And determining the normalized distance between the picture proportion vector and the template proportion vector with the consistent dimensions, and determining the proportion correlation coefficient of each first video template and the picture group to be synthesized according to the normalized distance.

Further, in one embodiment of the present invention, the normalized distance of the dimension-consistent picture scale vector from the template scale vector is determined according to the following equation:

wherein r is _{pic_k} Representing a picture scale vector V _pic The kth first picture proportion, r _{video_k} Representing template scale vector V _video The kth first region scale in (1), N represents the picture scale vector V _pic And template proportional vector V _video Vector dimension, D of _{pic_video} (V _pic ,V _video ) Representing a picture scale vector V _pic Proportional vector to template V _video Is used for the distance normalization.

Further, in one embodiment of the present invention, the step of determining the matching degree between each of the first video templates and the group of pictures to be synthesized according to the tag correlation coefficient and the scale correlation coefficient specifically includes:

determining a content attribute weight and an effect attribute weight of the first video template;

and taking the content attribute weight as the weight of the tag correlation coefficient, taking the effect attribute weight as the weight of the proportion correlation coefficient, and carrying out weighted summation on the tag correlation coefficient and the proportion correlation coefficient to obtain the matching degree of the first video template and the picture group to be synthesized.

Further, in an embodiment of the present invention, the step of selecting a second video template according to the matching degree, and further generating a first video according to the group of pictures to be synthesized and the second video template specifically includes:

selecting a plurality of first video templates with matching degree larger than or equal to a preset second threshold value as a second video template, or selecting a plurality of first video templates with matching degree ranking smaller than or equal to a preset third threshold value as a second video template;

and carrying out video synthesis on the picture group to be synthesized according to the second video template to generate the first video.

In a second aspect, an embodiment of the present invention provides a video generating system based on a video template, including:

the video template recall module is used for acquiring user side features and video template side features, inputting the user side features and the video template side features into a double-tower recall model, and obtaining a plurality of recalled first video templates;

the label correlation coefficient calculation module is used for determining a picture label vector of a picture group to be synthesized and a template label vector of the first video template, and calculating label correlation coefficients of the first video template and the picture group to be synthesized according to the picture label vector and the template label vector;

The proportion correlation coefficient calculation module is used for determining a picture proportion vector of the picture group to be synthesized and a template proportion vector of the first video template, and calculating proportion correlation coefficients of the first video template and the picture group to be synthesized according to the picture proportion vector and the template proportion vector;

the video template selection module is used for determining the matching degree of each first video template and the picture group to be synthesized according to the label correlation coefficient and the proportion correlation coefficient, selecting a second video template according to the matching degree, and generating a first video according to the picture group to be synthesized and the second video template.

In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for implementing a connection communication between the processor and the memory, where the program, when executed by the processor, implements a video template-based video generation method as described in the first aspect above.

In a fourth aspect, an embodiment of the present invention further provides a storage medium, where the storage medium is a computer readable storage medium, where one or more programs are stored, and the one or more programs are executable by one or more processors to implement the video generating method based on a video template according to the first aspect.

The advantages and benefits of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

The embodiment of the invention obtains user side characteristics and video template side characteristics, inputs the user side characteristics and the video template side characteristics into a double-tower recall model to obtain a plurality of recalled first video templates, then determines picture tag vectors of a picture group to be synthesized and template tag vectors of the first video templates, calculates tag correlation coefficients of each first video template and the picture group to be synthesized according to the picture tag vectors and the template tag vectors, then determines picture proportion vectors of the picture group to be synthesized and template proportion vectors of the first video templates, calculates proportion correlation coefficients of each first video template and the picture group to be synthesized according to the picture proportion vectors and the template proportion vectors, finally determines matching degree of each first video template and the picture group to be synthesized according to the tag correlation coefficients and the proportion correlation coefficients, selects a second video template according to the matching degree, and further generates a first video according to the picture group to be synthesized and the second video template. According to the embodiment of the invention, the plurality of first video templates which accord with the user characteristics are screened out through the double-tower recall model, and then the matching degree of each first video template and the picture group to be synthesized is determined based on the label correlation coefficient and the proportion correlation coefficient, so that the second video template with higher matching degree can be automatically selected to carry out video synthesis on the picture group to be synthesized, the condition that the topic content or the template proportion of the video template is not matched with the picture group to be synthesized is avoided, the video synthesis efficiency and the user experience are improved, and the video generation effect is better.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will refer to the drawings that are needed in the embodiments of the present invention, and it should be understood that the drawings in the following description are only for convenience and clarity to describe some embodiments in the technical solutions of the present invention, and other drawings may be obtained according to these drawings without any inventive effort for those skilled in the art.

Fig. 1 is a flowchart of steps of a video generating method based on a video template according to an embodiment of the present invention;

fig. 2 is a flowchart of step S101 provided in the embodiment of the present invention;

fig. 3 is a flowchart of step S102 provided in the embodiment of the present invention;

fig. 4 is a flowchart of step S103 provided in the embodiment of the present invention;

fig. 5 is a flowchart of step S104 provided in the embodiment of the present invention;

fig. 6 is another flowchart of step S104 provided in the embodiment of the present invention;

FIG. 7 is a schematic diagram of a dual tower recall model provided by an embodiment of the present invention;

fig. 8 is a schematic diagram of a calculation process of a tag correlation coefficient according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a video generating system based on a video template according to an embodiment of the present invention;

Fig. 10 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application. It should be noted that although functional block division is performed in a system diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the system diagram or the sequence in the flowchart. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

In the description of the present application, the plurality means two or more, and if the description is made to the first and second for the purpose of distinguishing technical features, it should not be construed as indicating or implying relative importance or implicitly indicating the number of the indicated technical features or implicitly indicating the precedence of the indicated technical features. Furthermore, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

The video generating method based on the video template provided by the embodiment of the application can be applied to a terminal, a server and software running in the terminal or the server. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements a video template-based video generation method, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In the embodiments of the present application, when related processing is performed according to user information, user behavior data, user history data, user location information, and other data related to user identity or characteristics, permission or consent of the user is obtained first, and the collection, use, processing, and the like of the data comply with related laws and regulations and standards of related countries and regions. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through popup or jump to a confirmation page and the like, and after the independent permission or independent consent of the user is definitely acquired, the necessary relevant data of the user for enabling the embodiment of the application to normally operate is acquired.

Referring to fig. 1, a step flow chart of a video generating method based on a video template provided in an embodiment of the present application is shown, and referring to fig. 1, the embodiment of the present application provides a video generating method based on a video template, which specifically includes the following steps:

s101, acquiring user side features and video template side features, and inputting the user side features and the video template side features into a double-tower recall model to obtain a plurality of recalled first video templates.

Specifically, the types of templates selected by different types of users are different, such as templates with lively preferences, stuck points and fast rhythms of young user groups, and European versions of blessing, splitting and flowers and plants of old user groups. With the increasing content of video template libraries, in order to produce suitable template videos, video template content suitable for user features needs to be extracted from a huge number of video templates. The embodiment of the invention constructs the user side characteristics and the template side characteristics and completes recall of the video template based on the double-tower calling model.

Referring to fig. 2, further as an alternative implementation manner, a step of acquiring a user side feature and a video template side feature, inputting the user side feature and the video template side feature into a dual-tower recall model to obtain a first video template of a plurality of recalls, which is shown in fig. 2 as a flowchart of step S101 provided by the embodiment of the present invention, specifically includes:

s1021, acquiring template use time, template use frequency and template use preference of a target user, and determining user side characteristics according to the template use time, the template use frequency and the template use preference;

s1022, acquiring a template style, a template type and a template rhythm of a target video template, and determining video template side characteristics according to the template style, the template type and the template rhythm;

S1023, inputting the user side features and the video template side features into a double-tower recall model, and outputting to obtain recall rates of all target video boards;

s1024, determining that the target video template with the recall rate larger than or equal to a preset first threshold value is the first video template.

Specifically, the embodiment of the invention completes recall operation of the video template through the double-tower recall model, constructs user side features based on template use time, template use frequency, template use preference and other user history data, constructs video template side features based on template style, template type and template rhythm, inputs the user side features and the video template side features to the double-tower recall module, combines data of an interoperation layer, and extracts template contents meeting user requirements.

As shown in fig. 7, which is a schematic diagram of a dual-tower recall model provided by the embodiment of the present invention, it can be understood that input user-side features and video template-side features respectively extract feature vectors through DNNs, then perform recall calculation based on cosine similarity or euclidean distance, and then select a target video template with a recall greater than or equal to a first threshold as a first video template.

In some alternative embodiments, the video template recall based on the trained double-tower recall model is performed as follows:

1. Offline processing:

1) Template emb generation: clustering video templates;

2) On-line simulation verification by the gpu server (positive sample recall hit rate 40%):

a. reading in the generated template ebedding;

b. obtaining a synthesis record and template information;

c. before each user synthesizes the video, finding 3 nearest making records;

d. accumulating and averaging the template embs corresponding to the 3 production records to obtain an emb of the user;

e. according to the user_emb, using fass to find the 10 template ids closest to the fasss;

f. checking recall hit rate of the positive sample;

3) And step 2) updating the template emb generated in the first step to the milvus server if the verification passes.

2. The online use steps are as follows:

1) Connecting a milvus server, and taking the number in a Restful mode;

2) According to the 3 input histories, making template ids, and acquiring 3 templates emb from a milvus server;

3) The 3 templates emb are accumulated and averaged to be used as a user_emb of the user;

4) And searching 20 template ids closest to the milvus server according to the user_emb and returning the template ids.

It can be recognized that the embodiment of the invention obtains the video template with higher recall rate through the double-tower recall model, effectively reduces the matching range of the subsequent video template, reduces the calculation amount of the matching degree of the subsequent template, and improves the matching efficiency of the video template, thereby improving the synthesis efficiency of the video.

S102, determining a picture tag vector of a picture group to be synthesized and a template tag vector of a first video template, and calculating tag correlation coefficients of each first video template and the picture group to be synthesized according to the picture tag vector and the template tag vector.

Specifically, a picture tag vector of the picture group to be synthesized and a template tag vector of the first video template are respectively determined, the picture tag vector and the template tag vector respectively reflect the content attribute of the picture group to be synthesized and the content attribute of the first video template, and the correlation between each first video template and the content tag of the picture group to be synthesized can be determined through the calculation of the tag correlation coefficient. The embodiment of the invention carries out intelligent label identification on the picture uploaded by the user and carries out correlation calculation with the recalled video template.

Referring to fig. 3, further as an alternative implementation manner, a step of determining a picture tag vector of a to-be-synthesized picture group and a template tag vector of a first video template, and calculating tag correlation coefficients of each first video template and the to-be-synthesized picture group according to the picture tag vector and the template tag vector is shown in fig. 3, which is a flowchart of step S102 provided by the embodiment of the present invention, and specifically includes:

S1021, obtaining a picture group to be synthesized, which is uploaded by a target user and comprises a plurality of pictures to be synthesized;

s1022, carrying out label classification on each picture to be synthesized through a convolutional neural network to obtain a first picture label of each picture to be synthesized, and generating a picture label vector according to the first picture label;

s1023, carrying out label classification on the first video templates through a ResNet residual error network to obtain a plurality of first template labels of the first video templates, and generating template label vectors according to the first template labels;

s1024, determining cosine similarity of the picture tag vector and the template tag vector, and determining tag correlation coefficients of each first video template and the picture group to be synthesized according to the cosine similarity.

Specifically, fig. 8 is a schematic diagram illustrating a calculation process of the tag correlation coefficient according to the embodiment of the present invention. When a user synthesizes a video, selecting a corresponding picture to upload; after uploading, carrying out tag identification and intelligent classification on each picture uploaded by a user based on a mask-CNN, merging tag data of all uploaded pictures, and outputting picture tag vectors of a user picture group; labeling of the video template, and performing advanced analysis through two steps of ResNet101 network model operation processing and post-processing; cosine similarity calculation is carried out on the picture label vector of the user picture group and the template label vector of the video template to obtain label correlation coefficients R of all video templates aiming at the picture group _{pic_video} 。

It should be noted that, in the embodiment of the invention, the user picture group tag and the video template tag share a set of tag system, two sets of tag vectors are constructed based on the unified tag system, and the tag correlation coefficient R of the video template and the picture group to be synthesized is obtained through cosine similarity calculation _{pic_video} . Through the processing of the step, the video template with higher label correlation with the user picture can be found out, and the effect of subsequent video synthesis is ensured to be in accordance with the userThe usage scenario.

S103, determining a picture proportion vector of the picture group to be synthesized and a template proportion vector of the first video templates, and calculating proportion correlation coefficients of the first video templates and the picture group to be synthesized according to the picture proportion vector and the template proportion vector.

Specifically, the embodiment of the invention calculates the proportion vector of the picture uploaded by the user, and performs normalized distance calculation by combining the proportion vector of the video template to obtain the proportion correlation coefficient of each first video template and the picture group to be synthesized.

Referring to fig. 4, as an alternative implementation manner, further referring to fig. 4, a step of determining a picture scale vector of a group of pictures to be synthesized and a template scale vector of a first video template, and calculating scale correlation coefficients of each first video template and the group of pictures to be synthesized according to the picture scale vector and the template scale vector is shown in fig. 4, where the step specifically includes:

S1031, determining a first picture proportion of each picture to be synthesized, and generating a picture proportion vector according to the first picture proportion;

s1032, determining a first area proportion of each template area in the first video template, and generating a template proportion vector according to the first area proportion;

s1033, comparing the picture proportion vector with the template proportion vector in vector dimension, and filling the picture proportion vector/the template proportion vector through a preset filling vector when the vector dimensions of the picture proportion vector and the template proportion vector are inconsistent, so as to obtain the picture proportion vector and the template proportion vector with consistent dimensions;

s1034, determining normalized distances of the picture proportion vectors and the template proportion vectors with consistent dimensions, and determining proportion correlation coefficients of the first video templates and the picture groups to be synthesized according to the normalized distances.

Specifically, in the pictures uploaded by the user, the aspect ratio is calculated for each picture, and is defined as r _{pic_i} Where i ranges from 1 to m, where m is the number of pictures uploaded by the user. Based on the aspect ratio of each picture, a picture scale vector can be derived as follows:

V _pic ＝(r _{pic_1} ,r _{pic_2} ,…,r _{pic_m} )

in the video template, each alternative location (i.e., template area) has an optimal scale value, defined as r _{video_j} Where j ranges from 1 to n, n being the number of alternative positions. Based on the aspect ratio of each alternative location, a template scale vector can be derived as follows:

V _video ＝(r _{video_1} ,r _{video_2} ,…,r _{video_n} )

And comparing the dimensions of the picture proportion vector and the template proportion vector, filling the vector with lower dimensions in a 0 supplementing mode, and ensuring the consistency of vector dimensions. The dimension of the processed proportional vector is N.

And carrying out normalized distance calculation on the picture proportion vector and the template proportion vector with consistent dimensions to obtain the difference between the two proportion vectors.

Further as an alternative embodiment, the normalized distance of the dimension-consistent picture scale vector and the template scale vector is determined according to the following equation:

Specifically, when the proportion of the pictures uploaded by the user is 100% identical to the proportion of the video templates, the distance D _{pic_video} Is 0; the larger the distance, the larger the difference between the picture and the video template uploaded by the user, and the maximum value is not more than 1. After the proportional correlation coefficient is obtained through the step, the comprehensive calculation of the matching degree of the next step can be performed.

S104, determining the matching degree of each first video template and the picture group to be synthesized according to the label correlation coefficient and the proportion correlation coefficient, selecting a second video template according to the matching degree, and generating a first video according to the picture group to be synthesized and the second video template.

Referring to fig. 5, as an alternative implementation manner, further referring to fig. 5, the step of determining the matching degree between each first video template and the group of pictures to be synthesized according to the tag correlation coefficient and the scale correlation coefficient specifically includes:

s1041, determining content attribute weight and effect attribute weight of a first video template;

s1042, taking the content attribute weight as the weight of the tag correlation coefficient, taking the effect attribute weight as the weight of the proportion correlation coefficient, and carrying out weighted summation on the tag correlation coefficient and the proportion correlation coefficient to obtain the matching degree of the first video template and the picture group to be synthesized.

Specifically, the matching degree between the group of pictures to be synthesized and each first video template can be further integrated and calculated by combining the tag correlation coefficient and the proportional correlation coefficient determined in the previous step, and the calculation mode is as follows:

S _{pic_video} ＝α·R _{pic_video} +β·D _{pic_video}

wherein R is _{pic_video} Label correlation coefficient representing user picture and video template, D _{pic_video} And the proportional correlation coefficient of the user picture and the video template is represented, and alpha and beta respectively represent the content attribute weight and the effect attribute weight.

The alpha and the beta of different video templates are different, the templates with the attribute of the content are biased, and the alpha has higher value, such as festival class templates and theme class templates; and the template with the attribute of the bias effect has higher beta value, such as a template with full-screen transition effect and a frame layer special effect. The embodiment of the invention determines the content attribute weight and the effect attribute weight of each first video template, and performs weighted summation on the tag correlation coefficient and the proportion correlation coefficient based on the two weights, so that the matching degree of each first video template and the picture group to be synthesized can be accurately obtained.

Referring to fig. 6, as another flowchart of step S104 provided by the embodiment of the present invention, further as an alternative implementation manner, a step of selecting a second video template according to the matching degree, and further generating a first video according to a group of pictures to be synthesized and the second video template specifically includes:

s1043, selecting a plurality of first video templates with matching degree larger than or equal to a preset second threshold value as a second video template, or selecting a plurality of first video templates with matching degree ranking smaller than or equal to a preset third threshold value as a second video template;

s1044, performing video synthesis on the picture group to be synthesized according to the second video template to generate a first video.

Specifically, batch calculation of matching degree is performed on the screened first video templates, matching degree values of different templates are obtained and ordered, so that Top K template information meeting the requirements of users can be obtained, and the video synthesis effect is optimized.

In some alternative embodiments, the first templates may be extracted according to the derived Top K template for video composition, and the video results may be returned. Meanwhile, other template data in the Top K can be fed back to the product side, so that the user can conveniently select and view the template data.

The method steps of the embodiments of the present invention are described above. It can be understood that the embodiment of the invention constructs the user side characteristics and the template side characteristics and completes recall of massive video templates based on the double-tower model; intelligent tag identification is carried out on the picture uploaded by the user, and tag correlation coefficients are calculated with the recalled video template; meanwhile, calculating a proportion vector of the picture uploaded by the user, and carrying out normalized distance calculation by combining the proportion vector of the video template to obtain a proportion correlation coefficient; and then, comprehensively evaluating and calculating the label correlation coefficient and the proportion correlation coefficient to obtain matching degree values of different video templates, sequencing, synthesizing video based on the video template with the highest matching degree, and generating video content with the best effect by one key to meet the requirements of users.

According to the embodiment of the invention, the plurality of first video templates which accord with the user characteristics are screened out through the double-tower recall model, and then the matching degree of each first video template and the picture group to be synthesized is determined based on the label correlation coefficient and the proportion correlation coefficient, so that the second video template with higher matching degree can be automatically selected to carry out video synthesis on the picture group to be synthesized, the condition that the topic content or the template proportion of the video template is not matched with the picture group to be synthesized is avoided, the video synthesis efficiency and the user experience are improved, and the video generation effect is better.

Fig. 9 is a schematic structural diagram of a video playing device according to an embodiment of the present invention, and referring to fig. 9, an embodiment of the present invention provides a video generating system based on a video template, including:

the video template recall module is used for acquiring user side features and video template side features, inputting the user side features and the video template side features into the double-tower recall model, and obtaining a plurality of recalled first video templates;

the label correlation coefficient calculation module is used for determining a picture label vector of the picture group to be synthesized and a template label vector of the first video templates, and calculating label correlation coefficients of each first video template and the picture group to be synthesized according to the picture label vector and the template label vector;

the proportional correlation coefficient calculation module is used for determining a picture proportional vector of the picture group to be synthesized and a template proportional vector of the first video templates, and calculating the proportional correlation coefficient of each first video template and the picture group to be synthesized according to the picture proportional vector and the template proportional vector;

The content in the method embodiment is applicable to the system embodiment, the functions specifically realized by the system embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method embodiment.

The embodiment of the invention also provides electronic equipment, which comprises: the video generation method based on the video template comprises a memory, a processor, a program stored on the memory and capable of running on the processor, and a data bus for realizing connection communication between the processor and the memory, wherein the program is executed by the processor to realize the video generation method based on the video template. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 10, a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention is shown in fig. 10, where the embodiment of the present invention provides an electronic device, including:

the processor 1001 may be implemented by using a general-purpose CPU (central processing unit), a microprocessor, an application-specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. to execute related programs to implement the technical solution provided by the embodiments of the present invention;

The memory 1002 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 1002 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present specification are implemented by software or firmware, relevant program codes are stored in the memory 1002, and the processor 1001 invokes a video generating method based on a video template to perform the embodiments of the present invention;

an input/output interface 1003 for implementing information input and output;

the communication interface 1004 is configured to implement communication interaction between the present device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

a bus 1005 for transferring information between the various components of the device (e.g., the processor 1001, memory 1002, input/output interface 1003, and communication interface 1004);

wherein the processor 1001, the memory 1002, the input/output interface 1003, and the communication interface 1004 realize communication connection between each other inside the device through the bus 1005.

The embodiment of the invention also provides a storage medium, which is a computer readable storage medium and is used for computer readable storage, the storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize the video generating method based on the video template.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the present invention has been described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the functions and/or features described above may be integrated in a single physical device and/or software module or one or more of the functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.

The above functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or a part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the above-described method of the various embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer-readable medium may even be paper or other suitable medium upon which the program described above is printed, as the program described above may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the application, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present application has been described in detail, the present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims

1. The video generation method based on the video template is characterized by comprising the following steps:

2. The video generating method according to claim 1, wherein the step of obtaining the user-side feature and the video template-side feature, inputting the user-side feature and the video template-side feature into a double-tower recall model, and obtaining a first video template of a plurality of recalls, specifically comprises:

3. The method for generating video based on video templates according to claim 1, wherein the step of determining a picture tag vector of a group of pictures to be synthesized and a template tag vector of the first video template, and calculating tag correlation coefficients of each of the first video templates and the group of pictures to be synthesized according to the picture tag vector and the template tag vector, specifically comprises:

4. A video generating method according to claim 3, wherein the step of determining a picture scale vector of the group of pictures to be synthesized and a template scale vector of the first video template, and calculating a scale correlation coefficient between each of the first video templates and the group of pictures to be synthesized according to the picture scale vector and the template scale vector, specifically comprises:

5. The video template-based video generation method of claim 4 wherein the normalized distance of the dimension-consistent picture scale vector from the template scale vector is determined according to the following equation:

wherein r is _{pic_k} Representing a picture scale vector V _pic The kth first picture proportion, r _{video_k} Representing template scale vector V _video The kth first region scale in (1), N represents the picture scale vector V _pic And template proportional vector V _video Vector dimension, D of _{pic_video} (V _pic ，V _video ) Representing a picture scale vector V _pic Proportional vector to template V _video Is used for the distance normalization.

6. The method according to claim 1, wherein the step of determining the matching degree between each first video template and the group of pictures to be synthesized according to the tag correlation coefficient and the scale correlation coefficient comprises:

7. The video generating method according to any one of claims 1 to 6, wherein the step of selecting a second video template according to the matching degree, and further generating a first video according to the group of pictures to be synthesized and the second video template specifically comprises:

8. A video generation system based on a video template, comprising:

9. An electronic device, characterized in that: the electronic device comprising a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connected communication between the processor and the memory, the program when executed by the processor implementing the steps of the video template-based video generation method according to any of claims 1 to 7.

10. A storage medium, the storage medium being a computer-readable storage medium for computer-readable storage, characterized by: the storage medium stores one or more programs executable by one or more processors to implement the steps of the video template-based video generation method of any one of claims 1 to 7.