CN116614672A

CN116614672A - Method for automatically mixing and cutting video based on text-video retrieval

Info

Publication number: CN116614672A
Application number: CN202310588510.4A
Authority: CN
Inventors: 丁岩; 柴兆虎; 赵宇迪; 施侃
Original assignee: Shanghai Shuchuan Data Technology Co ltd
Current assignee: Shanghai Shuchuan Data Technology Co ltd
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-08-18

Abstract

The invention relates to the technical field of video editing, in particular to a method for automatically cutting video in a mixed way based on text-video retrieval, which comprises the following steps: video cutting and library building: dividing the video into segments to form a video material library; text search: one text paragraph contains L sentences of text, and a plurality of video clips which are matched best are searched according to the text; video stitching: according to the combination of different fragments, the combination with highest consistency is selected. The invention has the advantages that a user can obtain the mixed-cut video corresponding to the text content by editing the text, each segment of the mixed-cut video has close style and higher consistency, the video is cut according to the semantics by means of the image-text model, the segmentation result is more accurate, the combination with the semantics is more compact, a plurality of video segments with independent semantics are obtained by segmentation, the subsequent application is convenient, a scoring model is introduced, the feature combination of various video segments is scored, and the result with close internal style and more proper is selected.

Description

Method for automatically mixing and cutting video based on text-video retrieval

Technical Field

The invention relates to the technical field of video editing, in particular to a method for automatically cutting video in a mixed mode based on text-video retrieval.

Background

Along with the development of technology in recent years, we can share a large amount of video data on the network at every moment, and how to integrate huge video resources on the network is used by people, so that the huge value of the video resources is brought into play, and the problem to be solved is urgent. Video mixing is one of the popular applications at present, namely, selecting proper video clips according to a plurality of sentences to form a complete video.

In the face of massive video resources, the common mixed shearing method needs to manually complete cutting, searching, splicing and other works; the method is low in speed and high in cost, is often inaccurate, and can miss more valuable videos; as the video scale increases, the drawbacks of this approach can double.

For manual video cutting, various methods for automatically cutting video, such as cutting video by using an image processing method such as a histogram, have been proposed in recent years; or a deep learning method is introduced to judge the segmentation position; however, the former has poor effect and can not accurately cut the lens, the latter needs a large number of labels and has single scene, and when the scene difference between the test data and the training data is large, the segmentation result is inaccurate. These methods are based on image segmentation, and the semantics of the video segments are not considered, so that the segmented video segments have incomplete semantics or comprise a plurality of semantics, which makes the video effect of subsequent splicing worse.

When video splicing is carried out after video clips are obtained according to a plurality of sentences, the random selection results are often directly spliced at present, so that the spliced videos are often ambiguous in style and are not coordinated enough; replacement and adjustment by manual intervention has been proposed, since there is no data that can be used directly; if the video is marked manually, the workload is huge, the time and the labor are wasted, and the final effect is often unstable due to subjective differences of people.

Disclosure of Invention

The invention aims to provide an automatic video mixing and cutting method based on text-video retrieval, which has the advantages that a user can obtain mixed and cut videos corresponding to text contents by editing texts, each segment of the mixed and cut videos has close style and higher consistency, and solves the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions: a method for automatically mixing and cutting video based on text-video retrieval comprises the following steps:

s1: video cutting and library building: dividing the video into segments to form a video material library;

s2: text search: one text paragraph contains L sentences of text, and a plurality of video clips which are matched best are searched according to the text;

s3: video stitching: selecting a combination with highest consistency according to the combination of different fragments;

s4: post-treatment: and adjusting and unifying the resolutions of the fragments to finally obtain the video with similar styles and conforming to the text.

Preferably, the step S1 adopts a multi-mode image-text model to cut the video into single-shot, semantically independent segments, and includes the following steps:

s1.1, firstly, decoding an original video, and converting the original video into continuous pictures;

s1.2, sequentially sending the obtained pictures into a multi-mode image-text model, and obtaining a vector serving as the characteristic of the corresponding picture;

s1.3, calculating the similarity of the features of the adjacent pictures, wherein the higher the score is, the closer the two pictures are; the lower the score, the greater the difference between the two pictures.

S1.4, judging whether the current position is a cutting point according to the similarity of the adjacent pictures: if the similarity of adjacent pictures is smaller than the threshold value, the difference is larger, and we can divide the pictures before and after as a cutting point into different fragments.

And S1.5, finally, calculating the characteristics of each frame of image of each segmented video segment by using a multi-mode model, calculating an average vector for all the characteristics, and taking the vector as a representative characteristic of the video segment.

Preferably, in the step S2, a text paragraph is divided into a plurality of text sentences, and the text sentences are respectively sent into a multi-mode model to extract features, and then the features are used to traverse a video material library, calculate similarity with the features of each video segment, and finally can obtain N results with highest scores;

and one text paragraph contains L texts, each text is used for searching the video in sequence, N results are returned each time, and finally LxN results are obtained.

Preferably, the step S3 includes the steps of:

s3.1, randomly splicing, namely splicing results obtained by the L texts in the S2 into videos, wherein the simplest method is to randomly select one result from N results of each sentence, and directly form one video;

and S3.2, a scoring model is adopted, and for text searching in S2, one of N results returned by each sentence of text is required to be selected, so that the video style obtained after the text searching is spliced with other text searching results is closest to the video style obtained after the text searching results are spliced, and the scoring is highest.

Preferably, the step S3.2 includes the steps of:

s3.2.1, a large amount of data is required when training the model; however, no data set is disclosed at present, and if manual annotation data is used, the workload is too large to realize; therefore, the invention automatically prepares data through the existing original video;

s3.2.2, after model convergence, a scoring model can be applied to select a proper video: and scoring the input video segment feature combinations, wherein the score is higher for the video combination with high internal consistency, and conversely, the obtained score is lower.

Preferably, in the step S3.2.1, the implementation procedure is as follows:

firstly, a large amount of videos are collected, so that continuity of the videos from films, short videos and the like is ensured;

then dividing each original video segment to obtain the characteristics of a plurality of video segments, wherein the characteristics form a sequence, namely a positive sample, the label of the positive sample is 1, and simultaneously, the video segments and the characteristics thereof are added into a video material library;

then, for each positive sample, searching a video material library according to the characteristics of each video fragment to obtain the video fragment and the characteristics closest to each other, and forming the characteristics into a sequence, namely a negative sample, wherein the label of the negative sample is 0;

this process is repeated multiple times, and sufficient positive and negative sample data can be obtained.

Preferably, in the step S3.2.1, after positive and negative sample data are obtained, the data are used to train and optimize a model, and cross entropy loss is used as a loss function;

in the training process, along with the convergence of the model, on-line difficult sample mining can be carried out on the training data, namely, the training data is sent into the model and then calculated to obtain loss, whether the training times of the data are increased or not is determined according to the loss, and if the loss is larger, the current data are added into the next batch of training data for continuous training.

Preferably, in the step S3.2.2, the results obtained by searching the text in S2 are combined with the previous sentence results sentence by sentence, and the combined features are sent to the scoring model. For example, for a first sentence, there are N results, and when N results for a second sentence are appended, there are NxN combinations;

and similarly, judging all the possibilities of the L sentences, combining NxL types of the N-type sentence, and sending the combination into a scoring model for judgment to obtain an optimal result.

Preferably, in the step S4, because the resolutions of the video clips are different, after the video is synthesized, the maximum value of the resolutions of the video clips needs to be selected, so that the video forms are more uniform, and then the video clips are spliced after being patched with black edges.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a method for automatically constructing data, which trains a model by using the data to obtain a better result, has the advantages that a user can obtain a mixed-cut video corresponding to text content by editing the text, each segment of the mixed-cut video has a similar style and higher consistency, and solves the following problems:

1. cutting video according to semantics: when the traditional method is used for cutting the video, semantic information is not considered, the obtained video fragments contain a plurality of semantemes, and the effect of finally forming the video can be influenced.

2. Ensuring video consistency: when selecting video clips, the traditional video splicing method is used for randomly selecting video clips, the inherent consistency of a plurality of video clips is not considered, or manual selection is used, the efficiency is low, the mixed shearing video method is used for introducing a scoring model to score the characteristic combination of various video clips, and the result with similar inherent style and more proper effect is selected.

Drawings

FIG. 1 is a flow chart of an automatic video blending method according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a method for automatically mixing and cutting video based on text-video retrieval includes the following steps:

As a further preferred scheme of the invention, the step S1 adopts a multi-mode image-text model to cut the video into single-shot, semantically independent segments, and comprises the following steps:

In step S2, a text paragraph is divided into a plurality of text sentences, the text sentences are respectively sent into a multi-mode model to extract features, the features are used for traversing a video material library, similarity is calculated with the features of each video segment, and finally N results with highest scores can be obtained;

As a further preferable embodiment of the present invention, step S3 includes the steps of:

As a further preferred embodiment of the present invention, step S3.2 comprises the steps of:

As a further preferred embodiment of the present invention, in step S3.2.1, the implementation procedure is as follows:

As a further preferred embodiment of the present invention, in step S3.2.1, after positive and negative sample data are obtained, these data are used to train and optimize a model, and cross entropy loss is used as a loss function;

for example, for the original video A, B, C, the following positive samples can be obtained by dividing to obtain the feature sequences A0-A1-A2, B0-B1-B2 and C0-C1-C2-C3:

A0-A1-A2, B0-B1-B2, C0-C1-C2, C1-C2-C3, and the corresponding label is 1;

at the same time, a negative sample is obtained:

A0-C1-A2, A0-B1-C2, C0-A1-B2, A1-C3-B2, etc., the corresponding label is 0;

As a further preferred embodiment of the present invention, in step S3.2.2, the results obtained by searching the text in S2 are combined sentence by sentence with the previous sentence results, and the combined features are sent to the scoring model. For example, for a first sentence, there are N results, and when N results for a second sentence are appended, there are NxN combinations;

The method has huge search space, can not bear huge calculation amount for embedded equipment and the like, can adopt a compromise method, reduce the search space and obtain relatively better results, thus different methods can be selected according to the equipment performance;

for example, assuming n=2, for the sentences a, B, C, after searching in the video material library, the video clip features A0, A1, B0, B1, C0, C1 are returned, respectively.

For sentences a-B, there are 4 combinations of A0-B0, A0-B1, A1-B0 and A1-B1, and the 2 combinations with the highest scores are selected after model scoring, and A1-B0 and A0-B1 are selected.

On the basis, a sentence C is added, wherein the sentence C comprises 4 combinations of A1-B0-C0, A1-B0-C1, A0-B1-C0 and A0-B1-C1, and the combination with the highest score can be obtained after model scoring, and the combination with the highest score of A1-B1-C0 is assumed to be the finally selected video fragment combination.

And through judgment of the scoring model, the styles of the obtained video clips are more similar, and the whole video clips are more coordinated.

As a further preferable scheme of the invention, in the step S4, because the resolutions of all video clips are different, the maximum value of the resolutions of all video clips is required to be selected after the video is synthesized, so that the video forms are more uniform, all video clips are spliced after being patched with black edges, and finally the spliced video is obtained through post-treatment, so that the video is more coordinated and attractive.

The method has the following generating effects:

the input text is:

A. each girl has a dream of snow white princess;

B. white horse prince grafted to the love of oneself is a life's struggling target;

C. the dream of white snow princess is also made;

D. often a person steals the ground to like a white and snowy princess in a fairy tale to have black hair;

E. bright red lips and white skin;

F. dancing with white horse prince in the blooming flower sea of sea lily.

Searching a video material library according to the text to obtain a plurality of video segment combinations, and selecting by using a scoring model to obtain the most suitable video segment combination.

To sum up: the invention provides a method for automatically constructing data, which trains a model by using the data to obtain a better result, has the advantages that a user can obtain a mixed-cut video corresponding to text content by editing the text, each segment of the mixed-cut video has a similar style and higher consistency, and solves the following problems:

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for automatically mixing and cutting video based on text-video retrieval is characterized by comprising the following steps: the method comprises the following steps:

2. The method for automatically blending video based on text-to-video retrieval of claim 1, wherein: the step S1 adopts a multi-mode image-text model to cut a video into single-shot, semantically independent fragments, and comprises the following steps:

s1.3, calculating the similarity of the features of the adjacent pictures, wherein the higher the score is, the closer the two pictures are; the lower the score, the larger the difference between the two pictures is;

s1.4, judging whether the current position is a cutting point according to the similarity of the adjacent pictures: if the similarity of adjacent pictures is smaller than the threshold value, the difference is larger, and the pictures before and after the picture can be divided into different fragments by taking the picture as a cutting point;

3. The method for automatically blending video based on text-to-video retrieval of claim 1, wherein: in the step S2, a text paragraph is divided into a plurality of text sentences, the text sentences are respectively sent into a multi-mode model to extract features, then the features are used to traverse a video material library, similarity is calculated with the features of each video segment, and finally N results with highest scores can be obtained;

4. The method for automatically blending video based on text-to-video retrieval of claim 1, wherein: the step S3 includes the steps of:

5. The method for automatically blending video based on text-to-video retrieval of claim 4, wherein: the step S3.2 comprises the steps of:

6. The method for automatically blending video based on text-to-video retrieval of claim 5, wherein: in the step S3.2.1, the implementation flow is as follows:

7. The method for automatically blending video based on text-to-video retrieval of claim 5, wherein: in the step S3.2.1, after positive and negative sample data are obtained, training and optimizing a model by using the data, and using cross entropy loss as a loss function;

8. The method for automatically blending video based on text-to-video retrieval of claim 5, wherein: in the step S3.2.2, the results obtained by searching the text in the step S2 are combined with the previous sentence results sentence by sentence, and the combined features are sent to the scoring model. For example, for a first sentence, there are N results, and when N results for a second sentence are appended, there are NxN combinations;

9. The method for automatically blending video based on text-to-video retrieval of claim 1, wherein: in the step S4, because the resolutions of the video clips are different, after the video is synthesized, the maximum value of the resolutions of the video clips needs to be selected, so that the video forms are more uniform, and then the video clips are spliced after being filled with black edges.