CN116614672A - Method for automatically mixing and cutting video based on text-video retrieval - Google Patents

Method for automatically mixing and cutting video based on text-video retrieval Download PDF

Info

Publication number
CN116614672A
CN116614672A CN202310588510.4A CN202310588510A CN116614672A CN 116614672 A CN116614672 A CN 116614672A CN 202310588510 A CN202310588510 A CN 202310588510A CN 116614672 A CN116614672 A CN 116614672A
Authority
CN
China
Prior art keywords
video
text
model
results
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310588510.4A
Other languages
Chinese (zh)
Inventor
丁岩
柴兆虎
赵宇迪
施侃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Shuchuan Data Technology Co ltd
Original Assignee
Shanghai Shuchuan Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Shuchuan Data Technology Co ltd filed Critical Shanghai Shuchuan Data Technology Co ltd
Priority to CN202310588510.4A priority Critical patent/CN116614672A/en
Publication of CN116614672A publication Critical patent/CN116614672A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4662Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
    • H04N21/4666Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms using neural networks, e.g. processing the feedback provided by the user
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/475End-user interface for inputting end-user data, e.g. personal identification number [PIN], preference data
    • H04N21/4756End-user interface for inputting end-user data, e.g. personal identification number [PIN], preference data for rating content, e.g. scoring a recommended movie
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The invention relates to the technical field of video editing, in particular to a method for automatically cutting video in a mixed way based on text-video retrieval, which comprises the following steps: video cutting and library building: dividing the video into segments to form a video material library; text search: one text paragraph contains L sentences of text, and a plurality of video clips which are matched best are searched according to the text; video stitching: according to the combination of different fragments, the combination with highest consistency is selected. The invention has the advantages that a user can obtain the mixed-cut video corresponding to the text content by editing the text, each segment of the mixed-cut video has close style and higher consistency, the video is cut according to the semantics by means of the image-text model, the segmentation result is more accurate, the combination with the semantics is more compact, a plurality of video segments with independent semantics are obtained by segmentation, the subsequent application is convenient, a scoring model is introduced, the feature combination of various video segments is scored, and the result with close internal style and more proper is selected.

Description

Method for automatically mixing and cutting video based on text-video retrieval
Technical Field
The invention relates to the technical field of video editing, in particular to a method for automatically cutting video in a mixed mode based on text-video retrieval.
Background
Along with the development of technology in recent years, we can share a large amount of video data on the network at every moment, and how to integrate huge video resources on the network is used by people, so that the huge value of the video resources is brought into play, and the problem to be solved is urgent. Video mixing is one of the popular applications at present, namely, selecting proper video clips according to a plurality of sentences to form a complete video.
In the face of massive video resources, the common mixed shearing method needs to manually complete cutting, searching, splicing and other works; the method is low in speed and high in cost, is often inaccurate, and can miss more valuable videos; as the video scale increases, the drawbacks of this approach can double.
For manual video cutting, various methods for automatically cutting video, such as cutting video by using an image processing method such as a histogram, have been proposed in recent years; or a deep learning method is introduced to judge the segmentation position; however, the former has poor effect and can not accurately cut the lens, the latter needs a large number of labels and has single scene, and when the scene difference between the test data and the training data is large, the segmentation result is inaccurate. These methods are based on image segmentation, and the semantics of the video segments are not considered, so that the segmented video segments have incomplete semantics or comprise a plurality of semantics, which makes the video effect of subsequent splicing worse.
When video splicing is carried out after video clips are obtained according to a plurality of sentences, the random selection results are often directly spliced at present, so that the spliced videos are often ambiguous in style and are not coordinated enough; replacement and adjustment by manual intervention has been proposed, since there is no data that can be used directly; if the video is marked manually, the workload is huge, the time and the labor are wasted, and the final effect is often unstable due to subjective differences of people.
Disclosure of Invention
The invention aims to provide an automatic video mixing and cutting method based on text-video retrieval, which has the advantages that a user can obtain mixed and cut videos corresponding to text contents by editing texts, each segment of the mixed and cut videos has close style and higher consistency, and solves the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: a method for automatically mixing and cutting video based on text-video retrieval comprises the following steps:
s1: video cutting and library building: dividing the video into segments to form a video material library;
s2: text search: one text paragraph contains L sentences of text, and a plurality of video clips which are matched best are searched according to the text;
s3: video stitching: selecting a combination with highest consistency according to the combination of different fragments;
s4: post-treatment: and adjusting and unifying the resolutions of the fragments to finally obtain the video with similar styles and conforming to the text.
Preferably, the step S1 adopts a multi-mode image-text model to cut the video into single-shot, semantically independent segments, and includes the following steps:
s1.1, firstly, decoding an original video, and converting the original video into continuous pictures;
s1.2, sequentially sending the obtained pictures into a multi-mode image-text model, and obtaining a vector serving as the characteristic of the corresponding picture;
s1.3, calculating the similarity of the features of the adjacent pictures, wherein the higher the score is, the closer the two pictures are; the lower the score, the greater the difference between the two pictures.
S1.4, judging whether the current position is a cutting point according to the similarity of the adjacent pictures: if the similarity of adjacent pictures is smaller than the threshold value, the difference is larger, and we can divide the pictures before and after as a cutting point into different fragments.
And S1.5, finally, calculating the characteristics of each frame of image of each segmented video segment by using a multi-mode model, calculating an average vector for all the characteristics, and taking the vector as a representative characteristic of the video segment.
Preferably, in the step S2, a text paragraph is divided into a plurality of text sentences, and the text sentences are respectively sent into a multi-mode model to extract features, and then the features are used to traverse a video material library, calculate similarity with the features of each video segment, and finally can obtain N results with highest scores;
and one text paragraph contains L texts, each text is used for searching the video in sequence, N results are returned each time, and finally LxN results are obtained.
Preferably, the step S3 includes the steps of:
s3.1, randomly splicing, namely splicing results obtained by the L texts in the S2 into videos, wherein the simplest method is to randomly select one result from N results of each sentence, and directly form one video;
and S3.2, a scoring model is adopted, and for text searching in S2, one of N results returned by each sentence of text is required to be selected, so that the video style obtained after the text searching is spliced with other text searching results is closest to the video style obtained after the text searching results are spliced, and the scoring is highest.
Preferably, the step S3.2 includes the steps of:
s3.2.1, a large amount of data is required when training the model; however, no data set is disclosed at present, and if manual annotation data is used, the workload is too large to realize; therefore, the invention automatically prepares data through the existing original video;
s3.2.2, after model convergence, a scoring model can be applied to select a proper video: and scoring the input video segment feature combinations, wherein the score is higher for the video combination with high internal consistency, and conversely, the obtained score is lower.
Preferably, in the step S3.2.1, the implementation procedure is as follows:
firstly, a large amount of videos are collected, so that continuity of the videos from films, short videos and the like is ensured;
then dividing each original video segment to obtain the characteristics of a plurality of video segments, wherein the characteristics form a sequence, namely a positive sample, the label of the positive sample is 1, and simultaneously, the video segments and the characteristics thereof are added into a video material library;
then, for each positive sample, searching a video material library according to the characteristics of each video fragment to obtain the video fragment and the characteristics closest to each other, and forming the characteristics into a sequence, namely a negative sample, wherein the label of the negative sample is 0;
this process is repeated multiple times, and sufficient positive and negative sample data can be obtained.
Preferably, in the step S3.2.1, after positive and negative sample data are obtained, the data are used to train and optimize a model, and cross entropy loss is used as a loss function;
in the training process, along with the convergence of the model, on-line difficult sample mining can be carried out on the training data, namely, the training data is sent into the model and then calculated to obtain loss, whether the training times of the data are increased or not is determined according to the loss, and if the loss is larger, the current data are added into the next batch of training data for continuous training.
Preferably, in the step S3.2.2, the results obtained by searching the text in S2 are combined with the previous sentence results sentence by sentence, and the combined features are sent to the scoring model. For example, for a first sentence, there are N results, and when N results for a second sentence are appended, there are NxN combinations;
and similarly, judging all the possibilities of the L sentences, combining NxL types of the N-type sentence, and sending the combination into a scoring model for judgment to obtain an optimal result.
Preferably, in the step S4, because the resolutions of the video clips are different, after the video is synthesized, the maximum value of the resolutions of the video clips needs to be selected, so that the video forms are more uniform, and then the video clips are spliced after being patched with black edges.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a method for automatically constructing data, which trains a model by using the data to obtain a better result, has the advantages that a user can obtain a mixed-cut video corresponding to text content by editing the text, each segment of the mixed-cut video has a similar style and higher consistency, and solves the following problems:
1. cutting video according to semantics: when the traditional method is used for cutting the video, semantic information is not considered, the obtained video fragments contain a plurality of semantemes, and the effect of finally forming the video can be influenced.
2. Ensuring video consistency: when selecting video clips, the traditional video splicing method is used for randomly selecting video clips, the inherent consistency of a plurality of video clips is not considered, or manual selection is used, the efficiency is low, the mixed shearing video method is used for introducing a scoring model to score the characteristic combination of various video clips, and the result with similar inherent style and more proper effect is selected.
Drawings
FIG. 1 is a flow chart of an automatic video blending method according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a method for automatically mixing and cutting video based on text-video retrieval includes the following steps:
s1: video cutting and library building: dividing the video into segments to form a video material library;
s2: text search: one text paragraph contains L sentences of text, and a plurality of video clips which are matched best are searched according to the text;
s3: video stitching: selecting a combination with highest consistency according to the combination of different fragments;
s4: post-treatment: and adjusting and unifying the resolutions of the fragments to finally obtain the video with similar styles and conforming to the text.
As a further preferred scheme of the invention, the step S1 adopts a multi-mode image-text model to cut the video into single-shot, semantically independent segments, and comprises the following steps:
s1.1, firstly, decoding an original video, and converting the original video into continuous pictures;
s1.2, sequentially sending the obtained pictures into a multi-mode image-text model, and obtaining a vector serving as the characteristic of the corresponding picture;
s1.3, calculating the similarity of the features of the adjacent pictures, wherein the higher the score is, the closer the two pictures are; the lower the score, the greater the difference between the two pictures.
S1.4, judging whether the current position is a cutting point according to the similarity of the adjacent pictures: if the similarity of adjacent pictures is smaller than the threshold value, the difference is larger, and we can divide the pictures before and after as a cutting point into different fragments.
And S1.5, finally, calculating the characteristics of each frame of image of each segmented video segment by using a multi-mode model, calculating an average vector for all the characteristics, and taking the vector as a representative characteristic of the video segment.
In step S2, a text paragraph is divided into a plurality of text sentences, the text sentences are respectively sent into a multi-mode model to extract features, the features are used for traversing a video material library, similarity is calculated with the features of each video segment, and finally N results with highest scores can be obtained;
and one text paragraph contains L texts, each text is used for searching the video in sequence, N results are returned each time, and finally LxN results are obtained.
As a further preferable embodiment of the present invention, step S3 includes the steps of:
s3.1, randomly splicing, namely splicing results obtained by the L texts in the S2 into videos, wherein the simplest method is to randomly select one result from N results of each sentence, and directly form one video;
and S3.2, a scoring model is adopted, and for text searching in S2, one of N results returned by each sentence of text is required to be selected, so that the video style obtained after the text searching is spliced with other text searching results is closest to the video style obtained after the text searching results are spliced, and the scoring is highest.
As a further preferred embodiment of the present invention, step S3.2 comprises the steps of:
s3.2.1, a large amount of data is required when training the model; however, no data set is disclosed at present, and if manual annotation data is used, the workload is too large to realize; therefore, the invention automatically prepares data through the existing original video;
s3.2.2, after model convergence, a scoring model can be applied to select a proper video: and scoring the input video segment feature combinations, wherein the score is higher for the video combination with high internal consistency, and conversely, the obtained score is lower.
As a further preferred embodiment of the present invention, in step S3.2.1, the implementation procedure is as follows:
firstly, a large amount of videos are collected, so that continuity of the videos from films, short videos and the like is ensured;
then dividing each original video segment to obtain the characteristics of a plurality of video segments, wherein the characteristics form a sequence, namely a positive sample, the label of the positive sample is 1, and simultaneously, the video segments and the characteristics thereof are added into a video material library;
then, for each positive sample, searching a video material library according to the characteristics of each video fragment to obtain the video fragment and the characteristics closest to each other, and forming the characteristics into a sequence, namely a negative sample, wherein the label of the negative sample is 0;
this process is repeated multiple times, and sufficient positive and negative sample data can be obtained.
As a further preferred embodiment of the present invention, in step S3.2.1, after positive and negative sample data are obtained, these data are used to train and optimize a model, and cross entropy loss is used as a loss function;
for example, for the original video A, B, C, the following positive samples can be obtained by dividing to obtain the feature sequences A0-A1-A2, B0-B1-B2 and C0-C1-C2-C3:
A0-A1-A2, B0-B1-B2, C0-C1-C2, C1-C2-C3, and the corresponding label is 1;
at the same time, a negative sample is obtained:
A0-C1-A2, A0-B1-C2, C0-A1-B2, A1-C3-B2, etc., the corresponding label is 0;
in the training process, along with the convergence of the model, on-line difficult sample mining can be carried out on the training data, namely, the training data is sent into the model and then calculated to obtain loss, whether the training times of the data are increased or not is determined according to the loss, and if the loss is larger, the current data are added into the next batch of training data for continuous training.
As a further preferred embodiment of the present invention, in step S3.2.2, the results obtained by searching the text in S2 are combined sentence by sentence with the previous sentence results, and the combined features are sent to the scoring model. For example, for a first sentence, there are N results, and when N results for a second sentence are appended, there are NxN combinations;
and similarly, judging all the possibilities of the L sentences, combining NxL types of the N-type sentence, and sending the combination into a scoring model for judgment to obtain an optimal result.
The method has huge search space, can not bear huge calculation amount for embedded equipment and the like, can adopt a compromise method, reduce the search space and obtain relatively better results, thus different methods can be selected according to the equipment performance;
for example, assuming n=2, for the sentences a, B, C, after searching in the video material library, the video clip features A0, A1, B0, B1, C0, C1 are returned, respectively.
For sentences a-B, there are 4 combinations of A0-B0, A0-B1, A1-B0 and A1-B1, and the 2 combinations with the highest scores are selected after model scoring, and A1-B0 and A0-B1 are selected.
On the basis, a sentence C is added, wherein the sentence C comprises 4 combinations of A1-B0-C0, A1-B0-C1, A0-B1-C0 and A0-B1-C1, and the combination with the highest score can be obtained after model scoring, and the combination with the highest score of A1-B1-C0 is assumed to be the finally selected video fragment combination.
And through judgment of the scoring model, the styles of the obtained video clips are more similar, and the whole video clips are more coordinated.
As a further preferable scheme of the invention, in the step S4, because the resolutions of all video clips are different, the maximum value of the resolutions of all video clips is required to be selected after the video is synthesized, so that the video forms are more uniform, all video clips are spliced after being patched with black edges, and finally the spliced video is obtained through post-treatment, so that the video is more coordinated and attractive.
The method has the following generating effects:
the input text is:
A. each girl has a dream of snow white princess;
B. white horse prince grafted to the love of oneself is a life's struggling target;
C. the dream of white snow princess is also made;
D. often a person steals the ground to like a white and snowy princess in a fairy tale to have black hair;
E. bright red lips and white skin;
F. dancing with white horse prince in the blooming flower sea of sea lily.
Searching a video material library according to the text to obtain a plurality of video segment combinations, and selecting by using a scoring model to obtain the most suitable video segment combination.
To sum up: the invention provides a method for automatically constructing data, which trains a model by using the data to obtain a better result, has the advantages that a user can obtain a mixed-cut video corresponding to text content by editing the text, each segment of the mixed-cut video has a similar style and higher consistency, and solves the following problems:
1. cutting video according to semantics: when the traditional method is used for cutting the video, semantic information is not considered, the obtained video fragments contain a plurality of semantemes, and the effect of finally forming the video can be influenced.
2. Ensuring video consistency: when selecting video clips, the traditional video splicing method is used for randomly selecting video clips, the inherent consistency of a plurality of video clips is not considered, or manual selection is used, the efficiency is low, the mixed shearing video method is used for introducing a scoring model to score the characteristic combination of various video clips, and the result with similar inherent style and more proper effect is selected.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (9)

1. A method for automatically mixing and cutting video based on text-video retrieval is characterized by comprising the following steps: the method comprises the following steps:
s1: video cutting and library building: dividing the video into segments to form a video material library;
s2: text search: one text paragraph contains L sentences of text, and a plurality of video clips which are matched best are searched according to the text;
s3: video stitching: selecting a combination with highest consistency according to the combination of different fragments;
s4: post-treatment: and adjusting and unifying the resolutions of the fragments to finally obtain the video with similar styles and conforming to the text.
2. The method for automatically blending video based on text-to-video retrieval of claim 1, wherein: the step S1 adopts a multi-mode image-text model to cut a video into single-shot, semantically independent fragments, and comprises the following steps:
s1.1, firstly, decoding an original video, and converting the original video into continuous pictures;
s1.2, sequentially sending the obtained pictures into a multi-mode image-text model, and obtaining a vector serving as the characteristic of the corresponding picture;
s1.3, calculating the similarity of the features of the adjacent pictures, wherein the higher the score is, the closer the two pictures are; the lower the score, the larger the difference between the two pictures is;
s1.4, judging whether the current position is a cutting point according to the similarity of the adjacent pictures: if the similarity of adjacent pictures is smaller than the threshold value, the difference is larger, and the pictures before and after the picture can be divided into different fragments by taking the picture as a cutting point;
and S1.5, finally, calculating the characteristics of each frame of image of each segmented video segment by using a multi-mode model, calculating an average vector for all the characteristics, and taking the vector as a representative characteristic of the video segment.
3. The method for automatically blending video based on text-to-video retrieval of claim 1, wherein: in the step S2, a text paragraph is divided into a plurality of text sentences, the text sentences are respectively sent into a multi-mode model to extract features, then the features are used to traverse a video material library, similarity is calculated with the features of each video segment, and finally N results with highest scores can be obtained;
and one text paragraph contains L texts, each text is used for searching the video in sequence, N results are returned each time, and finally LxN results are obtained.
4. The method for automatically blending video based on text-to-video retrieval of claim 1, wherein: the step S3 includes the steps of:
s3.1, randomly splicing, namely splicing results obtained by the L texts in the S2 into videos, wherein the simplest method is to randomly select one result from N results of each sentence, and directly form one video;
and S3.2, a scoring model is adopted, and for text searching in S2, one of N results returned by each sentence of text is required to be selected, so that the video style obtained after the text searching is spliced with other text searching results is closest to the video style obtained after the text searching results are spliced, and the scoring is highest.
5. The method for automatically blending video based on text-to-video retrieval of claim 4, wherein: the step S3.2 comprises the steps of:
s3.2.1, a large amount of data is required when training the model; however, no data set is disclosed at present, and if manual annotation data is used, the workload is too large to realize; therefore, the invention automatically prepares data through the existing original video;
s3.2.2, after model convergence, a scoring model can be applied to select a proper video: and scoring the input video segment feature combinations, wherein the score is higher for the video combination with high internal consistency, and conversely, the obtained score is lower.
6. The method for automatically blending video based on text-to-video retrieval of claim 5, wherein: in the step S3.2.1, the implementation flow is as follows:
firstly, a large amount of videos are collected, so that continuity of the videos from films, short videos and the like is ensured;
then dividing each original video segment to obtain the characteristics of a plurality of video segments, wherein the characteristics form a sequence, namely a positive sample, the label of the positive sample is 1, and simultaneously, the video segments and the characteristics thereof are added into a video material library;
then, for each positive sample, searching a video material library according to the characteristics of each video fragment to obtain the video fragment and the characteristics closest to each other, and forming the characteristics into a sequence, namely a negative sample, wherein the label of the negative sample is 0;
this process is repeated multiple times, and sufficient positive and negative sample data can be obtained.
7. The method for automatically blending video based on text-to-video retrieval of claim 5, wherein: in the step S3.2.1, after positive and negative sample data are obtained, training and optimizing a model by using the data, and using cross entropy loss as a loss function;
in the training process, along with the convergence of the model, on-line difficult sample mining can be carried out on the training data, namely, the training data is sent into the model and then calculated to obtain loss, whether the training times of the data are increased or not is determined according to the loss, and if the loss is larger, the current data are added into the next batch of training data for continuous training.
8. The method for automatically blending video based on text-to-video retrieval of claim 5, wherein: in the step S3.2.2, the results obtained by searching the text in the step S2 are combined with the previous sentence results sentence by sentence, and the combined features are sent to the scoring model. For example, for a first sentence, there are N results, and when N results for a second sentence are appended, there are NxN combinations;
and similarly, judging all the possibilities of the L sentences, combining NxL types of the N-type sentence, and sending the combination into a scoring model for judgment to obtain an optimal result.
9. The method for automatically blending video based on text-to-video retrieval of claim 1, wherein: in the step S4, because the resolutions of the video clips are different, after the video is synthesized, the maximum value of the resolutions of the video clips needs to be selected, so that the video forms are more uniform, and then the video clips are spliced after being filled with black edges.
CN202310588510.4A 2023-05-24 2023-05-24 Method for automatically mixing and cutting video based on text-video retrieval Pending CN116614672A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310588510.4A CN116614672A (en) 2023-05-24 2023-05-24 Method for automatically mixing and cutting video based on text-video retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310588510.4A CN116614672A (en) 2023-05-24 2023-05-24 Method for automatically mixing and cutting video based on text-video retrieval

Publications (1)

Publication Number Publication Date
CN116614672A true CN116614672A (en) 2023-08-18

Family

ID=87679650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310588510.4A Pending CN116614672A (en) 2023-05-24 2023-05-24 Method for automatically mixing and cutting video based on text-video retrieval

Country Status (1)

Country Link
CN (1) CN116614672A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117692676A (en) * 2023-12-08 2024-03-12 广东创意热店互联网科技有限公司 Video quick editing method based on artificial intelligence technology

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117692676A (en) * 2023-12-08 2024-03-12 广东创意热店互联网科技有限公司 Video quick editing method based on artificial intelligence technology

Similar Documents

Publication Publication Date Title
CN112015949B (en) Video generation method and device, storage medium and electronic equipment
CN108769801B (en) Synthetic method, device, equipment and the storage medium of short-sighted frequency
CN113709561B (en) Video editing method, device, equipment and storage medium
CN101872346B (en) Method for generating video navigation system automatically
CN112511854B (en) Live video highlight generation method, device, medium and equipment
CN111866585A (en) Video processing method and device
CN113641859B (en) Script generation method, system, computer storage medium and computer program product
CN109756751A (en) Multimedia data processing method and device, electronic equipment, storage medium
CN111083393A (en) Method for intelligently making short video
CN114419387A (en) Cross-modal retrieval system and method based on pre-training model and recall ranking
CN110019852A (en) Multimedia resource searching method and device
CN109005451B (en) Video strip splitting method based on deep learning
CN116614672A (en) Method for automatically mixing and cutting video based on text-video retrieval
CN114731458A (en) Video processing method, video processing apparatus, terminal device, and storage medium
CN113115055B (en) User portrait and live video file editing method based on viewing behavior
CN112784078A (en) Video automatic editing method based on semantic recognition
CN112004138A (en) Intelligent video material searching and matching method and device
CN108710860B (en) Video news segmentation method and device
CN110610500A (en) News video self-adaptive strip splitting method based on dynamic semantic features
CN117201715A (en) Video generation method and device and readable storage medium
CN113704506A (en) Media content duplication eliminating method and related device
CN114938473B (en) Comment video generation method and comment video generation device
CN112800263A (en) Video synthesis system, method and medium based on artificial intelligence
CN113660526B (en) Script generation method, system, computer storage medium and computer program product
CN113992973B (en) Video abstract generation method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination