CN117544822B

CN117544822B - Video editing automation method and system

Info

Publication number: CN117544822B
Application number: CN202410027805.9A
Authority: CN
Inventors: 吴晨辉; 周葭芜; 陈涛; 柴杰; 邓晓宇
Original assignee: Hangzhou Wayward Intelligent Technology Co ltd
Current assignee: Hangzhou Wayward Intelligent Technology Co ltd
Priority date: 2024-01-09
Filing date: 2024-01-09
Publication date: 2024-03-26
Anticipated expiration: 2044-01-09
Also published as: CN117544822A

Abstract

The application relates to a video clip automation method and system, wherein the method comprises the following steps: performing slicing treatment on the video to be clipped to obtain a video clip set containing a plurality of video clips; based on the text feature vector of the video script text and the video feature vector of the video clip, calculating to obtain matching similarity between the video script text and the video clip; based on the matching similarity, obtaining a target video segment corresponding to the video script text through a global optimal algorithm; synthesizing the target video clips based on the video script text to obtain target videos, and deleting the target video clips from the video clip set. According to the method and the device for achieving the video editing, the problem of how to improve the effect of video automatic editing is solved, deletion of matched target video clips in a video clip set is achieved, the picture repetition degree of a target video generated subsequently is reduced, and the overall matching accuracy of the video clips is improved through the use of a global optimal algorithm.

Description

Video editing automation method and system

Technical Field

The present application relates to the field of video data processing, and in particular, to a video clip automation method and system.

Background

In recent years, with the rapid development of internet technology, video has become one of the main modes of information transfer. In many video application scenarios, such as a short video platform, when a user shares videos shot in daily life or in an out tour process, the shot videos are often required to be edited and then released, but for users who do not know about learning related skills or are not always contacted with the video clips, a certain difficulty is often required to know the video clips, so how to reduce the entrance threshold of the video clips is one of the problems to be solved in the current video sharing communication scenario.

Patent publication number CN115967833a discloses a video generation method, apparatus, device meter storage medium. Specifically, extracting features of the explanation text to obtain word feature vectors; the video to be clipped is subjected to slicing processing, and a plurality of video clips are obtained; respectively carrying out multi-modal feature extraction on the plurality of video clips to obtain multi-modal feature vectors of the video clips; inputting the word feature vector and the multi-mode feature vector of the plurality of video clips into a text video matching model to determine at least one target video clip; and generating a target video according to the explanation text and the at least one target video segment. It follows that the patent can certainly realize automatic editing of video, but when it is necessary to automatically edit the same material to be edited based on a plurality of different explanatory texts, it is easy to generate a plurality of target videos with high picture repetition.

At present, no effective solution is proposed for the problem of how to improve the effect of video automation clipping in the related art.

Disclosure of Invention

The embodiment of the application provides a video editing automation method and a system, which at least solve the problem of how to improve the video editing automation effect in the related technology.

In a first aspect, embodiments of the present application provide a video clip automation method, the method including:

performing slicing treatment on the video to be clipped to obtain a video clip set containing a plurality of video clips;

calculating to obtain a video feature vector of the video clip;

based on the text feature vector of the video script text and the video feature vector, calculating to obtain the matching similarity between the video script text and the video fragment;

obtaining a target video segment corresponding to the video script text through a global optimal algorithm based on the matching similarity;

and synthesizing the target video clips based on the video script, obtaining a target video, and deleting the target video clips from the video clip set.

In some of these embodiments, before synthesizing the target video segment based on the video script, obtaining a target video, and deleting the target video segment from the set of video segments, the method includes:

and carrying out video classification on the video clips based on the video feature vectors to obtain a video classification set.

In some embodiments, synthesizing the target video segment based on the video script to obtain a target video, and deleting the target video segment from the video segment set includes:

synthesizing the target video clips based on the video script to obtain target videos, and deleting the target video clips from the video classification set;

and after all the video clips in the video classification set are deleted, recalling the deleted video clips through a recall mechanism of the video classification set, and storing the video clips in the video classification set again.

In some embodiments, performing video classification on the video segments based on the video feature vectors, and obtaining a video classification set includes:

and calculating the classification similarity among the video clips through a video classification model based on the video feature vectors, and classifying the video clips with the classification similarity larger than a preset threshold value into the same category to obtain a video classification set, wherein the classification similarity is measured through Euclidean distance, manhattan distance or cosine distance.

In some of these embodiments, calculating the matching similarity between the video script text and the video clip based on the text feature vector of the video script text and the video feature vector comprises:

and based on the text feature vectors of the video script texts and the video feature vectors, calculating to obtain matching similarity between each video script text and all the video fragments through a text video matching model, wherein the matching similarity is measured through Euclidean distance, manhattan distance or cosine distance.

In some embodiments, based on the matching similarity, obtaining, by a global optimization algorithm, a target video clip corresponding to the video script text includes:

and based on the matching similarity between each video script text and all the video fragments, maximizing the global similarity between the video script text and the video fragments through a global optimal algorithm, and obtaining a target video fragment corresponding to the video script text.

In some of these embodiments, synthesizing the target video segment based on the video script includes:

and splicing the target video fragments corresponding to the video script texts according to the sequence based on the sequence of the video script texts, so as to obtain the target video.

In some embodiments, synthesizing the target video segment based on the video script further comprises:

converting the video script text into target audio, and carrying out coding fusion on the target audio and a corresponding target video segment to obtain a fused video segment;

and splicing the fusion video fragments according to the sequence based on the sequence of the video script texts, so as to obtain a target video.

In some of these embodiments, before calculating a matching similarity between the video script text and the video clip based on the text feature vector of the video script text and the video feature vector, the method comprises:

and calculating a text feature vector of the video script text through a word vector tool.

In a second aspect, an embodiment of the present application provides a video clip automation system, where the system is configured to perform the method described in the first aspect, and the system includes a video processing module, a text processing module, a matching module, and a synthesizing module;

the video processing module is used for carrying out slicing processing on the video to be clipped to obtain a video clip set containing a plurality of video clips; calculating to obtain a video feature vector of the video clip;

the text processing module is used for calculating a text feature vector of the video script text through the word vector tool;

the matching module is used for calculating and obtaining matching similarity between the video script text and the video fragment according to the text feature vector of the video script text and the video feature vector; obtaining a target video segment corresponding to the video script text through a global optimal algorithm according to the matching similarity;

and the synthesis module is used for synthesizing the target video fragments according to the video script to obtain target videos, and deleting the target video fragments from the video fragment set.

Compared with the related art, the video editing automation method and the system provided by the embodiment of the application have the advantages that the video to be edited is subjected to slicing processing to obtain the video clip set comprising a plurality of video clips; calculating to obtain video feature vectors of the video clips; based on the text feature vector and the video feature vector of the video script text, calculating to obtain the matching similarity between the video script text and the video fragment; based on the matching similarity, obtaining a target video segment corresponding to the video script text through a global optimal algorithm; the method comprises the steps of synthesizing target video clips based on video script texts to obtain target videos, deleting the target video clips from a video clip set, solving the problem of how to improve the effect of video automation clipping, realizing deletion of the matched target video clips in the video clip set, reducing the picture repetition degree of the target videos generated subsequently, and improving the overall matching accuracy of the video clips by using a global optimal algorithm.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a flow chart of steps of a video clip automation method according to an embodiment of the present application;

FIG. 2 is a block diagram of a video clip automation system according to an embodiment of the present application;

fig. 3 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application.

The attached drawings are identified: 21. a video processing module; 22. a text processing module; 23. a matching module; 24. and a synthesis module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described and illustrated below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments provided herein, are intended to be within the scope of the present application.

It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the embodiments described herein can be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar terms herein do not denote a limitation of quantity, but rather denote the singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein refers to two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.

An embodiment of the present application provides a video clip automation method, and fig. 1 is a flowchart of steps of the video clip automation method according to an embodiment of the present application, as shown in fig. 1, and the method includes the following steps:

step S102, performing slicing processing on a video to be clipped to obtain a video clip set containing a plurality of video clips;

it should be noted that, the video to be clipped may be photographed by the user himself or may be obtained by downloading from the network after the user is authorized. In this embodiment, the video to be clipped may be fragmented according to the duration. Slicing the clip video by duration can be understood as: and cutting the video to be cut into video clips in a fixed time length.

Step S104, calculating to obtain video feature vectors of the video clips;

it should be noted that, the VIDEO feature vector calculated to obtain a single VIDEO segment may be a multidimensional array, and finally, a VIDEO feature vector SET of all VIDEO segments is obtained.

Step S106, calculating to obtain the matching similarity between the video script text and the video fragment based on the text feature vector and the video feature vector of the video script text;

step S106, specifically, a plurality of video script texts under a clipping theme input by a user are obtained, and text feature vectors of the video script texts are obtained through calculation by a word vector tool (such as bert); and calculating the matching similarity between each video script text and all video fragments through a text video matching model based on the text feature vector and the video feature vector of the video script text, wherein the matching similarity is measured through Euclidean distance, manhattan distance or cosine distance.

Step S108, obtaining a target video clip corresponding to the video script text through a global optimal algorithm based on the matching similarity;

specifically, in step S108, based on the matching similarity between each video script text and all the video clips, the global similarity between the video script text and the video clips is maximized by using a global optimization algorithm, so as to obtain the target video clip corresponding to the video script text.

It should be noted that, the patent publication No. CN115967833a discloses a video generating method, apparatus, and device meter storage medium. Specifically, in its specification [0056] - [0066] three ways of identifying a target video clip based on a text-based word feature vector and a multi-modal feature vector of the video are disclosed:

(1) inputting the multi-mode feature vectors of the plurality of video clips into a text video matching model, and sequentially inputting the plurality of subword feature vectors into the text video matching model to obtain candidate video clips respectively corresponding to the plurality of subword feature vectors; for each subword feature vector, a video segment selected by the user from the candidate video segments is determined to be the target video segment.

(2) Inputting word feature vectors into a word feature embedding module, and outputting word characterization vectors respectively corresponding to each mode feature; for each video segment, sequentially inputting the multi-mode feature vectors of the video segment into a multi-mode feature vector processing module, and outputting the mode feature vectors respectively corresponding to the mode features; the word representation vector and the mode representation vector are input into a similarity determination module, and the similarity of the video segment and the explanation text is output.

(3) Inputting the word feature vector into a weight distribution unit, and outputting the weight of each mode feature; inputting word feature vectors into a word feature embedding module, and outputting word characterization vectors respectively corresponding to each mode feature; for each video segment, sequentially inputting the multi-modal feature vectors of the video segment into a feature fusion unit and a multi-modal transformation unit, and outputting the modal characterization vectors respectively corresponding to the modal features; inputting the modal characterization vector and the word characterization vector into a similarity calculation unit to obtain the similarity of each modal feature and the explanation text respectively; and respectively inputting the similarity between each modal feature and the comment text and the weight of each modal feature into a similarity weighting unit, and outputting the similarity between the video segment and the comment text.

Therefore, all the three methods are to sequentially select the video segment with the largest matching similarity with each video script text as the target video segment, and the matching methods can be used for matching certain texts to obtain the best video segment, but are easy to trap in the local optimal trap. That is, a portion of text can be matched to the best video clip and another portion of text can be matched to an unrelated video clip, resulting in poor results for the final synthesized target video.

In step S108, the video segment having the highest matching similarity with the text of each video script is not sequentially selected as the target video segment. And on the basis of the matching similarity between each video script text and all video fragments, the global similarity between the video script text and the video fragments is maximized through a global optimization algorithm, and the target video fragments corresponding to the video script text are obtained.

A simple example is as follows:

the matching similarity of the video script text a1 and the video clip v1 is 90%;

the matching similarity of the video script text a1 and the video clip v2 is 80%;

the matching similarity of the video script text a1 and the video clip v3 is 70%;

the matching similarity of the video script text a2 and the video clip v1 is 91%;

the matching similarity of the video script text a2 and the video clip v2 is 55%;

the matching similarity of the video script text a2 and the video clip v3 is 58%;

the matching similarity of the video script text a3 and the video clip v1 is 70%;

the matching similarity of the video script text a3 and the video clip v2 is 80%;

the matching similarity of the video script text a3 and the video clip v3 is 90%.

If the matching method of the patent with publication number CN115967833a is performed, the target video segment of a1 is v1, the target video segment of a2 is v3, and the target video segment of a3 is v2, so that the matching similarity between the target video segment v3 and a2 matched with a2 is only 58%, and the effect of the target video synthesized by the matched video segments is poor. According to the embodiment, step S108 maximizes the global similarity between the video script text and the video clip (i.e. 80% +91% +90%) by the global optimization algorithm, at this time, the target video clip of a1 is v2, the target video clip of a2 is v1, and the target video clip of a3 is v3, and step S108 improves the overall matching accuracy of the video clip and the effect of the target video synthesized later based on the use of the global optimization algorithm.

Step S110, synthesizing the target video clips based on the video script text to obtain target videos, and deleting the target video clips from the video clip set.

Step S110 specifically further includes the steps of:

step S1101, performing video classification on the video clips based on the video feature vectors to obtain a video classification set.

In step S1101, specifically, based on the video feature vectors, the classification similarity between the video segments is calculated by the video classification model, and the video segments with the classification similarity greater than the preset threshold are classified into the same category, so as to obtain a video classification set, wherein the classification similarity is measured by euclidean distance, manhattan distance or cosine distance.

It should be noted that, as shown in the above step S104, the VIDEO feature vector SET of all VIDEO clips is white_video_emb_set. Randomly selecting a VIDEO segment vector from the WHOLE_VIDEO_EMB_SET as a reference VIDEO vector, inputting the reference VIDEO vector and all other VIDEO vectors into a VIDEO classification model to obtain a VIDEO SET S with high similarity with the reference VIDEO, and treating the VIDEO SET S as the same VIDEO (for example, treating the VIDEO SET S as a category 1). Then, the VIDEO S that has been classified is removed from the white_video_emb_set. And randomly selecting a new VIDEO vector from the WHOLE_VIDEO_EMB_SET as a reference VIDEO vector, and repeating the process. The role of the video classification model is to calculate the similarity of the reference video vector to other video vectors. And regarding the videos with the similarity larger than a preset threshold value as a same type of video SET, returning the SET, and finally obtaining a video classification SET WHOLE_CATEGORY_SET in the following format: { class id1: video clip 1, video clip 2], class id2: video clip 3, video clip 4, video clip 5], class id3: video clip 6.

Step S1102, synthesizing a target video clip based on the video script text to obtain a target video, and deleting the target video clip from the video classification set;

step S1102 is optionally based on the sequence of the video script texts, and the target video segments corresponding to the video script texts are spliced according to the sequence, so as to obtain the target video.

Step 1102 is optionally to convert the video script text into target audio, and encode and fuse the target audio with the corresponding target video segment to obtain a fused video segment; and splicing the fusion video fragments according to the sequence based on the sequence of the video script texts to obtain the target video.

The target video clip for synthesis in step S1102 is removed from the video classification SET white_video_set obtained in S1101. For example, step S1101 obtains a video classification set of { class id1: video clip 1, video clip 2], class id2: video clip 3, video clip 4, video clip 5], class id3: video clip 6. In step S1102, video clip 1 and video clip 3 are used, and then video clip 1 and video clip 3 are removed from the video classification set to obtain a new video classification set whole_set { class id1: video clip 2, class id2: video clip 4, video clip 5], class id3: video clip 6. The method and the device realize the deletion of the matched target video clips in the video classification set and reduce the picture repetition degree of the target video generated subsequently.

In step S1103, when all the video clips in the video classification set are deleted, the deleted video clips are recalled by the recall mechanism of the video classification set and are restored to the video classification set.

It should be noted that, after all the video clips in the video classification set are deleted in step S1103, the deleted video clips are recalled and stored again by the recall mechanism of the video classification set, so as to improve the multiplexing rate of the video clips obtained after video clips are clipped.

Through the steps S102 to S110 in the embodiment of the application, the problem of how to improve the effect of video automation clipping is solved, deletion of matched target video clips in a video clip set is realized, the picture repetition degree of a target video generated subsequently is reduced, and the overall matching accuracy of the video clips is improved by using a global optimal algorithm.

It should be noted that the steps illustrated in the above-described flow or flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.

An embodiment of the present application provides a video clip automation system, fig. 2 is a block diagram of the structure of the video clip automation system according to an embodiment of the present application, and as shown in fig. 2, the system includes a video processing module 21, a text processing module 22, a matching module 23, and a synthesizing module 24;

the video processing module 21 is configured to perform slicing processing on a video to be clipped to obtain a video clip set including a plurality of video clips; calculating to obtain video feature vectors of the video clips;

a text processing module 22, configured to calculate a text feature vector of the video script text through a word vector tool;

the matching module 23 is configured to calculate, according to the text feature vector and the video feature vector of the video script text, a matching similarity between the video script text and the video clip; according to the matching similarity, obtaining a target video segment corresponding to the video script text through a global optimal algorithm;

and the synthesizing module 24 is used for synthesizing the target video fragments according to the video script text to obtain target videos, and deleting the target video fragments from the video fragment set.

Through the video processing module 21, the text processing module 22, the matching module 23 and the synthesizing module 24 in the embodiment of the application, the problem of how to improve the effect of video automation clipping is solved, deletion of matched target video clips in a video clip set is realized, the picture repetition degree of a target video generated subsequently is reduced, and the overall matching accuracy of the video clips is improved by using a global optimal algorithm.

The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the various modules described above may be located in the same processor; or the above modules may be located in different processors in any combination.

The present embodiment also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and this embodiment is not repeated herein.

In addition, in combination with the video clip automation method in the above embodiment, the embodiment of the application may be implemented by providing a storage medium. The storage medium has a computer program stored thereon; the computer program, when executed by a processor, implements any of the video clip automation methods of the above embodiments.

In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a video clip automation method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

In one embodiment, fig. 3 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application, and as shown in fig. 3, an electronic device is provided, which may be a server, and an internal structure diagram thereof may be shown in fig. 3. The electronic device includes a processor, a network interface, an internal memory, and a non-volatile memory connected by an internal bus, where the non-volatile memory stores an operating system, computer programs, and a database. The processor is for providing computing and control capabilities, the network interface is for communicating with an external terminal via a network connection, the internal memory is for providing an environment for the operation of an operating system and a computer program which when executed by the processor implements a video clip automation method, and the database is for storing data.

It will be appreciated by those skilled in the art that the structure shown in fig. 3 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the electronic device to which the present application is applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be understood by those skilled in the art that the technical features of the above-described embodiments may be combined in any manner, and for brevity, all of the possible combinations of the technical features of the above-described embodiments are not described, however, they should be considered as being within the scope of the description provided herein, as long as there is no contradiction between the combinations of the technical features.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A video clip automation method, the method comprising:

calculating to obtain a video feature vector of the video clip;

performing video classification on the video clips based on the video feature vectors to obtain a video classification set;

2. The method of claim 1, wherein video classifying the video segments based on the video feature vectors to obtain a set of video classifications comprises:

3. The method of claim 1, wherein calculating a matching similarity between the video script text and the video clip based on a text feature vector of the video script text and the video feature vector comprises:

4. The method of claim 3, wherein deriving, based on the matching similarity, a target video clip corresponding to the video script text by a global optimization algorithm comprises:

5. The method of claim 1, wherein synthesizing the target video segment based on the video script comprises:

6. The method of claim 1, wherein synthesizing the target video segment based on the video script further comprises:

7. The method of claim 1, wherein prior to calculating a matching similarity between the video script text and the video clip based on a text feature vector of the video script text and the video feature vector, the method comprises:

8. A video clip automation system for performing the method of any of the preceding claims 1 to 7, the system comprising a video processing module, a text processing module, a matching module and a compositing module;