WO2023184636A1

WO2023184636A1 - Automatic video editing method and system, and terminal and storage medium

Info

Publication number: WO2023184636A1
Application number: PCT/CN2022/089560
Authority: WO
Inventors: 唐小初; 舒畅; 陈又新
Original assignee: 平安科技（深圳）有限公司
Priority date: 2022-03-29
Filing date: 2022-04-27
Publication date: 2023-10-05
Also published as: CN114694070A

Abstract

Disclosed in the present application are an automatic video editing method and system, and a terminal and a storage medium. The method comprises: acquiring key frames of a video to be edited, and self-tagging the key frames by using an image comparison algorithm, so as to generate unsupervised vector representations of the key frames; acquiring corpus information of said video, and acquiring an unsupervised vector representation of the corpus information by using a text comparison algorithm; segmenting said video according to the key frames, so as to generate video clips corresponding to the number of key frames; and calculating the similarity between adjacent video clips according to the unsupervised vector representations of the key frames and the unsupervised vector representation of the corpus information, and combining adjacent video clips between which the similarity is greater than a set similarity threshold value, so as to generate a video editing result of said video. In the embodiments of the present application, image information and text information are used, thereby avoiding manual data labeling, realizing automatic editing of a video, and greatly improving the video editing efficiency.

Description

An automatic video editing method, system, terminal and storage medium

This application requires the priority of the Chinese patent application submitted to the China Patent Office on March 29, 2022, with the application number 202210318902.4, and the invention name is "An automatic video editing method, system, terminal and storage medium", and its entire content is approved by This reference is incorporated into this application.

Technical field

This application relates to the technical field of cluster analysis of artificial intelligence, and in particular to an automatic video editing method, system, terminal and storage medium.

Background technique

With the development of 4G networks, short video technology has flourished. With the emergence of a large number of video apps such as Douyin, Kuaishou, and Bilibili, the number of videos has increased exponentially. Although videos are more intuitive than text and pictures, watching videos takes a lot of time. For a very long video, the valuable or user-interesting segments often only account for a part of the total length of the video, so the demand for video editing is also increasing day by day.

The inventor realized that video editing in the prior art usually relies on human resources, which is costly and costly, and the video editing efficiency is low, which hinders the development of short video technology to a certain extent.

Contents of the invention

This application provides an automatic video editing method, system, terminal and storage medium, aiming to solve the technical problems of existing video editing that relies on human resources, such as high financial resources and low video editing efficiency.

In order to solve the above technical problems, the technical solutions adopted in this application are:

An automatic video editing method, the method includes:

Obtain the key frames of the video to be edited, use an image comparison algorithm to self-mark the key frames, and generate an unsupervised vector representation of the key frames;

Obtain the corpus information of the video to be edited, and use a text comparison algorithm to obtain an unsupervised vector representation of the corpus information;

Segment the video to be edited according to the key frames and generate video segments corresponding to the number of key frames;

Calculate the similarity between adjacent video segments based on the unsupervised vector representation of the key frames and the unsupervised vector representation of the corpus information, merge the adjacent video segments whose similarity is greater than the preset similarity threshold, and generate the Describe the video editing results of the video to be edited.

Another technical solution adopted by the embodiment of the present application is: an automatic video editing system, including:

The first acquisition module: used to acquire the key frames of the video to be edited, and use an image comparison algorithm to self-mark the key frames, and generate an unsupervised vector representation of the key frames;

The second acquisition module: is used to obtain the corpus information of the video to be edited, and uses a text comparison algorithm to obtain the unsupervised vector representation of the corpus information;

Video segmentation module: used to segment the video to be edited according to the key frames and generate video segments corresponding to the number of key frames;

Video merging module: used to calculate the similarity of adjacent video segments based on the unsupervised vector representation of the key frame and the unsupervised vector representation of the corpus information, and merge the adjacent video segments whose similarity is greater than the set similarity threshold. Merge to generate a video editing result of the video to be edited.

Another technical solution adopted by the embodiment of the present application is: a terminal, the terminal includes a processor and a memory coupled to the processor, wherein,

The memory stores program instructions for implementing the above-mentioned automatic video editing method;

The processor is configured to execute the program instructions stored in the memory to perform the automatic video clipping operation.

Another technical solution adopted by the embodiments of the present application is: a storage medium that stores program instructions executable by a processor, and the program instructions are used to execute the above-mentioned automatic video editing method.

The automatic video editing method, system, terminal and storage medium of the embodiment of the present application collects key frames and corpus information of the video to be edited, divides the video to be edited into multiple video segments through the key frames, and generates video clips based on the key frames and corpus information. The vector representation calculates the similarity of adjacent video clips, and merges the video clips with higher similarity to obtain the final video editing result. The embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.

Description of drawings

Figure 1 is a schematic flow chart of an automatic video editing method according to the first embodiment of the present application;

Figure 2 is a schematic flowchart of the automatic video editing method according to the second embodiment of the present application;

Figure 3 is a schematic structural diagram of an automatic video editing system according to an embodiment of the present application;

Figure 4 is a schematic structural diagram of a terminal according to an embodiment of the present application;

Figure 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

The terms “first”, “second” and “third” in this application are only used for descriptive purposes and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Thus, features defined as "first", "second", and "third" may explicitly or implicitly include at least one of these features. In the description of this application, "plurality" means at least two, such as two, three, etc., unless otherwise clearly and specifically limited. All directional indications (such as up, down, left, right, front, back...) in the embodiments of this application are only used to explain the relative positional relationship between components in a specific posture (as shown in the drawings). , sports conditions, etc., if the specific posture changes, the directional indication will also change accordingly. Furthermore, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device that includes a series of steps or units is not limited to the listed steps or units, but optionally also includes steps or units that are not listed, or optionally also includes Other steps or units inherent to such processes, methods, products or devices.

Reference herein to "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art understand, both explicitly and implicitly, that the embodiments described herein may be combined with other embodiments.

Please refer to FIG. 1 , which is a schematic flowchart of an automatic video editing method according to the first embodiment of the present application. The automatic video editing method in the first embodiment of the present application includes the following steps S101-S104:

S101: Obtain the key frames of the video to be edited, and use an image comparison algorithm to self-mark the key frames to generate an unsupervised vector representation of the key frames;

Among them, the key frame is the frame where the key action occurs in the movement change of the character or object in the video to be edited. The key frame acquisition method is: use ffmpeg to extract frames from the video to be edited; among them, FFmpeg is a set of open source computer programs that can be used to record and convert digital audio and video, and convert them into streams. FFmpeg has video capture, Video format conversion, video capture, video watermarking and other functions. For all images after frame extraction, the similarity between adjacent images is calculated, and the images with similarity lower than the set threshold are regarded as key frames.

The specific steps of using the image comparison algorithm to self-label key frames are: based on the acquired key frames, an unsupervised algorithm is used to train the Self label model. The Self label model uses the image comparison algorithm to learn the unsupervised vector representation of the key frame image, and through clustering and representation Learn to self-label key frames and output the self_label(frame _k ) of the key frame, where frame _k represents the k-th key frame image.

S102: Obtain the corpus information of the video to be edited, and use the text comparison algorithm to obtain the unsupervised vector representation of the corpus information;

Among them, the specific method of obtaining corpus information is: using ASR technology to collect the ASR voice information of the video to be edited, and cutting the collected ASR voice information into ASR text information of a set length; using OCR technology to obtain it from the image after frame extraction OCR text information; use the cut ASR text information and OCR text information as the corpus information of the video to be edited.

The specific steps of using the text comparison algorithm to obtain the unsupervised vector representation of corpus information are: training the SimCSE model based on the corpus information. The SimCSE model uses the text comparison algorithm to learn the unsupervised vector representation of ASR text information and OCR text information, and outputs the text vector simcse of the video to be edited. (asr _k ) and simcse(ocr _k ); where, asr _k represents the k-th ASR text information of the video to be edited, and ocr _k represents the OCR text information of the k-th key frame image.

S103: Segment the video to be edited according to key frames and generate video clips corresponding to the number of key frames;

Among them, the video segmentation method is as follows: each key frame is used as a cutting point, the video to be edited is divided into video segments corresponding to the number of key frames, and each video segment includes a key frame image and ASR text information and OCR text information corresponding to the video clip.

S104: Calculate the similarity between adjacent video clips based on the unsupervised vector representation of key frames and the unsupervised vector representation of corpus information, merge adjacent video clips with similarity greater than the preset similarity threshold, and generate a video to be edited The video editing results;

Among them, the similarity calculation method of adjacent video clips is specifically:

First, calculate the similarity of key frames, ASR text information and OCR text information of adjacent video clips:

simi1＝cos(self_label(frame _k ),self_label(frame _k+1 )) (1)

simi2＝cos(simcse(asr _k ),simcse(asr _k+1 )) (2)

simi3＝cos(simcse(ocr _k ),simcse(ocr _k+1 )) (3)

Among them, simi1, simi2 and simi3 respectively represent the similarity of key frames, ASR text information and OCR text information in adjacent video clips;

Then, calculate the similarity of adjacent video clips based on the similarity of key frames, ASR text information and OCR text information:

simi＝α*simi1+β*simi2+(1―α―β)*simi3 (4)

simi represents the similarity of adjacent video clips, and α and β are adjustable parameters respectively.

The automatic video editing method in the first embodiment of the present application obtains key frames and corpus information of the video to be edited, uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, and uses a text comparison algorithm to learn the unsupervised vector representation of the corpus information. Divide the video to be edited into multiple video segments through key frames, calculate the similarity of adjacent video segments based on the vector representation of key frames and corpus information, and merge the video segments with higher similarity to obtain the final video editing result. . The embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.

Please refer to FIG. 2 , which is a schematic flowchart of an automatic video editing method according to the second embodiment of the present application. The automatic video editing method in the second embodiment of the present application includes the following steps S201-S209:

S201: Collect at least one video to be edited;

S202: Perform frame extraction processing on the video to be edited, and obtain the key frames of the video to be edited;

In this step, the ffmpeg program is used to extract frames from the collected videos to be edited. FFmpeg is a set of open source computer programs that can be used to record and convert digital audio and video, and convert them into streams. FFmpeg has Video capture, video format conversion, video capture, video watermarking and other functions. Key frames refer to the frames where key actions occur in the movement changes of characters or objects in the video to be edited. In the embodiment of this application, the key frame is obtained by: calculating the similarity between adjacent images for all images after frame extraction, and using the image frames whose similarity is lower than the set threshold as the key frame. While acquiring key frames, a certain number of remaining frames are retained according to a set ratio. The remaining frames are non-key frames. The number k of remaining frames can be set randomly.

S203: Based on the acquired key frames, an unsupervised algorithm is used to train the Self label model. The Self label model self-labels the key frames through clustering and representation learning;

In this step, the Self label model is a self-supervised algorithm that calibrates labels by maximizing the mutual information between data and labels. The Self label model uses the image comparison algorithm to learn the unsupervised vector representation of key frame images, and through clustering The sum representation learning performs self-labeling on the key frame image and outputs the self_label(frame _k ) of the key frame, where frame _k represents the k-th key frame image.

S204: Use ASR (Automatic Speech Recognition, automatic speech recognition technology) technology and OCR (Optical Character Recognition, optical character recognition) recognition technology to collect corpus information of the video to be edited;

In this step, video is a typical multi-modal data, including images and rich text information. The corpus information of the video to be edited includes ASR voice information in the video to be edited and OCR text information in the extracted frame image. In the embodiment of this application, the method of obtaining the corpus information of the video to be edited is specifically: using ASR technology to collect the ASR voice information of the video to be edited, and cutting the collected ASR voice information into ASR text information of a set length; at the same time, using OCR technology obtains OCR text information from the extracted image, and uses the cut ASR text information and OCR text information as the corpus information of the video to be edited. Among them, the cutting length of ASR voice information is 100, which can be set according to the actual application.

S205: Train the SimCSE model based on the collected corpus information, and output the SimCSE text vector of the video to be edited through the SimCSE model;

In the embodiment of this application, the SimCSE model can be trained unsupervised, based on the self-supervised training of the BERT model, enhanced with natural language data that maintains semantic equivalence through dropout, learn the unsupervised vector representation of the text with the help of the text comparison algorithm, and output the video to be edited. The text vectors simcse(asr _k ) and simcse(ocr _k ), where asr _k represents the k-th ASR text information of the video to be edited, and ocr _k represents the OCR text information of the k-th key frame image.

S206: Use each key frame as a cutting point, divide the video to be edited into multiple video clips, and make each video clip include a key frame and the ASR text information and OCR text information corresponding to the video clip. ;

In this step, the long video is divided into multiple shorter video segments based on the key frames of the video to be edited. Each video clip includes a key frame image and the ASR text and OCR text corresponding to the video clip, that is, the representation of each video clip is (frame, asr, ocr).

S207: Based on the key frames, ASR text information and OCR text of the video to be edited, calculate the similarity of two adjacent video clips before and after;

In this step, the similarity calculation method of adjacent video clips is as follows: first, calculate the similarity of key frames, ASR text and OCR text of adjacent video clips respectively, and then calculate based on the similarity of key frames, ASR text and OCR text. Similarity of adjacent video clips. The specific calculation formula is as follows:

simi1＝cos(self_label(frame _k ),self_label(frame _k+1 )) (1)

simi2＝cos(simcse(asr _k ),simcse(asr _k+1 )) (2)

simi3＝cos(simcse(ocr _k ),simcse(ocr _k+1 )) (3)

simi＝α*simi1+β*simi2+(1-α-β)*simi3 (4)

Among them, simi1, simi2 and simi3 respectively represent the similarity of key frames, ASR text and OCR text in adjacent video clips, and simi represents the similarity of adjacent video clips. α and β are adjustable parameters respectively. Preferably, in this embodiment of the present application, the values of α and β are set to 0.45.

S208: Determine whether the similarity between two adjacent video clips is greater than the set similarity threshold. If the similarity between the two adjacent video clips is greater than the preset similarity threshold, execute S209;

In this step, the similarity threshold is set to 0.5, that is, if the similarity between two adjacent video clips is greater than 0.5, the two video clips are considered similar enough and can be merged. Otherwise, the two video clips are discarded.

S209: Merge adjacent video clips whose similarity is greater than the preset similarity threshold to obtain the final video editing result;

In this step, the edited short video is obtained by merging the video clips with high similarity, which makes the edited short video smoother and improves the viewing experience of the viewer.

Based on the above, the automatic video editing method of the second embodiment of the present application collects key frames and corpus information of the video to be edited, uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, and uses a text comparison algorithm to learn the unsupervised vector representation of the corpus information. Supervised vector representation divides the video to be edited into multiple video clips through key frames, calculates the similarity of adjacent video clips based on the vector representation of key frames and corpus information, and merges video clips with higher similarity to obtain the final The result of the video clip. The embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.

In an optional implementation, the results of the automatic video editing method can also be uploaded to the blockchain.

Specifically, the corresponding summary information is obtained based on the result of the automatic video editing method. Specifically, the summary information is obtained by hashing the result of the automatic video editing method, for example, using the sha256s algorithm. Uploading summary information to the blockchain ensures its security and fairness and transparency to users. Users can download this summary information from the blockchain to verify whether the results of the automatic video editing method have been tampered with. The blockchain referred to in this example is a new application model of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithms. Blockchain is essentially a decentralized database. It is a series of data blocks generated using cryptographic methods. Each data block contains a batch of network transaction information and is used to verify its Validity of information (anti-counterfeiting) and generation of the next block. Blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Please refer to Figure 3, which is a schematic structural diagram of an automatic video editing system according to an embodiment of the present application. The automatic video editing system 40 in the embodiment of this application includes:

The first acquisition module 41 is used to obtain the key frames of the video to be edited, and uses an image comparison algorithm to self-mark the key frames to generate an unsupervised vector representation of the key frames; where the key frames are the movements of characters or objects in the video to be edited. The frame where the key action in the change occurs. The key frame acquisition method is: use ffmpeg to extract frames from the video to be edited; for all images after frame extraction, calculate the similarity between adjacent images, and use the images with a similarity lower than the set threshold as key frames.

The first acquisition module 41 uses an image comparison algorithm to self-label key frames. Specifically, based on the acquired key frames, an unsupervised algorithm is used to train the Self label model. The Self label model uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image. Key frames are self-labeled through clustering and representation learning, and the self_label(frame _k ) of the key frame is output, where frame _k represents the k-th key frame image.

The second acquisition module 42 is used to obtain the corpus information of the video to be edited, and uses a text comparison algorithm to obtain an unsupervised vector representation of the corpus information; wherein, the corpus information acquisition method is specifically: using ASR technology to collect the ASR voice information of the video to be edited. , and cut the collected ASR voice information into ASR text information of a set length; use OCR technology to obtain OCR text information from the image after frame extraction; use the cut ASR text information and OCR text information as the video to be edited corpus information.

The second acquisition module 42 uses a text comparison algorithm to obtain the unsupervised vector representation of the corpus information. Specifically, it trains the SimCSE model based on the corpus information. The SimCSE model uses the text comparison algorithm to learn the unsupervised vector representation of the ASR text information and the OCR text information, and outputs the data to be edited. The text vectors simcse(asr _k ) and simcse(ocr _k ) of the video; where, asr _k represents the k-th ASR text information of the video to be edited, and ocr _k represents the OCR text information of the k-th key frame image.

Video segmentation module 43: used to segment the video to be edited according to key frames and generate video segments corresponding to the number of key frames; wherein, the video segmentation method of the video segmentation module is specifically: use each key frame as a cutting point, Divide the video to be edited into video segments corresponding to the number of key frames, and make each video segment include a key frame image and ASR text information and OCR text information corresponding to the video segment.

Video merging module 44: used to calculate the similarity of adjacent video segments based on the unsupervised vector representation of key frames and the unsupervised vector representation of corpus information, merge adjacent video segments whose similarity is greater than the set similarity threshold, and generate The video editing result of the video to be edited; the similarity calculation method of adjacent video segments is specifically:

simi1＝cos(self_label(frame _k ),self_label(frame _k+1 )) (1)

simi2＝cos(simcse(asr _k ),simcse(asr _k+1 )) (2)

simi3＝cos(simcse(ocr _k ),simcse(ocr _k+1 )) (3)

simi＝α*simi1+β*simi2+(1-α-β)*simi3 (4)

The automatic video editing system in the embodiment of the present application obtains the key frames and corpus information of the video to be edited, uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, and uses the text comparison algorithm to learn the unsupervised vector representation of the corpus information. Through the key The video to be edited is divided into multiple video segments by frame, and the similarity of adjacent video segments is calculated based on the vector representation of key frames and corpus information, and the video segments with higher similarity are merged to obtain the final video editing result. The embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.

Please refer to Figure 4, which is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal 50 includes a processor 51 and a memory 52 coupled to the processor 51 .

The memory 52 stores program instructions for implementing the above-mentioned automatic video editing method.

The processor 51 is configured to execute program instructions stored in the memory 52 to perform automatic video editing operations.

Among them, the processor 51 can also be called a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip with signal processing capabilities. The processor 51 may also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component. . A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.

Please refer to FIG. 5 , which is a schematic structural diagram of a storage medium according to an embodiment of the present application. The storage medium in the embodiment of the present application stores program files 61 that can implement all the above methods. The program files 61 can be stored in the above-mentioned storage medium in the form of software products. The storage medium can be non-volatile or non-volatile. It is volatile and includes a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the various implementation methods of this application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. , or terminal equipment such as computers, servers, mobile phones, tablets, etc.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the system embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.

In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units. The above are only embodiments of the present application, and do not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made using the contents of the description and drawings of this application, or directly or indirectly applied in other related technical fields, All are similarly included in the patent protection scope of this application.

Claims

An automatic video editing method, wherein the method includes:

Obtain the key frames of the video to be edited, use an image comparison algorithm to self-mark the key frames, and generate an unsupervised vector representation of the key frames;

Obtain the corpus information of the video to be edited, and use a text comparison algorithm to obtain an unsupervised vector representation of the corpus information;

Segment the video to be edited according to the key frames and generate video segments corresponding to the number of key frames;

Calculate the similarity between adjacent video segments based on the unsupervised vector representation of the key frame and the unsupervised vector representation of the corpus information, merge the adjacent video segments whose similarity is greater than the preset similarity threshold, and generate the Describe the video editing results of the video to be edited.
The automatic video editing method according to claim 1, wherein the key frames are frames where key actions occur in the movement changes of characters or objects in the video to be edited, and obtaining the key frames of the video to be edited includes:

Use ffmpeg to extract frames from the video to be edited;

For all images after frame extraction, the similarity between adjacent images is calculated, and the images with similarity lower than the set threshold are regarded as key frames.
The automatic video editing method according to claim 2, wherein the self-marking of the key frames using an image comparison algorithm includes:

Based on the acquired key frames, an unsupervised algorithm is used to train the Selflabel model. The Selflabel model uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, self-labels the key frames through clustering and representation learning, and outputs the key frames. self_label(frame k ), where frame k represents the k-th key frame image.
The automatic video editing method according to any one of claims 1 to 3, wherein the obtaining corpus information of the video to be edited includes:

Use ASR technology to collect the ASR voice information of the video to be edited, and cut the collected ASR voice information into ASR text information of a set length;

Use OCR technology to obtain OCR text information from the image after frame extraction;

The cut ASR text information and OCR text information are used as corpus information of the video to be edited.
The automatic video editing method according to claim 4, wherein the use of a text comparison algorithm to obtain the unsupervised vector representation of the corpus information includes:

The SimCSE model is trained based on the corpus information. The SimCSE model uses a text comparison algorithm to learn the unsupervised vector representation of ASR text information and OCR text information, and outputs the text vectors simcse(asr k ) and simcse(ocr k ) of the video to be edited; Among them, asr k represents the k-th ASR text information of the video to be edited, and ocr k represents the OCR text information of the k-th key frame image.
The automatic video editing method according to claim 5, wherein the segmenting the video to be edited according to the key frames and generating video segments corresponding to the number of key frames includes:

Each key frame is used as a cutting point, and the video to be edited is divided into video segments corresponding to the number of key frames, and each video segment includes a key frame image and the video segment. Corresponding ASR text information and OCR text information.
The automatic video editing method according to claim 3 or 5, wherein calculating the similarity of adjacent video segments based on the unsupervised vector representation of the key frame and the unsupervised vector representation of the corpus information includes:

Calculate the similarity of the key frames, ASR text information and OCR text information of the adjacent video clips respectively:

simi1＝cos(self_label(frame k ),self_label(frame k+1 ))

simi2＝cos(simcse(asr k ),simcse(asr k+1 ))

simi3＝cos(simcse(ocr k ),simcse(ocr k+1 ))

Among them, simi1, simi2 and simi3 respectively represent the similarity of key frames, ASR text information and OCR text information in adjacent video clips;

Calculate the similarity of adjacent video clips based on the similarity of the key frames, ASR text information and OCR text information:

simi＝α*simi1+β*simi2+(1-α-β)*simi3,

simi represents the similarity of the adjacent video clips, and α and β are adjustable parameters respectively.
An automatic video editing system, wherein the system includes:

The first acquisition module: used to acquire the key frames of the video to be edited, and use an image comparison algorithm to self-mark the key frames, and generate an unsupervised vector representation of the key frames;

The second acquisition module: is used to obtain the corpus information of the video to be edited, and uses a text comparison algorithm to obtain the unsupervised vector representation of the corpus information;

Video segmentation module: used to segment the video to be edited according to the key frames and generate video segments corresponding to the number of key frames;

Video merging module: used to calculate the similarity of adjacent video segments based on the unsupervised vector representation of the key frame and the unsupervised vector representation of the corpus information, and merge the adjacent video segments whose similarity is greater than the set similarity threshold. Merge to generate a video editing result of the video to be edited.
A terminal, wherein the terminal includes a processor and a memory coupled to the processor, computer readable instructions are stored in the memory, and when executed by the processor, the computer readable instructions cause the The processor performs the following steps: obtains the key frames of the video to be edited, and uses an image comparison algorithm to self-mark the key frames, and generates an unsupervised vector representation of the key frames; obtains the corpus information of the video to be edited, and uses The text comparison algorithm obtains the unsupervised vector representation of the corpus information; segments the video to be edited according to the key frames, and generates video segments corresponding to the number of key frames; according to the unsupervised vectors of the key frames Express and unsupervised vector representation of corpus information, calculate the similarity between adjacent video segments, merge adjacent video segments whose similarity is greater than a preset similarity threshold, and generate a video editing result of the video to be edited.
The terminal according to claim 9, wherein the key frame is a frame where a key action occurs in the movement change of a character or object in the video to be edited, and the obtaining the key frame of the video to be edited includes:

Use ffmpeg to extract frames from the video to be edited;

For all images after frame extraction, the similarity between adjacent images is calculated, and the images with similarity lower than the set threshold are regarded as key frames.
The terminal according to claim 10, wherein the use of an image comparison algorithm to self-mark the key frames includes:

Based on the acquired key frames, an unsupervised algorithm is used to train the Selflabel model. The Selflabel model uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, self-labels the key frames through clustering and representation learning, and outputs the key frames. self_label(frame k ), where frame k represents the k-th key frame image.
The terminal according to any one of claims 9 to 11, wherein the obtaining corpus information of the video to be edited includes:

Use ASR technology to collect the ASR voice information of the video to be edited, and cut the collected ASR voice information into ASR text information of a set length;

Use OCR technology to obtain OCR text information from the image after frame extraction;

The cut ASR text information and OCR text information are used as corpus information of the video to be edited.
The terminal according to claim 12, wherein the use of a text comparison algorithm to obtain the unsupervised vector representation of the corpus information includes:

The SimCSE model is trained based on the corpus information. The SimCSE model uses a text comparison algorithm to learn the unsupervised vector representation of ASR text information and OCR text information, and outputs the text vectors simcse(asr k ) and simcse(ocr k ) of the video to be edited; Among them, asr k represents the k-th ASR text information of the video to be edited, and ocr k represents the OCR text information of the k-th key frame image.
The terminal according to claim 13, wherein the segmenting the video to be edited according to the key frames and generating video segments corresponding to the number of key frames includes:

Each key frame is used as a cutting point, and the video to be edited is divided into video segments corresponding to the number of key frames, and each video segment includes a key frame image and the video segment. Corresponding ASR text information and OCR text information.
A storage medium that stores program files that can implement the following steps, which steps include: obtaining key frames of a video to be edited, using an image comparison algorithm to self-mark the key frames, and generating unsupervised images of the key frames. Vector representation; obtain the corpus information of the video to be edited, and use a text comparison algorithm to obtain an unsupervised vector representation of the corpus information; segment the video to be edited according to the key frames, and generate a number corresponding to the number of key frames video clips; calculate the similarity between adjacent video clips based on the unsupervised vector representation of the key frame and the unsupervised vector representation of the corpus information, and compare the adjacent video clips with the similarity greater than the preset similarity threshold Merge to generate a video editing result of the video to be edited.
The storage medium according to claim 15, wherein the key frames are frames where key actions occur in the movement changes of characters or objects in the video to be edited, and obtaining the key frames of the video to be edited includes:

Use ffmpeg to extract frames from the video to be edited;

For all images after frame extraction, the similarity between adjacent images is calculated, and the images with similarity lower than the set threshold are regarded as key frames.
The storage medium according to claim 16, wherein the use of an image comparison algorithm to self-mark the key frames includes:

Based on the acquired key frames, an unsupervised algorithm is used to train the Selflabel model. The Selflabel model uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, self-labels the key frames through clustering and representation learning, and outputs the key frames. self_label(frame k ), where frame k represents the k-th key frame image.
The storage medium according to any one of claims 15 to 17, wherein the obtaining corpus information of the video to be edited includes:

Use ASR technology to collect the ASR voice information of the video to be edited, and cut the collected ASR voice information into ASR text information of a set length;

Use OCR technology to obtain OCR text information from the image after frame extraction;

The cut ASR text information and OCR text information are used as corpus information of the video to be edited.
The storage medium according to claim 18, wherein the use of a text comparison algorithm to obtain the unsupervised vector representation of the corpus information includes:

The SimCSE model is trained based on the corpus information. The SimCSE model uses a text comparison algorithm to learn the unsupervised vector representation of ASR text information and OCR text information, and outputs the text vectors simcse(asr k ) and simcse(ocr k ) of the video to be edited; Among them, asr k represents the k-th ASR text information of the video to be edited, and ocr k represents the OCR text information of the k-th key frame image.
The storage medium according to claim 19, wherein the segmenting the video to be edited according to the key frames and generating video segments corresponding to the number of key frames includes:

Each key frame is used as a cutting point, and the video to be edited is divided into video segments corresponding to the number of key frames, and each video segment includes a key frame image and the video segment. Corresponding ASR text information and OCR text information.