WO2023184636A1 - Automatic video editing method and system, and terminal and storage medium - Google Patents

Automatic video editing method and system, and terminal and storage medium Download PDF

Info

Publication number
WO2023184636A1
WO2023184636A1 PCT/CN2022/089560 CN2022089560W WO2023184636A1 WO 2023184636 A1 WO2023184636 A1 WO 2023184636A1 CN 2022089560 W CN2022089560 W CN 2022089560W WO 2023184636 A1 WO2023184636 A1 WO 2023184636A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
edited
key frames
asr
information
Prior art date
Application number
PCT/CN2022/089560
Other languages
French (fr)
Chinese (zh)
Inventor
唐小初
舒畅
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023184636A1 publication Critical patent/WO2023184636A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23424Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement

Definitions

  • This application relates to the technical field of cluster analysis of artificial intelligence, and in particular to an automatic video editing method, system, terminal and storage medium.
  • This application provides an automatic video editing method, system, terminal and storage medium, aiming to solve the technical problems of existing video editing that relies on human resources, such as high financial resources and low video editing efficiency.
  • An automatic video editing method includes:
  • an automatic video editing system including:
  • the first acquisition module used to acquire the key frames of the video to be edited, and use an image comparison algorithm to self-mark the key frames, and generate an unsupervised vector representation of the key frames;
  • the second acquisition module is used to obtain the corpus information of the video to be edited, and uses a text comparison algorithm to obtain the unsupervised vector representation of the corpus information;
  • Video segmentation module used to segment the video to be edited according to the key frames and generate video segments corresponding to the number of key frames;
  • Video merging module used to calculate the similarity of adjacent video segments based on the unsupervised vector representation of the key frame and the unsupervised vector representation of the corpus information, and merge the adjacent video segments whose similarity is greater than the set similarity threshold. Merge to generate a video editing result of the video to be edited.
  • a terminal includes a processor and a memory coupled to the processor, wherein,
  • the memory stores program instructions for implementing the above-mentioned automatic video editing method
  • the processor is configured to execute the program instructions stored in the memory to perform the automatic video clipping operation.
  • a storage medium that stores program instructions executable by a processor, and the program instructions are used to execute the above-mentioned automatic video editing method.
  • the automatic video editing method, system, terminal and storage medium of the embodiment of the present application collects key frames and corpus information of the video to be edited, divides the video to be edited into multiple video segments through the key frames, and generates video clips based on the key frames and corpus information.
  • the vector representation calculates the similarity of adjacent video clips, and merges the video clips with higher similarity to obtain the final video editing result.
  • the embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.
  • Figure 1 is a schematic flow chart of an automatic video editing method according to the first embodiment of the present application
  • FIG. 2 is a schematic flowchart of the automatic video editing method according to the second embodiment of the present application.
  • Figure 3 is a schematic structural diagram of an automatic video editing system according to an embodiment of the present application.
  • Figure 4 is a schematic structural diagram of a terminal according to an embodiment of the present application.
  • Figure 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
  • first”, “second” and “third” in this application are only used for descriptive purposes and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Thus, features defined as “first”, “second”, and “third” may explicitly or implicitly include at least one of these features.
  • “plurality” means at least two, such as two, three, etc., unless otherwise clearly and specifically limited. All directional indications (such as up, down, left, right, front, back%) in the embodiments of this application are only used to explain the relative positional relationship between components in a specific posture (as shown in the drawings). , sports conditions, etc., if the specific posture changes, the directional indication will also change accordingly.
  • an embodiment means that a particular feature, structure or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application.
  • the appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art understand, both explicitly and implicitly, that the embodiments described herein may be combined with other embodiments.
  • FIG. 1 is a schematic flowchart of an automatic video editing method according to the first embodiment of the present application.
  • the automatic video editing method in the first embodiment of the present application includes the following steps S101-S104:
  • S101 Obtain the key frames of the video to be edited, and use an image comparison algorithm to self-mark the key frames to generate an unsupervised vector representation of the key frames;
  • the key frame is the frame where the key action occurs in the movement change of the character or object in the video to be edited.
  • the key frame acquisition method is: use ffmpeg to extract frames from the video to be edited; among them, FFmpeg is a set of open source computer programs that can be used to record and convert digital audio and video, and convert them into streams. FFmpeg has video capture, Video format conversion, video capture, video watermarking and other functions. For all images after frame extraction, the similarity between adjacent images is calculated, and the images with similarity lower than the set threshold are regarded as key frames.
  • the specific steps of using the image comparison algorithm to self-label key frames are: based on the acquired key frames, an unsupervised algorithm is used to train the Self label model.
  • the Self label model uses the image comparison algorithm to learn the unsupervised vector representation of the key frame image, and through clustering and representation Learn to self-label key frames and output the self_label(frame k ) of the key frame, where frame k represents the k-th key frame image.
  • S102 Obtain the corpus information of the video to be edited, and use the text comparison algorithm to obtain the unsupervised vector representation of the corpus information;
  • the specific method of obtaining corpus information is: using ASR technology to collect the ASR voice information of the video to be edited, and cutting the collected ASR voice information into ASR text information of a set length; using OCR technology to obtain it from the image after frame extraction OCR text information; use the cut ASR text information and OCR text information as the corpus information of the video to be edited.
  • the specific steps of using the text comparison algorithm to obtain the unsupervised vector representation of corpus information are: training the SimCSE model based on the corpus information.
  • the SimCSE model uses the text comparison algorithm to learn the unsupervised vector representation of ASR text information and OCR text information, and outputs the text vector simcse of the video to be edited. (asr k ) and simcse(ocr k ); where, asr k represents the k-th ASR text information of the video to be edited, and ocr k represents the OCR text information of the k-th key frame image.
  • the video segmentation method is as follows: each key frame is used as a cutting point, the video to be edited is divided into video segments corresponding to the number of key frames, and each video segment includes a key frame image and ASR text information and OCR text information corresponding to the video clip.
  • S104 Calculate the similarity between adjacent video clips based on the unsupervised vector representation of key frames and the unsupervised vector representation of corpus information, merge adjacent video clips with similarity greater than the preset similarity threshold, and generate a video to be edited The video editing results;
  • the similarity calculation method of adjacent video clips is specifically:
  • simi1, simi2 and simi3 respectively represent the similarity of key frames, ASR text information and OCR text information in adjacent video clips;
  • simi represents the similarity of adjacent video clips, and ⁇ and ⁇ are adjustable parameters respectively.
  • the automatic video editing method in the first embodiment of the present application obtains key frames and corpus information of the video to be edited, uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, and uses a text comparison algorithm to learn the unsupervised vector representation of the corpus information. Divide the video to be edited into multiple video segments through key frames, calculate the similarity of adjacent video segments based on the vector representation of key frames and corpus information, and merge the video segments with higher similarity to obtain the final video editing result. .
  • the embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.
  • FIG. 2 is a schematic flowchart of an automatic video editing method according to the second embodiment of the present application.
  • the automatic video editing method in the second embodiment of the present application includes the following steps S201-S209:
  • S201 Collect at least one video to be edited
  • S202 Perform frame extraction processing on the video to be edited, and obtain the key frames of the video to be edited;
  • the ffmpeg program is used to extract frames from the collected videos to be edited.
  • FFmpeg is a set of open source computer programs that can be used to record and convert digital audio and video, and convert them into streams.
  • FFmpeg has Video capture, video format conversion, video capture, video watermarking and other functions.
  • Key frames refer to the frames where key actions occur in the movement changes of characters or objects in the video to be edited.
  • the key frame is obtained by: calculating the similarity between adjacent images for all images after frame extraction, and using the image frames whose similarity is lower than the set threshold as the key frame. While acquiring key frames, a certain number of remaining frames are retained according to a set ratio. The remaining frames are non-key frames. The number k of remaining frames can be set randomly.
  • S203 Based on the acquired key frames, an unsupervised algorithm is used to train the Self label model.
  • the Self label model self-labels the key frames through clustering and representation learning;
  • the Self label model is a self-supervised algorithm that calibrates labels by maximizing the mutual information between data and labels.
  • the Self label model uses the image comparison algorithm to learn the unsupervised vector representation of key frame images, and through clustering
  • the sum representation learning performs self-labeling on the key frame image and outputs the self_label(frame k ) of the key frame, where frame k represents the k-th key frame image.
  • S204 Use ASR (Automatic Speech Recognition, automatic speech recognition technology) technology and OCR (Optical Character Recognition, optical character recognition) recognition technology to collect corpus information of the video to be edited;
  • ASR Automatic Speech Recognition, automatic speech recognition technology
  • OCR Optical Character Recognition, optical character recognition
  • video is a typical multi-modal data, including images and rich text information.
  • the corpus information of the video to be edited includes ASR voice information in the video to be edited and OCR text information in the extracted frame image.
  • the method of obtaining the corpus information of the video to be edited is specifically: using ASR technology to collect the ASR voice information of the video to be edited, and cutting the collected ASR voice information into ASR text information of a set length; at the same time, using OCR technology obtains OCR text information from the extracted image, and uses the cut ASR text information and OCR text information as the corpus information of the video to be edited.
  • the cutting length of ASR voice information is 100, which can be set according to the actual application.
  • S205 Train the SimCSE model based on the collected corpus information, and output the SimCSE text vector of the video to be edited through the SimCSE model;
  • the SimCSE model can be trained unsupervised, based on the self-supervised training of the BERT model, enhanced with natural language data that maintains semantic equivalence through dropout, learn the unsupervised vector representation of the text with the help of the text comparison algorithm, and output the video to be edited.
  • S206 Use each key frame as a cutting point, divide the video to be edited into multiple video clips, and make each video clip include a key frame and the ASR text information and OCR text information corresponding to the video clip. ;
  • each video clip includes a key frame image and the ASR text and OCR text corresponding to the video clip, that is, the representation of each video clip is (frame, asr, ocr).
  • S207 Based on the key frames, ASR text information and OCR text of the video to be edited, calculate the similarity of two adjacent video clips before and after;
  • the similarity calculation method of adjacent video clips is as follows: first, calculate the similarity of key frames, ASR text and OCR text of adjacent video clips respectively, and then calculate based on the similarity of key frames, ASR text and OCR text. Similarity of adjacent video clips.
  • the specific calculation formula is as follows:
  • simi1, simi2 and simi3 respectively represent the similarity of key frames
  • simi represents the similarity of adjacent video clips.
  • ⁇ and ⁇ are adjustable parameters respectively.
  • the values of ⁇ and ⁇ are set to 0.45.
  • S208 Determine whether the similarity between two adjacent video clips is greater than the set similarity threshold. If the similarity between the two adjacent video clips is greater than the preset similarity threshold, execute S209;
  • the similarity threshold is set to 0.5, that is, if the similarity between two adjacent video clips is greater than 0.5, the two video clips are considered similar enough and can be merged. Otherwise, the two video clips are discarded.
  • the edited short video is obtained by merging the video clips with high similarity, which makes the edited short video smoother and improves the viewing experience of the viewer.
  • the automatic video editing method of the second embodiment of the present application collects key frames and corpus information of the video to be edited, uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, and uses a text comparison algorithm to learn the unsupervised vector representation of the corpus information.
  • Supervised vector representation divides the video to be edited into multiple video clips through key frames, calculates the similarity of adjacent video clips based on the vector representation of key frames and corpus information, and merges video clips with higher similarity to obtain the final The result of the video clip.
  • the embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.
  • the results of the automatic video editing method can also be uploaded to the blockchain.
  • the corresponding summary information is obtained based on the result of the automatic video editing method.
  • the summary information is obtained by hashing the result of the automatic video editing method, for example, using the sha256s algorithm.
  • Uploading summary information to the blockchain ensures its security and fairness and transparency to users. Users can download this summary information from the blockchain to verify whether the results of the automatic video editing method have been tampered with.
  • the blockchain referred to in this example is a new application model of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithms.
  • Blockchain is essentially a decentralized database. It is a series of data blocks generated using cryptographic methods. Each data block contains a batch of network transaction information and is used to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • Blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • FIG 3 is a schematic structural diagram of an automatic video editing system according to an embodiment of the present application.
  • the automatic video editing system 40 in the embodiment of this application includes:
  • the first acquisition module 41 is used to obtain the key frames of the video to be edited, and uses an image comparison algorithm to self-mark the key frames to generate an unsupervised vector representation of the key frames; where the key frames are the movements of characters or objects in the video to be edited.
  • the key frame acquisition method is: use ffmpeg to extract frames from the video to be edited; for all images after frame extraction, calculate the similarity between adjacent images, and use the images with a similarity lower than the set threshold as key frames.
  • the first acquisition module 41 uses an image comparison algorithm to self-label key frames. Specifically, based on the acquired key frames, an unsupervised algorithm is used to train the Self label model.
  • the Self label model uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image. Key frames are self-labeled through clustering and representation learning, and the self_label(frame k ) of the key frame is output, where frame k represents the k-th key frame image.
  • the second acquisition module 42 is used to obtain the corpus information of the video to be edited, and uses a text comparison algorithm to obtain an unsupervised vector representation of the corpus information; wherein, the corpus information acquisition method is specifically: using ASR technology to collect the ASR voice information of the video to be edited. , and cut the collected ASR voice information into ASR text information of a set length; use OCR technology to obtain OCR text information from the image after frame extraction; use the cut ASR text information and OCR text information as the video to be edited corpus information.
  • the second acquisition module 42 uses a text comparison algorithm to obtain the unsupervised vector representation of the corpus information. Specifically, it trains the SimCSE model based on the corpus information.
  • the SimCSE model uses the text comparison algorithm to learn the unsupervised vector representation of the ASR text information and the OCR text information, and outputs the data to be edited.
  • the text vectors simcse(asr k ) and simcse(ocr k ) of the video; where, asr k represents the k-th ASR text information of the video to be edited, and ocr k represents the OCR text information of the k-th key frame image.
  • Video segmentation module 43 used to segment the video to be edited according to key frames and generate video segments corresponding to the number of key frames; wherein, the video segmentation method of the video segmentation module is specifically: use each key frame as a cutting point, Divide the video to be edited into video segments corresponding to the number of key frames, and make each video segment include a key frame image and ASR text information and OCR text information corresponding to the video segment.
  • Video merging module 44 used to calculate the similarity of adjacent video segments based on the unsupervised vector representation of key frames and the unsupervised vector representation of corpus information, merge adjacent video segments whose similarity is greater than the set similarity threshold, and generate The video editing result of the video to be edited; the similarity calculation method of adjacent video segments is specifically:
  • simi1, simi2 and simi3 respectively represent the similarity of key frames, ASR text information and OCR text information in adjacent video clips;
  • simi represents the similarity of adjacent video clips, and ⁇ and ⁇ are adjustable parameters respectively.
  • the automatic video editing system in the embodiment of the present application obtains the key frames and corpus information of the video to be edited, uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, and uses the text comparison algorithm to learn the unsupervised vector representation of the corpus information.
  • the video to be edited is divided into multiple video segments by frame, and the similarity of adjacent video segments is calculated based on the vector representation of key frames and corpus information, and the video segments with higher similarity are merged to obtain the final video editing result.
  • the embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.
  • the terminal 50 includes a processor 51 and a memory 52 coupled to the processor 51 .
  • the memory 52 stores program instructions for implementing the above-mentioned automatic video editing method.
  • the processor 51 is configured to execute program instructions stored in the memory 52 to perform automatic video editing operations.
  • the processor 51 can also be called a CPU (Central Processing Unit).
  • the processor 51 may be an integrated circuit chip with signal processing capabilities.
  • the processor 51 may also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component.
  • DSP digital signal processor
  • ASIC application-specific integrated circuit
  • FPGA off-the-shelf programmable gate array
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • FIG. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
  • the storage medium in the embodiment of the present application stores program files 61 that can implement all the above methods.
  • the program files 61 can be stored in the above-mentioned storage medium in the form of software products.
  • the storage medium can be non-volatile or non-volatile. It is volatile and includes a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the various implementation methods of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. , or terminal equipment such as computers, servers, mobile phones, tablets, etc.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Signal Processing (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

Disclosed in the present application are an automatic video editing method and system, and a terminal and a storage medium. The method comprises: acquiring key frames of a video to be edited, and self-tagging the key frames by using an image comparison algorithm, so as to generate unsupervised vector representations of the key frames; acquiring corpus information of said video, and acquiring an unsupervised vector representation of the corpus information by using a text comparison algorithm; segmenting said video according to the key frames, so as to generate video clips corresponding to the number of key frames; and calculating the similarity between adjacent video clips according to the unsupervised vector representations of the key frames and the unsupervised vector representation of the corpus information, and combining adjacent video clips between which the similarity is greater than a set similarity threshold value, so as to generate a video editing result of said video. In the embodiments of the present application, image information and text information are used, thereby avoiding manual data labeling, realizing automatic editing of a video, and greatly improving the video editing efficiency.

Description

一种自动视频剪辑方法、系统、终端及存储介质An automatic video editing method, system, terminal and storage medium
本申请要求于2022年03月29日提交中国专利局、申请号为202210318902.4,发明名称为“一种自动视频剪辑方法、系统、终端及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application submitted to the China Patent Office on March 29, 2022, with the application number 202210318902.4, and the invention name is "An automatic video editing method, system, terminal and storage medium", and its entire content is approved by This reference is incorporated into this application.
技术领域Technical field
本申请涉及人工智能之聚类分析技术领域,特别是涉及一种自动视频剪辑方法、系统、终端及存储介质。This application relates to the technical field of cluster analysis of artificial intelligence, and in particular to an automatic video editing method, system, terminal and storage medium.
背景技术Background technique
借助4G网络的发展,短视频技术得到了蓬勃发展,随着抖音、快手、B站等大量视频APP的涌现,视频数量呈现指数级增长。虽然视频比文字、图片更加直观,但观看视频需要耗费大量时间。对于一段很长的视频,有价值或用户感兴趣的片段往往只占视频总长的一部分,因此视频剪辑的需求也在与日俱增。With the development of 4G networks, short video technology has flourished. With the emergence of a large number of video apps such as Douyin, Kuaishou, and Bilibili, the number of videos has increased exponentially. Although videos are more intuitive than text and pictures, watching videos takes a lot of time. For a very long video, the valuable or user-interesting segments often only account for a part of the total length of the video, so the demand for video editing is also increasing day by day.
发明人意识到,现有技术中的视频剪辑通常需要依赖人力资源,既费财力,且视频剪辑效率低下,在一定程度上阻碍了短视频技术的发展。The inventor realized that video editing in the prior art usually relies on human resources, which is costly and costly, and the video editing efficiency is low, which hinders the development of short video technology to a certain extent.
发明内容Contents of the invention
本申请提供了一种自动视频剪辑方法、系统、终端及存储介质,旨在解决现有的视频剪辑需要依赖人力资源存在的费财力、视频剪辑效率低下等技术问题。This application provides an automatic video editing method, system, terminal and storage medium, aiming to solve the technical problems of existing video editing that relies on human resources, such as high financial resources and low video editing efficiency.
为解决上述技术问题,本申请采用的技术方案为:In order to solve the above technical problems, the technical solutions adopted in this application are:
一种自动视频剪辑方法,所述方法包括:An automatic video editing method, the method includes:
获取待剪辑视频的关键帧,并采用图像对比算法对所述关键帧进行自我标记,生成所述关键帧的无监督向量表示;Obtain the key frames of the video to be edited, use an image comparison algorithm to self-mark the key frames, and generate an unsupervised vector representation of the key frames;
获取待剪辑视频的语料信息,并采用文本对比算法获取所述语料信息的无监督向量表示;Obtain the corpus information of the video to be edited, and use a text comparison algorithm to obtain an unsupervised vector representation of the corpus information;
根据所述关键帧对所述待剪辑视频进行分割,生成与所述关键帧数量相对应的视频片段;Segment the video to be edited according to the key frames and generate video segments corresponding to the number of key frames;
根据所述关键帧的无监督向量表示以及语料信息的无监督向量表示计算相邻视频片段之间的相似度,将所述相似度大于预设相似度阈值的相邻视频片段进行合并,生成所述待剪辑视频的视频剪辑结果。Calculate the similarity between adjacent video segments based on the unsupervised vector representation of the key frames and the unsupervised vector representation of the corpus information, merge the adjacent video segments whose similarity is greater than the preset similarity threshold, and generate the Describe the video editing results of the video to be edited.
本申请实施例采取的另一技术方案为:一种自动视频剪辑系统,包括:Another technical solution adopted by the embodiment of the present application is: an automatic video editing system, including:
第一获取模块:用于获取待剪辑视频的关键帧,并采用图像对比算法对所述关键帧进行自我标记,生成所述关键帧的无监督向量表示;The first acquisition module: used to acquire the key frames of the video to be edited, and use an image comparison algorithm to self-mark the key frames, and generate an unsupervised vector representation of the key frames;
第二获取模块:用于获取待剪辑视频的语料信息,并采用文本对比算法获取所述语料信息的无监督向量表示;The second acquisition module: is used to obtain the corpus information of the video to be edited, and uses a text comparison algorithm to obtain the unsupervised vector representation of the corpus information;
视频分割模块:用于根据所述关键帧对所述待剪辑视频进行分割,生成与所述关键帧数量相对应的视频片段;Video segmentation module: used to segment the video to be edited according to the key frames and generate video segments corresponding to the number of key frames;
视频合并模块:用于根据所述关键帧的无监督向量表示以及语料信息的无监督向量表示计算相邻视频片段的相似度,将所述相似度大于设定相似度阈值的相邻视 频片段进行合并,生成所述待剪辑视频的视频剪辑结果。Video merging module: used to calculate the similarity of adjacent video segments based on the unsupervised vector representation of the key frame and the unsupervised vector representation of the corpus information, and merge the adjacent video segments whose similarity is greater than the set similarity threshold. Merge to generate a video editing result of the video to be edited.
本申请实施例采取的又一技术方案为:一种终端,所述终端包括处理器、与所述处理器耦接的存储器,其中,Another technical solution adopted by the embodiment of the present application is: a terminal, the terminal includes a processor and a memory coupled to the processor, wherein,
所述存储器存储有用于实现上述的自动视频剪辑方法的程序指令;The memory stores program instructions for implementing the above-mentioned automatic video editing method;
所述处理器用于执行所述存储器存储的所述程序指令以执行所述自动视频剪辑操作。The processor is configured to execute the program instructions stored in the memory to perform the automatic video clipping operation.
本申请实施例采取的又一技术方案为:一种存储介质,存储有处理器可运行的程序指令,所述程序指令用于执行上述的自动视频剪辑方法。Another technical solution adopted by the embodiments of the present application is: a storage medium that stores program instructions executable by a processor, and the program instructions are used to execute the above-mentioned automatic video editing method.
本申请实施例的自动视频剪辑方法、系统、终端及存储介质通过收集待剪辑视频的关键帧以及语料信息,通过关键帧将待剪辑视频分割为多个视频片段,并基于关键帧和语料信息的向量表示计算相邻视频片段的相似性,对相似性较高的视频片段进行合并,得到最终的视频剪辑结果。本申请实施例同时利用了图像和文本信息,避免了人工数据标注,实现了视频的自动剪辑,并大大提高了视频剪辑效率。The automatic video editing method, system, terminal and storage medium of the embodiment of the present application collects key frames and corpus information of the video to be edited, divides the video to be edited into multiple video segments through the key frames, and generates video clips based on the key frames and corpus information. The vector representation calculates the similarity of adjacent video clips, and merges the video clips with higher similarity to obtain the final video editing result. The embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.
附图说明Description of drawings
图1是本申请第一实施例的自动视频剪辑方法的流程示意图;Figure 1 is a schematic flow chart of an automatic video editing method according to the first embodiment of the present application;
图2是本申请第二实施例的自动视频剪辑方法的流程示意图;Figure 2 is a schematic flowchart of the automatic video editing method according to the second embodiment of the present application;
图3是本申请实施例自动视频剪辑系统的结构示意图;Figure 3 is a schematic structural diagram of an automatic video editing system according to an embodiment of the present application;
图4是本申请实施例的终端结构示意图;Figure 4 is a schematic structural diagram of a terminal according to an embodiment of the present application;
图5是本申请实施例的存储介质结构示意图。Figure 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.
本申请中的术语“第一”、“第二”、“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”、“第三”的特征可以明示或者隐含地包括至少一个该特征。本申请的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。本申请实施例中所有方向性指示(诸如上、下、左、右、前、后……)仅用于解释在某一特定姿态(如附图所示)下各部件之间的相对位置关系、运动情况等,如果该特定姿态发生改变时,则该方向性指示也相应地随之改变。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms “first”, “second” and “third” in this application are only used for descriptive purposes and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Thus, features defined as "first", "second", and "third" may explicitly or implicitly include at least one of these features. In the description of this application, "plurality" means at least two, such as two, three, etc., unless otherwise clearly and specifically limited. All directional indications (such as up, down, left, right, front, back...) in the embodiments of this application are only used to explain the relative positional relationship between components in a specific posture (as shown in the drawings). , sports conditions, etc., if the specific posture changes, the directional indication will also change accordingly. Furthermore, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device that includes a series of steps or units is not limited to the listed steps or units, but optionally also includes steps or units that are not listed, or optionally also includes Other steps or units inherent to such processes, methods, products or devices.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art understand, both explicitly and implicitly, that the embodiments described herein may be combined with other embodiments.
请参阅图1,是本申请第一实施例的自动视频剪辑方法的流程示意图。本申请第一实施例的自动视频剪辑方法包括以下步骤S101-S104:Please refer to FIG. 1 , which is a schematic flowchart of an automatic video editing method according to the first embodiment of the present application. The automatic video editing method in the first embodiment of the present application includes the following steps S101-S104:
S101:获取待剪辑视频的关键帧,并采用图像对比算法对关键帧进行自我标记, 生成关键帧的无监督向量表示;S101: Obtain the key frames of the video to be edited, and use an image comparison algorithm to self-mark the key frames to generate an unsupervised vector representation of the key frames;
其中,关键帧为待剪辑视频中角色或者物体运动变化中关键动作所处的帧。关键帧获取方式为:采用ffmpeg对待剪辑视频进行抽帧处理;其中,FFmpeg是一套可以用来记录、转换数字音频、视频,并能将其转化为流的开源计算机程序,FFmpeg具有视频采集、视频格式转换、视频抓图、给视频加水印等功能。对于所有抽帧后的图像,计算相邻图像之间的相似度,将相似度低于设定阈值的图像作为关键帧。Among them, the key frame is the frame where the key action occurs in the movement change of the character or object in the video to be edited. The key frame acquisition method is: use ffmpeg to extract frames from the video to be edited; among them, FFmpeg is a set of open source computer programs that can be used to record and convert digital audio and video, and convert them into streams. FFmpeg has video capture, Video format conversion, video capture, video watermarking and other functions. For all images after frame extraction, the similarity between adjacent images is calculated, and the images with similarity lower than the set threshold are regarded as key frames.
采用图像对比算法对关键帧进行自我标记具体为:基于获取的关键帧,采用无监督算法训练Self label模型,Self label模型采用图像对比算法学习关键帧图像的无监督向量表示,通过聚类和表示学习对关键帧进行自我标记,输出关键帧的self_label(frame k),frame k表示第k幅关键帧图像。 The specific steps of using the image comparison algorithm to self-label key frames are: based on the acquired key frames, an unsupervised algorithm is used to train the Self label model. The Self label model uses the image comparison algorithm to learn the unsupervised vector representation of the key frame image, and through clustering and representation Learn to self-label key frames and output the self_label(frame k ) of the key frame, where frame k represents the k-th key frame image.
S102:获取待剪辑视频的语料信息,并采用文本对比算法获取语料信息的无监督向量表示;S102: Obtain the corpus information of the video to be edited, and use the text comparison algorithm to obtain the unsupervised vector representation of the corpus information;
其中,语料信息获取方式具体为:采用ASR技术收集待剪辑视频的ASR语音信息,并将收集到的ASR语音信息切割为设定长度的ASR文本信息;采用OCR技术从抽帧后的图像中获取OCR文本信息;将切割后的ASR文本信息和OCR文本信息作为待剪辑视频的语料信息。Among them, the specific method of obtaining corpus information is: using ASR technology to collect the ASR voice information of the video to be edited, and cutting the collected ASR voice information into ASR text information of a set length; using OCR technology to obtain it from the image after frame extraction OCR text information; use the cut ASR text information and OCR text information as the corpus information of the video to be edited.
采用文本对比算法获取语料信息的无监督向量表示具体为:基于语料信息训练SimCSE模型,SimCSE模型采用文本对比算法学习ASR文本信息和OCR文本信息的无监督向量表示,输出待剪辑视频的文本向量simcse(asr k)和simcse(ocr k);其中,asr k表示待剪辑视频的第k个ASR文本信息,ocr k表示第k幅关键帧图像的OCR文本信息。 The specific steps of using the text comparison algorithm to obtain the unsupervised vector representation of corpus information are: training the SimCSE model based on the corpus information. The SimCSE model uses the text comparison algorithm to learn the unsupervised vector representation of ASR text information and OCR text information, and outputs the text vector simcse of the video to be edited. (asr k ) and simcse(ocr k ); where, asr k represents the k-th ASR text information of the video to be edited, and ocr k represents the OCR text information of the k-th key frame image.
S103:根据关键帧对待剪辑视频进行分割,生成与关键帧数量相对应的视频片段;S103: Segment the video to be edited according to key frames and generate video clips corresponding to the number of key frames;
其中,视频分割方式具体为:将每一个关键帧分别作为一个切割点,将待剪辑视频分割成与关键帧数量相对应的视频片段,并使每个视频片段中分别包括一幅关键帧图像以及视频片段对应的ASR文本信息和OCR文本信息。Among them, the video segmentation method is as follows: each key frame is used as a cutting point, the video to be edited is divided into video segments corresponding to the number of key frames, and each video segment includes a key frame image and ASR text information and OCR text information corresponding to the video clip.
S104:根据关键帧的无监督向量表示以及语料信息的无监督向量表示计算相邻视频片段之间的相似度,将相似度大于预设相似度阈值的相邻视频片段进行合并,生成待剪辑视频的视频剪辑结果;S104: Calculate the similarity between adjacent video clips based on the unsupervised vector representation of key frames and the unsupervised vector representation of corpus information, merge adjacent video clips with similarity greater than the preset similarity threshold, and generate a video to be edited The video editing results;
其中,相邻视频片段的相似度计算方式具体为:Among them, the similarity calculation method of adjacent video clips is specifically:
首先,分别计算相邻视频片段的关键帧、ASR文本信息以及OCR文本信息的相似度:First, calculate the similarity of key frames, ASR text information and OCR text information of adjacent video clips:
simi1=cos(self_label(frame k),self_label(frame k+1))   (1) simi1=cos(self_label(frame k ),self_label(frame k+1 )) (1)
simi2=cos(simcse(asr k),simcse(asr k+1))   (2) simi2=cos(simcse(asr k ),simcse(asr k+1 )) (2)
simi3=cos(simcse(ocr k),simcse(ocr k+1))    (3) simi3=cos(simcse(ocr k ),simcse(ocr k+1 )) (3)
其中,simi1、simi2以及simi3分别表示相邻视频片段中关键帧、ASR文本信息以及OCR文本信息的相似度;Among them, simi1, simi2 and simi3 respectively represent the similarity of key frames, ASR text information and OCR text information in adjacent video clips;
然后,根据关键帧、ASR文本信息以及OCR文本信息的相似度计算相邻视频片段的相似度:Then, calculate the similarity of adjacent video clips based on the similarity of key frames, ASR text information and OCR text information:
simi=α*simi1+β*simi2+(1―α―β)*simi3   (4)simi=α*simi1+β*simi2+(1―α―β)*simi3 (4)
simi表示相邻视频片段的相似度,α、β分别为可调节参数。simi represents the similarity of adjacent video clips, and α and β are adjustable parameters respectively.
本申请第一实施例的自动视频剪辑方法通过获取待剪辑视频的关键帧和语料信息,采用图像对比算法学习关键帧图像的无监督向量表示,采用文本对比算法学习语料信息的无监督向量表示,通过关键帧将待剪辑视频分割为多个视频片段,并基于关键帧和语料信息的向量表示计算相邻视频片段的相似性,对相似性较高的视频片段进行合并,得到最终的视频剪辑结果。本申请实施例同时利用了图像和文本信息,避免了人工数据标注,实现了视频的自动剪辑,并大大提高了视频剪辑效率。The automatic video editing method in the first embodiment of the present application obtains key frames and corpus information of the video to be edited, uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, and uses a text comparison algorithm to learn the unsupervised vector representation of the corpus information. Divide the video to be edited into multiple video segments through key frames, calculate the similarity of adjacent video segments based on the vector representation of key frames and corpus information, and merge the video segments with higher similarity to obtain the final video editing result. . The embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.
请参阅图2,是本申请第二实施例的自动视频剪辑方法的流程示意图。本申请第二实施例的自动视频剪辑方法包括以下步骤S201-S209:Please refer to FIG. 2 , which is a schematic flowchart of an automatic video editing method according to the second embodiment of the present application. The automatic video editing method in the second embodiment of the present application includes the following steps S201-S209:
S201:收集至少一个待剪辑视频;S201: Collect at least one video to be edited;
S202:对待剪辑视频进行抽帧处理,获取待剪辑视频的关键帧;S202: Perform frame extraction processing on the video to be edited, and obtain the key frames of the video to be edited;
本步骤中,采用ffmpeg程序对收集到的待剪辑视频进行抽帧处理,其中,FFmpeg是一套可以用来记录、转换数字音频、视频,并能将其转化为流的开源计算机程序,FFmpeg具有视频采集、视频格式转换、视频抓图、给视频加水印等功能。关键帧是指待剪辑视频中角色或者物体运动变化中关键动作所处的帧。本申请实施例中,关键帧的获取方式为:对于所有抽帧后的图像,计算相邻图像之间的相似度,将相似度低于设定阈值的图像帧作为关键帧。在获取关键帧的同时,按照设定比例保留一定数目的其余帧,其余帧即非关键帧,其余帧的保留数目k可随机设定。In this step, the ffmpeg program is used to extract frames from the collected videos to be edited. FFmpeg is a set of open source computer programs that can be used to record and convert digital audio and video, and convert them into streams. FFmpeg has Video capture, video format conversion, video capture, video watermarking and other functions. Key frames refer to the frames where key actions occur in the movement changes of characters or objects in the video to be edited. In the embodiment of this application, the key frame is obtained by: calculating the similarity between adjacent images for all images after frame extraction, and using the image frames whose similarity is lower than the set threshold as the key frame. While acquiring key frames, a certain number of remaining frames are retained according to a set ratio. The remaining frames are non-key frames. The number k of remaining frames can be set randomly.
S203:基于获取的关键帧,采用无监督算法训练Self label模型,Self label模型通过聚类和表示学习对关键帧进行自我标记;S203: Based on the acquired key frames, an unsupervised algorithm is used to train the Self label model. The Self label model self-labels the key frames through clustering and representation learning;
本步骤中,Self label模型是一种通过最大化数据和标签之间的互信息来标定label的自监督算法,Self label模型借助于图像对比算法学习关键帧图像的无监督向量表示,通过聚类和表示学习对关键帧图像进行自我标记,输出关键帧的self_label(frame k),frame k表示第k幅关键帧图像。 In this step, the Self label model is a self-supervised algorithm that calibrates labels by maximizing the mutual information between data and labels. The Self label model uses the image comparison algorithm to learn the unsupervised vector representation of key frame images, and through clustering The sum representation learning performs self-labeling on the key frame image and outputs the self_label(frame k ) of the key frame, where frame k represents the k-th key frame image.
S204:采用ASR(Automatic Speech Recognition,自动语音识别技术)技术以及OCR(Optical Character Recognition,光学字符识别)识别技术收集待剪辑视频的语料信息;S204: Use ASR (Automatic Speech Recognition, automatic speech recognition technology) technology and OCR (Optical Character Recognition, optical character recognition) recognition technology to collect corpus information of the video to be edited;
本步骤中,视频是典型的多模态数据,包括图像以及丰富的文本信息。待剪辑视频的语料信息包括待剪辑视频中的ASR语音信息以及抽帧图像中的OCR文本信息。本申请实施例中,待剪辑视频的语料信息获取方式具体为:采用ASR技术收集待剪辑视频的ASR语音信息,并将收集到的ASR语音信息切割为设定长度的ASR文本信息;同时,采用OCR技术从抽帧后的图像中获取OCR文本信息,将切割后的ASR文本信息和OCR文本信息作为待剪辑视频的语料信息。其中,ASR语音信息的切割长度为100,具体可根据实际应用进行设定。In this step, video is a typical multi-modal data, including images and rich text information. The corpus information of the video to be edited includes ASR voice information in the video to be edited and OCR text information in the extracted frame image. In the embodiment of this application, the method of obtaining the corpus information of the video to be edited is specifically: using ASR technology to collect the ASR voice information of the video to be edited, and cutting the collected ASR voice information into ASR text information of a set length; at the same time, using OCR technology obtains OCR text information from the extracted image, and uses the cut ASR text information and OCR text information as the corpus information of the video to be edited. Among them, the cutting length of ASR voice information is 100, which can be set according to the actual application.
S205:基于收集的语料信息训练SimCSE模型,通过SimCSE模型输出待剪辑视频的SimCSE文本向量;S205: Train the SimCSE model based on the collected corpus information, and output the SimCSE text vector of the video to be edited through the SimCSE model;
本申请实施例中,SimCSE模型可以无监督训练,基于BERT模型的自监督训练,通过dropout保持语义等价的自然语言数据增强,借助于文本对比算法学习文本的无监督向量表示,输出待剪辑视频的文本向量simcse(asr k)和simcse(ocr k),其中,asr k表示待剪辑视频的第k个ASR文本信息,ocr k表示第k幅关键帧图像的OCR文本信息。 In the embodiment of this application, the SimCSE model can be trained unsupervised, based on the self-supervised training of the BERT model, enhanced with natural language data that maintains semantic equivalence through dropout, learn the unsupervised vector representation of the text with the help of the text comparison algorithm, and output the video to be edited. The text vectors simcse(asr k ) and simcse(ocr k ), where asr k represents the k-th ASR text information of the video to be edited, and ocr k represents the OCR text information of the k-th key frame image.
S206:将每一个关键帧分别作为一个切割点,将待剪辑视频分割成多个视频片段,并使每个视频片段中分别包括一幅关键帧以及该视频片段对应的ASR文本信息和OCR文本信息;S206: Use each key frame as a cutting point, divide the video to be edited into multiple video clips, and make each video clip include a key frame and the ASR text information and OCR text information corresponding to the video clip. ;
本步骤中,基于待剪辑视频的关键帧将长视频划分为多个较短的视频片段。在 每个视频片段中,分别包括一张关键帧图像以及该视频片段对应的ASR文本和OCR文本,即每个视频片段的表示为(frame,asr,ocr)。In this step, the long video is divided into multiple shorter video segments based on the key frames of the video to be edited. Each video clip includes a key frame image and the ASR text and OCR text corresponding to the video clip, that is, the representation of each video clip is (frame, asr, ocr).
S207:基于待剪辑视频的关键帧、ASR文本信息以及OCR文本,分别计算前后两个相邻视频片段的相似度;S207: Based on the key frames, ASR text information and OCR text of the video to be edited, calculate the similarity of two adjacent video clips before and after;
本步骤中,相邻视频片段的相似度计算方式具体为:首先分别计算相邻视频片段的关键帧、ASR文本以及OCR文本的相似度,然后根据关键帧、ASR文本以及OCR文本的相似度计算相邻视频片段的相似度。具体计算公式如下:In this step, the similarity calculation method of adjacent video clips is as follows: first, calculate the similarity of key frames, ASR text and OCR text of adjacent video clips respectively, and then calculate based on the similarity of key frames, ASR text and OCR text. Similarity of adjacent video clips. The specific calculation formula is as follows:
simi1=cos(self_label(frame k),self_label(frame k+1))   (1) simi1=cos(self_label(frame k ),self_label(frame k+1 )) (1)
simi2=cos(simcse(asr k),simcse(asr k+1))   (2) simi2=cos(simcse(asr k ),simcse(asr k+1 )) (2)
simi3=cos(simcse(ocr k),simcse(ocr k+1))   (3) simi3=cos(simcse(ocr k ),simcse(ocr k+1 )) (3)
simi=α*simi1+β*simi2+(1―α―β)*simi3    (4)simi=α*simi1+β*simi2+(1-α-β)*simi3 (4)
其中,simi1、simi2以及simi3分别表示相邻视频片段中关键帧、ASR文本以及OCR文本的相似度,simi表示相邻视频片段的相似度。α、β分别为可调节参数。优选地,本申请实施例设定α及β的值为0.45。Among them, simi1, simi2 and simi3 respectively represent the similarity of key frames, ASR text and OCR text in adjacent video clips, and simi represents the similarity of adjacent video clips. α and β are adjustable parameters respectively. Preferably, in this embodiment of the present application, the values of α and β are set to 0.45.
S208:判断两个相邻视频片段之间的相似度是否大于设定的相似度阈值,如果两个相邻视频片段之间的相似度大于预设的相似度阈值,执行S209;S208: Determine whether the similarity between two adjacent video clips is greater than the set similarity threshold. If the similarity between the two adjacent video clips is greater than the preset similarity threshold, execute S209;
本步骤,相似度阈值设定为0.5,即如果两个相邻视频片段之间的相似度大于0.5,则认为这两个视频片段足够相似,可以进行合并。否则,则丢弃这两个视频片段。In this step, the similarity threshold is set to 0.5, that is, if the similarity between two adjacent video clips is greater than 0.5, the two video clips are considered similar enough and can be merged. Otherwise, the two video clips are discarded.
S209:将相似度大于预设的相似度阈值的相邻视频片段进行合并,得到最终的视频剪辑结果;S209: Merge adjacent video clips whose similarity is greater than the preset similarity threshold to obtain the final video editing result;
本步骤中,通过将相似度较高的视频片段进行合并,得到剪辑后的短视频,使得剪辑后的短视频更加顺畅,提升观看者的观看体验。In this step, the edited short video is obtained by merging the video clips with high similarity, which makes the edited short video smoother and improves the viewing experience of the viewer.
基于上述,本申请第二实施例的自动视频剪辑方法通过收集待剪辑视频的关键帧以及语料信息,采用图像对比算法学习关键帧图像的无监督向量表示,采用文本的对比算法学习语料信息的无监督向量表示,通过关键帧将待剪辑视频分割为多个视频片段,并基于关键帧和语料信息的向量表示计算相邻视频片段的相似性,对相似性较高的视频片段进行合并,得到最终的视频剪辑结果。本申请实施例同时利用了图像和文本信息,避免了人工数据标注,实现了视频的自动剪辑,并大大提高了视频剪辑效率。Based on the above, the automatic video editing method of the second embodiment of the present application collects key frames and corpus information of the video to be edited, uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, and uses a text comparison algorithm to learn the unsupervised vector representation of the corpus information. Supervised vector representation divides the video to be edited into multiple video clips through key frames, calculates the similarity of adjacent video clips based on the vector representation of key frames and corpus information, and merges video clips with higher similarity to obtain the final The result of the video clip. The embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.
在一个可选的实施方式中,还可以:将所述的自动视频剪辑方法的结果上传至区块链中。In an optional implementation, the results of the automatic video editing method can also be uploaded to the blockchain.
具体地,基于所述的自动视频剪辑方法的结果得到对应的摘要信息,具体来说,摘要信息由所述的自动视频剪辑方法的结果进行散列处理得到,比如利用sha256s算法处理得到。将摘要信息上传至区块链可保证其安全性和对用户的公正透明性。用户可以从区块链中下载得该摘要信息,以便查证所述的自动视频剪辑方法的结果是否被篡改。本示例所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。Specifically, the corresponding summary information is obtained based on the result of the automatic video editing method. Specifically, the summary information is obtained by hashing the result of the automatic video editing method, for example, using the sha256s algorithm. Uploading summary information to the blockchain ensures its security and fairness and transparency to users. Users can download this summary information from the blockchain to verify whether the results of the automatic video editing method have been tampered with. The blockchain referred to in this example is a new application model of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithms. Blockchain is essentially a decentralized database. It is a series of data blocks generated using cryptographic methods. Each data block contains a batch of network transaction information and is used to verify its Validity of information (anti-counterfeiting) and generation of the next block. Blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
请参阅图3,是本申请实施例自动视频剪辑系统的结构示意图。本申请实施例自动视频剪辑系统40包括:Please refer to Figure 3, which is a schematic structural diagram of an automatic video editing system according to an embodiment of the present application. The automatic video editing system 40 in the embodiment of this application includes:
第一获取模块41:用于获取待剪辑视频的关键帧,并采用图像对比算法对关键帧进行自我标记,生成关键帧的无监督向量表示;其中,关键帧为待剪辑视频中角色或者物体运动变化中关键动作所处的帧。关键帧获取方式为:采用ffmpeg对待剪辑视频进行抽帧处理;对于所有抽帧后的图像,计算相邻图像之间的相似度,将相似度低于设定阈值的图像作为关键帧。The first acquisition module 41 is used to obtain the key frames of the video to be edited, and uses an image comparison algorithm to self-mark the key frames to generate an unsupervised vector representation of the key frames; where the key frames are the movements of characters or objects in the video to be edited. The frame where the key action in the change occurs. The key frame acquisition method is: use ffmpeg to extract frames from the video to be edited; for all images after frame extraction, calculate the similarity between adjacent images, and use the images with a similarity lower than the set threshold as key frames.
第一获取模块41采用图像对比算法对关键帧进行自我标记具体为:基于获取的关键帧,采用无监督算法训练Self label模型,Self label模型采用图像对比算法学习关键帧图像的无监督向量表示,通过聚类和表示学习对关键帧进行自我标记,输出关键帧的self_label(frame k),frame k表示第k幅关键帧图像。 The first acquisition module 41 uses an image comparison algorithm to self-label key frames. Specifically, based on the acquired key frames, an unsupervised algorithm is used to train the Self label model. The Self label model uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image. Key frames are self-labeled through clustering and representation learning, and the self_label(frame k ) of the key frame is output, where frame k represents the k-th key frame image.
第二获取模块42:用于获取待剪辑视频的语料信息,并采用文本对比算法获取语料信息的无监督向量表示;其中,语料信息获取方式具体为:采用ASR技术收集待剪辑视频的ASR语音信息,并将收集到的ASR语音信息切割为设定长度的ASR文本信息;采用OCR技术从抽帧后的图像中获取OCR文本信息;将切割后的ASR文本信息和OCR文本信息作为待剪辑视频的语料信息。The second acquisition module 42 is used to obtain the corpus information of the video to be edited, and uses a text comparison algorithm to obtain an unsupervised vector representation of the corpus information; wherein, the corpus information acquisition method is specifically: using ASR technology to collect the ASR voice information of the video to be edited. , and cut the collected ASR voice information into ASR text information of a set length; use OCR technology to obtain OCR text information from the image after frame extraction; use the cut ASR text information and OCR text information as the video to be edited corpus information.
第二获取模块42采用文本对比算法获取语料信息的无监督向量表示具体为:基于语料信息训练SimCSE模型,SimCSE模型采用文本对比算法学习ASR文本信息和OCR文本信息的无监督向量表示,输出待剪辑视频的文本向量simcse(asr k)和simcse(ocr k);其中,asr k表示待剪辑视频的第k个ASR文本信息,ocr k表示第k幅关键帧图像的OCR文本信息。 The second acquisition module 42 uses a text comparison algorithm to obtain the unsupervised vector representation of the corpus information. Specifically, it trains the SimCSE model based on the corpus information. The SimCSE model uses the text comparison algorithm to learn the unsupervised vector representation of the ASR text information and the OCR text information, and outputs the data to be edited. The text vectors simcse(asr k ) and simcse(ocr k ) of the video; where, asr k represents the k-th ASR text information of the video to be edited, and ocr k represents the OCR text information of the k-th key frame image.
视频分割模块43:用于根据关键帧对待剪辑视频进行分割,生成与关键帧数量相对应的视频片段;其中,视频分割模块的视频分割方式具体为:将每一个关键帧分别作为一个切割点,将待剪辑视频分割成与关键帧数量相对应的视频片段,并使每个视频片段中分别包括一幅关键帧图像以及视频片段对应的ASR文本信息和OCR文本信息。Video segmentation module 43: used to segment the video to be edited according to key frames and generate video segments corresponding to the number of key frames; wherein, the video segmentation method of the video segmentation module is specifically: use each key frame as a cutting point, Divide the video to be edited into video segments corresponding to the number of key frames, and make each video segment include a key frame image and ASR text information and OCR text information corresponding to the video segment.
视频合并模块44:用于根据关键帧的无监督向量表示以及语料信息的无监督向量表示计算相邻视频片段的相似度,将相似度大于设定相似度阈值的相邻视频片段进行合并,生成待剪辑视频的视频剪辑结果;其中,相邻视频片段的相似度计算方式具体为:Video merging module 44: used to calculate the similarity of adjacent video segments based on the unsupervised vector representation of key frames and the unsupervised vector representation of corpus information, merge adjacent video segments whose similarity is greater than the set similarity threshold, and generate The video editing result of the video to be edited; the similarity calculation method of adjacent video segments is specifically:
首先,分别计算相邻视频片段的关键帧、ASR文本信息以及OCR文本信息的相似度:First, calculate the similarity of key frames, ASR text information and OCR text information of adjacent video clips:
simi1=cos(self_label(frame k),self_label(frame k+1))    (1) simi1=cos(self_label(frame k ),self_label(frame k+1 )) (1)
simi2=cos(simcse(asr k),simcse(asr k+1))   (2) simi2=cos(simcse(asr k ),simcse(asr k+1 )) (2)
simi3=cos(simcse(ocr k),simcse(ocr k+1))    (3) simi3=cos(simcse(ocr k ),simcse(ocr k+1 )) (3)
其中,simi1、simi2以及simi3分别表示相邻视频片段中关键帧、ASR文本信息以及OCR文本信息的相似度;Among them, simi1, simi2 and simi3 respectively represent the similarity of key frames, ASR text information and OCR text information in adjacent video clips;
然后,根据关键帧、ASR文本信息以及OCR文本信息的相似度计算相邻视频片段的相似度:Then, calculate the similarity of adjacent video clips based on the similarity of key frames, ASR text information and OCR text information:
simi=α*simi1+β*simi2+(1―α―β)*simi3    (4)simi=α*simi1+β*simi2+(1-α-β)*simi3 (4)
simi表示相邻视频片段的相似度,α、β分别为可调节参数。simi represents the similarity of adjacent video clips, and α and β are adjustable parameters respectively.
本申请实施例的自动视频剪辑系统通过获取待剪辑视频的关键帧和语料信息,采用图像对比算法学习关键帧图像的无监督向量表示,采用文本对比算法学习语料信息的无监督向量表示,通过关键帧将待剪辑视频分割为多个视频片段,并基于关键帧和语料信息的向量表示计算相邻视频片段的相似性,对相似性较高的视频片段进行合并,得到最终的视频剪辑结果。本申请实施例同时利用了图像和文本信息,避免了人工数据标注,实现了视频的自动剪辑,并大大提高了视频剪辑效率。The automatic video editing system in the embodiment of the present application obtains the key frames and corpus information of the video to be edited, uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, and uses the text comparison algorithm to learn the unsupervised vector representation of the corpus information. Through the key The video to be edited is divided into multiple video segments by frame, and the similarity of adjacent video segments is calculated based on the vector representation of key frames and corpus information, and the video segments with higher similarity are merged to obtain the final video editing result. The embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.
请参阅图4,为本申请实施例的终端结构示意图。该终端50包括处理器51、与处理器51耦接的存储器52。Please refer to Figure 4, which is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal 50 includes a processor 51 and a memory 52 coupled to the processor 51 .
存储器52存储有用于实现上述自动视频剪辑方法的程序指令。The memory 52 stores program instructions for implementing the above-mentioned automatic video editing method.
处理器51用于执行存储器52存储的程序指令以执行自动视频剪辑操作。The processor 51 is configured to execute program instructions stored in the memory 52 to perform automatic video editing operations.
其中,处理器51还可以称为CPU(Central Processing Unit,中央处理单元)。处理器51可能是一种集成电路芯片,具有信号的处理能力。处理器51还可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。Among them, the processor 51 can also be called a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip with signal processing capabilities. The processor 51 may also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component. . A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
请参阅图5,图5为本申请实施例的存储介质的结构示意图。本申请实施例的存储介质存储有能够实现上述所有方法的程序文件61,其中,该程序文件61可以以软件产品的形式存储在上述存储介质中,所述存储介质可以是非易失性,也可以是易失性,其包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施方式方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质,或者是计算机、服务器、手机、平板等终端设备。Please refer to FIG. 5 , which is a schematic structural diagram of a storage medium according to an embodiment of the present application. The storage medium in the embodiment of the present application stores program files 61 that can implement all the above methods. The program files 61 can be stored in the above-mentioned storage medium in the form of software products. The storage medium can be non-volatile or non-volatile. It is volatile and includes a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the various implementation methods of this application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. , or terminal equipment such as computers, servers, mobile phones, tablets, etc.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的系统实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the system embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。以上仅为本申请的实施方式,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units. The above are only embodiments of the present application, and do not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made using the contents of the description and drawings of this application, or directly or indirectly applied in other related technical fields, All are similarly included in the patent protection scope of this application.

Claims (20)

  1. 一种自动视频剪辑方法,其中,所述方法包括:An automatic video editing method, wherein the method includes:
    获取待剪辑视频的关键帧,并采用图像对比算法对所述关键帧进行自我标记,生成所述关键帧的无监督向量表示;Obtain the key frames of the video to be edited, use an image comparison algorithm to self-mark the key frames, and generate an unsupervised vector representation of the key frames;
    获取待剪辑视频的语料信息,并采用文本对比算法获取所述语料信息的无监督向量表示;Obtain the corpus information of the video to be edited, and use a text comparison algorithm to obtain an unsupervised vector representation of the corpus information;
    根据所述关键帧对所述待剪辑视频进行分割,生成与所述关键帧数量相对应的视频片段;Segment the video to be edited according to the key frames and generate video segments corresponding to the number of key frames;
    根据所述关键帧的无监督向量表示以及语料信息的无监督向量表示计算相邻视频片段之间的相似度,将所述相似度大于预设相似度阈值的相邻视频片段进行合并,生成所述待剪辑视频的视频剪辑结果。Calculate the similarity between adjacent video segments based on the unsupervised vector representation of the key frame and the unsupervised vector representation of the corpus information, merge the adjacent video segments whose similarity is greater than the preset similarity threshold, and generate the Describe the video editing results of the video to be edited.
  2. 根据权利要求1所述的自动视频剪辑方法,其中,所述关键帧为待剪辑视频中角色或者物体运动变化中关键动作所处的帧,所述获取待剪辑视频的关键帧,包括:The automatic video editing method according to claim 1, wherein the key frames are frames where key actions occur in the movement changes of characters or objects in the video to be edited, and obtaining the key frames of the video to be edited includes:
    采用ffmpeg对所述待剪辑视频进行抽帧处理;Use ffmpeg to extract frames from the video to be edited;
    对于所有抽帧后的图像,计算相邻图像之间的相似度,将相似度低于设定阈值的图像作为关键帧。For all images after frame extraction, the similarity between adjacent images is calculated, and the images with similarity lower than the set threshold are regarded as key frames.
  3. 根据权利要求2所述的自动视频剪辑方法,其中,所述采用图像对比算法对所述关键帧进行自我标记,包括:The automatic video editing method according to claim 2, wherein the self-marking of the key frames using an image comparison algorithm includes:
    基于获取的关键帧,采用无监督算法训练Selflabel模型,所述Selflabel模型采用图像对比算法学习关键帧图像的无监督向量表示,通过聚类和表示学习对关键帧进行自我标记,输出所述关键帧的self_label(frame k),其中,frame k表示第k幅关键帧图像。 Based on the acquired key frames, an unsupervised algorithm is used to train the Selflabel model. The Selflabel model uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, self-labels the key frames through clustering and representation learning, and outputs the key frames. self_label(frame k ), where frame k represents the k-th key frame image.
  4. 根据权利要求1至3任一项所述的自动视频剪辑方法,其中,所述获取待剪辑视频的语料信息,包括:The automatic video editing method according to any one of claims 1 to 3, wherein the obtaining corpus information of the video to be edited includes:
    采用ASR技术收集待剪辑视频的ASR语音信息,并将收集到的ASR语音信息切割为设定长度的ASR文本信息;Use ASR technology to collect the ASR voice information of the video to be edited, and cut the collected ASR voice information into ASR text information of a set length;
    采用OCR技术从抽帧后的图像中获取OCR文本信息;Use OCR technology to obtain OCR text information from the image after frame extraction;
    将切割后的ASR文本信息和OCR文本信息作为待剪辑视频的语料信息。The cut ASR text information and OCR text information are used as corpus information of the video to be edited.
  5. 根据权利要求4所述的自动视频剪辑方法,其中,所述采用文本对比算法获取所述语料信息的无监督向量表示,包括:The automatic video editing method according to claim 4, wherein the use of a text comparison algorithm to obtain the unsupervised vector representation of the corpus information includes:
    基于所述语料信息训练SimCSE模型,所述SimCSE模型采用文本对比算法学习ASR文本信息和OCR文本信息的无监督向量表示,输出待剪辑视频的文本向量simcse(asr k)和simcse(ocr k);其中,asr k表示待剪辑视频的第k个ASR文本信息,ocr k表示第k幅关键帧图像的OCR文本信息。 The SimCSE model is trained based on the corpus information. The SimCSE model uses a text comparison algorithm to learn the unsupervised vector representation of ASR text information and OCR text information, and outputs the text vectors simcse(asr k ) and simcse(ocr k ) of the video to be edited; Among them, asr k represents the k-th ASR text information of the video to be edited, and ocr k represents the OCR text information of the k-th key frame image.
  6. 根据权利要求5所述的自动视频剪辑方法,其中,所述根据所述关键帧对所述待剪辑视频进行分割,生成与所述关键帧数量相对应的视频片段,包括:The automatic video editing method according to claim 5, wherein the segmenting the video to be edited according to the key frames and generating video segments corresponding to the number of key frames includes:
    将每一个关键帧分别作为一个切割点,将所述待剪辑视频分割成与所述关键帧数量相对应的视频片段,并使每个视频片段中分别包括一幅关键帧图像以及所述视 频片段对应的ASR文本信息和OCR文本信息。Each key frame is used as a cutting point, and the video to be edited is divided into video segments corresponding to the number of key frames, and each video segment includes a key frame image and the video segment. Corresponding ASR text information and OCR text information.
  7. 根据权利要求3或5所述的自动视频剪辑方法,其中,所述根据所述关键帧的无监督向量表示以及语料信息的无监督向量表示计算相邻视频片段的相似度,包括:The automatic video editing method according to claim 3 or 5, wherein calculating the similarity of adjacent video segments based on the unsupervised vector representation of the key frame and the unsupervised vector representation of the corpus information includes:
    分别计算所述相邻视频片段的关键帧、ASR文本信息以及OCR文本信息的相似度:Calculate the similarity of the key frames, ASR text information and OCR text information of the adjacent video clips respectively:
    simi1=cos(self_label(frame k),self_label(frame k+1)) simi1=cos(self_label(frame k ),self_label(frame k+1 ))
    simi2=cos(simcse(asr k),simcse(asr k+1)) simi2=cos(simcse(asr k ),simcse(asr k+1 ))
    simi3=cos(simcse(ocr k),simcse(ocr k+1)) simi3=cos(simcse(ocr k ),simcse(ocr k+1 ))
    其中,simi1、simi2以及simi3分别表示相邻视频片段中关键帧、ASR文本信息以及OCR文本信息的相似度;Among them, simi1, simi2 and simi3 respectively represent the similarity of key frames, ASR text information and OCR text information in adjacent video clips;
    根据所述关键帧、ASR文本信息以及OCR文本信息的相似度计算相邻视频片段的相似度:Calculate the similarity of adjacent video clips based on the similarity of the key frames, ASR text information and OCR text information:
    simi=α*simi1+β*simi2+(1―α―β)*simi3,simi=α*simi1+β*simi2+(1-α-β)*simi3,
    simi表示所述相邻视频片段的相似度,α、β分别为可调节参数。simi represents the similarity of the adjacent video clips, and α and β are adjustable parameters respectively.
  8. 一种自动视频剪辑系统,其中,所述系统包括:An automatic video editing system, wherein the system includes:
    第一获取模块:用于获取待剪辑视频的关键帧,并采用图像对比算法对所述关键帧进行自我标记,生成所述关键帧的无监督向量表示;The first acquisition module: used to acquire the key frames of the video to be edited, and use an image comparison algorithm to self-mark the key frames, and generate an unsupervised vector representation of the key frames;
    第二获取模块:用于获取待剪辑视频的语料信息,并采用文本对比算法获取所述语料信息的无监督向量表示;The second acquisition module: is used to obtain the corpus information of the video to be edited, and uses a text comparison algorithm to obtain the unsupervised vector representation of the corpus information;
    视频分割模块:用于根据所述关键帧对所述待剪辑视频进行分割,生成与所述关键帧数量相对应的视频片段;Video segmentation module: used to segment the video to be edited according to the key frames and generate video segments corresponding to the number of key frames;
    视频合并模块:用于根据所述关键帧的无监督向量表示以及语料信息的无监督向量表示计算相邻视频片段的相似度,将所述相似度大于设定相似度阈值的相邻视频片段进行合并,生成所述待剪辑视频的视频剪辑结果。Video merging module: used to calculate the similarity of adjacent video segments based on the unsupervised vector representation of the key frame and the unsupervised vector representation of the corpus information, and merge the adjacent video segments whose similarity is greater than the set similarity threshold. Merge to generate a video editing result of the video to be edited.
  9. 一种终端,其中,所述终端包括处理器、与所述处理器耦接的存储器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:获取待剪辑视频的关键帧,并采用图像对比算法对所述关键帧进行自我标记,生成所述关键帧的无监督向量表示;获取待剪辑视频的语料信息,并采用文本对比算法获取所述语料信息的无监督向量表示;根据所述关键帧对所述待剪辑视频进行分割,生成与所述关键帧数量相对应的视频片段;根据所述关键帧的无监督向量表示以及语料信息的无监督向量表示计算相邻视频片段之间的相似度,将所述相似度大于预设相似度阈值的相邻视频片段进行合并,生成所述待剪辑视频的视频剪辑结果。A terminal, wherein the terminal includes a processor and a memory coupled to the processor, computer readable instructions are stored in the memory, and when executed by the processor, the computer readable instructions cause the The processor performs the following steps: obtains the key frames of the video to be edited, and uses an image comparison algorithm to self-mark the key frames, and generates an unsupervised vector representation of the key frames; obtains the corpus information of the video to be edited, and uses The text comparison algorithm obtains the unsupervised vector representation of the corpus information; segments the video to be edited according to the key frames, and generates video segments corresponding to the number of key frames; according to the unsupervised vectors of the key frames Express and unsupervised vector representation of corpus information, calculate the similarity between adjacent video segments, merge adjacent video segments whose similarity is greater than a preset similarity threshold, and generate a video editing result of the video to be edited.
  10. 根据权利要求9所述的终端,其中,所述关键帧为待剪辑视频中角色或者物体运动变化中关键动作所处的帧,所述获取待剪辑视频的关键帧,包括:The terminal according to claim 9, wherein the key frame is a frame where a key action occurs in the movement change of a character or object in the video to be edited, and the obtaining the key frame of the video to be edited includes:
    采用ffmpeg对所述待剪辑视频进行抽帧处理;Use ffmpeg to extract frames from the video to be edited;
    对于所有抽帧后的图像,计算相邻图像之间的相似度,将相似度低于设定阈值的图像作为关键帧。For all images after frame extraction, the similarity between adjacent images is calculated, and the images with similarity lower than the set threshold are regarded as key frames.
  11. 根据权利要求10所述的终端,其中,所述采用图像对比算法对所述关键帧进行自我标记,包括:The terminal according to claim 10, wherein the use of an image comparison algorithm to self-mark the key frames includes:
    基于获取的关键帧,采用无监督算法训练Selflabel模型,所述Selflabel模型采用图像对比算法学习关键帧图像的无监督向量表示,通过聚类和表示学习对关键帧进行自我标记,输出所述关键帧的self_label(frame k),其中,frame k表示第k幅 关键帧图像。 Based on the acquired key frames, an unsupervised algorithm is used to train the Selflabel model. The Selflabel model uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, self-labels the key frames through clustering and representation learning, and outputs the key frames. self_label(frame k ), where frame k represents the k-th key frame image.
  12. 根据权利要求9至11任一项所述的终端,其中,所述获取待剪辑视频的语料信息,包括:The terminal according to any one of claims 9 to 11, wherein the obtaining corpus information of the video to be edited includes:
    采用ASR技术收集待剪辑视频的ASR语音信息,并将收集到的ASR语音信息切割为设定长度的ASR文本信息;Use ASR technology to collect the ASR voice information of the video to be edited, and cut the collected ASR voice information into ASR text information of a set length;
    采用OCR技术从抽帧后的图像中获取OCR文本信息;Use OCR technology to obtain OCR text information from the image after frame extraction;
    将切割后的ASR文本信息和OCR文本信息作为待剪辑视频的语料信息。The cut ASR text information and OCR text information are used as corpus information of the video to be edited.
  13. 根据权利要求12所述的终端,其中,所述采用文本对比算法获取所述语料信息的无监督向量表示,包括:The terminal according to claim 12, wherein the use of a text comparison algorithm to obtain the unsupervised vector representation of the corpus information includes:
    基于所述语料信息训练SimCSE模型,所述SimCSE模型采用文本对比算法学习ASR文本信息和OCR文本信息的无监督向量表示,输出待剪辑视频的文本向量simcse(asr k)和simcse(ocr k);其中,asr k表示待剪辑视频的第k个ASR文本信息,ocr k表示第k幅关键帧图像的OCR文本信息。 The SimCSE model is trained based on the corpus information. The SimCSE model uses a text comparison algorithm to learn the unsupervised vector representation of ASR text information and OCR text information, and outputs the text vectors simcse(asr k ) and simcse(ocr k ) of the video to be edited; Among them, asr k represents the k-th ASR text information of the video to be edited, and ocr k represents the OCR text information of the k-th key frame image.
  14. 根据权利要求13所述的终端,其中,所述根据所述关键帧对所述待剪辑视频进行分割,生成与所述关键帧数量相对应的视频片段,包括:The terminal according to claim 13, wherein the segmenting the video to be edited according to the key frames and generating video segments corresponding to the number of key frames includes:
    将每一个关键帧分别作为一个切割点,将所述待剪辑视频分割成与所述关键帧数量相对应的视频片段,并使每个视频片段中分别包括一幅关键帧图像以及所述视频片段对应的ASR文本信息和OCR文本信息。Each key frame is used as a cutting point, and the video to be edited is divided into video segments corresponding to the number of key frames, and each video segment includes a key frame image and the video segment. Corresponding ASR text information and OCR text information.
  15. 一种存储介质,存储有能够实现如下步骤的程序文件,所述步骤包括:获取待剪辑视频的关键帧,并采用图像对比算法对所述关键帧进行自我标记,生成所述关键帧的无监督向量表示;获取待剪辑视频的语料信息,并采用文本对比算法获取所述语料信息的无监督向量表示;根据所述关键帧对所述待剪辑视频进行分割,生成与所述关键帧数量相对应的视频片段;根据所述关键帧的无监督向量表示以及语料信息的无监督向量表示计算相邻视频片段之间的相似度,将所述相似度大于预设相似度阈值的相邻视频片段进行合并,生成所述待剪辑视频的视频剪辑结果。A storage medium that stores program files that can implement the following steps, which steps include: obtaining key frames of a video to be edited, using an image comparison algorithm to self-mark the key frames, and generating unsupervised images of the key frames. Vector representation; obtain the corpus information of the video to be edited, and use a text comparison algorithm to obtain an unsupervised vector representation of the corpus information; segment the video to be edited according to the key frames, and generate a number corresponding to the number of key frames video clips; calculate the similarity between adjacent video clips based on the unsupervised vector representation of the key frame and the unsupervised vector representation of the corpus information, and compare the adjacent video clips with the similarity greater than the preset similarity threshold Merge to generate a video editing result of the video to be edited.
  16. 根据权利要求15所述的存储介质,其中,所述关键帧为待剪辑视频中角色或者物体运动变化中关键动作所处的帧,所述获取待剪辑视频的关键帧,包括:The storage medium according to claim 15, wherein the key frames are frames where key actions occur in the movement changes of characters or objects in the video to be edited, and obtaining the key frames of the video to be edited includes:
    采用ffmpeg对所述待剪辑视频进行抽帧处理;Use ffmpeg to extract frames from the video to be edited;
    对于所有抽帧后的图像,计算相邻图像之间的相似度,将相似度低于设定阈值的图像作为关键帧。For all images after frame extraction, the similarity between adjacent images is calculated, and the images with similarity lower than the set threshold are regarded as key frames.
  17. 根据权利要求16所述的存储介质,其中,所述采用图像对比算法对所述关键帧进行自我标记,包括:The storage medium according to claim 16, wherein the use of an image comparison algorithm to self-mark the key frames includes:
    基于获取的关键帧,采用无监督算法训练Selflabel模型,所述Selflabel模型采用图像对比算法学习关键帧图像的无监督向量表示,通过聚类和表示学习对关键帧进行自我标记,输出所述关键帧的self_label(frame k),其中,frame k表示第k幅关键帧图像。 Based on the acquired key frames, an unsupervised algorithm is used to train the Selflabel model. The Selflabel model uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, self-labels the key frames through clustering and representation learning, and outputs the key frames. self_label(frame k ), where frame k represents the k-th key frame image.
  18. 根据权利要求15至17任一项所述的存储介质,其中,所述获取待剪辑视频的语料信息,包括:The storage medium according to any one of claims 15 to 17, wherein the obtaining corpus information of the video to be edited includes:
    采用ASR技术收集待剪辑视频的ASR语音信息,并将收集到的ASR语音信息切割为设定长度的ASR文本信息;Use ASR technology to collect the ASR voice information of the video to be edited, and cut the collected ASR voice information into ASR text information of a set length;
    采用OCR技术从抽帧后的图像中获取OCR文本信息;Use OCR technology to obtain OCR text information from the image after frame extraction;
    将切割后的ASR文本信息和OCR文本信息作为待剪辑视频的语料信息。The cut ASR text information and OCR text information are used as corpus information of the video to be edited.
  19. 根据权利要求18所述的存储介质,其中,所述采用文本对比算法获取所述语料信息的无监督向量表示,包括:The storage medium according to claim 18, wherein the use of a text comparison algorithm to obtain the unsupervised vector representation of the corpus information includes:
    基于所述语料信息训练SimCSE模型,所述SimCSE模型采用文本对比算法学习ASR文本信息和OCR文本信息的无监督向量表示,输出待剪辑视频的文本向量simcse(asr k)和simcse(ocr k);其中,asr k表示待剪辑视频的第k个ASR文本信息,ocr k表示第k幅关键帧图像的OCR文本信息。 The SimCSE model is trained based on the corpus information. The SimCSE model uses a text comparison algorithm to learn the unsupervised vector representation of ASR text information and OCR text information, and outputs the text vectors simcse(asr k ) and simcse(ocr k ) of the video to be edited; Among them, asr k represents the k-th ASR text information of the video to be edited, and ocr k represents the OCR text information of the k-th key frame image.
  20. 根据权利要求19所述的存储介质,其中,所述根据所述关键帧对所述待剪辑视频进行分割,生成与所述关键帧数量相对应的视频片段,包括:The storage medium according to claim 19, wherein the segmenting the video to be edited according to the key frames and generating video segments corresponding to the number of key frames includes:
    将每一个关键帧分别作为一个切割点,将所述待剪辑视频分割成与所述关键帧数量相对应的视频片段,并使每个视频片段中分别包括一幅关键帧图像以及所述视频片段对应的ASR文本信息和OCR文本信息。Each key frame is used as a cutting point, and the video to be edited is divided into video segments corresponding to the number of key frames, and each video segment includes a key frame image and the video segment. Corresponding ASR text information and OCR text information.
PCT/CN2022/089560 2022-03-29 2022-04-27 Automatic video editing method and system, and terminal and storage medium WO2023184636A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210318902.4 2022-03-29
CN202210318902.4A CN114694070A (en) 2022-03-29 2022-03-29 Automatic video editing method, system, terminal and storage medium

Publications (1)

Publication Number Publication Date
WO2023184636A1 true WO2023184636A1 (en) 2023-10-05

Family

ID=82140927

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/089560 WO2023184636A1 (en) 2022-03-29 2022-04-27 Automatic video editing method and system, and terminal and storage medium

Country Status (2)

Country Link
CN (1) CN114694070A (en)
WO (1) WO2023184636A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9620168B1 (en) * 2015-12-21 2017-04-11 Amazon Technologies, Inc. Cataloging video and creating video summaries
CN108882057A (en) * 2017-05-09 2018-11-23 北京小度互娱科技有限公司 Video abstraction generating method and device
CN111526382A (en) * 2020-04-20 2020-08-11 广东小天才科技有限公司 Live video text generation method, device, equipment and storage medium
CN111797850A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Video classification method and device, storage medium and electronic equipment
CN113709561A (en) * 2021-04-14 2021-11-26 腾讯科技(深圳)有限公司 Video editing method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9620168B1 (en) * 2015-12-21 2017-04-11 Amazon Technologies, Inc. Cataloging video and creating video summaries
CN108882057A (en) * 2017-05-09 2018-11-23 北京小度互娱科技有限公司 Video abstraction generating method and device
CN111797850A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Video classification method and device, storage medium and electronic equipment
CN111526382A (en) * 2020-04-20 2020-08-11 广东小天才科技有限公司 Live video text generation method, device, equipment and storage medium
CN113709561A (en) * 2021-04-14 2021-11-26 腾讯科技(深圳)有限公司 Video editing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114694070A (en) 2022-07-01

Similar Documents

Publication Publication Date Title
CN108833973B (en) Video feature extraction method and device and computer equipment
WO2022142014A1 (en) Multi-modal information fusion-based text classification method, and related device thereof
CN109756751B (en) Multimedia data processing method and device, electronic equipment and storage medium
US8503523B2 (en) Forming a representation of a video item and use thereof
WO2018177139A1 (en) Method and apparatus for generating video abstract, server and storage medium
WO2023011094A1 (en) Video editing method and apparatus, electronic device, and storage medium
CN112929744A (en) Method, apparatus, device, medium and program product for segmenting video clips
WO2023197979A1 (en) Data processing method and apparatus, and computer device and storage medium
CN113010703A (en) Information recommendation method and device, electronic equipment and storage medium
CN112100440B (en) Video pushing method, device and medium
CN114339360B (en) Video processing method, related device and equipment
WO2022227218A1 (en) Drug name recognition method and apparatus, and computer device and storage medium
WO2020135756A1 (en) Video segment extraction method, apparatus and device, and computer-readable storage medium
CN114390368B (en) Live video data processing method and device, equipment and readable medium
WO2023173539A1 (en) Video content processing method and system, and terminal and storage medium
US20200151208A1 (en) Time code to byte indexer for partial object retrieval
US20190311746A1 (en) Indexing media content library using audio track fingerprinting
KR20210047467A (en) Method and System for Auto Multiple Image Captioning
CN115734024A (en) Audio data processing method, device, equipment and storage medium
JP6730760B2 (en) Server and program, video distribution system
WO2023184636A1 (en) Automatic video editing method and system, and terminal and storage medium
CN116017088A (en) Video subtitle processing method, device, electronic equipment and storage medium
KR102526263B1 (en) Method and System for Auto Multiple Image Captioning
CN115359492A (en) Text image matching model training method, picture labeling method, device and equipment
JP6713183B1 (en) Servers and programs

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22934462

Country of ref document: EP

Kind code of ref document: A1