WO2023184636A1 - Procédé et système de montage vidéo automatique, ainsi que terminal et support de stockage - Google Patents

Procédé et système de montage vidéo automatique, ainsi que terminal et support de stockage Download PDF

Info

Publication number
WO2023184636A1
WO2023184636A1 PCT/CN2022/089560 CN2022089560W WO2023184636A1 WO 2023184636 A1 WO2023184636 A1 WO 2023184636A1 CN 2022089560 W CN2022089560 W CN 2022089560W WO 2023184636 A1 WO2023184636 A1 WO 2023184636A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
edited
key frames
asr
information
Prior art date
Application number
PCT/CN2022/089560
Other languages
English (en)
Chinese (zh)
Inventor
唐小初
舒畅
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023184636A1 publication Critical patent/WO2023184636A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23424Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement

Definitions

  • This application relates to the technical field of cluster analysis of artificial intelligence, and in particular to an automatic video editing method, system, terminal and storage medium.
  • This application provides an automatic video editing method, system, terminal and storage medium, aiming to solve the technical problems of existing video editing that relies on human resources, such as high financial resources and low video editing efficiency.
  • An automatic video editing method includes:
  • an automatic video editing system including:
  • the first acquisition module used to acquire the key frames of the video to be edited, and use an image comparison algorithm to self-mark the key frames, and generate an unsupervised vector representation of the key frames;
  • the second acquisition module is used to obtain the corpus information of the video to be edited, and uses a text comparison algorithm to obtain the unsupervised vector representation of the corpus information;
  • Video segmentation module used to segment the video to be edited according to the key frames and generate video segments corresponding to the number of key frames;
  • Video merging module used to calculate the similarity of adjacent video segments based on the unsupervised vector representation of the key frame and the unsupervised vector representation of the corpus information, and merge the adjacent video segments whose similarity is greater than the set similarity threshold. Merge to generate a video editing result of the video to be edited.
  • a terminal includes a processor and a memory coupled to the processor, wherein,
  • the memory stores program instructions for implementing the above-mentioned automatic video editing method
  • the processor is configured to execute the program instructions stored in the memory to perform the automatic video clipping operation.
  • a storage medium that stores program instructions executable by a processor, and the program instructions are used to execute the above-mentioned automatic video editing method.
  • the automatic video editing method, system, terminal and storage medium of the embodiment of the present application collects key frames and corpus information of the video to be edited, divides the video to be edited into multiple video segments through the key frames, and generates video clips based on the key frames and corpus information.
  • the vector representation calculates the similarity of adjacent video clips, and merges the video clips with higher similarity to obtain the final video editing result.
  • the embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.
  • Figure 1 is a schematic flow chart of an automatic video editing method according to the first embodiment of the present application
  • FIG. 2 is a schematic flowchart of the automatic video editing method according to the second embodiment of the present application.
  • Figure 3 is a schematic structural diagram of an automatic video editing system according to an embodiment of the present application.
  • Figure 4 is a schematic structural diagram of a terminal according to an embodiment of the present application.
  • Figure 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
  • first”, “second” and “third” in this application are only used for descriptive purposes and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Thus, features defined as “first”, “second”, and “third” may explicitly or implicitly include at least one of these features.
  • “plurality” means at least two, such as two, three, etc., unless otherwise clearly and specifically limited. All directional indications (such as up, down, left, right, front, back%) in the embodiments of this application are only used to explain the relative positional relationship between components in a specific posture (as shown in the drawings). , sports conditions, etc., if the specific posture changes, the directional indication will also change accordingly.
  • an embodiment means that a particular feature, structure or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application.
  • the appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art understand, both explicitly and implicitly, that the embodiments described herein may be combined with other embodiments.
  • FIG. 1 is a schematic flowchart of an automatic video editing method according to the first embodiment of the present application.
  • the automatic video editing method in the first embodiment of the present application includes the following steps S101-S104:
  • S101 Obtain the key frames of the video to be edited, and use an image comparison algorithm to self-mark the key frames to generate an unsupervised vector representation of the key frames;
  • the key frame is the frame where the key action occurs in the movement change of the character or object in the video to be edited.
  • the key frame acquisition method is: use ffmpeg to extract frames from the video to be edited; among them, FFmpeg is a set of open source computer programs that can be used to record and convert digital audio and video, and convert them into streams. FFmpeg has video capture, Video format conversion, video capture, video watermarking and other functions. For all images after frame extraction, the similarity between adjacent images is calculated, and the images with similarity lower than the set threshold are regarded as key frames.
  • the specific steps of using the image comparison algorithm to self-label key frames are: based on the acquired key frames, an unsupervised algorithm is used to train the Self label model.
  • the Self label model uses the image comparison algorithm to learn the unsupervised vector representation of the key frame image, and through clustering and representation Learn to self-label key frames and output the self_label(frame k ) of the key frame, where frame k represents the k-th key frame image.
  • S102 Obtain the corpus information of the video to be edited, and use the text comparison algorithm to obtain the unsupervised vector representation of the corpus information;
  • the specific method of obtaining corpus information is: using ASR technology to collect the ASR voice information of the video to be edited, and cutting the collected ASR voice information into ASR text information of a set length; using OCR technology to obtain it from the image after frame extraction OCR text information; use the cut ASR text information and OCR text information as the corpus information of the video to be edited.
  • the specific steps of using the text comparison algorithm to obtain the unsupervised vector representation of corpus information are: training the SimCSE model based on the corpus information.
  • the SimCSE model uses the text comparison algorithm to learn the unsupervised vector representation of ASR text information and OCR text information, and outputs the text vector simcse of the video to be edited. (asr k ) and simcse(ocr k ); where, asr k represents the k-th ASR text information of the video to be edited, and ocr k represents the OCR text information of the k-th key frame image.
  • the video segmentation method is as follows: each key frame is used as a cutting point, the video to be edited is divided into video segments corresponding to the number of key frames, and each video segment includes a key frame image and ASR text information and OCR text information corresponding to the video clip.
  • S104 Calculate the similarity between adjacent video clips based on the unsupervised vector representation of key frames and the unsupervised vector representation of corpus information, merge adjacent video clips with similarity greater than the preset similarity threshold, and generate a video to be edited The video editing results;
  • the similarity calculation method of adjacent video clips is specifically:
  • simi1, simi2 and simi3 respectively represent the similarity of key frames, ASR text information and OCR text information in adjacent video clips;
  • simi represents the similarity of adjacent video clips, and ⁇ and ⁇ are adjustable parameters respectively.
  • the automatic video editing method in the first embodiment of the present application obtains key frames and corpus information of the video to be edited, uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, and uses a text comparison algorithm to learn the unsupervised vector representation of the corpus information. Divide the video to be edited into multiple video segments through key frames, calculate the similarity of adjacent video segments based on the vector representation of key frames and corpus information, and merge the video segments with higher similarity to obtain the final video editing result. .
  • the embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.
  • FIG. 2 is a schematic flowchart of an automatic video editing method according to the second embodiment of the present application.
  • the automatic video editing method in the second embodiment of the present application includes the following steps S201-S209:
  • S201 Collect at least one video to be edited
  • S202 Perform frame extraction processing on the video to be edited, and obtain the key frames of the video to be edited;
  • the ffmpeg program is used to extract frames from the collected videos to be edited.
  • FFmpeg is a set of open source computer programs that can be used to record and convert digital audio and video, and convert them into streams.
  • FFmpeg has Video capture, video format conversion, video capture, video watermarking and other functions.
  • Key frames refer to the frames where key actions occur in the movement changes of characters or objects in the video to be edited.
  • the key frame is obtained by: calculating the similarity between adjacent images for all images after frame extraction, and using the image frames whose similarity is lower than the set threshold as the key frame. While acquiring key frames, a certain number of remaining frames are retained according to a set ratio. The remaining frames are non-key frames. The number k of remaining frames can be set randomly.
  • S203 Based on the acquired key frames, an unsupervised algorithm is used to train the Self label model.
  • the Self label model self-labels the key frames through clustering and representation learning;
  • the Self label model is a self-supervised algorithm that calibrates labels by maximizing the mutual information between data and labels.
  • the Self label model uses the image comparison algorithm to learn the unsupervised vector representation of key frame images, and through clustering
  • the sum representation learning performs self-labeling on the key frame image and outputs the self_label(frame k ) of the key frame, where frame k represents the k-th key frame image.
  • S204 Use ASR (Automatic Speech Recognition, automatic speech recognition technology) technology and OCR (Optical Character Recognition, optical character recognition) recognition technology to collect corpus information of the video to be edited;
  • ASR Automatic Speech Recognition, automatic speech recognition technology
  • OCR Optical Character Recognition, optical character recognition
  • video is a typical multi-modal data, including images and rich text information.
  • the corpus information of the video to be edited includes ASR voice information in the video to be edited and OCR text information in the extracted frame image.
  • the method of obtaining the corpus information of the video to be edited is specifically: using ASR technology to collect the ASR voice information of the video to be edited, and cutting the collected ASR voice information into ASR text information of a set length; at the same time, using OCR technology obtains OCR text information from the extracted image, and uses the cut ASR text information and OCR text information as the corpus information of the video to be edited.
  • the cutting length of ASR voice information is 100, which can be set according to the actual application.
  • S205 Train the SimCSE model based on the collected corpus information, and output the SimCSE text vector of the video to be edited through the SimCSE model;
  • the SimCSE model can be trained unsupervised, based on the self-supervised training of the BERT model, enhanced with natural language data that maintains semantic equivalence through dropout, learn the unsupervised vector representation of the text with the help of the text comparison algorithm, and output the video to be edited.
  • S206 Use each key frame as a cutting point, divide the video to be edited into multiple video clips, and make each video clip include a key frame and the ASR text information and OCR text information corresponding to the video clip. ;
  • each video clip includes a key frame image and the ASR text and OCR text corresponding to the video clip, that is, the representation of each video clip is (frame, asr, ocr).
  • S207 Based on the key frames, ASR text information and OCR text of the video to be edited, calculate the similarity of two adjacent video clips before and after;
  • the similarity calculation method of adjacent video clips is as follows: first, calculate the similarity of key frames, ASR text and OCR text of adjacent video clips respectively, and then calculate based on the similarity of key frames, ASR text and OCR text. Similarity of adjacent video clips.
  • the specific calculation formula is as follows:
  • simi1, simi2 and simi3 respectively represent the similarity of key frames
  • simi represents the similarity of adjacent video clips.
  • ⁇ and ⁇ are adjustable parameters respectively.
  • the values of ⁇ and ⁇ are set to 0.45.
  • S208 Determine whether the similarity between two adjacent video clips is greater than the set similarity threshold. If the similarity between the two adjacent video clips is greater than the preset similarity threshold, execute S209;
  • the similarity threshold is set to 0.5, that is, if the similarity between two adjacent video clips is greater than 0.5, the two video clips are considered similar enough and can be merged. Otherwise, the two video clips are discarded.
  • the edited short video is obtained by merging the video clips with high similarity, which makes the edited short video smoother and improves the viewing experience of the viewer.
  • the automatic video editing method of the second embodiment of the present application collects key frames and corpus information of the video to be edited, uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, and uses a text comparison algorithm to learn the unsupervised vector representation of the corpus information.
  • Supervised vector representation divides the video to be edited into multiple video clips through key frames, calculates the similarity of adjacent video clips based on the vector representation of key frames and corpus information, and merges video clips with higher similarity to obtain the final The result of the video clip.
  • the embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.
  • the results of the automatic video editing method can also be uploaded to the blockchain.
  • the corresponding summary information is obtained based on the result of the automatic video editing method.
  • the summary information is obtained by hashing the result of the automatic video editing method, for example, using the sha256s algorithm.
  • Uploading summary information to the blockchain ensures its security and fairness and transparency to users. Users can download this summary information from the blockchain to verify whether the results of the automatic video editing method have been tampered with.
  • the blockchain referred to in this example is a new application model of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithms.
  • Blockchain is essentially a decentralized database. It is a series of data blocks generated using cryptographic methods. Each data block contains a batch of network transaction information and is used to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • Blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • FIG 3 is a schematic structural diagram of an automatic video editing system according to an embodiment of the present application.
  • the automatic video editing system 40 in the embodiment of this application includes:
  • the first acquisition module 41 is used to obtain the key frames of the video to be edited, and uses an image comparison algorithm to self-mark the key frames to generate an unsupervised vector representation of the key frames; where the key frames are the movements of characters or objects in the video to be edited.
  • the key frame acquisition method is: use ffmpeg to extract frames from the video to be edited; for all images after frame extraction, calculate the similarity between adjacent images, and use the images with a similarity lower than the set threshold as key frames.
  • the first acquisition module 41 uses an image comparison algorithm to self-label key frames. Specifically, based on the acquired key frames, an unsupervised algorithm is used to train the Self label model.
  • the Self label model uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image. Key frames are self-labeled through clustering and representation learning, and the self_label(frame k ) of the key frame is output, where frame k represents the k-th key frame image.
  • the second acquisition module 42 is used to obtain the corpus information of the video to be edited, and uses a text comparison algorithm to obtain an unsupervised vector representation of the corpus information; wherein, the corpus information acquisition method is specifically: using ASR technology to collect the ASR voice information of the video to be edited. , and cut the collected ASR voice information into ASR text information of a set length; use OCR technology to obtain OCR text information from the image after frame extraction; use the cut ASR text information and OCR text information as the video to be edited corpus information.
  • the second acquisition module 42 uses a text comparison algorithm to obtain the unsupervised vector representation of the corpus information. Specifically, it trains the SimCSE model based on the corpus information.
  • the SimCSE model uses the text comparison algorithm to learn the unsupervised vector representation of the ASR text information and the OCR text information, and outputs the data to be edited.
  • the text vectors simcse(asr k ) and simcse(ocr k ) of the video; where, asr k represents the k-th ASR text information of the video to be edited, and ocr k represents the OCR text information of the k-th key frame image.
  • Video segmentation module 43 used to segment the video to be edited according to key frames and generate video segments corresponding to the number of key frames; wherein, the video segmentation method of the video segmentation module is specifically: use each key frame as a cutting point, Divide the video to be edited into video segments corresponding to the number of key frames, and make each video segment include a key frame image and ASR text information and OCR text information corresponding to the video segment.
  • Video merging module 44 used to calculate the similarity of adjacent video segments based on the unsupervised vector representation of key frames and the unsupervised vector representation of corpus information, merge adjacent video segments whose similarity is greater than the set similarity threshold, and generate The video editing result of the video to be edited; the similarity calculation method of adjacent video segments is specifically:
  • simi1, simi2 and simi3 respectively represent the similarity of key frames, ASR text information and OCR text information in adjacent video clips;
  • simi represents the similarity of adjacent video clips, and ⁇ and ⁇ are adjustable parameters respectively.
  • the automatic video editing system in the embodiment of the present application obtains the key frames and corpus information of the video to be edited, uses an image comparison algorithm to learn the unsupervised vector representation of the key frame image, and uses the text comparison algorithm to learn the unsupervised vector representation of the corpus information.
  • the video to be edited is divided into multiple video segments by frame, and the similarity of adjacent video segments is calculated based on the vector representation of key frames and corpus information, and the video segments with higher similarity are merged to obtain the final video editing result.
  • the embodiment of the present application utilizes both image and text information, avoids manual data annotation, realizes automatic video editing, and greatly improves the efficiency of video editing.
  • the terminal 50 includes a processor 51 and a memory 52 coupled to the processor 51 .
  • the memory 52 stores program instructions for implementing the above-mentioned automatic video editing method.
  • the processor 51 is configured to execute program instructions stored in the memory 52 to perform automatic video editing operations.
  • the processor 51 can also be called a CPU (Central Processing Unit).
  • the processor 51 may be an integrated circuit chip with signal processing capabilities.
  • the processor 51 may also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component.
  • DSP digital signal processor
  • ASIC application-specific integrated circuit
  • FPGA off-the-shelf programmable gate array
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • FIG. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
  • the storage medium in the embodiment of the present application stores program files 61 that can implement all the above methods.
  • the program files 61 can be stored in the above-mentioned storage medium in the form of software products.
  • the storage medium can be non-volatile or non-volatile. It is volatile and includes a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the various implementation methods of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. , or terminal equipment such as computers, servers, mobile phones, tablets, etc.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Signal Processing (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

La présente demande divulgue un procédé et un système de montage vidéo automatique, ainsi qu'un terminal et un support de stockage. Le procédé consiste : à acquérir des trames clés d'une vidéo à monter et à auto-étiqueter les trames clés en utilisant un algorithme de comparaison d'images de façon à générer des représentations vectorielles non supervisées des trames clés ; à acquérir des informations de corpus de ladite vidéo et à acquérir une représentation vectorielle non supervisée des informations de corpus en utilisant un algorithme de comparaison de texte ; à segmenter ladite vidéo selon les trames clés de façon à générer des clips vidéo correspondant au nombre de trames clés ; et à calculer la similarité entre des clips vidéo adjacents selon les représentations vectorielles non supervisées des trames clés et la représentation vectorielle non supervisée des informations de corpus et à combiner des clips vidéo adjacents entre lesquels la similarité est supérieure à une valeur seuil de similarité définie, de façon à générer un résultat de montage vidéo de ladite vidéo. Dans les modes de réalisation de la présente demande, des informations d'image et des informations de texte sont utilisées, ce qui permet d'éviter un étiquetage de données manuel, de réaliser un montage automatique d'une vidéo et d'améliorer considérablement l'efficacité de montage vidéo.
PCT/CN2022/089560 2022-03-29 2022-04-27 Procédé et système de montage vidéo automatique, ainsi que terminal et support de stockage WO2023184636A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210318902.4 2022-03-29
CN202210318902.4A CN114694070A (zh) 2022-03-29 2022-03-29 一种自动视频剪辑方法、系统、终端及存储介质

Publications (1)

Publication Number Publication Date
WO2023184636A1 true WO2023184636A1 (fr) 2023-10-05

Family

ID=82140927

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/089560 WO2023184636A1 (fr) 2022-03-29 2022-04-27 Procédé et système de montage vidéo automatique, ainsi que terminal et support de stockage

Country Status (2)

Country Link
CN (1) CN114694070A (fr)
WO (1) WO2023184636A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9620168B1 (en) * 2015-12-21 2017-04-11 Amazon Technologies, Inc. Cataloging video and creating video summaries
CN108882057A (zh) * 2017-05-09 2018-11-23 北京小度互娱科技有限公司 视频摘要生成方法及装置
CN111526382A (zh) * 2020-04-20 2020-08-11 广东小天才科技有限公司 一种直播视频文本生成方法、装置、设备及存储介质
CN111797850A (zh) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 视频分类方法、装置、存储介质及电子设备
CN113709561A (zh) * 2021-04-14 2021-11-26 腾讯科技(深圳)有限公司 视频剪辑方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9620168B1 (en) * 2015-12-21 2017-04-11 Amazon Technologies, Inc. Cataloging video and creating video summaries
CN108882057A (zh) * 2017-05-09 2018-11-23 北京小度互娱科技有限公司 视频摘要生成方法及装置
CN111797850A (zh) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 视频分类方法、装置、存储介质及电子设备
CN111526382A (zh) * 2020-04-20 2020-08-11 广东小天才科技有限公司 一种直播视频文本生成方法、装置、设备及存储介质
CN113709561A (zh) * 2021-04-14 2021-11-26 腾讯科技(深圳)有限公司 视频剪辑方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN114694070A (zh) 2022-07-01

Similar Documents

Publication Publication Date Title
CN108833973B (zh) 视频特征的提取方法、装置和计算机设备
WO2022142014A1 (fr) Procédé de classification de texte sur la base d'une fusion d'informations multimodales et dispositif associé correspondant
CN109756751B (zh) 多媒体数据处理方法及装置、电子设备、存储介质
US8503523B2 (en) Forming a representation of a video item and use thereof
WO2018177139A1 (fr) Procédé et appareil de génération de résumé vidéo, serveur et support de stockage
WO2023011094A1 (fr) Procédé et appareil de montage vidéo, dispositif électronique et support de stockage
CN112929744A (zh) 用于分割视频剪辑的方法、装置、设备、介质和程序产品
WO2023197979A1 (fr) Procédé et appareil de traitement de données, et dispositif informatique et support des stockage
CN113010703A (zh) 一种信息推荐方法、装置、电子设备和存储介质
CN112100440B (zh) 视频推送方法、设备及介质
CN114339360B (zh) 一种视频处理的方法、相关装置及设备
WO2022227218A1 (fr) Procédé et appareil de reconnaissance de nom de médicament, dispositif informatique et support de stockage
WO2020135756A1 (fr) Procédé, appareil et dispositif d'extraction de segment vidéo et support de stockage lisible par ordinateur
CN114390368B (zh) 直播视频数据的处理方法及装置、设备、可读介质
WO2023173539A1 (fr) Procédé et système de traitement de contenu vidéo, terminal et support de stockage
US20200151208A1 (en) Time code to byte indexer for partial object retrieval
US20190311746A1 (en) Indexing media content library using audio track fingerprinting
KR20210047467A (ko) 이미지 다중 캡션 자동 생성 방법 및 시스템
CN115734024A (zh) 音频数据处理方法、装置、设备及存储介质
JP6730760B2 (ja) サーバおよびプログラム、動画配信システム
WO2023184636A1 (fr) Procédé et système de montage vidéo automatique, ainsi que terminal et support de stockage
CN116017088A (zh) 视频字幕处理方法、装置、电子设备和存储介质
KR102526263B1 (ko) 이미지 다중 캡션 자동 생성 방법 및 시스템
CN115359492A (zh) 文本图像匹配模型训练方法、图片标注方法、装置、设备
JP6713183B1 (ja) サーバおよびプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22934462

Country of ref document: EP

Kind code of ref document: A1