WO2024001057A1 - Procédé de récupération vidéo basé sur une invite de segment d'attention - Google Patents

Procédé de récupération vidéo basé sur une invite de segment d'attention Download PDF

Info

Publication number
WO2024001057A1
WO2024001057A1 PCT/CN2022/137814 CN2022137814W WO2024001057A1 WO 2024001057 A1 WO2024001057 A1 WO 2024001057A1 CN 2022137814 W CN2022137814 W CN 2022137814W WO 2024001057 A1 WO2024001057 A1 WO 2024001057A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
text
features
similarity
visual
Prior art date
Application number
PCT/CN2022/137814
Other languages
English (en)
Chinese (zh)
Inventor
乔宇
陈思然
许清林
王亚立
马跃
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2024001057A1 publication Critical patent/WO2024001057A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data

Definitions

  • the present invention relates to the technical field of video retrieval, and more specifically, to a video retrieval method based on attention segment prompts.
  • the purpose of the present invention is to overcome the above-mentioned shortcomings of the prior art and provide a video retrieval method based on attention segment prompts.
  • the method includes the following steps:
  • Extract visual information from the video then calculate the corresponding global features, and extract text features based on the query text;
  • the advantage of the present invention is that it designs and solves problems based on the inherent characteristics of video text retrieval, directly uses text to query the most relevant video frames, and updates the possible most relevant pictures while retaining global information. More weights will help improve the accuracy and efficiency of retrieval.
  • the present invention has good advantages even on small data sets, and can be well migrated and integrated into other model frameworks to further improve retrieval performance. Moreover, it can be plug and play, very simple and convenient.
  • Figure 1 is a flow chart of a video retrieval method based on attention segment prompts according to one embodiment of the present invention
  • Figure 2 is a schematic diagram of a video retrieval framework based on attention segment prompts according to an embodiment of the present invention
  • Figure 3 is a rendering of migrating the attention segment prompt module to an existing model according to an embodiment of the present invention.
  • any specific values are to be construed as illustrative only and not as limiting. Accordingly, other examples of the exemplary embodiments may have different values.
  • the present invention proposes a concise attentional segment prompting (ASP) framework that can dynamically utilize text-related segments in videos to facilitate retrieval.
  • the attentional segment cueing framework includes two important parts: segment cueing and video aggregation.
  • segment cueing For a set of text and video, Snippet Hints can cleverly construct text-driven visual cues to dynamically obtain video snippets relevant to the text query.
  • Video aggregation can obtain global information of videos.
  • the inventive model framework can effectively learn a robust, text-related visual representation.
  • the provided video retrieval method based on attention segment prompts includes the following steps:
  • Step S110 extract visual features of the video and text features of the query text.
  • FIG. 2 illustrates the framework in the form of functional modules.
  • the most similar video clips in the video can be dynamically searched based on the query text to obtain the key features of the video with respect to the query text; then, the video information is globally integrated to obtain the global features of the video. Finally, the key features and global features are weighted and summed to obtain the final visual features (or visual information), and then the video retrieval results are obtained based on the final visual features.
  • the framework provided by the present invention is more effective than the existing technology, because many original videos contain information of multiple scenes, and query text often only describes part of the video, which makes it difficult to match text information with rich video features.
  • the present invention can dynamically find the video clips described based on the text, significantly improving the video retrieval efficiency.
  • module 1 is used to extract text features and visual features. For example, perform average downsampling on the video, convert the video into a picture sequence, and then use the visual feature extractor VIT to extract visual features (or visual information), and use a text feature extractor to extract text features.
  • the text feature extractor can use BERT or other language models.
  • VIT and BERT are initialized using weights obtained after CLIP pre-training.
  • the CLIP model uses the pre-training task of image-text matching to learn cross-modal information of visual text from large-scale image-text data through self-supervision and contrastive learning methods. CLIP can zero-shot transfer to other downstream tasks, and its framework is also Can be used as a basis for other visual text tasks to achieve better performance.
  • Step S120 Extract key features of the video regarding the query text, and fuse the key features with global visual features to obtain final visual features.
  • a novel attention segment-based prompt module (Module 2) is proposed.
  • the visual information is passed through the temporal transformer module, so that the features of each frame have contextual information, and then based on the query text, the similarity is compared with the features of each frame of the video, and dynamically obtained and queried The most similar video clip information of text as the key feature of vision.
  • the visual information extracted by the visual feature extractor VIT is averaged to obtain the global visual features. Then the key features and global features are weighted and summed to obtain the final visual features.
  • the present invention focuses attention on important information that is highly relevant to the query text, thereby improving calculation efficiency and accuracy of feature extraction, and reducing the amount of parameters.
  • the key features and global features can be fused in the form of linear weighting or exponential weighting.
  • the optimal hyperparameters are selected to weigh the importance of key features (clip level) and global features (video level), and for different data sets, the optimal hyperparameters are explored based on the characteristics of the data set itself. The hyperparameters finally achieve better performance on various types of data sets.
  • Step S130 Based on the final visual features, calculate the similarity between the video and the query text to obtain the retrieval results.
  • the video retrieval results are output through further cross-modal similarity calculation (see module 3 in Figure 2).
  • cosine similarity is used as an evaluation criterion for similarity between videos and texts to calculate the similarity between the query text and each video.
  • the loss function is calculated based on the similarity matrix obtained by training to learn the model.
  • the most similar video or videos are selected based on the similarity and output as the retrieved target video.
  • the loss function of the present invention takes into account the similarity between the query text and the video. Through this design, when selecting video frames in a targeted manner, the video frame with the greatest correlation can be directly selected.
  • the loss function is set to:
  • T (i) represents the text features of the i-th text in this batch
  • Z (i,j) represents the visual features of the i-th video in each batch generated through the fragment prompts of the j-th text
  • ⁇ , > represents two Cosine similarity between features
  • M represents the number of videos.
  • the present invention for a given text description, one or more videos most relevant to the text can be effectively found from a massive video library.
  • the video clip may include the content of more than one scene, including camera switching, the target character performing multiple consecutive activities, etc., for example, in an animation clip involving two people on a table in a restaurant After drinking and talking, they walked out of the restaurant and talked.
  • 20 different annotators watched the video and then gave their own annotations.
  • a considerable number only described part of the video such as "Two Animated characters at the dining table", "Two animated characters walking and talking", etc. Based on such considerations, the present invention puts forward the idea of recommending video clips.
  • the video expression can dynamically calculate the correlation score with the video for specific text input, and find the most relevant part of the video to the text query, thereby meeting the user's personalized query needs.
  • the present invention dynamically aligns the query text with the video clips to obtain the clips most relevant to the text, thereby enhancing the robustness of the retrieval and achieving better experimental results even with a small number of training sets. See Table 2, which illustrates the training. Compare with the CLIP4Clip experimental results when the data sizes are 30, 300 and 3000 respectively.
  • the present invention at least has the following technical effects:
  • the designed plug-and-play attention segment prompt module can dynamically obtain relevant video segments based on the query text, so that different query texts can effectively match different related content in the video, enhancing the retrieval effect.
  • the present invention can give full play to its advantages even on small data sets, and has strong portability.
  • the ASP module can be transplanted to other retrieval models to further improve retrieval results.
  • the conventional models are CLIP and CLIP4Clip, which is improved for video.
  • the present invention adds its own contribution on this basis, and the retrieval performance is significantly better than the baseline of CLIP4Clip.
  • the invention can be applied to electronic devices, servers or clouds to retrieve one or more target videos based on the query text.
  • the electronic device can be a terminal device or a server.
  • the terminal device includes any mobile phone, tablet computer, personal digital assistant (PDA), sales terminal (POS), smart wearable device (smart watch, virtual reality glasses, virtual reality helmet, etc.) Terminal Equipment.
  • Servers include but are not limited to application servers or web servers, and can be independent servers, cluster servers, cloud servers, etc.
  • the invention may be a system, method and/or computer program product.
  • a computer program product may include a computer-readable storage medium having computer-readable program instructions thereon for causing a processor to implement various aspects of the invention.
  • Computer-readable storage media may be tangible devices that can retain and store instructions for use by an instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. More specific examples (non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) or Flash memory), Static Random Access Memory (SRAM), Compact Disk Read Only Memory (CD-ROM), Digital Versatile Disk (DVD), Memory Stick, Floppy Disk, Mechanical Coding Device, such as a printer with instructions stored on it.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • Flash memory Static Random Access Memory
  • CD-ROM Compact Disk Read Only Memory
  • DVD Digital Versatile Disk
  • Memory Stick
  • Computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or through electrical wires. transmitted electrical signals.
  • Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to various computing/processing devices, or to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage on a computer-readable storage medium in the respective computing/processing device .
  • Computer program instructions for performing operations of the present invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages.
  • the computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server implement.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider through the Internet). connect).
  • LAN local area network
  • WAN wide area network
  • an external computer such as an Internet service provider through the Internet. connect
  • an electronic circuit such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA)
  • the electronic circuit can Computer readable program instructions are executed to implement various aspects of the invention.
  • These computer-readable program instructions may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus, thereby producing a machine that, when executed by the processor of the computer or other programmable data processing apparatus, , resulting in an apparatus that implements the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium. These instructions cause the computer, programmable data processing device and/or other equipment to work in a specific manner. Therefore, the computer-readable medium storing the instructions includes An article of manufacture that includes instructions that implement aspects of the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other equipment, causing a series of operating steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executed on a computer, other programmable data processing apparatus, or other equipment to implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions that embody one or more elements for implementing the specified logical function(s).
  • Executable instructions may occur out of the order noted in the figures. For example, two consecutive blocks may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or acts. , or can be implemented using a combination of specialized hardware and computer instructions. It is well known to those skilled in the art that implementation through hardware, implementation through software, and implementation through a combination of software and hardware are all equivalent.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention divulgue un procédé de récupération de vidéo basé sur une invite de segment d'attention. Le procédé consiste à : extraire des informations visuelles pour une vidéo, calculer une caractéristique globale correspondante et extraire une caractéristique de texte d'après un texte d'interrogation ; utiliser un convertisseur temporel pour les informations visuelles de sorte que la caractéristique de chaque trame comporte des informations contextuelles ; rechercher des clips vidéo similaires dans la vidéo d'après la caractéristique de texte et acquérir les informations de clip vidéo les plus similaires au texte de requête en tant que caractéristique visuelle clé ; effectuer une somme pondérée sur la caractéristique clé et la caractéristique globale afin d'obtenir une caractéristique visuelle finale ; et calculer une similarité entre le texte de requête et la vidéo en fonction de la caractéristique visuelle finale de façon à récupérer une vidéo cible satisfaisant une exigence de similarité. Selon la présente invention, une plus grande importance est accordée à l'écran le plus pertinent, tandis que les informations globales sont réservées, de sorte que la vidéo cible peut être récupérée avec précision.
PCT/CN2022/137814 2022-07-01 2022-12-09 Procédé de récupération vidéo basé sur une invite de segment d'attention WO2024001057A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210768147.XA CN115269913A (zh) 2022-07-01 2022-07-01 一种基于注意力片段提示的视频检索方法
CN202210768147.X 2022-07-01

Publications (1)

Publication Number Publication Date
WO2024001057A1 true WO2024001057A1 (fr) 2024-01-04

Family

ID=83763289

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/137814 WO2024001057A1 (fr) 2022-07-01 2022-12-09 Procédé de récupération vidéo basé sur une invite de segment d'attention

Country Status (2)

Country Link
CN (1) CN115269913A (fr)
WO (1) WO2024001057A1 (fr)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115269913A (zh) * 2022-07-01 2022-11-01 深圳先进技术研究院 一种基于注意力片段提示的视频检索方法
CN116089653B (zh) * 2023-03-20 2023-06-27 山东大学 一种基于场景信息的视频检索方法
CN116091984B (zh) * 2023-04-12 2023-07-18 中国科学院深圳先进技术研究院 视频目标分割方法、装置、电子设备及存储介质
CN116821417B (zh) * 2023-08-28 2023-12-12 中国科学院自动化研究所 视频标签序列生成方法和装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750339A (zh) * 2012-06-05 2012-10-24 北京交通大学 一种基于视频重构的重复片段定位方法
CN110933518A (zh) * 2019-12-11 2020-03-27 浙江大学 一种利用卷积多层注意力网络机制生成面向查询的视频摘要的方法
US20210248376A1 (en) * 2020-02-06 2021-08-12 Adobe Inc. Generating a response to a user query utilizing visual features of a video segment and a query-response-neural network
CN113590881A (zh) * 2021-08-09 2021-11-02 北京达佳互联信息技术有限公司 视频片段检索方法、视频片段检索模型的训练方法及装置
CN114037945A (zh) * 2021-12-10 2022-02-11 浙江工商大学 一种基于多粒度特征交互的跨模态检索方法
CN115269913A (zh) * 2022-07-01 2022-11-01 深圳先进技术研究院 一种基于注意力片段提示的视频检索方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750339A (zh) * 2012-06-05 2012-10-24 北京交通大学 一种基于视频重构的重复片段定位方法
CN110933518A (zh) * 2019-12-11 2020-03-27 浙江大学 一种利用卷积多层注意力网络机制生成面向查询的视频摘要的方法
US20210248376A1 (en) * 2020-02-06 2021-08-12 Adobe Inc. Generating a response to a user query utilizing visual features of a video segment and a query-response-neural network
CN113590881A (zh) * 2021-08-09 2021-11-02 北京达佳互联信息技术有限公司 视频片段检索方法、视频片段检索模型的训练方法及装置
CN114037945A (zh) * 2021-12-10 2022-02-11 浙江工商大学 一种基于多粒度特征交互的跨模态检索方法
CN115269913A (zh) * 2022-07-01 2022-11-01 深圳先进技术研究院 一种基于注意力片段提示的视频检索方法

Also Published As

Publication number Publication date
CN115269913A (zh) 2022-11-01

Similar Documents

Publication Publication Date Title
WO2024001057A1 (fr) Procédé de récupération vidéo basé sur une invite de segment d'attention
CN111143610B (zh) 一种内容推荐方法、装置、电子设备和存储介质
CN107609152B (zh) 用于扩展查询式的方法和装置
CN105677735B (zh) 一种视频搜索方法及装置
WO2019144850A1 (fr) Procédé et appareil de recherche vidéo fondée sur le contenu vidéo
JP6361351B2 (ja) 発話ワードをランク付けする方法、プログラム及び計算処理システム
WO2020155423A1 (fr) Procédé et appareil d'extraction d'informations inter-modes, et support de stockage
JP7304370B2 (ja) ビデオ検索方法、装置、デバイス及び媒体
US20220121331A1 (en) Efficiently augmenting images with related content
US10083379B2 (en) Training image-recognition systems based on search queries on online social networks
CN111694965B (zh) 一种基于多模态知识图谱的图像场景检索系统及方法
WO2019169872A1 (fr) Procédé et dispositif de recherche de ressource de contenu, et serveur
US20150169641A1 (en) Refining image annotations
WO2022134701A1 (fr) Procédé et appareil de traitement vidéo
CN110083729B (zh) 一种图像搜索的方法及系统
CN113079417B (zh) 生成弹幕的方法、装置、设备和存储介质
US11126682B1 (en) Hyperlink based multimedia processing
JP2016502194A (ja) ビデオ検索方法及び装置
US20210211784A1 (en) Method and apparatus for retrieving teleplay content
US11263400B2 (en) Identifying entity attribute relations
US20180210961A1 (en) Information search method and apparatus
CN113704507B (zh) 数据处理方法、计算机设备以及可读存储介质
CN110019948B (zh) 用于输出信息的方法和装置
TW201931163A (zh) 影像搜尋方法、系統和索引建構方法和媒體
CN112100440A (zh) 视频推送方法、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22949133

Country of ref document: EP

Kind code of ref document: A1