CN117473120A - Video retrieval method based on lens features - Google Patents

Video retrieval method based on lens features Download PDF

Info

Publication number
CN117473120A
CN117473120A CN202311815386.7A CN202311815386A CN117473120A CN 117473120 A CN117473120 A CN 117473120A CN 202311815386 A CN202311815386 A CN 202311815386A CN 117473120 A CN117473120 A CN 117473120A
Authority
CN
China
Prior art keywords
video
shot
retrieval
similarity
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311815386.7A
Other languages
Chinese (zh)
Inventor
陈丹伟
陶玉翰
纪翀
罗圣美
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202311815386.7A priority Critical patent/CN117473120A/en
Publication of CN117473120A publication Critical patent/CN117473120A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Library & Information Science (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a video retrieval method based on lens characteristics, and belongs to the field of video retrieval. The method comprises three steps of lens segmentation, feature extraction, similarity calculation and sequencing. The shot segmentation uses an Autoshot model to divide the video, the feature extraction uses a MC 3-18 model to extract the features of the divided shots, the similarity calculation and the sequencing calculate the cosine similarity among the shot feature vectors, and the parallel output results are obtained. The method for searching the video by using the video has the advantages of effectively solving the problem of searching the complete video by using the video clips and having small influence on the searching result due to the picture sequence change caused by the later video editing.

Description

Video retrieval method based on lens features
Technical Field
The invention relates to the field of video retrieval, in particular to a video retrieval method based on lens characteristics.
Background
The invention relies on some other technologies as background in the implementation process, and the background technology is briefly described below:
video retrieval is an important research area in the fields of computer vision and information retrieval, the main objective of which is to efficiently retrieve and acquire specific video clips or information from large-scale video data. With the popularity of the internet and mobile devices, the amount of video data generation and sharing has grown exponentially. A great deal of video content is produced in the fields of social media, online education, monitoring systems, etc. Processing and managing these large amounts of video data becomes critical for efficient retrieval. The invention discloses a video retrieval method based on lens characteristics, and relates to the technologies of lens segmentation, video retrieval and video similarity comparison.
Document Zhu, wentao et al, "autoplot: A Short Video Dataset and State-of-the-Art Shot Boundary detection," 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2023): 2238-2247 discloses a lens boundary detection method that uses a new neural network architecture autoplot, a model that is improved on the basis of the tranet v2 model. The main principle is that the input video is processed frame by using a model, the output characteristics of adjacent frames are compared in similarity, if the output characteristics are lower than a certain threshold value, the video is regarded as a shot boundary, and then the whole video is divided into different shots by a plurality of shot boundaries. The method is a technology used in the lens segmentation stage of the invention.
A video feature extraction method using a neural network architecture MC3 is disclosed in document Tran, du et al, "A Closer Look at Spatiotemporal Convolutions for Action registration," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017): 6450-6459. The architecture is improved on the basis of a C3D model, only the first three layers of 3D convolution of the network are reserved, and the subsequent 3D convolution is changed into a 2D convolution. By the aid of the method, the parameter quantity can be reduced on the basis of keeping the capability of extracting time sequence characteristics of the network, training efficiency is effectively improved, and gradient disappearance is prevented. The method is a technique used in the feature extraction stage of the invention.
Chinese patent publication No. CN116226442a, publication No. 2023.06.06, entitled "a similar video search method, apparatus, system, and storage medium". The patent discloses a retrieval method of similar videos, which adopts a data training scheme of image pairs and learns various editing deformation modes by using a self-supervision mode. The method can solve the problem of searching similar videos of pictures, but has the defect that the problem of searching complete videos by video clips cannot be solved.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a video retrieval method based on lens characteristics, which can effectively solve the problem of searching a complete video through a video clip.
The invention adopts the following technical scheme to solve the technical problems:
a video retrieval method based on lens features comprises the following steps:
step 1: performing shot segmentation, namely performing shot segmentation on the video by using an Autoshot model;
step 2: feature extraction, namely performing feature extraction on the divided lenses by using a MC 3-18 model;
step 3: and (5) calculating and sequencing the similarity, and calculating cosine similarity among shot feature vectors and arranging the output results.
Further, in the step 1 shot segmentation, an autoplot model is used to process the input video frame by frame, the probability that each frame is a shot boundary is output, the frame number of the shot boundary is determined according to a set threshold, and the corresponding relation of the video name and the shot boundary frame number is stored as an index file.
Further, the feature extraction in step 2 uses the MC3_18 model to perform feature extraction on the input video shots, obtains the shot division condition of the video by reading the video name-shot boundary frame number index file, determines the number of frames of each shot to participate in calculation by using the num_frame parameter value, and performs feature extraction on each shot to obtain the feature vector of each shot.
Further, the similarity calculation and sorting in the step 3 uses cosine similarity, the cosine similarity is calculated one by one between the shot feature vector of the video segment to be searched and the shot feature vector of the video in the library, and the similarity is sorted to obtain the complete video with the most similar shot as the search result.
Furthermore, the video in the searching range is required to be added into a video library for preprocessing, the processing method is the same as that of the video fragment to be searched, the processing result is an index library of video name-lens feature vectors, the video searching is required to be carried out after the generation of the index library is completed, and the searching result is determined by comparing the lens feature vectors of the video fragment to be searched one by one in the index library.
Compared with the prior art, the invention has the following beneficial effects:
(1) It is good at finding the complete video through the incomplete video.
(2) The success rate of video retrieval can be improved under the condition that the shot is frequently switched.
(3) The influence on the search result is small after the video is clipped and re-pieced.
Drawings
Fig. 1 is a complete flowchart of a video retrieval method based on shot characteristics according to the present invention.
Fig. 2 is a basic schematic diagram of a video retrieval method based on lens features according to the present invention.
Fig. 3 is an AutoShot model architecture diagram.
Fig. 4 is a diagram of the structure of the MC 3-18 model.
Detailed Description
The present invention is described below based on examples, but the present invention is not limited to only these examples.
The complete flow chart of the invention is shown in fig. 1, the basic schematic diagram is shown in fig. 2, and comprises a systematic flow of establishing a video retrieval index library, extracting features of video fragments to be retrieved, calculating similarity and arranging an output result, wherein the invention comprises three steps of lens segmentation, feature extraction, similarity calculation and sequencing, and the following steps are introduced one by one:
step 1 is shot segmentation, and the main principle is that a deep learning model Autoshot is used for processing an input complete video, an Autoshot model architecture diagram is shown in fig. 3, and the model can process and extract advanced semantic features frame by frame and distinguish shot boundaries according to feature differences between adjacent frames. In general, the variation between different frames within a shot is not significant, and if the magnitude of the characteristic variation between adjacent frames exceeds a certain threshold, then shot transitions are considered to occur, i.e., shot boundaries exist between the adjacent frames. After traversing all frames of the video, the model returns frame numbers of all shot boundaries, and shot segmentation is completed.
And 2, extracting features, wherein the main principle is that different shots are respectively processed by using a deep learning model MC 3-18, and the high-level semantic features of the shots are extracted. The structure of the Mc3_18 model is shown in fig. 4, and the Mc3_18 is an 18-layer residual error network, which comprises a 3D convolution layer and a 2D convolution layer, so that the network can simultaneously take into account the extraction of spatial semantic information and sequential logic information of a video, and the network performs well on the overall feature expression of the video. In this stage we can distinguish between different shots by shot boundary frame numbers output from the first stage, so that feature extraction is performed within each shot. Considering the complexity of calculation, we will not let all frames in the shot participate in calculation, but set a num_frame value, and select num_frames from the shot in an equally-spaced manner, so as to speed up the extraction of shot features. In addition, num_frame is variable and is generally taken as 16. And finally, outputting high-dimensional feature vectors of all the shots of the video at the stage, and completing the feature extraction.
Step 3 is the similarity calculation and ranking, which is the last stage of the method. The main principle is cosine similarity calculation, which measures the cosine value of the included angle between two vectors, wherein the closer to 1 is the more similar the two vectors, and the closer to 0 is the less similar the two vectors. The specific method is that the cosine similarity is calculated between the feature vector of each lens of the video segment to be searched and the feature vector of each lens of the original video in the library, and the similarity score is calculated after the completion. If the most similar shots of different shots of the fragment to be detected belong to one original video, the similarity score of the original video is considered as the average value of the scores of the different shots. And finally, sequencing the similarity scores of the original videos in the library to obtain the original video with the highest similarity score. The original video is considered to be the complete video to which the video clip belongs, and the video retrieval is completed.
The present invention is not limited to the above-mentioned embodiments, and any person skilled in the art should, within the scope of the present invention, use the equivalent structures or equivalent processes disclosed in the present invention, or use the equivalent structures or equivalent processes directly or indirectly in other related technical fields, so that the present invention is easily contemplated to be modified or replaced.

Claims (5)

1. The video retrieval method based on the lens characteristics is characterized by comprising the following steps of:
step 1: performing shot segmentation, namely performing shot segmentation on the video by using an Autoshot model;
step 2: feature extraction, namely performing feature extraction on the divided lenses by using a MC 3-18 model;
step 3: and (5) calculating and sequencing the similarity, and calculating cosine similarity among shot feature vectors and arranging the output results.
2. The method for searching video based on shot characteristics according to claim 1, wherein in the step 1 shot segmentation, an autoplot model is used to process an input video frame by frame, a probability that each frame is a shot boundary is output, a frame number of the shot boundary is determined according to a set threshold, and a correspondence of a video name-the shot boundary frame number is stored as an index file.
3. The video retrieval method based on shot characteristics according to claim 1, wherein the step 2 of feature extraction uses a mc3_18 model to perform feature extraction on input video shots, obtains the shot division condition of the video by reading a video name-shot boundary frame number index file, determines the number of frames of each shot to participate in calculation through num_frame parameter values, and performs feature extraction on shot by shot to obtain feature vectors of the shots.
4. The video retrieval method based on shot features according to claim 1, wherein the similarity calculated and sequenced in step 3 uses cosine similarity, the cosine similarity is calculated one by one between the shot feature vector of the video segment to be retrieved and the shot feature vector of the video in the library, and the similarity is sequenced to take the complete video with the most similar shot as the retrieval result.
5. The video retrieval method based on lens features according to claim 1, wherein videos in a retrieval range are required to be added into a video library for preprocessing, the processing method is the same as that of video clips to be retrieved, an index library of video name-lens feature vectors is generated as a result of processing, video retrieval is required to be performed after the generation of the index library is completed, and retrieval results are determined by comparing lens feature vectors of the video clips to be retrieved one by one in the index library.
CN202311815386.7A 2023-12-27 2023-12-27 Video retrieval method based on lens features Pending CN117473120A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311815386.7A CN117473120A (en) 2023-12-27 2023-12-27 Video retrieval method based on lens features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311815386.7A CN117473120A (en) 2023-12-27 2023-12-27 Video retrieval method based on lens features

Publications (1)

Publication Number Publication Date
CN117473120A true CN117473120A (en) 2024-01-30

Family

ID=89633334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311815386.7A Pending CN117473120A (en) 2023-12-27 2023-12-27 Video retrieval method based on lens features

Country Status (1)

Country Link
CN (1) CN117473120A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427517A (en) * 2019-07-18 2019-11-08 华戎信息产业有限公司 A kind of figure based on scene lexicographic tree searches video method, device and computer readable storage medium
CN114724218A (en) * 2022-04-08 2022-07-08 北京中科闻歌科技股份有限公司 Video detection method, device, equipment and medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427517A (en) * 2019-07-18 2019-11-08 华戎信息产业有限公司 A kind of figure based on scene lexicographic tree searches video method, device and computer readable storage medium
CN114724218A (en) * 2022-04-08 2022-07-08 北京中科闻歌科技股份有限公司 Video detection method, device, equipment and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BRUNO KORBAR等: "SCSampler: Sampling Salient Clips from Video for Efficient Action Recognition", HTTPS://ARXIV.ORG/PDF/1904.04289.PDF, 30 August 2019 (2019-08-30), pages 1 - 14 *
WENTAO ZHU等: "AutoShot: A Short Video Dataset and State-of-the-Art Shot Boundary Detection", CVPR 2023 OPEN ACCESS, 22 June 2023 (2023-06-22), pages 2238 - 2247, XP034397375, DOI: 10.1109/CVPRW59228.2023.00218 *

Similar Documents

Publication Publication Date Title
CN108319938B (en) High-quality training data preparation system for high-performance face recognition system
CN108898620B (en) Target tracking method based on multiple twin neural networks and regional neural network
CN107239565B (en) Image retrieval method based on saliency region
CN107315795B (en) The instance of video search method and system of joint particular persons and scene
Tan et al. Selective dependency aggregation for action classification
CN111104555A (en) Video hash retrieval method based on attention mechanism
Chung et al. Hand gesture recognition via image processing techniques and deep CNN
CN115035418A (en) Remote sensing image semantic segmentation method and system based on improved deep LabV3+ network
Markatopoulou et al. Cascade of classifiers based on binary, non-binary and deep convolutional network descriptors for video concept detection
Bora et al. A review on video summarization approcahes: recent advances and directions
Sreeja et al. A unified model for egocentric video summarization: an instance-based approach
CN114333062A (en) Pedestrian re-recognition model training method based on heterogeneous dual networks and feature consistency
WO2024032177A1 (en) Data processing method and apparatus, electronic device, storage medium, and program product
CN117312681A (en) Meta universe oriented user preference product recommendation method and system
CN117473120A (en) Video retrieval method based on lens features
CN115100694A (en) Fingerprint quick retrieval method based on self-supervision neural network
CN110942463A (en) Video target segmentation method based on generation countermeasure network
Zhu et al. Autoshot: A short video dataset and state-of-the-art shot boundary detection
CN111079527A (en) Shot boundary detection method based on 3D residual error network
CN116229323A (en) Human body behavior recognition method based on improved depth residual error network
Wang et al. Multi-scale spatial-temporal network for person re-identification
Sudha et al. Reducing semantic gap in video retrieval with fusion: A survey
Chatur et al. A simple review on content based video images retrieval
CN116680435B (en) Similar image retrieval matching method based on multi-layer feature extraction
Shrestha et al. Temporal querying of faces in videos using bitmap index

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination