CN117473120A

CN117473120A - Video retrieval method based on lens features

Info

Publication number: CN117473120A
Application number: CN202311815386.7A
Authority: CN
Inventors: 陈丹伟; 陶玉翰; 纪翀; 罗圣美
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-01-30

Abstract

The invention discloses a video retrieval method based on lens characteristics, and belongs to the field of video retrieval. The method comprises three steps of lens segmentation, feature extraction, similarity calculation and sequencing. The shot segmentation uses an Autoshot model to divide the video, the feature extraction uses a MC 3-18 model to extract the features of the divided shots, the similarity calculation and the sequencing calculate the cosine similarity among the shot feature vectors, and the parallel output results are obtained. The method for searching the video by using the video has the advantages of effectively solving the problem of searching the complete video by using the video clips and having small influence on the searching result due to the picture sequence change caused by the later video editing.

Description

Video retrieval method based on lens features

Technical Field

The invention relates to the field of video retrieval, in particular to a video retrieval method based on lens characteristics.

Background

The invention relies on some other technologies as background in the implementation process, and the background technology is briefly described below:

video retrieval is an important research area in the fields of computer vision and information retrieval, the main objective of which is to efficiently retrieve and acquire specific video clips or information from large-scale video data. With the popularity of the internet and mobile devices, the amount of video data generation and sharing has grown exponentially. A great deal of video content is produced in the fields of social media, online education, monitoring systems, etc. Processing and managing these large amounts of video data becomes critical for efficient retrieval. The invention discloses a video retrieval method based on lens characteristics, and relates to the technologies of lens segmentation, video retrieval and video similarity comparison.

Document Zhu, wentao et al, "autoplot: A Short Video Dataset and State-of-the-Art Shot Boundary detection," 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2023): 2238-2247 discloses a lens boundary detection method that uses a new neural network architecture autoplot, a model that is improved on the basis of the tranet v2 model. The main principle is that the input video is processed frame by using a model, the output characteristics of adjacent frames are compared in similarity, if the output characteristics are lower than a certain threshold value, the video is regarded as a shot boundary, and then the whole video is divided into different shots by a plurality of shot boundaries. The method is a technology used in the lens segmentation stage of the invention.

A video feature extraction method using a neural network architecture MC3 is disclosed in document Tran, du et al, "A Closer Look at Spatiotemporal Convolutions for Action registration," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017): 6450-6459. The architecture is improved on the basis of a C3D model, only the first three layers of 3D convolution of the network are reserved, and the subsequent 3D convolution is changed into a 2D convolution. By the aid of the method, the parameter quantity can be reduced on the basis of keeping the capability of extracting time sequence characteristics of the network, training efficiency is effectively improved, and gradient disappearance is prevented. The method is a technique used in the feature extraction stage of the invention.

Chinese patent publication No. CN116226442a, publication No. 2023.06.06, entitled "a similar video search method, apparatus, system, and storage medium". The patent discloses a retrieval method of similar videos, which adopts a data training scheme of image pairs and learns various editing deformation modes by using a self-supervision mode. The method can solve the problem of searching similar videos of pictures, but has the defect that the problem of searching complete videos by video clips cannot be solved.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a video retrieval method based on lens characteristics, which can effectively solve the problem of searching a complete video through a video clip.

The invention adopts the following technical scheme to solve the technical problems:

a video retrieval method based on lens features comprises the following steps:

step 1: performing shot segmentation, namely performing shot segmentation on the video by using an Autoshot model;

step 2: feature extraction, namely performing feature extraction on the divided lenses by using a MC 3-18 model;

step 3: and (5) calculating and sequencing the similarity, and calculating cosine similarity among shot feature vectors and arranging the output results.

Further, in the step 1 shot segmentation, an autoplot model is used to process the input video frame by frame, the probability that each frame is a shot boundary is output, the frame number of the shot boundary is determined according to a set threshold, and the corresponding relation of the video name and the shot boundary frame number is stored as an index file.

Further, the feature extraction in step 2 uses the MC3_18 model to perform feature extraction on the input video shots, obtains the shot division condition of the video by reading the video name-shot boundary frame number index file, determines the number of frames of each shot to participate in calculation by using the num_frame parameter value, and performs feature extraction on each shot to obtain the feature vector of each shot.

Further, the similarity calculation and sorting in the step 3 uses cosine similarity, the cosine similarity is calculated one by one between the shot feature vector of the video segment to be searched and the shot feature vector of the video in the library, and the similarity is sorted to obtain the complete video with the most similar shot as the search result.

Furthermore, the video in the searching range is required to be added into a video library for preprocessing, the processing method is the same as that of the video fragment to be searched, the processing result is an index library of video name-lens feature vectors, the video searching is required to be carried out after the generation of the index library is completed, and the searching result is determined by comparing the lens feature vectors of the video fragment to be searched one by one in the index library.

Compared with the prior art, the invention has the following beneficial effects:

(1) It is good at finding the complete video through the incomplete video.

(2) The success rate of video retrieval can be improved under the condition that the shot is frequently switched.

(3) The influence on the search result is small after the video is clipped and re-pieced.

Drawings

Fig. 1 is a complete flowchart of a video retrieval method based on shot characteristics according to the present invention.

Fig. 2 is a basic schematic diagram of a video retrieval method based on lens features according to the present invention.

Fig. 3 is an AutoShot model architecture diagram.

Fig. 4 is a diagram of the structure of the MC 3-18 model.

Detailed Description

The present invention is described below based on examples, but the present invention is not limited to only these examples.

The complete flow chart of the invention is shown in fig. 1, the basic schematic diagram is shown in fig. 2, and comprises a systematic flow of establishing a video retrieval index library, extracting features of video fragments to be retrieved, calculating similarity and arranging an output result, wherein the invention comprises three steps of lens segmentation, feature extraction, similarity calculation and sequencing, and the following steps are introduced one by one:

step 1 is shot segmentation, and the main principle is that a deep learning model Autoshot is used for processing an input complete video, an Autoshot model architecture diagram is shown in fig. 3, and the model can process and extract advanced semantic features frame by frame and distinguish shot boundaries according to feature differences between adjacent frames. In general, the variation between different frames within a shot is not significant, and if the magnitude of the characteristic variation between adjacent frames exceeds a certain threshold, then shot transitions are considered to occur, i.e., shot boundaries exist between the adjacent frames. After traversing all frames of the video, the model returns frame numbers of all shot boundaries, and shot segmentation is completed.

And 2, extracting features, wherein the main principle is that different shots are respectively processed by using a deep learning model MC 3-18, and the high-level semantic features of the shots are extracted. The structure of the Mc3_18 model is shown in fig. 4, and the Mc3_18 is an 18-layer residual error network, which comprises a 3D convolution layer and a 2D convolution layer, so that the network can simultaneously take into account the extraction of spatial semantic information and sequential logic information of a video, and the network performs well on the overall feature expression of the video. In this stage we can distinguish between different shots by shot boundary frame numbers output from the first stage, so that feature extraction is performed within each shot. Considering the complexity of calculation, we will not let all frames in the shot participate in calculation, but set a num_frame value, and select num_frames from the shot in an equally-spaced manner, so as to speed up the extraction of shot features. In addition, num_frame is variable and is generally taken as 16. And finally, outputting high-dimensional feature vectors of all the shots of the video at the stage, and completing the feature extraction.

Step 3 is the similarity calculation and ranking, which is the last stage of the method. The main principle is cosine similarity calculation, which measures the cosine value of the included angle between two vectors, wherein the closer to 1 is the more similar the two vectors, and the closer to 0 is the less similar the two vectors. The specific method is that the cosine similarity is calculated between the feature vector of each lens of the video segment to be searched and the feature vector of each lens of the original video in the library, and the similarity score is calculated after the completion. If the most similar shots of different shots of the fragment to be detected belong to one original video, the similarity score of the original video is considered as the average value of the scores of the different shots. And finally, sequencing the similarity scores of the original videos in the library to obtain the original video with the highest similarity score. The original video is considered to be the complete video to which the video clip belongs, and the video retrieval is completed.

The present invention is not limited to the above-mentioned embodiments, and any person skilled in the art should, within the scope of the present invention, use the equivalent structures or equivalent processes disclosed in the present invention, or use the equivalent structures or equivalent processes directly or indirectly in other related technical fields, so that the present invention is easily contemplated to be modified or replaced.

Claims

1. The video retrieval method based on the lens characteristics is characterized by comprising the following steps of:

2. The method for searching video based on shot characteristics according to claim 1, wherein in the step 1 shot segmentation, an autoplot model is used to process an input video frame by frame, a probability that each frame is a shot boundary is output, a frame number of the shot boundary is determined according to a set threshold, and a correspondence of a video name-the shot boundary frame number is stored as an index file.

3. The video retrieval method based on shot characteristics according to claim 1, wherein the step 2 of feature extraction uses a mc3_18 model to perform feature extraction on input video shots, obtains the shot division condition of the video by reading a video name-shot boundary frame number index file, determines the number of frames of each shot to participate in calculation through num_frame parameter values, and performs feature extraction on shot by shot to obtain feature vectors of the shots.

4. The video retrieval method based on shot features according to claim 1, wherein the similarity calculated and sequenced in step 3 uses cosine similarity, the cosine similarity is calculated one by one between the shot feature vector of the video segment to be retrieved and the shot feature vector of the video in the library, and the similarity is sequenced to take the complete video with the most similar shot as the retrieval result.

5. The video retrieval method based on lens features according to claim 1, wherein videos in a retrieval range are required to be added into a video library for preprocessing, the processing method is the same as that of video clips to be retrieved, an index library of video name-lens feature vectors is generated as a result of processing, video retrieval is required to be performed after the generation of the index library is completed, and retrieval results are determined by comparing lens feature vectors of the video clips to be retrieved one by one in the index library.