CN117076713A

CN117076713A - Video fingerprint extraction and retrieval method

Info

Publication number: CN117076713A
Application number: CN202311346143.3A
Authority: CN
Inventors: 张兰; 罗湛
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-10-18
Filing date: 2023-10-18
Publication date: 2023-11-17
Anticipated expiration: 2043-10-18
Also published as: CN117076713B

Abstract

The invention provides a video fingerprint extraction and retrieval method. The method comprises the following steps: extracting feature vectors of an image training set by using the trained image feature fingerprint extraction model, and processing the image feature vector set by using a K-means clustering algorithm to obtain a training codebook; performing various operations on the videos in the video base to be retrieved by using the trained image feature fingerprint extraction model to obtain an image feature set of the video base to be retrieved; preprocessing the query video, and calculating the similarity between the query video and the video in the video base to be searched under each cluster center of the training codebook by utilizing the image characteristics of the query video and the image characteristic set of the video base to be searched obtained by preprocessing; and adding the similarity obtained under each cluster center of the training codebook to obtain the overall similarity of the query video and the video in the video base to be searched, and obtaining the search result of the query video in the video base to be searched based on the overall similarity.

Description

Video fingerprint extraction and retrieval method

Technical Field

The present invention relates to the field of image processing and video processing, and in particular, to a video fingerprint extraction and retrieval method, a training method for an image feature fingerprint extraction model, an electronic device, and a storage medium.

Background

Because of the popularity of video sharing services over the internet, the number of videos on a network has reached an unprecedented scale. These videos typically have a large amount of near-duplicate content, which is an important requirement for video sharing platforms to track and filter. At the same time, the vast amount of videos on the internet is essentially that of stealing others and performing tamper transformation, which also makes video copyright protection a focus problem. Approximately repeated video retrieval is taken as an indispensable component in applications such as video filtering, tracking, copyright protection and the like, is increasingly receiving research attention, and extracting video features capable of being robust to various tampering attacks as a ring of video detection keys is also a current urgent difficult problem to be overcome.

However, in the prior art, the problems of high storage cost, low retrieval efficiency, low retrieval accuracy and the like exist for the retrieval of the approximate repeated video.

Disclosure of Invention

In view of the above, the present invention provides a video fingerprint extraction and retrieval method, a training method of an image feature fingerprint extraction model, an electronic device, and a storage medium, in order to solve at least one of the above problems.

According to a first aspect of the present invention, there is provided a video fingerprint extraction and retrieval method, comprising:

extracting feature vectors of an image training set by using a trained image feature fingerprint extraction model to obtain an image feature vector set, processing the image feature set by using a K-means clustering algorithm, and taking the obtained image feature set with K cluster centers as a training codebook;

the method comprises the steps that a training image feature fingerprint extraction model is utilized to sequentially perform feature extraction operation, quantization operation, aggregation operation, binarization operation and inverted index operation on videos in a video base to be retrieved, and an image feature set of the video base to be retrieved is obtained;

preprocessing the query video, and calculating the similarity between the query video and the video in the video base to be searched under each cluster center of the training codebook by utilizing the image characteristics of the query video and the image characteristic set of the video base to be searched obtained by preprocessing;

and adding the similarity obtained under each cluster center of the training codebook to obtain the overall similarity of the query video and the video in the video base to be searched, and obtaining the search result of the query video in the video base to be searched based on the overall similarity.

According to an embodiment of the present invention, the obtaining an image feature set of a video base to be retrieved by performing feature extraction, quantization, aggregation, binarization and inverted indexing on a video in the video base to be retrieved using a trained image feature fingerprint extraction model includes:

uniformly extracting frames from videos in a video base to be retrieved, extracting feature vectors of each frame by using a trained image feature fingerprint extraction model based on a frame extraction result, and obtaining frame-level feature vectors;

the method comprises the steps of completing quantization operation by distributing frame-level feature vectors to K cluster centers of a training codebook according to a predefined distribution standard, and obtaining K feature vector clusters formed by the frame-level feature vectors;

the aggregation operation is completed by carrying out addition operation on vectors in each feature vector cluster and then carrying out L2 normalization processing, and the value of the cluster center corresponding to the aggregated feature vector is subtracted from the aggregated feature vector to obtain a processed aggregated feature vector;

performing binarization operation on the processed aggregate feature vector through a symbol function to obtain a binarized aggregate feature vector;

and taking the cluster center of the training codebook as an index, quantizing the binarized aggregate feature vector of each cluster center, and taking the quantized result as an inverted index to obtain an image feature set of the video base to be searched.

According to an embodiment of the present invention, preprocessing a query video, and calculating a similarity between the query video and a video in a video base to be retrieved under each cluster center of a training codebook by using an image feature of the query video obtained by preprocessing and an image feature set of the video base to be retrieved includes:

performing uniform frame extraction operation, frame-level feature extraction operation, quantization operation, aggregation operation and binarization operation on the query video to obtain image features of the query vector;

acquiring image features of a current video to be retrieved which participate in similarity calculation from an image feature set of a video base to be retrieved;

and calculating the similarity between the query video and the current video to be searched under each cluster center of the training codebook by utilizing the characteristic features of the query video and the image features of the current video to be searched.

According to an embodiment of the present invention, adding the similarities obtained at each cluster center of the training codebook to obtain the overall similarity between the query video and the video in the video base to be searched, and obtaining the search result of the query video in the video base to be searched based on the overall similarity includes:

adding the similarity obtained under each cluster center of the training codebook to obtain the overall similarity of the query video and each video in the video base to be retrieved;

and taking N videos with the maximum overall similarity with the video to be queried in the video base to be searched as search results of the query video.

According to a second aspect of the present invention, there is provided a training method of an image feature fingerprint extraction model, applied to a video fingerprint extraction and retrieval method, characterized by comprising:

tampering transformation is carried out on the image data of the open source by utilizing predefined automation tools, so that a training set with self-supervision annotation information is obtained;

constructing an image feature fingerprint extraction model based on a twin neural network architecture and initializing model parameters;

processing the training set by utilizing the image feature fingerprint advanced model to obtain an image feature fingerprint extraction result;

processing the image characteristic fingerprint extraction result and the labeling information corresponding to the image characteristic fingerprint extraction result by utilizing a predefined loss function to obtain a loss value;

according to the loss value, carrying out parameter updating and optimization on the image feature fingerprint extraction model to obtain an image feature fingerprint extraction model after parameter optimization;

and (3) carrying out feature extraction operation, loss value calculation operation, parameter and optimization operation iteratively until a preset training condition is met, and obtaining a trained image feature fingerprint extraction model.

According to an embodiment of the present invention, the performing tamper transformation on the open-source image data by using the predefined automation tool to obtain the training set with the self-supervision labeling information includes:

performing space transformation operation and/or color transformation operation and/or pixel level transformation operation on the image data of the open source through predefined automation tools to finish tampering transformation operation on the image data of the open source, so as to obtain a training set with self-supervision annotation information;

under the condition that the images in the training set belong to images derived from the same image in the image data of the open source, the self-supervision marking information is a first preset value;

and under the condition that the images in the training set do not belong to the images derived from the same image in the image data of the open source, the self-supervision marking information is a second preset value.

According to an embodiment of the invention, the predefined loss function comprises a contrast loss function.

According to an embodiment of the present invention, the image feature fingerprint extraction model includes a plurality of backbone networks sharing parameters with each other, wherein the backbone networks are constructed based on the EfficientNet V2.

According to a third aspect of the present invention, there is provided an electronic device comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform a video fingerprint extraction and retrieval method and a training method of an image feature fingerprint extraction model.

According to a fourth aspect of the present invention, there is provided a computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform a video fingerprint extraction and retrieval method and a training method for an image feature fingerprint extraction model.

The video fingerprint extraction and retrieval method provided by the invention can extract the characteristic information of the video, which is robust to various tampering attacks, to form video fingerprints, and rapidly retrieve approximately repeated videos in a characteristic library of massive videos according to the fingerprints of the query videos; the video fingerprint extraction and retrieval method provided by the invention is applied to extracting the frame-level features of the video, and the frame-level features of each video are obviously reduced by aggregating the frame-level features with similar semantics in the video; by binarizing the aggregated features, the video fingerprint storage space overhead is significantly reduced with little loss in performance, while the retrieval time overhead is further reduced by using the inverted index structure.

Drawings

FIG. 1 is a flow chart of a video fingerprint extraction and retrieval method according to an embodiment of the invention;

FIG. 2 is a flow chart of acquiring a set of image features of a video base to be retrieved according to an embodiment of the invention;

FIG. 3 is a graph of computing cluster similarity of a query video with each video in a video base to be retrieved at a single cluster center, in accordance with an embodiment of the invention;

FIG. 4 is a flow chart of a training method of an image feature fingerprint extraction model according to an embodiment of the invention;

FIG. 5 is a schematic diagram of a twin neural network architecture of an image feature fingerprint extraction model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a backbone network structure of an image feature fingerprint extraction model according to an embodiment of the present invention;

fig. 7 schematically shows a block diagram of an electronic device adapted to implement a video fingerprint extraction and retrieval method and a training method of an image feature fingerprint extraction model according to an embodiment of the invention.

Detailed Description

The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.

In the technical scheme disclosed by the invention, the related video data obtain the authorization of the related party, and the video data are processed, applied and stored under the permission of the related party, so that the related process accords with the rules of laws and regulations, necessary and reliable security measures are adopted, and the requirements of popular regulations are met.

The core of approximately repeated video retrieval is the extraction of video features. Some work characterizes each video by extracting a set of frame-level features and then temporally aligning the similarity matrix from frame to frame as the similarity of the two videos, but due to the large number of frame-level features, the storage cost and the time cost of retrieval are very high. Other works are to aggregate a set of frame-level features of a video into video-level features to characterize the video, and determine the similarity of the two videos through the distance between the video-level features, but the video-level features often cannot accurately capture the space-time structure in the video, and much detail information is lost, so that the retrieval effect is often unsatisfactory. Meanwhile, most of the existing works currently adopt a supervised deep learning method to extract the features of the video, a large amount of manually marked video is needed in the process of training a model, and a large amount of manpower and material resources are generally consumed by data marking, so that the model performance is often subject to the pull of limited marked data.

In view of the problems in the prior art, the invention aims to train an image fingerprint extraction model in a self-supervision mode and apply the image fingerprint extraction model to extract video fingerprints, and the quick retrieval of the approximate repeated video is completed with extremely low space storage and low price through aggregation of frame-level features with similar semantics and binarization operation.

Fig. 1 is a flowchart of a video fingerprint extraction and retrieval method according to an embodiment of the present invention.

As shown in FIG. 1, the video fingerprint extraction and retrieval method includes operations S110-S140.

In operation S110, feature vectors of the image training set are extracted by using the trained image feature fingerprint extraction model, an image feature vector set is obtained, the image feature set is processed by using a K-means clustering algorithm, and the obtained image feature set with K cluster centers is used as a training codebook.

In operation S120, the feature extraction operation, the quantization operation, the aggregation operation, the binarization operation and the inverted index operation are sequentially performed on the video in the video base to be retrieved by using the trained image feature fingerprint extraction model, so as to obtain an image feature set of the video base to be retrieved.

In operation S130, the query video is preprocessed, and the similarity between the query video and the video in the video base to be retrieved under each cluster center of the training codebook is calculated by using the image features of the query video and the image feature set of the video base to be retrieved obtained by the preprocessing.

In operation S130, a cluster similarity of the query video and each video in the video base to be retrieved under the center of each cluster is calculated, for example, in the kth cluster, a cluster similarity of the query video and the jth video in the video base to be retrieved is calculated.

In operation S140, the similarities obtained under each cluster center of the training codebook are added as the overall similarity between the query video and the video in the video base to be retrieved, and the retrieval result of the query video in the video base to be retrieved is obtained based on the overall similarity.

And after the overall similarity is obtained, taking N videos with the maximum overall similarity with the query video in the video base to be searched as final search results, wherein N is a positive integer.

Fig. 2 is a flowchart of acquiring an image feature set of a video base to be retrieved according to an embodiment of the present invention.

As shown in fig. 2, the feature extraction operation, the quantization operation, the aggregation operation, the binarization operation and the inverted indexing operation are sequentially performed on the video in the video base to be retrieved by using the trained image feature fingerprint extraction model, so that an image feature set of the video base to be retrieved is obtained, and the image feature set comprises operations S210-S250.

In operation S210, the videos in the video base to be retrieved are uniformly extracted, and based on the frame extraction result, the feature vector of each frame is extracted by using the trained image feature fingerprint extraction model, so as to obtain a frame-level feature vector.

The operation S210 is used for extracting frame-level image features, uniformly extracting frames from the video in the video base to be retrieved, and extracting feature vectors of each frame by using the image feature extraction model.

In operation S220, a quantization operation is performed by assigning frame-level feature vectors to K cluster centers of the training codebook according to a predefined assignment criterion, resulting in K feature vector clusters composed of frame-level feature vectors.

Operation S220 is directed to performing a quantization operation of assigning a frame-level feature vector of a video to a cluster center closest thereto in a codebook, whereby the frame-level feature constitutes K clusters.

In operation S230, the aggregation operation is completed by performing an L2 normalization process after performing an addition operation on the vectors in each feature vector cluster, and subtracting the value of the cluster center corresponding to the aggregated feature vector from the aggregated feature vector to obtain a processed aggregated feature vector.

In operation S240, the processed aggregate feature vector is binarized by a sign function to obtain a binarized aggregate feature vector.

In operation S250, the cluster center of the training codebook is used as an index, the binarized aggregate feature vector of each cluster center is quantized, and the image feature set of the video base to be retrieved is obtained based on the quantization result as an inverted index.

In the process of constructing the inverted index in operation S250, all the aggregated features quantized to the current cluster center are constructed as values with the cluster center of the training codebook as an index and stored locally.

Through the operations S210-S250, the video in the frequency base to be searched is uniformly extracted, the feature of each frame is extracted by using the image feature extraction model, and operations such as quantization, aggregation, binarization, inverted indexing and the like are performed to reduce the space and time cost of searching.

FIG. 3 is a graph of computing cluster similarity of a query video with each video in a video base to be retrieved at a single cluster center, according to an embodiment of the invention.

As shown in fig. 3, preprocessing the query video, and calculating the similarity between the query video and the video in the video base to be searched under each cluster center of the training codebook by using the image features of the query video and the image feature set of the video base to be searched obtained by preprocessing includes operations S310 to S330.

In operation S310, a uniform frame extraction operation, a frame-level feature extraction operation, a quantization operation, an aggregation operation, and a binarization operation are performed on the query video, so as to obtain image features of the query vector.

Operation S310 aims at acquiring image features of the query video to facilitate calculation of the subsequent cluster similarity.

In operation S320, image features of the current video to be retrieved that participate in the similarity calculation are acquired from the image feature set of the video base to be retrieved.

In operation S330, similarity between the query video and the current video to be retrieved under each cluster center of the training codebook is calculated using the characteristic features of the query video and the image features of the current video to be retrieved.

Operations S310 to S330 aim to calculate the cluster similarity between the query video and each video in the video base to be retrieved at a certain cluster center. Through operations S310 to S330, each video in the video base to be retrieved has K (i.e., the number of cluster centers of the training codebook) cluster similarities with the query video.

According to an embodiment of the present invention, adding the similarities obtained at each cluster center of the training codebook to obtain the overall similarity between the query video and the video in the video base to be searched, and obtaining the search result of the query video in the video base to be searched based on the overall similarity includes: adding the similarity obtained under each cluster center of the training codebook to obtain the overall similarity of the query video and each video in the video base to be retrieved; and taking N videos with the maximum overall similarity with the video to be queried in the video base to be searched as search results of the query video.

After the cluster similarity of each video in the video base to be searched and the query video under the specific cluster center is obtained, adding the K cluster similarities of each video in the video base to be searched and the query video to obtain the overall similarity of each video in the video base to be searched and the query video, ranking based on the overall similarity (for example, descending order according to the overall similarity), and taking N (N is a positive integer) top ranked videos out of the overall similarity as search results of the query video in the video base to be searched.

Fig. 4 is a flowchart of a training method of an image feature fingerprint extraction model according to an embodiment of the present invention.

As shown in fig. 4, the training method of the image feature fingerprint extraction model is applied to a video fingerprint extraction and retrieval method, and includes operations S410-S460.

In operation S410, tamper transformation is performed on the image data of the open source using a predefined automation tool, resulting in a training set with self-supervising annotation information.

According to an embodiment of the present invention, the performing tamper transformation on the open-source image data by using the predefined automation tool to obtain the training set with the self-supervision labeling information includes: performing space transformation operation and/or color transformation operation and/or pixel level transformation operation on the image data of the open source through predefined automation tools to finish tampering transformation operation on the image data of the open source, so as to obtain a training set with self-supervision annotation information; under the condition that the images in the training set belong to images derived from the same image in the image data of the open source, the self-supervision marking information is a first preset value; and under the condition that the images in the training set do not belong to the images derived from the same image in the image data of the open source, the self-supervision marking information is a second preset value.

In operation S420, an image feature fingerprint extraction model is constructed based on the twin neural network architecture and model parameter initialization is performed.

In operation S430, the training set is processed using the image feature fingerprint advanced model to obtain an image feature fingerprint extraction result.

In operation S440, the image feature fingerprint extraction result and the annotation information corresponding to the image feature fingerprint extraction result are processed using a predefined loss function, resulting in a loss value.

In operation S450, according to the loss value, the image feature fingerprint extraction model is updated and optimized to obtain the image feature fingerprint extraction model after parameter optimization.

In operation S460, the feature extraction operation, the loss value calculation operation, and the parameter and optimization operation are iterated until a preset training condition is satisfied, so as to obtain a trained image feature fingerprint extraction model.

The above operations S410 to S460 aim to train the image fingerprint extraction model: the method comprises the steps of performing various tampering transformation on original image data through an automation tool to construct a training set containing self-supervision annotation information, training an image feature extraction model through a contrast loss function by adopting a twin network architecture, and extracting image feature fingerprints.

The above-mentioned video fingerprint extraction and search method and the training method of the image feature fingerprint extraction model provided by the present invention are described in further detail below with reference to specific embodiments and fig. 5 and 6.

Fig. 5 is a schematic diagram of a twin neural network architecture of an image feature fingerprint extraction model according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a backbone network structure of an image feature fingerprint extraction model according to an embodiment of the present invention.

Firstly, in order to realize the extraction of the image characteristic fingerprints, an image characteristic fingerprint extraction model is firstly constructed and trained so as to provide a tool for the extraction of the subsequent video fingerprint characteristics.

In the process of training an image fingerprint extraction model, a training set containing self-supervision annotation information is constructed by performing various tampering transformation on original image data through an automatic tool, a twin network architecture is adopted, and an image feature extraction model is trained through a contrast loss function so as to extract image feature fingerprints.

Using open source image data, constructing a dataset by using an automatic tool to carry out various tamper transformations (including space transformation, color transformation, pixel level transformation and the like and combinations thereof), wherein images derived from the same image are regarded as the same category, otherwise, are regarded as different categories, so that a training set containing self-supervision annotation information is obtained for training an image feature extraction model. The image feature extraction model adopts a twin neural network architecture shown in fig. 5, and the network architecture adopts an EfficientNet V2 shown in fig. 6, including Conv33 (convolutional network 3->3) Fused-MBConv1 (Fused depth separable convolutional network 1), multiple Fused-MBConv4 (Fused depth separable convolutional network 4), MBConv4 (depth separable convolutional network 4), multiple MBConv6 (depth separable convolutional network 4), conv 1->1 (convolutional network 1->1) Pooling (pooling layer) and FC (fully connected layer); training a deep learning model for extracting image feature vectors robust to various tampering attacks by contrast learning ideas enabling similar image feature distances to be as close as possible and heterogeneous image feature distances to be as far as possible, wherein a loss function adopts contrast loss, as shown in a formula (1):

（1），

wherein,、/>for two images +.>Representing a feature extraction model, ++>Representing that two images belong to the same category, +.>Representing that two images belong to different categories, +.>Representing the interval parameter.

And secondly, extracting video fingerprints by using the trained image feature fingerprint extraction model.

In the actual video fingerprint extraction process, extracting feature vectors of an image training set through a trained image feature extraction model, and then clustering the feature vectors into K mean values, wherein the obtained set formed by K cluster centers is used as a training codebook.

Then, feature extraction needs to be carried out on each video in the video base to be retrieved. In the process of extracting the features of the video base, the image feature extraction model is used for extracting the features of the video in the video base to be retrieved, and a series of processing is performed to save space and time, and the specific steps are as follows.

Extracting frame-level features: and uniformly extracting frames from the videos in the video base, and extracting the feature vector of each frame by using an image feature extraction model.

Quantification: assigning frame-level feature vectors of the video to cluster centers closest thereto in the codebook, whereby the frame-level features constituteAnd a cluster.

Polymerization: the feature vectors of each cluster belonging to the same video are aggregated in such a way that the feature vectors are addedNormalizing, subtracting the value of the corresponding cluster center from the aggregated features, and finally passing through a sign function shown in the formula (2)(2) All the aggregate features are binarized.

Constructing an inverted index: and taking the cluster center of the codebook as an index, and constructing an inverted index by taking all the aggregation features quantized to the current cluster center as values and storing the inverted index locally.

Finally, for a given query videoSearching for the +.f. with the maximum similarity to the query video in the video base>Video. The specific search steps are described below.

For query videoAnd performing the operations of uniform frame extraction, frame level feature extraction, quantization, aggregation and binarization.

Loading an inverted index, in the firstIn each cluster, query video +.>And +.>Cluster similarity calculation for each video is described by equation (3):

（3），

wherein,for querying video +.>In->Aggregate feature vector of individual clusters,>is the +.>The video is at->The aggregate feature vector of the individual clusters,/>for feature vector dimension, < >>To select coefficients.

Querying videoAnd video base->The overall similarity calculation of the individual videos is shown in formula (4):

（4），

after the similarity calculation of the query video and all videos in the video base is completed, the similarity between the query video and the query video in the video base is the largestThe individual videos serve as the final search result.

As shown in fig. 7, an electronic device 700 according to an embodiment of the present invention includes a processor 701 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. The processor 701 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 701 may also include on-board memory for caching purposes. The processor 701 may comprise a single processing unit or a plurality of processing units for performing different actions of the method flow according to an embodiment of the invention.

In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are stored. The processor 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. The processor 701 performs various operations of the method flow according to an embodiment of the present invention by executing programs in the ROM 702 and/or the RAM 703. Note that the program may be stored in one or more memories other than the ROM 702 and the RAM 703. The processor 701 may also perform various operations of the method flow according to embodiments of the present invention by executing programs stored in one or more memories.

According to an embodiment of the invention, the electronic device 700 may further comprise an input/output (I/O) interface 705, the input/output (I/O) interface 705 also being connected to the bus 704. The electronic device 700 may also include one or more of the following components connected to the I/O interface 705: an input section 706 including a keyboard, a mouse, and the like; an output portion 707 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 708 including a hard disk or the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. The drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read therefrom is mounted into the storage section 708 as necessary.

The present invention also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present invention.

According to embodiments of the present invention, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to an embodiment of the invention, the computer-readable storage medium may include ROM 702 and/or RAM 703 and/or one or more memories other than ROM 702 and RAM 703 described above.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not meant to limit the scope of the invention, but to limit the invention thereto.

Claims

1. A method for video fingerprint extraction and retrieval, comprising:

the training-completed image feature fingerprint extraction model is utilized to sequentially perform feature extraction operation, quantization operation, aggregation operation, binarization operation and inverted index operation on the video in the video base to be searched, so that an image feature set of the video base to be searched is obtained;

preprocessing query videos, and calculating the similarity between the query videos and the videos in the video base to be searched under each cluster center of the training codebook by utilizing the image characteristics of the query videos and the image characteristic set of the video base to be searched obtained by preprocessing;

2. The method of claim 1, wherein performing feature extraction, quantization, aggregation, binarization, and inverted indexing on the video in the video base to be retrieved using the trained image feature fingerprint extraction model sequentially, and obtaining the image feature set of the video base to be retrieved comprises:

uniformly extracting frames from the videos in the video base to be retrieved, and extracting the feature vector of each frame by using the trained image feature fingerprint extraction model based on the frame extraction result to obtain a frame-level feature vector;

the frame-level feature vectors are distributed to K cluster centers of the training codebook according to a predefined distribution standard to complete quantization operation, so that K feature vector clusters formed by the frame-level feature vectors are obtained;

the method comprises the steps of performing addition operation on vectors in each feature vector cluster, performing L2 normalization processing to complete aggregation operation, and subtracting a value of a cluster center corresponding to the aggregated feature vector from the aggregated feature vector to obtain a processed aggregated feature vector;

and taking the cluster center of the training codebook as an index, quantizing the binarized aggregate feature vector of each cluster center, and taking a quantization result as an inverted index to obtain the image feature set of the video base to be searched.

3. The method of claim 1, wherein preprocessing the query video and calculating the similarity of the query video and the video in the video base to be retrieved at each cluster center of the training codebook using the preprocessed image features of the query video and the image feature set of the video base to be retrieved comprises:

performing uniform frame extraction operation, frame-level feature extraction operation, quantization operation, aggregation operation and binarization operation on the query video to obtain image features of query vectors;

acquiring image features of the current video to be retrieved which participate in similarity calculation from the image feature set of the video base to be retrieved;

4. The method of claim 1, wherein adding the similarities obtained under each cluster center of the training codebook as an overall similarity between the query video and the videos in the video base to be retrieved, and obtaining the retrieval result of the query video in the video base to be retrieved based on the overall similarity comprises:

and taking N videos with the maximum overall similarity with the video to be queried in the video base to be retrieved as retrieval results of the query video.

5. A training method of an image feature fingerprint extraction model, applied to the method of any one of claims 1 to 4, comprising:

constructing the image characteristic fingerprint extraction model based on a twin neural network architecture and initializing model parameters;

according to the loss value, carrying out parameter updating and optimization on the image characteristic fingerprint extraction model to obtain an image characteristic fingerprint extraction model after parameter optimization;

6. The method of claim 5, wherein tamper transforming the open source image data using a predefined automation tool to obtain a training set with self-supervising annotation information comprises:

performing spatial transformation operation and/or color transformation operation and/or pixel level transformation operation on the open-source image data through the predefined automation tool to finish tampering transformation operation on the open-source image data, so as to obtain the training set with self-supervision annotation information;

wherein, the self-supervision labeling information is a first preset value under the condition that the images in the training set belong to the images derived from the same image in the open-source image data;

7. The method of claim 5, wherein the predefined loss function comprises a contrast loss function.

8. The method of claim 5, wherein the image feature fingerprint extraction model comprises a plurality of backbone networks sharing parameters with each other, wherein the backbone networks are constructed based on EfficientNet V2.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-8.

10. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-8.