WO2021007999A1 - 视频帧处理方法及装置 - Google Patents

视频帧处理方法及装置 Download PDF

Info

Publication number
WO2021007999A1
WO2021007999A1 PCT/CN2019/114271 CN2019114271W WO2021007999A1 WO 2021007999 A1 WO2021007999 A1 WO 2021007999A1 CN 2019114271 W CN2019114271 W CN 2019114271W WO 2021007999 A1 WO2021007999 A1 WO 2021007999A1
Authority
WO
WIPO (PCT)
Prior art keywords
video frame
feature
video
cnn
target
Prior art date
Application number
PCT/CN2019/114271
Other languages
English (en)
French (fr)
Inventor
吕孟叶
董治
李深远
Original Assignee
腾讯音乐娱乐科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯音乐娱乐科技(深圳)有限公司 filed Critical 腾讯音乐娱乐科技(深圳)有限公司
Priority to KR1020227005421A priority Critical patent/KR20220032627A/ko
Publication of WO2021007999A1 publication Critical patent/WO2021007999A1/zh
Priority to US17/575,140 priority patent/US20220139085A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences

Definitions

  • the present invention relates to the field of Internet technology, in particular to a video frame processing method and device.
  • the inventor found that a large amount of videos are stored in the video library of the existing video website. In order to avoid storing duplicate videos in the video library, duplicate videos are usually detected. In the process of detecting duplicate videos, the detection of duplicate video frames is particularly important.
  • the embodiments of the present invention provide a video frame processing method and device, which can realize the detection of repeated video frames with high accuracy.
  • an embodiment of the present invention provides a video frame processing method, including:
  • the local feature of the target video frame includes the first key point of the target video frame and the corresponding first key point Feature descriptor
  • the local feature of the first video frame includes a second key point in the first video frame and a feature descriptor corresponding to the second key point;
  • the acquiring the convolutional neural network CNN feature of the target video frame includes:
  • the target video frame is input to a CNN network for processing, and the CNN feature of the target video frame is obtained.
  • the acquiring the first video frame from multiple sample video frames includes:
  • Obtain a CNN feature index of a sample video the sample video including the plurality of sample video frames, and the CNN feature index is used to represent a plurality of clusters formed according to the reduced-dimensional CNN feature clustering of the plurality of sample video frames , Each of the clusters includes a cluster center and a reduced-dimensional CNN feature of at least one sample video frame in the cluster;
  • the local features of the target video frame include m first key points and m first feature descriptors corresponding to the m key points, and one first key point corresponds to One of the first feature descriptors, the local features of the first video frame include n second key points and n second feature descriptors corresponding to the n second key points, one of the second The key point corresponds to one of the second feature descriptors, the m is a natural number greater than or equal to 2, and the n is a natural number greater than or equal to 2;
  • the calculating the matching degree between the local feature of the first video frame and the local feature of the target video frame includes:
  • the first feature descriptor is a valid descriptor matching the first video frame
  • the degree of matching between the local features of the first video frame and the local features of the target video frame is determined.
  • the method before acquiring the convolutional neural network CNN features and local features of the target video frame, the method further includes:
  • Dimensionality reduction processing is performed on the CNN features of the multiple sample video frames by using the PCA matrix of principal component analysis to obtain the dimensionality reduction CNN features of the multiple sample video frames;
  • the compressed CNN feature corresponding to each cluster, and the cluster center of each cluster a CNN feature index of the sample video is generated.
  • the target video frame belongs to the video to be queried, and the playback time point of the target video frame in the video to be queried is the first playback time point, and the first video frame belongs to The target video in the sample video, the using the first video frame as the repeated video frame of the target video frame includes:
  • the method further includes:
  • the time period and the second time period of the target video, the video in the first time period in the video to be queried and the video in the second time period in the target video are determined as repeated video segments, and
  • the first time period includes the first play time point
  • the second time period includes the second play time point
  • the continuous distribution condition includes that the time difference between adjacent repeated video frames is less than a third threshold.
  • an embodiment of the present invention provides a video frame processing device, including:
  • the first acquisition module is used to acquire the convolutional neural network CNN feature of the target video frame and the local feature of the target video frame.
  • the local feature of the target video frame includes the first key point of the target video frame and Describe the feature descriptor corresponding to the first key point;
  • a dimensionality reduction processing module configured to perform dimensionality reduction processing on the CNN feature of the target video frame, and obtain the dimensionality reduction CNN feature of the target video frame;
  • the second acquisition module is configured to acquire a first video frame from a plurality of sample video frames, and the distance between the reduced-dimensional CNN feature of the first video frame and the reduced-dimensional CNN feature of the target video frame meets the first prediction Set conditions
  • the third acquisition module is configured to acquire the local features of the first video frame, and the local features of the first video frame include a second key point in the first video frame and a second key point corresponding to the second key point.
  • a calculation module configured to calculate the degree of matching between the local feature of the first video frame and the local feature of the target video frame
  • the first determining module is configured to use the first video frame as a repeated video frame of the target video frame if the matching degree meets a second preset condition.
  • the first acquisition module is specifically configured to acquire the video to be queried; take screenshots of the video to be queried at equal intervals to obtain a target video frame, and the target video frame Query any one of the multiple video frames obtained by taking screenshots of the video at equal intervals; input the target video frame into a CNN network for processing, and obtain the CNN feature of the target video frame.
  • the second acquisition module includes:
  • the first acquiring unit is configured to acquire a CNN feature index of a sample video, where the sample video includes the plurality of sample video frames, and the CNN feature index is used to represent the dimensionality reduction CNN feature aggregation of the plurality of sample video frames.
  • the first calculation unit is used to calculate the distance between the reduced-dimensional CNN feature of the target video frame and the cluster center of each cluster in the plurality of clusters, and calculate the distance between the closest cluster center Cluster as the target cluster;
  • the second calculation unit is used to calculate the distance between the reduced-dimensional CNN feature of the target video frame and the reduced-dimensional CNN feature of each sample video frame in at least one sample video frame included in the target cluster, and calculate the distance
  • the sample video frame corresponding to the latest dimensionality reduction CNN feature is taken as the first video frame.
  • the local features of the target video frame include m first key points and m first feature descriptors corresponding to the m key points, and one first key point corresponds to One of the first feature descriptors, the local features of the first video frame include n second key points and n second feature descriptors corresponding to the n second key points, one of the second The key point corresponds to one of the second feature descriptors, the m is a natural number greater than or equal to 2, and the n is a natural number greater than or equal to 2.
  • the calculation module includes:
  • the second acquiring unit is configured to acquire, for each of the first feature descriptors, n distances between each second feature descriptor of the n second feature descriptors and the first feature descriptor ;
  • a sorting unit configured to sort the n distances in a descending order to form a sorting queue
  • the third acquiring unit is configured to acquire the last k distances sorted in the sorting queue, where k is a natural number greater than or equal to 2;
  • a first determining unit configured to determine, according to the k distances, that the first feature descriptor is a valid descriptor matching the first video frame
  • the second determining unit is configured to determine the degree of matching between the local feature of the first video frame and the local feature of the target video frame according to the number of valid descriptors in the m first feature descriptors.
  • the device further includes:
  • the fourth acquisition module is used to acquire the CNN features of the multiple sample video frames
  • a dimensionality reduction processing module configured to perform dimensionality reduction processing on the CNN features of the multiple sample video frames by using the PCA matrix of principal component analysis to obtain the dimensionality reduction CNN features of the multiple sample video frames;
  • a clustering module configured to perform k-means clustering on the reduced-dimensional CNN features of the multiple sample video frames to form multiple clusters, each of the clusters containing at least one reduced-dimensional CNN feature of the sample video frames;
  • a quantization compression module configured to quantify and compress the dimensionality reduction CNN features of at least one sample video frame included in each cluster to obtain compressed CNN features corresponding to the cluster;
  • the generating module is configured to generate the CNN feature index of the sample video according to the multiple clusters, the compressed CNN feature corresponding to each cluster, and the cluster center of each cluster.
  • the target video frame belongs to the video to be queried
  • the playback time point of the target video frame in the video to be queried is the first playback time point
  • the first video frame belongs to
  • the first determining module includes:
  • the fourth acquiring unit is configured to acquire the frame identifier of the first video frame
  • the third determining unit is configured to find that the frame identifier of the first video frame corresponds to the second play time point of the target video, and use the video frame at the second play time point in the target video as the to-be-query The repeated video frame of the video frame at the first playback time point in the video.
  • the device further includes:
  • the second determining module is configured to: if the number of repeated video frames of the video to be queried and the target video is greater than a second threshold, and the repeated video frames that meet the continuous distribution condition among all repeated video frames are distributed in all
  • the first time period of the video to be queried and the second time period of the target video, and the video of the first time period in the video to be queried and the video of the second time period in the target video are determined
  • the first time period includes the first play time point
  • the second time period includes the second play time point
  • the continuous distribution condition includes that the time difference between adjacent repeated video frames is less than the first Three thresholds.
  • an embodiment of the present invention provides a video frame processing device, where the video frame processing device includes a processor and a memory;
  • the processor is connected to a memory, wherein the memory is used to store program code, and the processor is used to call the program code to execute the method described in the first aspect.
  • an embodiment of the present invention provides a computer storage medium, wherein the computer storage medium stores a computer program, the computer program includes program instructions, and when executed by a processor, the program instructions execute the first The method described in one aspect.
  • the distance between the reduced-dimensional CNN feature of the target video frame and the reduced-dimensional CNN feature between the sample video frame is used as the first layer of filtering, and then the local features of the target video frame and the first video frame The degree of matching between the local features is screened in the second layer, so that the repeated video frames of the target video frame can be accurately detected with high accuracy.
  • FIG. 1 is a flowchart of a method for processing video frames according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a generation process of a CNN feature index provided by an embodiment of the present invention
  • FIG. 3 is a flowchart of extracting dual features of a video frame according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of retrieving duplicate videos provided by an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a video frame processing apparatus provided by an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of another video frame processing apparatus provided by an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of another video frame processing device provided by an embodiment of the present invention.
  • FIG. 1 is a schematic flowchart of a method for processing video frames according to an embodiment of the present invention.
  • the video frame processing method of the embodiment of the present invention may include the following steps S101 to S106.
  • S102 Perform dimensionality reduction processing on the CNN feature of the target video frame, and obtain the dimensionality reduction CNN feature of the target video frame;
  • the target video frame may be any video frame in the video to be queried, or the target video frame may be a single picture frame that needs to be compared. If the target video frame is a video frame in the video to be queried, the video to be queried is captured at equal intervals to generate multiple video frames, and the target video frame can be any one of the multiple video frames.
  • the screenshot interval of the video to be queried should be smaller than the screenshot interval of the sample video in the database to be compared, that is, the query video is taken at a higher frequency, such as 5 frames per second, to ensure that it can match the sample video frames in the library. More preferably, the time point of the screenshot can be randomly jittered to avoid the extreme situation where all the video frames to be queried happen to have a large interval with the sample video frames in the library.
  • the local feature of the target video frame is the feature descriptor of the key points extracted in the target video frame.
  • the extracted key points are pixels in the target video frame that have a large difference in pixel values from adjacent pixels, for example, the edges and corners of the image of the target video frame.
  • the way to obtain the CNN features of the target video frame can be to select a CNN network pre-trained on a large-scale general image data set (imagenet, open-images, or ml-images data set), and set the target
  • the video frame is input to the CNN network, and the feature map output by the last one or more convolutional layers is pooled to obtain the original CNN feature of the target video frame.
  • the CNN feature is a floating-point vector with a fixed length and a high dimensionality (for example, 2048 dimensions).
  • the Principal Component Analysis (PCA) matrix can be used to perform dimensionality reduction processing on the CNN feature to obtain the dimensionality reduction CNN feature of the target video frame.
  • the method for obtaining the local feature of the target video frame may be that any one of the local feature extractors such as SIFT, SURF, ORB, AKAZE, BRISK, etc. can be used to extract the local features.
  • BRISK has relatively high accuracy and speed.
  • the local features extracted using default parameters contain a large number of key points and corresponding feature descriptors.
  • the key points can be sorted by response, and only the dozens of key points with the highest response and the corresponding feature descriptors are retained, and one key point corresponds to a feature descriptor.
  • the detection threshold can be lowered one or more times until enough key points are detected.
  • S103 Acquire a first video frame from a plurality of sample video frames, and the distance between the reduced-dimensional CNN feature of the first video frame and the reduced-dimensional CNN feature of the target video frame meets a first preset condition;
  • the K sample video frames that are closest to the dimensionality reduction CNN feature of the target video frame are selected from the sample video frames already stored in the database, and the first video frame may be the K sample video frames Any one of.
  • the sample video frames stored in the database are sample video frames of multiple sample videos.
  • the dimensionality reduction CNN feature of the K sample video frames is the closest to the dimensionality reduction CNN feature of the target video frame, that is, the dimensionality reduction of the K sample video frames
  • the K distances between the CNN feature and the reduced-dimensional CNN feature of the target video frame are ranked in the top K positions among the distances between the reduced-dimensional CNN features of all sample video frames and the reduced-dimensional CNN features of the target video frame.
  • the distance between the dimensionality-reduced CNN feature of the sample video frame and the dimensionality-reduced CNN feature of the target video frame is sorted in ascending order of distance.
  • the selection method of selecting the closest K sample video frames may be to first obtain the CNN feature index of the sample video.
  • the sample video includes all the sample videos in the library, that is, according to the reduction of the sample video frames of all the sample videos in the library.
  • One-dimensional CNN feature can generate a CNN feature index.
  • the CNN feature index is a structure, which is used to represent multiple clusters formed by dimensionality reduction CNN feature clusters of sample video frames of all sample videos in the library. Each cluster contains the cluster center and the cluster.
  • the dimensionality reduction CNN feature of at least one sample video frame Then calculate the distance between the reduced-dimensional CNN feature of the target video frame and the cluster center of each cluster of the above multiple cluster frames, and use the cluster corresponding to the closest cluster center as the target cluster.
  • the target cluster can include one cluster or multiple clusters. If it includes multiple clusters, it can be the distance between the cluster centers of the multiple clusters and the reduced-dimensional CNN features of the target video frame. First. Finally, calculate the distance between the dimensionality-reduced CNN feature of the target video frame and the dimensionality-reduced CNN feature of each sample video frame in at least one sample video frame contained in the target cluster, and select any of the closest K sample video frames One video frame is regarded as the first video frame, that is, there are K first video frames.
  • the stored target cluster can be dynamically read from the database.
  • the original CNN features of at least one sample video frame are calculated and sorted, and the closest K sample video frames are obtained as K first video frames.
  • the K first video frames can also be filtered, for example, if the distance exceeds a certain threshold, they are directly eliminated.
  • the CNN feature index is generated based on the sample video frames contained in all sample videos in the library, and the videos in the library are screenshots at equal intervals to generate corresponding sample video frames.
  • the interval of the screenshots depends on the desired time resolution. For example, if you need to detect repeats of 5s and above, the interval between screenshots should be less than 5s.
  • a CNN network pre-trained on a large-scale general image data set imagenet, open-images, or ml-images data set
  • the CNN feature is a fixed length
  • the floating-point vector has a higher dimensionality (for example, 2048 dimensions).
  • the local features of the sample video frames of all the sample videos in the library can be extracted separately, and one of the local feature extractors such as SIFT, SURF, ORB, AKAZE, BRISK, etc. can be selected for local feature extraction.
  • BRISK is relatively high in accuracy and speed.
  • the local features extracted by default parameters contain a large number of key points and corresponding feature descriptors.
  • the key points can be sorted by response, and only the dozens of key points with the highest response and the corresponding feature descriptors are kept. More preferably, for some video frames with relatively smooth texture, too few key points may be detected, and the detection threshold can be lowered one or more times until sufficient key points and corresponding feature descriptors are detected.
  • a database can be established according to the video id and video frame id of the sample video, and (video id, video frame id, video frame time point, CNN feature, local Feature)
  • the information tuple is uploaded to the database.
  • the CNN feature is a 2048-dimensional single-precision floating point number (occupies 8KB space)
  • database The size is about 1.07TB and can be deployed on a single machine.
  • FIG. 3 it is a schematic diagram of CNN feature and local feature extraction provided by the embodiment of the present invention.
  • a screenshot is taken of the video in the library to obtain a sample video frame, and the CNN feature map is further calculated through the CNN network.
  • the CNN feature map is further pooled to obtain the CNN features of the sample video frame.
  • local feature extraction is performed on the sample video frame to obtain the local feature of the sample video frame.
  • the local features are screened to obtain the local features of the sample video frame after screening. Upload the CNN features and local features of the sample video frames to the database for storage.
  • the original CNN features of all or part of the sample video frames are read from the database and loaded into the memory, and the original CNN features loaded into the memory are used to train the PCA matrix, which can reduce the CNN feature dimension At the same time, keep the original information as much as possible.
  • the PCA matrix has a data whitening effect.
  • the original CNN features of all sample video frames in the database are loaded into the memory at one time or in batches, and the original CNN features of all sample video frames are used to reduce the dimensionality of the original CNN features of all sample video frames. Obtain the reduced-dimensional CNN features of all sample video frames.
  • k-means clustering is performed on the reduced-dimensional CNN features of all sample video frames to obtain N clusters and corresponding cluster centers.
  • Each cluster contains at least one Dimensionality reduction CNN features of sample video frames.
  • the dimensionality reduction CNN features of the sample video frames included in each cluster are quantized and compressed.
  • scalar quantization or product quantization can be used to quantify each cluster
  • the dimensionality reduction CNN features of the included sample video frames are further compressed. For example, if scalar quantization is used, each dimension of the reduced-dimensional CNN feature of the sample video frame can be compressed from a 4-byte floating point number to a 1-byte integer.
  • S25 Generate a CNN feature index of the sample video according to the multiple clusters, the compressed CNN feature corresponding to each of the clusters, and the cluster center of each cluster.
  • the CNN features of the sample videos in the library can be generated Index
  • the CNN feature index is a structure
  • the above N clusters, the cluster centers of the N clusters, and the compressed CNN features corresponding to each cluster can be obtained through the CNN feature index.
  • S104 Acquire a local feature of the first video frame, where the local feature of the first video frame includes a second key point in the first video frame and a feature descriptor corresponding to the second key point;
  • the method for acquiring the local features of the first video frame may be to extract the local features of the first video frame from the database established in step S21.
  • the corresponding local feature may be searched from the database frame according to the video id to which the first video frame belongs and the first video frame id.
  • the local features of the first video frame include the second key point extracted in the first video frame and the feature descriptor corresponding to the second key point.
  • the key points extracted in the first video frame may be adjacent to the first video frame. Pixels with large differences in pixel values, such as corners in the first video frame.
  • S105 Calculate the degree of matching between the local feature of the first video frame and the local feature of the target video frame
  • the degree of matching between the local feature of the video frame and the local feature of the target video frame is greater than the first threshold, it is determined that the first video frame and the target video frame are repeated video frames.
  • the calculation method for calculating the matching degree between the local feature of the first video frame and the local feature of the target video frame may be, for example, the local feature of the target video frame includes m first feature descriptors corresponding to m first key points, A first key point corresponds to a first feature descriptor.
  • each first feature descriptor A i For each first feature descriptor A i , obtain n distances between each second feature descriptor of n second feature descriptors B j and the first feature descriptor. According to the order from largest to smallest, the n distances are sorted to form a sorting queue, and the last k distances of the sorting queue frame are obtained, that is, the closest k distances, and the first feature is determined according to the k distances
  • the descriptor is a valid descriptor matching the first video frame.
  • the first feature descriptor Ai is considered to be an effective matcher matching the first video frame.
  • the degree of matching between the local features of the first video frame and the local features of the target video frame is determined.
  • K first video frames that is, the target video frame to be queried and multiple sample video frames may be repeated video frames
  • the playback time point at which the repeated video frame is located can be further located, and which video in the sample video the repeated video frame comes from.
  • the target video frame is the video frame at the first playback time point of the video to be queried
  • the frame identifier of the first video frame can be obtained
  • the frame identifier of the first video frame corresponding to the second playback time of the target video can be found from the database.
  • the K first video frames may correspond to different playback time points of different videos.
  • the repeated video frames that meet the continuous distribution condition among all the repeated video frames are respectively distributed in the to-be Query the first time period of the video and the second time period of the target video, and determine that the video in the first time period in the video to be queried and the video in the second time period in the target video are repeated Video segment, the first time period includes the first play time point, the second time period includes the second play time point, and the continuous distribution condition includes that the time difference between adjacent repeated video frames is less than a third threshold .
  • the target video frame is any video frame in the video to be queried. All video frames included in the video to be queried are compared with sample video frames of sample videos in the library to determine whether it is a repeated video frame. The number of repetitive video frames of the video to be queried and the target video is greater than the second threshold. The number of frames can be determined by the number of repetitive video frames in the video to be queried.
  • the continuous distribution condition may be that the time difference between adjacent repeated video frames is less than the third threshold, that is, the repeated video frames are basically continuously concentrated in the first time period of the video to be queried and the second time period of the target video.
  • the specific method is as follows: the sample video frames of the sample videos in the library are evenly distributed to P computers, and the CNN features and local features of the sample video frames allocated to each are extracted in parallel, and each computer will The obtained CNN features and local features are uploaded to the database.
  • the database should adopt a solution that can support massive amounts of data, such as cloud storage.
  • the total number of readings/P CNN features on P computers is neither omitted nor overlapped.
  • Clustering is performed according to the shared parameters and the CNN features read by each, and the CNN feature index is established in each memory.
  • the CNN feature index on each computer is different.
  • When querying repeated videos calculate the CNN features and local features of the video frame to be queried on one or more computers, and then send the obtained CNN features and local features to all computers, and each computer calculates the CNN of the video frame to be queried in parallel
  • the distance between the feature and the CNN feature in each cluster indicated by the CNN feature index of the respective computer, and the distance calculated by each is sent to a computer.
  • the computer reorders according to the distance and takes the K results with the shortest distance. Determine the K first video frames, and further determine whether they are repeated video frames through matching of local features.
  • the distance between the reduced-dimensional CNN feature of the target video frame and the reduced-dimensional CNN feature between the sample video frame is used as the first layer of filtering, and then the local features of the target video frame and the first video frame The degree of matching between the local features is screened in the second layer, so that the repeated video frames of the target video frame can be accurately detected with high accuracy.
  • FIG 4 is a flow chart for detecting repeated video frames according to an embodiment of the present invention.
  • a screenshot of the video to be queried is taken to obtain the video frame, which is further calculated by the CNN network, pooling and dimensionality reduction During the processing, the CNN features of the video frame are obtained.
  • the nearest neighbor search and filtering operations are performed to obtain the sample video frame closest to the CNN feature of the video frame as the first video frame.
  • further Verification by local features That is, read the local features of the first video frame from the database.
  • local feature extraction is performed on the video frame, and the local features of the video frame are obtained by sorting and filtering according to the responsivity. Calculate the matching degree between the local feature of the video frame and the local feature of the first video frame in the library. If the matching degree is greater than the threshold, further determine the video id and time point from which the first video frame comes.
  • FIG. 5 is a schematic structural diagram of a video frame processing apparatus provided in an embodiment of the present invention.
  • the video frame processing apparatus of the embodiment of the present invention may include:
  • the first acquisition module 11 is used to acquire the convolutional neural network CNN feature of the target video frame and the local feature of the target video frame.
  • the local feature of the target video frame includes the first key point and the The feature descriptor corresponding to the first key point;
  • the dimensionality reduction processing module 12 is configured to perform dimensionality reduction processing on the CNN feature of the target video frame, and obtain the dimensionality reduction CNN feature of the target video frame;
  • the target video frame may be any video frame in the video to be queried, or the target video frame may be a single picture frame that needs to be compared. If the target video frame is a video frame in the video to be queried, the video to be queried is captured at equal intervals to generate multiple video frames, and the target video frame can be any one of the multiple video frames.
  • the screenshot interval of the video to be queried should be smaller than the screenshot interval of the sample video in the database to be compared, that is, the query video is taken at a higher frequency, such as 5 frames per second, to ensure that it can match the sample video frames in the library. More preferably, the time point of the screenshot can be randomly jittered to avoid the extreme situation where all the video frames to be queried happen to have a large interval with the sample video frames in the library.
  • the local feature of the target video frame is the feature descriptor of the key points extracted in the target video frame.
  • the extracted key points are pixels that have a large difference in pixel value from adjacent pixels, for example, the edges and corners of the image of the target video frame.
  • the way to obtain the CNN features of the target video frame can be to select a CNN network pre-trained on a large-scale general image data set (imagenet, open-images, or ml-images data set), and set the target
  • the video frame is input to the CNN network, and the feature map output by the last one or more convolutional layers is pooled to obtain the CNN feature of the target video frame.
  • the CNN feature is a floating-point vector with a fixed length and a high dimensionality (for example, 2048 dimensions).
  • a principal component analysis (Principal Components Analysis, PCA) matrix can be used to reduce the dimensionality of the original CNN features.
  • PCA Principal Components Analysis
  • the method for obtaining the local feature of the target video frame may be that any one of the local feature extractors such as SIFT, SURF, ORB, AKAZE, BRISK, etc. can be used to extract the local features.
  • BRISK has relatively high accuracy and speed.
  • the local features extracted using default parameters contain a large number of key points and corresponding feature descriptors.
  • the key points can be sorted by response, and only the dozens of key points with the highest response and the corresponding feature descriptors are retained, and one key point corresponds to a feature descriptor.
  • the detection threshold can be lowered one or more times until enough key points are detected.
  • the second acquisition module 13 is configured to acquire a first video frame from a plurality of sample video frames, and the distance between the reduced-dimensional CNN feature of the first video frame and the reduced-dimensional CNN feature of the target video frame conforms to the first Preset condition
  • the second acquisition module may include a first acquisition unit, a first calculation unit, and a second calculation unit;
  • the first acquiring unit is configured to acquire a CNN feature index of a sample video, where the sample video includes the plurality of sample video frames, and the CNN feature index is used to represent the dimensionality reduction CNN feature aggregation of the plurality of sample video frames.
  • the first calculation unit is used to calculate the distance between the reduced-dimensional CNN feature of the target video frame and the cluster center of each cluster in the plurality of clusters, and calculate the distance between the closest cluster center Cluster as the target cluster;
  • the second calculation unit is used to calculate the distance between the reduced-dimensional CNN feature of the target video frame and the reduced-dimensional CNN feature of each sample video frame in at least one sample video frame included in the target cluster, and calculate the distance
  • the sample video frame corresponding to the latest dimensionality reduction CNN feature is taken as the first video frame.
  • the K sample video frames that are closest to the dimensionality reduction CNN feature of the target video frame are selected from the sample video frames already stored in the database, and the first video frame may be the K sample video frames Any one of.
  • the sample video frames stored in the database are sample video frames of multiple sample videos.
  • the dimensionality reduction CNN feature of the K sample video frames is the closest to the dimensionality reduction CNN feature of the target video frame, that is, the dimensionality reduction of the K sample video frames
  • the K distances between the CNN feature and the reduced-dimensional CNN feature of the target video frame are ranked in the top K positions among the distances between the reduced-dimensional CNN features of all sample video frames and the reduced-dimensional CNN features of the target video frame.
  • the distance between the dimensionality-reduced CNN feature of the sample video frame and the dimensionality-reduced CNN feature of the target video frame is sorted in ascending order of distance.
  • the selection method of selecting the closest K sample video frames may be to first obtain the CNN feature index of the sample video.
  • the sample video includes all the sample videos in the library, that is, according to the reduction of the sample video frames of all the sample videos in the library.
  • Dimension CNN features can generate a dimensionality reduction CNN feature index.
  • the CNN feature index is a structure, which is used to represent multiple clusters formed by dimensionality reduction CNN feature clusters of sample video frames of all sample videos in the library. Each cluster contains the cluster center and the cluster.
  • the dimensionality reduction CNN feature of at least one sample video frame Then calculate the distance between the reduced-dimensional CNN feature of the target video frame and the cluster center of each cluster of the above multiple cluster frames, and use the cluster corresponding to the closest cluster center as the target cluster.
  • the target cluster can include one cluster or multiple clusters. If it includes multiple clusters, it can be the distance between the cluster centers of the multiple clusters and the reduced-dimensional CNN features of the target video frame. First. Finally, calculate the distance between the dimensionality-reduced CNN feature of the target video frame and the dimensionality-reduced CNN feature of each sample video frame in at least one sample video frame contained in the target cluster, and select any of the closest K sample video frames One video frame is regarded as the first video frame, that is, there are K first video frames.
  • the stored target cluster can be dynamically read from the database.
  • the original CNN features of at least one sample video frame are calculated and sorted, and the closest K sample video frames are obtained as K first video frames.
  • the K first video frames can also be filtered, for example, if the distance exceeds a certain threshold, they are directly eliminated.
  • the third acquisition module 14 is configured to acquire the local features of the first video frame, and the local features of the first video frame include a second key point in the first video frame and a second key point corresponding to the second key point Feature descriptor;
  • the method for acquiring the local features of the first video frame may be to extract the local features of the first video frame from the database established in step S21.
  • the corresponding local feature may be searched from the database frame according to the video id to which the first video frame belongs and the first video frame id.
  • the local features of the first video frame include the second key point extracted in the first video frame and the second feature descriptor corresponding to the second key point.
  • the second key point extracted in the first video frame may be the first Pixels in the video frame that have large differences in pixel values from adjacent pixels, such as the corners and corners in the first video frame.
  • the calculation module 15 is configured to calculate the degree of matching between the local features of the first video frame and the local features of the target video frame;
  • the calculation module 15 may include a second acquiring unit, a sorting unit, a third acquiring unit, a first determining unit, and a second determining unit;
  • the second acquiring unit is configured to acquire, for each of the first feature descriptors, n distances between each second feature descriptor of the n second feature descriptors and the first feature descriptor ;
  • a sorting unit configured to sort the n distances in a descending order to form a sorting queue
  • the third acquiring unit is configured to acquire the last k distances sorted in the sorting queue, where k is a natural number greater than or equal to 2;
  • a first determining unit configured to determine, according to the k distances, that the first feature descriptor is a valid descriptor matching the first video frame
  • the second determining unit is configured to determine the degree of matching between the local feature of the first video frame and the local feature of the target video frame according to the number of valid descriptors in the m first feature descriptors.
  • the first determining module 16 is configured to use the first video frame as a repeated video frame of the target video frame if the matching degree meets a second preset condition.
  • the first determining module may include a fourth acquiring unit and a third determining unit;
  • the fourth acquiring unit is configured to acquire the frame identifier of the first video frame
  • the third determining unit is configured to find that the frame identifier of the first video frame corresponds to the target video in the plurality of videos, and the frame identifier of the first video frame corresponds to the second play time of the target video Point, and determine the video frame at the first play time point in the video to be queried and the video frame at the second play time point in the target video as repeated video frames.
  • the degree of matching between the local feature of the video frame and the local feature of the target video frame is greater than the first threshold, it is determined that the first video frame and the target video frame are repeated video frames.
  • the calculation method for calculating the matching degree between the local feature of the first video frame and the local feature of the target video frame may be, for example, the local feature of the target video frame includes m first feature descriptors corresponding to m first key points, A first key point corresponds to a first feature descriptor.
  • each first feature descriptor A i For each first feature descriptor A i , obtain n distances between each second feature descriptor of n second feature descriptors B j and the first feature descriptor. According to the order from largest to smallest, the n distances are sorted to form a sorting queue, and the last k distances of the sorting queue frame are obtained, that is, the closest k distances, and the first feature is determined according to the k distances
  • the descriptor is a valid descriptor matching the first video frame.
  • the first feature descriptor Ai is considered to be an effective matcher matching the first video frame.
  • the degree of matching between the local features of the first video frame and the local features of the target video frame is determined.
  • K first video frames that is, the target video frame to be queried and multiple sample video frames may be repeated video frames
  • the playback time point at which the repeated video frame is located can be further located, and which video in the sample video the repeated video frame comes from.
  • the target video frame is the video frame at the first playback time point of the video to be queried
  • the frame identifier of the first video frame can be obtained
  • the frame identifier of the first video frame corresponding to the second playback time of the target video can be found from the database.
  • the K first video frames may correspond to different playback time points of different videos.
  • the CNN feature and local feature of the target video frame to be queried are first acquired, and then the sample video frame with the closest distance to the CNN feature of the target video frame is selected from a plurality of sample video frames as the first video frame , And then obtain the local features of the first video frame, and finally calculate the matching degree between the local features of the first video frame and the local features of the target video frame. If the matching degree is greater than the first threshold, determine the first video frame and the target video The frame is a repeated video frame. In this way, the first level of screening is performed through CNN features, and then the second level of screening is performed through local feature matching, so as to accurately determine whether the target video frame and the sample video frame are duplicate video frames with high accuracy.
  • FIG. 6 it is a schematic structural diagram of another video frame processing apparatus provided by an embodiment of the present invention.
  • the video frame processing apparatus provided by an embodiment of the present invention includes: a first acquisition module 21, dimensionality reduction processing Module 22, second acquisition module 23, third acquisition module 24, calculation module 25, first determination module 26, fourth acquisition module 27, dimensionality reduction processing module 28, clustering module 29, quantization compression module 30, generation module 31 And the second determination module 32; wherein, the first acquisition module 21, the dimensionality reduction processing module 22, the second acquisition module 23, the third acquisition module 24, the calculation module 25, the first determination module 26, please refer to the description of the embodiment in FIG. 5 , I won’t repeat it here.
  • the fourth acquisition module 27 is configured to acquire the CNN features of the multiple sample video frames
  • the CNN feature index is generated based on the sample video frames contained in all sample videos in the library, and the videos in the library are screenshots at equal intervals to generate corresponding sample video frames.
  • the interval of the screenshots depends on the desired time resolution. For example, if you need to detect repeats of 5s and above, the interval between screenshots should be less than 5s.
  • a CNN network pre-trained on a large-scale general image data set imagenet, open-images, or ml-images data set
  • Each sample video frame extracted is input into the CNN network, and the feature map output by the last one or more convolutional layers is pooled to obtain the original CNN feature of the video frame.
  • the original CNN feature is a A fixed-length floating-point vector has a higher dimensionality (for example, 2048 dimensions).
  • the local features of the sample video frames of all the sample videos in the library can be extracted separately, and one of the local feature extractors such as SIFT, SURF, ORB, AKAZE, BRISK, etc. can be selected for local feature extraction.
  • BRISK is relatively high in accuracy and speed.
  • the local features extracted with default parameters contain a large number of key points and corresponding feature descriptors.
  • the key points can be sorted by response, and only the dozens of key points with the highest response and the corresponding feature descriptors are kept. More preferably, for some video frames with relatively smooth texture, too few key points may be detected, and the detection threshold can be lowered one or more times until sufficient key points and corresponding feature descriptors are detected.
  • a database can be established according to the video id and video frame id of the sample video, and (video id, video frame id, video frame time point, CNN feature, local Feature)
  • the information tuple is uploaded to the database.
  • the CNN feature is a 2048-dimensional single-precision floating point number (occupies 8KB space)
  • database The size is about 1.07TB and can be deployed on a single machine.
  • FIG. 3 it is a schematic diagram of CNN feature and local feature extraction provided by the embodiment of the present invention.
  • a screenshot is taken of the video in the library to obtain a sample video frame, and the CNN feature map is further calculated through the CNN network.
  • the CNN feature map is further pooled to obtain the CNN features of the sample video frame.
  • local feature extraction is performed on the sample video frame to obtain the local feature of the sample video frame.
  • the local features are screened to obtain the local features of the sample video frame after screening. Upload the CNN features and local features of the sample video frames to the database for storage.
  • the dimensionality reduction processing module 28 is configured to perform dimensionality reduction processing on the CNN features of the multiple sample video frames by using the PCA matrix of principal component analysis to obtain the low-dimensional CNN features of the multiple sample video frames;
  • the original CNN features of all or part of the sample video frames are read from the database and loaded into the memory, and the original CNN features loaded into the memory are used to train the PCA matrix, which can reduce the CNN feature dimension At the same time, keep the original information as much as possible.
  • the PCA matrix has a data whitening effect.
  • the original CNN features of all sample video frames in the database are loaded into the memory at one time or in batches, and the original CNN features of all sample video frames are used to reduce the dimensionality of the original CNN features of all sample video frames. Obtain the reduced-dimensional CNN features of all sample video frames.
  • the clustering module 29 is configured to perform k-means clustering on the reduced-dimensional CNN features of the multiple sample video frames to form multiple clusters, each of the clusters containing at least one reduced-dimensional CNN feature of the sample video frames ;
  • k-means clustering is performed on the low-dimensional CNN features of all sample video frames to obtain N clusters and corresponding cluster centers.
  • Each cluster contains at least one Dimensionality reduction CNN features of sample video frames.
  • the quantization compression module 30 is configured to quantize and compress the dimensionality reduction CNN features of at least one sample video frame included in each of the clusters, to obtain compressed CNN features corresponding to the clusters;
  • the dimensionality reduction CNN features of the sample video frames included in each cluster are quantized and compressed.
  • scalar quantization or product quantization can be used to quantify each cluster
  • the dimensionality reduction CNN features of the included sample video frames are further compressed. For example, if scalar quantization is used, each dimension of the reduced-dimensional CNN feature of the sample video frame can be compressed from a 4-byte floating point number to a 1-byte integer.
  • the generating module 31 is configured to generate a CNN feature index of the sample video according to the multiple clusters, the compressed CNN feature corresponding to each cluster, and the cluster center of each cluster.
  • the CNN features of the sample videos in the library can be generated Index
  • the CNN feature index is a structure
  • the above N clusters, the cluster centers of the N clusters, and the compressed CNN features corresponding to each cluster can be obtained through the CNN feature index.
  • the second determining module 32 is configured to: if the number of repeated video frames of the video to be queried and the target video is greater than a second threshold, and the repeated video frames that meet the continuous distribution condition among all the repeated video frames are respectively distributed in For the first time period of the video to be queried and the second time period of the target video, combine the video of the first time period in the video to be queried and the video of the second time period in the target video Determined to be a repeated video segment, the first time period includes the first play time point, the second time period includes the second play time point, and the continuous distribution condition includes that the time difference between adjacent repeated video frames is less than The third threshold.
  • the target video frame is any video frame in the video to be queried. All video frames included in the video to be queried are compared with sample video frames of sample videos in the library to determine whether it is a repeated video frame. The number of repetitive video frames of the video to be queried and the target video is greater than the second threshold. The number of frames can be determined by the number of repetitive video frames in the video to be queried.
  • the continuous distribution condition may be that the time difference between adjacent repeated video frames is less than the third threshold, that is, the repeated video frames are basically continuously concentrated in the first time period of the video to be queried and the second time period of the target video.
  • the distance between the reduced-dimensional CNN feature of the target video frame and the reduced-dimensional CNN feature between the sample video frame is used as the first layer of filtering, and then the local features of the target video frame and the first video frame The degree of matching between the local features is screened in the second layer, so that the repeated video frames of the target video frame can be accurately detected with high accuracy.
  • the embodiment of the present invention also provides a computer storage medium.
  • the computer storage medium may store multiple instructions, and the instructions are suitable for being loaded by a processor and executing the method steps of the embodiment shown in FIG. For the process, reference may be made to the specific description of the embodiment shown in FIG. 1, which will not be repeated here.
  • the video frame processing apparatus 1000 may include: at least one processor 1001, such as a CPU, at least one Communication interface 1003, memory 1004, and at least one communication bus 1002.
  • the communication bus 1002 is used to implement connection and communication between these components.
  • the communication interface 1003 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 1004 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as at least one disk memory.
  • the memory 1004 may also be at least one storage device located far away from the foregoing processor 1001.
  • the memory 1004 as a computer storage medium may include an operating system, a network communication module, and program instructions.
  • the processor 1001 may be used to load program instructions stored in the memory 1004, and specifically perform the following operations:
  • the local feature of the target video frame includes the first key point of the target video frame and the corresponding first key point Feature descriptor
  • the local feature of the first video frame includes a second key point in the first video frame and a feature descriptor corresponding to the second key point;
  • the acquiring the convolutional neural network CNN feature of the target video frame includes:
  • the target video frame is input to a CNN network for processing, and the CNN feature of the target video frame is obtained.
  • the acquiring the first video frame from multiple sample video frames includes:
  • Obtain a CNN feature index of a sample video the sample video including the plurality of sample video frames, and the CNN feature index is used to represent a plurality of clusters formed according to the reduced-dimensional CNN feature clustering of the plurality of sample video frames , Each of the clusters includes a cluster center and a reduced-dimensional CNN feature of at least one sample video frame in the cluster;
  • the local features of the target video frame include m first key points and m first feature descriptors corresponding to the m key points, and one first key point corresponds to one first key point.
  • the local features of the first video frame include n second key points and n second feature descriptors corresponding to the n second key points, and one second key point corresponds to one
  • the m is a natural number greater than or equal to 2
  • the n is a natural number greater than or equal to 2;
  • the calculating the matching degree between the local feature of the first video frame and the local feature of the target video frame includes:
  • the first feature descriptor is a valid descriptor matching the first video frame
  • the degree of matching between the local features of the first video frame and the local features of the target video frame is determined.
  • the method before acquiring the convolutional neural network CNN features and local features of the target video frame, the method further includes:
  • Dimensionality reduction processing is performed on the CNN features of the multiple sample video frames by using the PCA matrix of principal component analysis to obtain the dimensionality reduction CNN features of the multiple sample video frames;
  • the compressed CNN feature corresponding to each cluster, and the cluster center of each cluster a CNN feature index of the sample video is generated.
  • the target video frame belongs to the video to be queried
  • the playback time point of the target video frame in the video to be queried is the first playback time point
  • the first video frame belongs to the target in the sample video Video
  • the using the first video frame as the repeated video frame of the target video frame includes:
  • processor 1001 may also be used to load program instructions stored in the memory 1004 to perform the following operations:
  • the time period and the second time period of the target video, the video in the first time period in the video to be queried and the video in the second time period in the target video are determined as repeated video segments, and
  • the first time period includes the first play time point
  • the second time period includes the second play time point
  • the continuous distribution condition includes that the time difference between adjacent repeated video frames is less than a third threshold.
  • the program can be stored in a computer readable storage medium. When executed, it includes the processes of the above-mentioned method embodiments.
  • the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Library & Information Science (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Image Analysis (AREA)

Abstract

一种视频帧处理方法及装置,其中,视频帧处理方法包括:获取目标视频帧的CNN特征和目标视频帧的局部特征(S101),对目标视频帧的CNN特征进行降维处理,获取目标视频帧的降维CNN特征(S102);从多个样本视频帧中获取第一视频帧(S103),第一视频帧的降维CNN特征与目标视频帧的降维CNN特征之间的距离符合第一预设条件;获取第一视频帧的局部特征(S104);计算第一视频帧的局部特征和目标视频帧的局部特征之间的匹配度(S105);若匹配度符合第二预设条件,将第一视频帧作为目标视频帧的重复视频帧(S106)。采用该方法和装置可以提高重复视频帧检测的准确度。

Description

视频帧处理方法及装置 技术领域
本发明涉及互联网技术领域,尤其涉及一种视频帧处理方法及装置。
背景技术
随着信息技术的发展,多媒体技术也应运而生,用户可以通过互联网观看各种视频网站发布的视频,同时用户也可以向视频网站上传视频。
在实现本发明过程中,发明人发现现有视频网站的视频库中存储了海量的视频。为了避免重复视频存储在视频库中,通常会对重复视频进行检测,而对重复视频的检测过程中,重复视频帧的检测就显得尤其重要。
发明内容
本发明实施例提供一种视频帧处理方法及装置,可以实现重复视频帧的检测,并且准确度高。
第一方面,本发明实施例提供了一种视频帧处理方法,包括:
获取目标视频帧的卷积神经网络CNN特征和所述目标视频帧的局部特征,所述目标视频帧的局部特征包括所述目标视频帧的第一关键点和与所述第一关键点对应的特征描述子;
对所述目标视频帧的CNN特征进行降维处理,获取所述目标视频帧的降维CNN特征;
从多个样本视频帧中获取第一视频帧,所述第一视频帧的降维CNN特征与所述目标视频帧的降维CNN特征之间的距离符合第一预设条件;
获取所述第一视频帧的局部特征,所述第一视频帧的局部特征包括所述第一视频帧中的第二关键点和与所述第二关键点对应的特征描述子;
计算所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度;
若所述匹配度符合第二预设条件,将所述第一视频帧作为所述目标视频帧的重复视频帧。
在一种可能的设计中,所述获取目标视频帧的卷积神经网络CNN特征,包括:
获取待查询视频;
对所述待查询视频进行等间隔视频截图,获得目标视频帧,所述目标视频帧为对所述待查询视频进行等间隔视频截图得到的多个视频帧中的任意一个视频帧;
将所述目标视频帧输入CNN网络进行处理,获得所述目标视频帧的CNN特征。
在一种可能的实现方式中,所述从多个样本视频帧中获取第一视频帧,包括:
获取样本视频的CNN特征索引,所述样本视频包括所述多个样本视频帧,所述CNN特征索引用于表示根据所述多个样本视频帧的降维CNN特征聚类形成的多个聚类,每个所述聚类包含聚类中心和所述聚类中的至少一个样本视频帧的降维CNN特征;
计算所述目标视频帧的降维CNN特征和所述多个聚类中每个聚类的聚类中心之间的距离,并将距离最近的聚类中心所对应的聚类作为目标聚类;
计算所述目标视频帧的降维CNN特征和所述目标聚类包含的至少一个样本视频帧中每个样本视频帧的降维CNN特征之间的距离,并将距离最近的降维CNN特征所对应的样本视频帧作为第一视频帧。
在一种可能的实现方式中,所述目标视频帧的局部特征包括m个第一关键点以及和所述m个关键点对应的m个第一特征描述子,一个所述第一关键点对应一个所述第一特征描述子,所述第一视频帧的局部特征包含n个第二关键点以及和所述n个第二关键点对应的n个第二特征描述子,一个所述第二关键点对应一个所述第二特征描述子,所述m为大于或者等于2的自然数,所述n为大于或者等于2的自然数;
所述计算所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度,包括:
针对每个所述第一特征描述子,获取所述n个第二特征描述子中每个第二特征描述子与所述第一特征描述子之间的n个距离;
按照从大到小的先后顺序,将所述n个距离排序形成排序队列;
获取所述排序队列中排序在最后的k个距离,所述k为大于或者等于2的自然数;
根据所述k个距离,确定所述第一特征描述子是与所述第一视频帧匹配的有效描述子;
根据所述m个第一特征描述子中是有效描述子的数量,确定所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度。
在一种可能的实现方式中,所述获取目标视频帧的卷积神经网络CNN特征和局部特征之前,还包括:
获取所述多个样本视频帧的CNN特征;
采用主成分分析PCA矩阵对所述多个样本视频帧的CNN特征进行降维处理,获得所述多个样本视频帧的降维CNN特征;
对所述多个样本视频帧的降维CNN特征进行k-means聚类,形成多个聚类,每个所述聚类包含至少一个样本视频帧的降维CNN特征;
对每个所述聚类包含的至少一个样本视频帧的降维CNN特征进行量化压缩,获得所述聚类对应的压缩后的CNN特征;
根据所述多个聚类、每个所述聚类对应的压缩后的CNN特征以及每个所述聚类的聚类中心,生成样本视频的CNN特征索引。
在一种可能的实现方式中,所述目标视频帧属于待查询视频,且所述目标视频帧在所述待查询视频中的播放时间点为第一播放时间点,所述第一视频帧属于样本视频中的目标视频,所述将所述第一视频帧作为所述目标视频帧的重复视频帧,包括:
获取所述第一视频帧的帧标识;
查找所述第一视频帧的帧标识对应于所述目标视频的第二播放时间点,并将所述目标视频中第二播放时间点的视频帧作为所述待查询视频中第一播放时间点的视频帧的重复视频帧。
在一种可能的实现方式中,所述方法还包括,
若所述待查询视频与所述目标视频所有重复视频帧的帧数量大于第二阈值,且所述所有重复视频帧中满足连续分布条件的重复视频帧分别分布于所述待查询视频的第一时间段 和所述目标视频的第二时间段,将所述待查询视频中所述第一时间段的视频和所述目标视频中所述第二时间段的视频确定为重复视频段,所述第一时间段包括所述第一播放时间点,所述第二时间段包括所述第二播放时间点,所述连续分布条件包括相邻重复视频帧的时间差小于第三阈值。
第二方面,本发明实施例提供一种视频帧处理装置,包括:
第一获取模块,用于获取目标视频帧的卷积神经网络CNN特征和所述目标视频帧的局部特征,所述目标视频帧的局部特征包括所述目标视频帧的第一关键点和与所述第一关键点对应的特征描述子;
降维处理模块,用于对所述目标视频帧的CNN特征进行降维处理,获取所述目标视频帧的降维CNN特征;
第二获取模块,用于从多个样本视频帧中获取第一视频帧,所述第一视频帧的降维CNN特征与所述目标视频帧的降维CNN特征之间的距离符合第一预设条件;
第三获取模块,用于获取所述第一视频帧的局部特征,所述第一视频帧的局部特征包括所述第一视频帧中的第二关键点和与所述第二关键点对应的特征描述子;
计算模块,用于计算所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度;
第一确定模块,用于若所述匹配度符合第二预设条件,将所述第一视频帧作为所述目标视频帧的重复视频帧。
在一种可能的实现方式中,所述第一获取模块具体用于获取待查询视频;对所述待查询视频进行等间隔视频截图,获得目标视频帧,所述目标视频帧为对所述待查询视频进行等间隔视频截图得到的多个视频帧中的任意一个视频帧;将所述目标视频帧输入CNN网络进行处理,获得所述目标视频帧的CNN特征。
在一种可能的设计中,所述第二获取模块包括:
第一获取单元,用于获取样本视频的CNN特征索引,所述样本视频包括所述多个样本视频帧,所述CNN特征索引用于表示根据所述多个样本视频帧的降维CNN特征聚类形成的多个聚类,每个所述聚类包含聚类中心和所述聚类中的至少一个样本视频帧的降维CNN特征;
第一计算单元,用于计算所述目标视频帧的降维CNN特征和所述多个聚类中每个聚类的聚类中心之间的距离,并将距离最近的聚类中心所对应的聚类作为目标聚类;
第二计算单元,用于计算所述目标视频帧的降维CNN特征和所述目标聚类包含的至少一个样本视频帧中每个样本视频帧的降维CNN特征之间的距离,并将距离最近的降维CNN特征所对应的样本视频帧作为第一视频帧。
在一种可能的实现方式中,所述目标视频帧的局部特征包括m个第一关键点以及和所述m个关键点对应的m个第一特征描述子,一个所述第一关键点对应一个所述第一特征描述子,所述第一视频帧的局部特征包含n个第二关键点以及和所述n个第二关键点对应的n个第二特征描述子,一个所述第二关键点对应一个所述第二特征描述子,所述m为大于或者等于2的自然数,所述n为大于或者等于2的自然数;所述计算模块包括:
第二获取单元,用于针对每个所述第一特征描述子,获取所述n个第二特征描述子中 每个第二特征描述子与所述第一特征描述子之间的n个距离;
排序单元,用于按照从大到小的先后顺序,将所述n个距离排序形成排序队列;
第三获取单元,用于获取所述排序队列中排序在最后的k个距离,所述k为大于或者等于2的自然数;
第一确定单元,用于根据所述k个距离,确定所述第一特征描述子是与所述第一视频帧匹配的有效描述子;
第二确定单元,用于根据所述m个第一特征描述子中是有效描述子的数量,确定所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度。
在一种可能的实现方式中,所述装置还包括:
第四获取模块,用于获取所述多个样本视频帧的CNN特征;
降维处理模块,用于采用主成分分析PCA矩阵对所述多个样本视频帧的CNN特征进行降维处理,获得所述多个样本视频帧的降维CNN特征;
聚类模块,用于对所述多个样本视频帧的降维CNN特征进行k-means聚类,形成多个聚类,每个所述聚类包含至少一个样本视频帧的降维CNN特征;
量化压缩模块,用于对每个所述聚类包含的至少一个样本视频帧的降维CNN特征进行量化压缩,获得所述聚类对应的压缩后的CNN特征;
生成模块,用于根据所述多个聚类、每个所述聚类对应的压缩后的CNN特征以及每个所述聚类的聚类中心,生成样本视频的CNN特征索引。
在一种可能的实现方式中,所述目标视频帧属于待查询视频,且所述目标视频帧在所述待查询视频中的播放时间点为第一播放时间点,所述第一视频帧属于样本视频中的目标视频,所述第一确定模块包括:
第四获取单元,用于获取所述第一视频帧的帧标识;
第三确定单元,用于查找所述第一视频帧的帧标识对应于所述目标视频的第二播放时间点,并将所述目标视频中第二播放时间点的视频帧作为所述待查询视频中第一播放时间点的视频帧的重复视频帧。
在一种可能的实现方式中,所述装置还包括:
第二确定模块,用于若所述待查询视频与所述目标视频所有重复视频帧的帧数量大于第二阈值,且所述所有重复视频帧中满足连续分布条件的重复视频帧分别分布于所述待查询视频的第一时间段和所述目标视频的第二时间段,将所述待查询视频中所述第一时间段的视频和所述目标视频中所述第二时间段的视频确定为重复视频段,所述第一时间段包括所述第一播放时间点,所述第二时间段包括所述第二播放时间点,所述连续分布条件包括相邻重复视频帧的时间差小于第三阈值。
第三方面,本发明实施例提供一种视频帧处理装置,所述视频帧处理装置包括处理器和存储器;
所述处理器和存储器相连,其中,所述存储器用于存储程序代码,所述处理器用于调用所述程序代码,以执行第一方面所述的方法。
第四方面,本发明实施例提供一种计算机存储介质,其特征在于,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时, 执行第一方面所述的方法。
本发明实施例中,通过目标视频帧的降维CNN特征与样本视频帧之间的降维CNN特征之间的距离作第一层筛选,再进一步通过目标视频帧的局部特征与第一视频帧的局部特征之间的匹配度作第二层筛选,从而准确的检测到目标视频帧的重复视频帧,准确度高。
附图说明
为了说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍。
图1为本发明实施例提供的一种视频帧处理方法的流程图;
图2为本发明实施例提供的一种CNN特征索引的生成流程示意图;
图3为本发明实施例提供的一种提取视频帧双重特征的流程图;
图4为本发明实施例提供的一种检索重复视频的示意图;
图5为本发明实施例提供的一种视频帧处理装置的结构示意图;
图6为本发明实施例提供的另一种视频帧处理装置的结构示意图;
图7为本发明实施例提供的又一种视频帧处理装置的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行描述。
下面将结合附图1-附图4,对本发明实施例提供的视频帧处理方法进行详细介绍。
请参见图1,为本发明实施例提供了一种视频帧处理方法的流程示意图。如图1所示,本发明实施例的所述视频帧处理方法可以包括以下步骤S101-步骤S106。
S101,获取目标视频帧的卷积神经网络CNN特征和所述目标视频帧的局部特征,所述目标视频帧的局部特征包括所述目标视频帧的第一关键点和与所述第一关键点对应的特征描述子;
S102,对所述目标视频帧的CNN特征进行降维处理,获取所述目标视频帧的降维CNN特征;
在一个实施例中,目标视频帧可以是待查询视频中的任意一个视频帧,或者,目标视频帧可以是需要进行比较的一个单独的图片帧。若目标视频帧是待查询视频中的视频帧,则是将待查询视频等间隔视频截图,生成多个视频帧,目标视频帧可以是该多个视频帧中的任意一个。通常对待查询视频的截图间隔应该小于需要比较的数据库中的样本视频的截图间隔,即对待查询视频采用较高的频率截图,例如每秒5帧,以确保可以与库内的样本视频帧匹配。更优的,截图的时间点可以随机抖动,以避免出现所有待查询视频帧都恰巧与库内样本视频帧间隔较大的极端情况。
获取目标视频帧的卷积神经网络(Convolutional Neural Network,CNN)特征和目标视频帧的局部特征,目标视频帧的局部特征即是目标视频帧中提取的关键点的特征描述子,目标视频帧中提取的关键点即是目标视频帧中与邻近像素点的像素值差异较大的像素点,比如,目标视频帧的图像中的棱角。
可选的,获取目标视频帧的CNN特征的获取方式可以是,选取一个在大规模通用图片 数据集(imagenet,open-images,或ml-images数据集)上预训练得到的CNN网络,将目标视频帧输入该CNN网络,将最后一个或多个卷积层输出的特征图做池化处理(pooling),得到目标视频帧的原始CNN特征。该CNN特征是一个具有固定长度的浮点型向量,具有较高的维数(例如2048维)。在获取目标视频帧的CNN特征后,可以利用主成分分析(Principal components analysis,PCA)矩阵对该CNN特征进行降维处理,获得该目标视频帧的降维CNN特征。
可选的,获取目标视频帧的局部特征的获取方式可以是,可选用SIFT、SURF、ORB、AKAZE、BRISK等局部特征提取器中的任意一种提取器进行局部特征的提取。根据测试,BRISK相对而言准确率和速度都高。通常,若使用默认参数提取的局部特征会包含大量的关键点和相应的特征描述子。为节约存储空间,可将关键点按响应度(response)进行排序,只保留响应度最高的数十个关键点和相应的特征描述子,其中一个关键点对应一个特征描述子。
更优的,对于某些纹理较为平滑的视频帧,可能检测出的关键点过少,可以一次或多次降低检测阈值,直到检测出足够的关键点。
S103,从多个样本视频帧中获取第一视频帧,所述第一视频帧的降维CNN特征与所述目标视频帧的降维CNN特征之间的距离符合第一预设条件;
在一个实施例中,从数据库中已经存储的样本视频帧中选择与目标视频帧的降维CNN特征之间距离最近的K个样本视频帧,第一视频帧可以是该K个样本视频帧中的任意一个。数据库中已经存储的样本视频帧为多个样本视频的样本视频帧,可选的,将每个样本视频等间隔截图,可以得到多个样本视频帧,截图的间隔取决于期望达到的时间分辨率。例如,如果需要检出5s及以上的片段重复,那么截图的间隔应小于5s。
从该多个样本视频帧中选择K个样本视频帧,该K个样本视频帧的降维CNN特征与目标视频帧的降维CNN特征之间距离最近,即该K个样本视频帧的降维CNN特征与目标视频帧的降维CNN特征之间的K个距离排序在所有样本视频帧的降维CNN特征与目标视频帧的降维CNN特征之间的距离中前K位,其中,该所有样本视频帧的降维CNN特征与目标视频帧的降维CNN特征之间的距离按照距离从小到大的顺序进行排序。
其中,选择距离最近的K个样本视频帧的选择方式可以是,首先获取样本视频的CNN特征索引,该样本视频包括库内所有的样本视频,即根据库内所有样本视频的样本视频帧的降维CNN特征可以生成一个CNN特征索引。该CNN特征索引是一个结构体,它用于表示库内所有样本视频的样本视频帧的降维CNN特征聚类形成的多个聚类,每个聚类包含聚类中心和该聚类中的至少一个样本视频帧的降维CNN特征。然后计算目标视频帧的降维CNN特征和上述多个聚类帧每个聚类的聚类中心之间的距离,并将距离最近的聚类中心所对应的聚类作为目标聚类,可选的,目标聚类可以包括一个聚类或者多个聚类,若是包括多个聚类,则可以是该多个聚类的聚类中心与目标视频帧的降维CNN特征之间的距离排序在最前。最后计算目标视频帧的降维CNN特征与目标聚类所包含的至少一个样本视频帧中每个样本视频帧的降维CNN特征之间的距离,并将距离最近的K个样本视频帧中任意一个视频帧作为第一视频帧,即存在K个第一视频帧。
可选的,由于CNN特征索引内的降维CNN特征已经经过降维和量化压缩处理,因此 计算出的距离一般是近似的,因此,可以从数据库中动态读取所存储的目标聚类所包含的至少一个样本视频帧的原始的CNN特征进行计算排序,得到距离最近的K个样本视频帧,作为K个第一视频帧。
进一步可选的,还可以对该K个第一视频帧进行筛选,比如,距离超过一定阈值,则直接剔除掉。
对于上述实施例所提及的CNN特征索引生成方法可以参照图2,如图所示,CNN特征索引生成流程如图所示,包括步骤S21-S25;
S21,获取所述多个样本视频帧的CNN特征;
在一个实施例中,根据库内所有样本视频所包含的样本视频帧生成CNN特征索引,将库内视频等间隔截图,生成相应的样本视频帧,截图的间隔取决于期望达到的时间分辨率。例如,如果需要检出5s及以上的片段重复,那么截图的间隔应小于5s。
提取所划分的多个样本视频帧中每个样本视频帧的CNN特征。具体可选的,选取一个在大规模通用图片数据集(imagenet,open-images,或ml-images数据集)上预训练得到的CNN网络。将提取的每个样本视频帧分别输入该CNN网络,将最后一个或多个卷积层输出的特征图做池化处理(pooling),得到视频帧的CNN特征,该CNN特征是一个具有固定长度的浮点型向量,具有较高的维数(例如2048维)。
进一步可选的,还可以分别提取库内所有样本视频的样本视频帧的局部特征,可选用SIFT、SURF、ORB、AKAZE、BRISK等局部特征提取器中的一种进行局部特征的提取。根据测试,BRISK相对而言准确率和速度都高,通常使用默认参数提取的局部特征包含大量的关键点和相应的特征描述子。为节约存储空间,可将关键点按响应度(response)进行排序,只保留响应度最高的数十个关键点和相应的特征描述子。更优的,对于某些纹理较为平滑的视频帧,可能检测出的关键点过少,可以一次或多次降低检测阈值,直到检出足够的关键点和相应的特征描述子。
当提取了所有样本视频的样本视频帧的CNN特征和局部特征后,可以根据样本视频的视频id、视频帧id建立数据库,将(视频id,视频帧id,视频帧时间点,CNN特征,局部特征)信息元组上传数据库。假设视频总长度为10万小时,5秒一帧截图,CNN特征为2048维单精度浮点数(占用8KB空间),局部特征为128个BRISK特征描述子(约占用128x64Byte=8KB空间),数据库大小约为1.07TB,可以在单机上部署。
如图3所示,即是本发明实施例提供的一种CNN特征和局部特征提取示意图,如图所示,对库内视频进行截图,得到样本视频帧,进一步通过CNN网络计算得到CNN特征图,进一步对CNN特征图进行池化处理,即得到样本视频帧的CNN特征,同时,对样本视频帧进行局部特征提取,得到样本视频帧的局部特征,进一步通过响应度排序和过滤的方式,对局部特征进行筛选,得到筛选后样本视频帧的局部特征。将样本视频帧的CNN特征、局部特征上传数据库进行存储。
S22,采用主成分分析PCA矩阵对所述多个样本视频帧的CNN特征进行降维处理,获得所述多个样本视频帧的降维CNN特征;
在一个实施例中,从数据库中读出全部或部分样本视频帧的原始CNN特征加载到内存中,并利用加载到内存中的原始CNN特征训练PCA矩阵,该PCA矩阵可以在降低CNN 特征维数的同时,尽可能保留原有信息。可选的,通过对PCA矩阵的特征值(eigenvalue)开平方再取倒数,该PCA矩阵附带数据白化效果。
当训练得到PCA矩阵后,将数据库中的全部样本视频帧的原始CNN特征一次性或者分批加载到内存中,并利用训练得到的PCA矩阵对全部样本视频帧的原始CNN特征进行降维处理,获得该全部样本视频帧的降维CNN特征。
S23,对所述多个样本视频帧的降维CNN特征进行k-means聚类,形成多个聚类,每个所述聚类包含至少一个样本视频帧的降维CNN特征;
在一个实施例中,依据欧氏距离或余弦距离,对全部样本视频帧的降维CNN特征进行k-means聚类,得到N个聚类和对应的聚类中心,每个聚类包含至少一个样本视频帧的降维CNN特征。
S24,对每个所述聚类包含的至少一个样本视频帧的降维CNN特征进行量化压缩,获得所述聚类对应的压缩后的CNN特征;
在一个实施例中,对每个聚类包含的样本视频帧的降维CNN特征进行量化压缩,可选的,可以利用标量量化(scalar quantization)或者乘积量化(product quantization),将每个聚类包含的样本视频帧的降维CNN特征进一步压缩。例如如果使用标量量化,可以把样本视频帧的降维CNN特征的每个维度从4字节的浮点数压缩到1字节的整数。
S25,根据所述多个聚类、每个所述聚类对应的压缩后的CNN特征以及每个所述聚类的聚类中心,生成样本视频的CNN特征索引。
在一个实施例中,根据k-means聚类形成的N个聚类、该N个聚类的聚类中心、每个聚类对应的压缩后的CNN特征,可以生成库内样本视频的CNN特征索引,该CNN特征索引是一个结构体,通过该CNN特征索引可以得到上述N个聚类、该N个聚类的聚类中心以及每个聚类对应的压缩后的CNN特征。
假设样本视频总长度为10万小时,5秒一帧截图,PCA降维到256,且使用标量量化,CNN特征索引最终大小约100,000x60x12x256Byte=17GB,可以在单机的内存上生成。
S104,获取所述第一视频帧的局部特征,所述第一视频帧的局部特征包括所述第一视频帧中的第二关键点和与所述第二关键点对应的特征描述子;
在一个实施例中,当从所有样本视频帧中,获取第一视频帧后,需要进一步通过局部特征进行验证。获取第一视频帧的局部特征的获取方式可以是,从上述步骤S21中所建立的数据库中提取第一视频帧的局部特征。可选的,可以根据第一视频帧所属视频id、该第一视频帧id从数据库帧查找对应的局部特征。第一视频帧的局部特征包括在该第一视频帧中提取的第二关键点和第二关键点对应的特征描述子,第一视频帧中提取的关键点可以是第一视频帧中与邻近像素点的像素值差异较大的像素点,比如第一视频帧中的棱角部分。
可选的,若包括K个第一视频帧,则需要获取该K个第一视频帧中每个第一视频帧的局部特征。
S105,计算所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度;
S106,若所述匹配度符合第二预设条件,将所述第一视频帧作为所述目标视频帧的重复视频帧。
在一个实施例中,若存在K个第一视频帧,需要计算该K个第一视频帧中每个第一视 频帧的局部特征和目标视频帧的局部特征之间的匹配度,若第一视频帧的局部特征和目标视频帧的局部特征之间的匹配度大于第一阈值,则确定该第一视频帧与目标视频帧为重复视频帧。
计算第一视频帧的局部特征与目标视频帧的局部特征之间的匹配度的计算方式可以是,比如目标视频帧的局部特征包含m个第一关键点对应的m个第一特征描述子,一个第一关键点对应一个第一特征描述子,比如第一特征描述子可以用A i(i=1,2,…m)表示,第一视频帧的局部特征包含n个第二关键点对应的n个第二特征描述子,一个第二关键点对应一个第二特征描述子,比如第二特征描述子可以用B j(j=1,2,…n)表示,所述m为大于或者等于2的自然数,所述n为大于或者等于2的自然数。
针对每个第一特征描述子A i,获取n个第二特征描述子B j中每个第二特征描述子与第一特征描述子之间的n个距离。按照从大到小的先后顺序,将该n个距离排序形成排序队列,获取该排序队列帧排序在最后的k个距离,即距离最近的k个距离,根据该k个距离,确定第一特征描述子是与第一视频帧匹配的有效描述子。比如k=2,即对每个Ai找到与其距离最近的B j1(距离为D i,j,1)和距离第二近的B j2(距离为D i,j,2),若D i,j,1<r*D i,j,2(r可在0.6-0.9间取值),则认为第一特征描述子Ai是与第一视频帧匹配的有效匹配子。
进一步根据目标视频帧中m个第一特征描述子中是有效描述子的数量,确定所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度。比如最终的匹配度可以是,匹配度=有效描述子的数量/max(m,n)。
若存在K个第一视频帧,即待查询目标视频帧与多个样本视频帧均可能是重复视频帧,则需要验证目标视频帧与该K个第一视频帧中每个第一视频帧之间的匹配度,以确定目标视频帧与第一视频帧是否为重复视频帧。
当确定了目标视频帧与第一视频帧为重复视频帧后,可以进一步定位重复视频帧所在的播放时间点,以及该重复视频帧来自于样本视频中的哪一个视频。比如目标视频帧为待查询视频的第一播放时间点的视频帧,可以获取该第一视频帧的帧标识,从数据库中查找该第一视频帧的帧标识对应于目标视频的第二播放时间点,需要说明的是,若存在K个第一视频帧,则该K个第一视频帧可能对应于不同的视频的不同播放时间点。
进一步可选的,若所述待查询视频与所述目标视频所有重复视频帧的帧数量大于第二阈值,且所述所有重复视频帧中满足连续分布条件的重复视频帧分别分布于所述待查询视频的第一时间段和所述目标视频的第二时间段,将所述待查询视频中所述第一时间段的视频和所述目标视频中所述第二时间段的视频确定为重复视频段,所述第一时间段包括所述第一播放时间点,所述第二时间段包括所述第二播放时间点,所述连续分布条件包括相邻重复视频帧的时间差小于第三阈值。
在一个实施例帧,目标视频帧是待查询视频中的任意一个视频帧,在对待查询视频所包含的所有视频帧均与库内样本视频的样本视频帧进行比较,确定是否为重复视频帧。待查询视频与目标视频所有重复视频帧的帧数量大于第二阈值,该帧数量可以由待查询视频中确定为重复视频帧的帧数量决定,比如,待查询视频中100个视频帧与库内的样本视频帧为重复视频帧,则重复视频帧的帧数量为100,并且重复视频帧中满足连续分布条件的重复视频帧分别分布于待查询视频的第一时间段和目标视频的第二时间段,将待查询视频 中第一时间段的视频和目标视频中第二时间段的视频确定为重复视频段。其中,连续分布条件可以是相邻重复视频帧的时间差小于第三阈值,即重复视频帧基本连续集中分布在待查询视频的第一时间段和目标视频的第二时间段。
当样本视频的数量进一步增大,如超过百万小时,受制于内存大小和查询速度的要求,在单机上实现重复视频帧的确定将比较困难,则可以拓展至多机分布式部署。假设有P台计算机可供部署,具体方法如下:将库内样本视频的样本视频帧平均分配到在P台计算机上,并行提取各自分配到的样本视频帧的CNN特征和局部特征,各个计算机将所得到的CNN特征和局部特征上传数据库。该数据库应当采用可支持海量数据的方案,例如云端存储。
从数据库中读出部分CNN特征到一台计算机的内存中,并根据读出的该部分CNN特征训练得到PCA矩阵、k-means聚类和量化压缩的参数,将所得到的PCA矩阵、k-means聚类参数和量化压缩的参数共享到所有的计算机上。
P台计算机上各自读取总量/P个CNN特征,既不遗漏也不重叠。根据共享的参数和各自读取的CNN特征进行聚类,分别在各自内存上建立CNN特征索引,每个计算机上的CNN特征索引不同。查询重复视频时,在一台或多台计算机上计算待查询视频帧的CNN特征和局部特征,然后将得到的CNN特征和局部特征发送到所有计算机,各个计算机并行计算该待查询视频帧的CNN特征与各自计算机的CNN特征索引所指示的各个聚类中CNN特征之间的距离,并将各自计算的距离发送给一台计算机,该计算机根据距离进行重排序,取距离最短的K个结果,确定K个第一视频帧,进一步通过局部特征的匹配确定是否为重复视频帧。
本发明实施例中,通过目标视频帧的降维CNN特征与样本视频帧之间的降维CNN特征之间的距离作第一层筛选,再进一步通过目标视频帧的局部特征与第一视频帧的局部特征之间的匹配度作第二层筛选,从而准确的检测到目标视频帧的重复视频帧,准确度高。
请参照图4,为本发明实施例提供的一种重复视频帧的检测流程图,如图所示,首先对待查询视频进行截图,得到视频帧,进一步通过CNN网络计算、池化以及降维的处理过程,得到视频帧的CNN特征。根据库内视频CNN特征索引,进行最邻近搜索以及过滤操作,得到距离视频帧的CNN特征距离最近的样本视频帧,作为第一视频帧,为了确定第一视频帧是否为重复视频帧,需要进一步通过局部特征进行验证。即从数据库内读取第一视频帧的局部特征。
同时,对视频帧进行局部特征提取,根据响应度排序和过滤,得到视频帧的局部特征。计算视频帧的局部特征和库内第一视频帧的局部特征之间的匹配度,若匹配度大于阈值,则进一步确定第一视频帧来自的视频id和时间点。
请参见图5,为本发明实施例提供了一种视频帧处理装置的结构示意图。如图5所示,本发明实施例的所述视频帧处理装置可以包括:
第一获取模块11,用于获取目标视频帧的卷积神经网络CNN特征和所述目标视频帧的局部特征,所述目标视频帧的局部特征包括所述目标视频帧的第一关键点和与所述第一 关键点对应的特征描述子;
降维处理模块12,用于对所述目标视频帧的CNN特征进行降维处理,获取所述目标视频帧的降维CNN特征;
在一个实施例中,目标视频帧可以是待查询视频中的任意一个视频帧,或者,目标视频帧可以是需要进行比较的一个单独的图片帧。若目标视频帧是待查询视频中的视频帧,则是将待查询视频等间隔视频截图,生成多个视频帧,目标视频帧可以是该多个视频帧中的任意一个。通常对待查询视频的截图间隔应该小于需要比较的数据库中的样本视频的截图间隔,即对待查询视频采用较高的频率截图,例如每秒5帧,以确保可以与库内的样本视频帧匹配。更优的,截图的时间点可以随机抖动,以避免出现所有待查询视频帧都恰巧与库内样本视频帧间隔较大的极端情况。
获取目标视频帧的卷积神经网络(Convolutional Neural Network,CNN)特征和目标视频帧的局部特征,目标视频帧的局部特征即是目标视频帧中提取的关键点的特征描述子,目标视频帧中提取的关键点即是与邻近像素点的像素值差异较大的像素点,比如,目标视频帧的图像中的棱角。
可选的,获取目标视频帧的CNN特征的获取方式可以是,选取一个在大规模通用图片数据集(imagenet,open-images,或ml-images数据集)上预训练得到的CNN网络,将目标视频帧输入该CNN网络,将最后一个或多个卷积层输出的特征图做池化处理(pooling),得到目标视频帧的CNN特征。该CNN特征是一个具有固定长度的浮点型向量,具有较高的维数(例如2048维)。可选的,在获取目标视频帧的原始CNN特征后,可以利用主成分分析(Principal components analysis,PCA)矩阵对该原始CNN特征进行降维。
可选的,获取目标视频帧的局部特征的获取方式可以是,可选用SIFT、SURF、ORB、AKAZE、BRISK等局部特征提取器中的任意一种提取器进行局部特征的提取。根据测试,BRISK相对而言准确率和速度都高。通常,若使用默认参数提取的局部特征会包含大量的关键点和相应的特征描述子。为节约存储空间,可将关键点按响应度(response)进行排序,只保留响应度最高的数十个关键点和相应的特征描述子,其中一个关键点对应一个特征描述子。
更优的,对于某些纹理较为平滑的视频帧,可能检测出的关键点过少,可以一次或多次降低检测阈值,直到检测出足够的关键点。
第二获取模块13,用于从多个样本视频帧中获取第一视频帧,所述第一视频帧的降维CNN特征与所述目标视频帧的降维CNN特征之间的距离符合第一预设条件;
可选的,第二获取模块可包括第一获取单元、第一计算单元以及第二计算单元;
第一获取单元,用于获取样本视频的CNN特征索引,所述样本视频包括所述多个样本视频帧,所述CNN特征索引用于表示根据所述多个样本视频帧的降维CNN特征聚类形成的多个聚类,每个所述聚类包含聚类中心和所述聚类中的至少一个样本视频帧的降维CNN特征;
第一计算单元,用于计算所述目标视频帧的降维CNN特征和所述多个聚类中每个聚类的聚类中心之间的距离,并将距离最近的聚类中心所对应的聚类作为目标聚类;
第二计算单元,用于计算所述目标视频帧的降维CNN特征和所述目标聚类包含的至少 一个样本视频帧中每个样本视频帧的降维CNN特征之间的距离,并将距离最近的降维CNN特征所对应的样本视频帧作为第一视频帧。
在一个实施例中,从数据库中已经存储的样本视频帧中选择与目标视频帧的降维CNN特征之间距离最近的K个样本视频帧,第一视频帧可以是该K个样本视频帧中的任意一个。数据库中已经存储的样本视频帧为多个样本视频的样本视频帧,可选的,将每个样本视频等间隔截图,可以得到多个样本视频帧,截图的间隔取决于期望达到的时间分辨率。例如,如果需要检出5s及以上的片段重复,那么截图的间隔应小于5s。
从该多个样本视频帧中选择K个样本视频帧,该K个样本视频帧的降维CNN特征与目标视频帧的降维CNN特征之间距离最近,即该K个样本视频帧的降维CNN特征与目标视频帧的降维CNN特征之间的K个距离排序在所有样本视频帧的降维CNN特征与目标视频帧的降维CNN特征之间的距离中前K位,其中,该所有样本视频帧的降维CNN特征与目标视频帧的降维CNN特征之间的距离按照距离从小到大的顺序进行排序。
其中,选择距离最近的K个样本视频帧的选择方式可以是,首先获取样本视频的CNN特征索引,该样本视频包括库内所有的样本视频,即根据库内所有样本视频的样本视频帧的降维CNN特征可以生成一个降维CNN特征索引。该CNN特征索引是一个结构体,它用于表示库内所有样本视频的样本视频帧的降维CNN特征聚类形成的多个聚类,每个聚类包含聚类中心和该聚类中的至少一个样本视频帧的降维CNN特征。然后计算目标视频帧的降维CNN特征和上述多个聚类帧每个聚类的聚类中心之间的距离,并将距离最近的聚类中心所对应的聚类作为目标聚类,可选的,目标聚类可以包括一个聚类或者多个聚类,若是包括多个聚类,则可以是该多个聚类的聚类中心与目标视频帧的降维CNN特征之间的距离排序在最前。最后计算目标视频帧的降维CNN特征与目标聚类所包含的至少一个样本视频帧中每个样本视频帧的降维CNN特征之间的距离,并将距离最近的K个样本视频帧中任意一个视频帧作为第一视频帧,即存在K个第一视频帧。
可选的,由于CNN特征索引内的降维CNN特征已经经过降维和量化压缩处理,因此计算出的距离一般是近似的,因此,可以从数据库中动态读取所存储的目标聚类所包含的至少一个样本视频帧的原始的CNN特征进行计算排序,得到距离最近的K个样本视频帧,作为K个第一视频帧。
进一步可选的,还可以对该K个第一视频帧进行筛选,比如,距离超过一定阈值,则直接剔除掉。
第三获取模块14,用于获取所述第一视频帧的局部特征,所述第一视频帧的局部特征包括所述第一视频帧中的第二关键点和与所述第二关键点对应的特征描述子;
在一个实施例中,当从所有样本视频帧中,确定了与待查询目标视频帧为第一视频帧后,需要进一步通过局部特征进行验证。获取第一视频帧的局部特征的获取方式可以是,从上述步骤S21中所建立的数据库中提取第一视频帧的局部特征。可选的,可以根据第一视频帧所属视频id、该第一视频帧id从数据库帧查找对应的局部特征。第一视频帧的局部特征包括在该第一视频帧中提取的第二关键点以及和第二关键点对应的第二特征描述子,第一视频帧中提取的第二关键点可以是第一视频帧中与邻近像素点的像素值差异较大的像素点,比如第一视频帧中的棱角部分。
可选的,若包括K个第一视频帧,则需要获取该K个第一视频帧中每个样本视频帧的局部特征。
计算模块15,用于计算所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度;
可选的,计算模块15可包括第二获取单元、排序单元、第三获取单元、第一确定单元以及第二确定单元;
第二获取单元,用于针对每个所述第一特征描述子,获取所述n个第二特征描述子中每个第二特征描述子与所述第一特征描述子之间的n个距离;
排序单元,用于按照从大到小的先后顺序,将所述n个距离排序形成排序队列;
第三获取单元,用于获取所述排序队列中排序在最后的k个距离,所述k为大于或者等于2的自然数;
第一确定单元,用于根据所述k个距离,确定所述第一特征描述子是与所述第一视频帧匹配的有效描述子;
第二确定单元,用于根据所述m个第一特征描述子中是有效描述子的数量,确定所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度。
第一确定模块16,用于若所述匹配度符合第二预设条件,将所述第一视频帧作为所述目标视频帧的重复视频帧。
可选的,第一确定模块可包括第四获取单元和第三确定单元;
第四获取单元,用于获取所述第一视频帧的帧标识;
第三确定单元,用于查找所述第一视频帧的帧标识对应于所述多个视频中的目标视频,以及所述第一视频帧的帧标识对应于所述目标视频的第二播放时间点,并将所述待查询视频中第一播放时间点的视频帧与所述目标视频中第二播放时间点的视频帧确定为重复视频帧。
在一个实施例中,若存在K个第一视频帧,需要计算该K个第一视频帧中每个第一视频帧的局部特征和目标视频帧的局部特征之间的匹配度,若第一视频帧的局部特征和目标视频帧的局部特征之间的匹配度大于第一阈值,则确定该第一视频帧与目标视频帧为重复视频帧。
计算第一视频帧的局部特征与目标视频帧的局部特征之间的匹配度的计算方式可以是,比如目标视频帧的局部特征包含m个第一关键点对应的m个第一特征描述子,一个第一关键点对应一个第一特征描述子,比如第一特征描述子可以用A i(i=1,2,…m)表示,第一视频帧的局部特征包含n个第二关键点对应的n个第二特征描述子,一个第二关键点对应一个第二特征描述子,比如第二特征描述子可以用B j(j=1,2,…n)表示,所述m为大于或者等于2的自然数,所述n为大于或者等于2的自然数。
针对每个第一特征描述子A i,获取n个第二特征描述子B j中每个第二特征描述子与第一特征描述子之间的n个距离。按照从大到小的先后顺序,将该n个距离排序形成排序队列,获取该排序队列帧排序在最后的k个距离,即距离最近的k个距离,根据该k个距离,确定第一特征描述子是与第一视频帧匹配的有效描述子。比如k=2,即对每个Ai找到与其距离最近的B j1(距离为D i,j,1)和距离第二近的B j2(距离为D i,j,2),若D i,j,1<r*D i,j,2(r可 在0.6-0.9间取值),则认为第一特征描述子Ai是与第一视频帧匹配的有效匹配子。
进一步根据目标视频帧中m个第一特征描述子中是有效描述子的数量,确定所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度。比如最终的匹配度可以是,匹配度=有效描述子的数量/max(m,n)。
若存在K个第一视频帧,即待查询目标视频帧与多个样本视频帧均可能是重复视频帧,则需要验证目标视频帧与该K个第一视频帧中每个第一视频帧之间的匹配度,以确定目标视频帧与第一视频帧是否为重复视频帧。
当确定了目标视频帧与第一视频帧为重复视频帧后,可以进一步定位重复视频帧所在的播放时间点,以及该重复视频帧来自于样本视频中的哪一个视频。比如目标视频帧为待查询视频的第一播放时间点的视频帧,可以获取该第一视频帧的帧标识,从数据库中查找该第一视频帧的帧标识对应于目标视频的第二播放时间点,需要说明的是,若存在K个第一视频帧,则该K个第一视频帧可能对应于不同的视频的不同播放时间点。
本发明实施例中,首先获取待查询目标视频帧的CNN特征和局部特征,再从多个样本视频帧帧选择与目标视频帧的CNN特征之间距离最近的样本视频帧,作为第一视频帧,然后获取第一视频帧的局部特征,最后计算第一视频帧的局部特征和目标视频帧的局部特征之间的匹配度,若匹配度大于第一阈值,确定该第一视频帧与目标视频帧为重复视频帧。这种方式先通过CNN特征作第一层筛选,再进一步通过局部特征的匹配作第二层筛选,从而准确的确定目标视频帧与样本视频帧是否为重复视频帧,准确度高。
具体执行步骤可以参见前述图1方法实施例的描述,此处不在赘述。
如图6所示,为本发明实施例提供的另一种视频帧处理装置的结构示意图,如图所示,本发明实施例提供的视频帧处理装置包括:第一获取模块21、降维处理模块22、第二获取模块23、第三获取模块24、计算模块25、第一确定模块26、第四获取模块27、降维处理模块28、聚类模块29、量化压缩模块30、生成模块31以及第二确定模块32;其中,第一获取模块21、降维处理模块22、第二获取模块23、第三获取模块24、计算模块25、第一确定模块26请参照图5实施例的描述,在此不再赘述。
第四获取模块27,用于获取所述多个样本视频帧的CNN特征;
在一个实施例中,根据库内所有样本视频所包含的样本视频帧生成CNN特征索引,将库内视频等间隔截图,生成相应的样本视频帧,截图的间隔取决于期望达到的时间分辨率。例如,如果需要检出5s及以上的片段重复,那么截图的间隔应小于5s。
提取所划分的多个样本视频帧中每个样本视频帧的CNN特征。具体可选的,选取一个在大规模通用图片数据集(imagenet,open-images,或ml-images数据集)上预训练得到的CNN网络。将提取的每个样本视频帧分别输入该CNN网络,将最后一个或多个卷积层输出的特征图做池化处理(pooling),得到视频帧的原始CNN特征,该原始CNN特征是一个具有固定长度的浮点型向量,具有较高的维数(例如2048维)。
进一步可选的,还可以分别提取库内所有样本视频的样本视频帧的局部特征,可选用SIFT、SURF、ORB、AKAZE、BRISK等局部特征提取器中的一种进行局部特征的提取。根据测试,BRISK相对而言准确率和速度都高,通常使用默认参数提取的局部特征包含大 量的关键点和相应的特征描述子。为节约存储空间,可将关键点按响应度(response)进行排序,只保留响应度最高的数十个关键点和相应的特征描述子。更优的,对于某些纹理较为平滑的视频帧,可能检测出的关键点过少,可以一次或多次降低检测阈值,直到检出足够的关键点和相应的特征描述子。
当提取了所有样本视频的样本视频帧的CNN特征和局部特征后,可以根据样本视频的视频id、视频帧id建立数据库,将(视频id,视频帧id,视频帧时间点,CNN特征,局部特征)信息元组上传数据库。假设视频总长度为10万小时,5秒一帧截图,CNN特征为2048维单精度浮点数(占用8KB空间),局部特征为128个BRISK特征描述子(约占用128x64Byte=8KB空间),数据库大小约为1.07TB,可以在单机上部署。
如图3所示,即是本发明实施例提供的一种CNN特征和局部特征提取示意图,如图所示,对库内视频进行截图,得到样本视频帧,进一步通过CNN网络计算得到CNN特征图,进一步对CNN特征图进行池化处理,即得到样本视频帧的CNN特征,同时,对样本视频帧进行局部特征提取,得到样本视频帧的局部特征,进一步通过响应度排序和过滤的方式,对局部特征进行筛选,得到筛选后样本视频帧的局部特征。将样本视频帧的CNN特征、局部特征上传数据库进行存储。
降维处理模块28,用于采用主成分分析PCA矩阵对所述多个样本视频帧的CNN特征进行降维处理,获得所述多个样本视频帧的低维度CNN特征;
在一个实施例中,从数据库中读出全部或部分样本视频帧的原始CNN特征加载到内存中,并利用加载到内存中的原始CNN特征训练PCA矩阵,该PCA矩阵可以在降低CNN特征维数的同时,尽可能保留原有信息。可选的,通过对PCA矩阵的特征值(eigenvalue)开平方再取倒数,该PCA矩阵附带数据白化效果。
当训练得到PCA矩阵后,将数据库中的全部样本视频帧的原始CNN特征一次性或者分批加载到内存中,并利用训练得到的PCA矩阵对全部样本视频帧的原始CNN特征进行降维处理,获得该全部样本视频帧的降维CNN特征。
聚类模块29,用于对所述多个样本视频帧的降维CNN特征进行k-means聚类,形成多个聚类,每个所述聚类包含至少一个样本视频帧的降维CNN特征;
在一个实施例中,依据欧氏距离或余弦距离,对全部样本视频帧的低维度CNN特征进行k-means聚类,得到N个聚类和对应的聚类中心,每个聚类包含至少一个样本视频帧的降维CNN特征。
量化压缩模块30,用于对每个所述聚类包含的至少一个样本视频帧的降维CNN特征进行量化压缩,获得所述聚类对应的压缩后的CNN特征;
在一个实施例中,对每个聚类包含的样本视频帧的降维CNN特征进行量化压缩,可选的,可以利用标量量化(scalar quantization)或者乘积量化(product quantization),将每个聚类包含的样本视频帧的降维CNN特征进一步压缩。例如如果使用标量量化,可以把样本视频帧的降维CNN特征的每个维度从4字节的浮点数压缩到1字节的整数。
生成模块31,用于根据所述多个聚类、每个所述聚类对应的压缩后的CNN特征以及每个所述聚类的聚类中心,生成样本视频的CNN特征索引。
在一个实施例中,根据k-means聚类形成的N个聚类、该N个聚类的聚类中心、每个 聚类对应的压缩后的CNN特征,可以生成库内样本视频的CNN特征索引,该CNN特征索引是一个结构体,通过该CNN特征索引可以得到上述N个聚类、该N个聚类的聚类中心以及每个聚类对应的压缩后的CNN特征。
假设样本视频总长度为10万小时,5秒一帧截图,PCA降维到256,且使用标量量化,CNN特征索引最终大小约100,000x60x12x256Byte=17GB,可以在单机的内存上生成。
第二确定模块32,用于若所述待查询视频与所述目标视频所有重复视频帧的帧数量大于第二阈值,且所述所有重复视频帧中满足连续分布条件的重复视频帧分别分布于所述待查询视频的第一时间段和所述目标视频的第二时间段,将所述待查询视频中所述第一时间段的视频和所述目标视频中所述第二时间段的视频确定为重复视频段,所述第一时间段包括所述第一播放时间点,所述第二时间段包括所述第二播放时间点,所述连续分布条件包括相邻重复视频帧的时间差小于第三阈值。
在一个实施例帧,目标视频帧是待查询视频中的任意一个视频帧,在对待查询视频所包含的所有视频帧均与库内样本视频的样本视频帧进行比较,确定是否为重复视频帧。待查询视频与目标视频所有重复视频帧的帧数量大于第二阈值,该帧数量可以由待查询视频中确定为重复视频帧的帧数量决定,比如,待查询视频中100个视频帧与库内的样本视频帧为重复视频帧,则重复视频帧的帧数量为100,并且重复视频帧中满足连续分布条件的重复视频帧分别分布于待查询视频的第一时间段和目标视频的第二时间段,将待查询视频中第一时间段的视频和目标视频中第二时间段的视频确定为重复视频段。其中,连续分布条件可以是相邻重复视频帧的时间差小于第三阈值,即重复视频帧基本连续集中分布在待查询视频的第一时间段和目标视频的第二时间段。
本发明实施例中,通过目标视频帧的降维CNN特征与样本视频帧之间的降维CNN特征之间的距离作第一层筛选,再进一步通过目标视频帧的局部特征与第一视频帧的局部特征之间的匹配度作第二层筛选,从而准确的检测到目标视频帧的重复视频帧,准确度高。
具体执行步骤可以参见前述图1方法实施例的描述,此处不在赘述。
本发明实施例还提供了一种计算机存储介质,所述计算机存储介质可以存储有多条指令,所述指令适于由处理器加载并执行如上述图1所示实施例的方法步骤,具体执行过程可以参见图1所示实施例的具体说明,在此不进行赘述。
请参照图7,为本发明实施例提供的另一种视频帧处理装置的结构示意图,如图7所示,所述视频帧处理装置1000可以包括:至少一个处理器1001,例如CPU,至少一个通信接口1003,存储器1004,至少一个通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。通信接口1003可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1004可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器1004可选的还可以是至少一个位于远离前述处理器1001的存储装置。如图7所示,作为一种计算机存储介质的存储器1004中可以包括操作系统、网络通信模块以及程序指令。
在图7所示的视频帧处理装置1000中,处理器1001可以用于加载存储器1004中存储 的程序指令,并具体执行以下操作:
获取目标视频帧的卷积神经网络CNN特征和所述目标视频帧的局部特征,所述目标视频帧的局部特征包括所述目标视频帧的第一关键点和与所述第一关键点对应的特征描述子;
对所述目标视频帧的CNN特征进行降维处理,获取所述目标视频帧的降维CNN特征;
从多个样本视频帧中获取第一视频帧,所述第一视频帧的降维CNN特征与所述目标视频帧的降维CNN特征之间的距离符合第一预设条件;
获取所述第一视频帧的局部特征,所述第一视频帧的局部特征包括所述第一视频帧中的第二关键点和与所述第二关键点对应的特征描述子;
计算所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度;
若所述匹配度符合第二预设条件,将所述第一视频帧作为所述目标视频帧的重复视频帧。
可选的,所述获取目标视频帧的卷积神经网络CNN特征,包括:
获取待查询视频;
对所述待查询视频进行等间隔视频截图,获得目标视频帧,所述目标视频帧为对所述待查询视频进行等间隔视频截图得到的多个视频帧中的任意一个视频帧;
将所述目标视频帧输入CNN网络进行处理,获得所述目标视频帧的CNN特征。
可选的,所述从多个样本视频帧中获取第一视频帧,包括:
获取样本视频的CNN特征索引,所述样本视频包括所述多个样本视频帧,所述CNN特征索引用于表示根据所述多个样本视频帧的降维CNN特征聚类形成的多个聚类,每个所述聚类包含聚类中心和所述聚类中的至少一个样本视频帧的降维CNN特征;
计算所述目标视频帧的降维CNN特征和所述多个聚类中每个聚类的聚类中心之间的距离,并将距离最近的聚类中心所对应的聚类作为目标聚类;
计算所述目标视频帧的降维CNN特征和所述目标聚类包含的至少一个样本视频帧中每个样本视频帧的降维CNN特征之间的距离,并将距离最近的降维CNN特征所对应的样本视频帧作为第一视频帧。
可选的,所述目标视频帧的局部特征包括m个第一关键点以及和所述m个关键点对应的m个第一特征描述子,一个所述第一关键点对应一个所述第一特征描述子,所述第一视频帧的局部特征包含n个第二关键点以及和所述n个第二关键点对应的n个第二特征描述子,一个所述第二关键点对应一个所述第二特征描述子,所述m为大于或者等于2的自然数,所述n为大于或者等于2的自然数;
所述计算所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度,包括:
针对每个所述第一特征描述子,获取所述n个第二特征描述子中每个第二特征描述子与所述第一特征描述子之间的n个距离;
按照从大到小的先后顺序,将所述n个距离排序形成排序队列;
获取所述排序队列中排序在最后的k个距离,所述k为大于或者等于2的自然数;
根据所述k个距离,确定所述第一特征描述子是与所述第一视频帧匹配的有效描述子;
根据所述m个第一特征描述子中是有效描述子的数量,确定所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度。
可选的,所述获取目标视频帧的卷积神经网络CNN特征和局部特征之前,还包括:
获取所述多个样本视频帧的CNN特征;
采用主成分分析PCA矩阵对所述多个样本视频帧的CNN特征进行降维处理,获得所述多个样本视频帧的降维CNN特征;
对所述多个样本视频帧的降维CNN特征进行k-means聚类,形成多个聚类,每个所述聚类包含至少一个样本视频帧的降维CNN特征;
对每个所述聚类包含的至少一个样本视频帧的降维CNN特征进行量化压缩,获得所述聚类对应的压缩后的CNN特征;
根据所述多个聚类、每个所述聚类对应的压缩后的CNN特征以及每个所述聚类的聚类中心,生成样本视频的CNN特征索引。
可选的,所述目标视频帧属于待查询视频,且所述目标视频帧在所述待查询视频中的播放时间点为第一播放时间点,所述第一视频帧属于样本视频中的目标视频,所述将所述第一视频帧作为所述目标视频帧的重复视频帧,包括:
获取所述第一视频帧的帧标识;
查找所述第一视频帧的帧标识对应于所述目标视频的第二播放时间点,并将所述目标视频中第二播放时间点的视频帧作为所述待查询视频中第一播放时间点的视频帧的重复视频帧。
可选的,处理器1001还可以用于加载存储器1004中存储的程序指令,用于执行以下操作:
若所述待查询视频与所述目标视频所有重复视频帧的帧数量大于第二阈值,且所述所有重复视频帧中满足连续分布条件的重复视频帧分别分布于所述待查询视频的第一时间段和所述目标视频的第二时间段,将所述待查询视频中所述第一时间段的视频和所述目标视频中所述第二时间段的视频确定为重复视频段,所述第一时间段包括所述第一播放时间点,所述第二时间段包括所述第二播放时间点,所述连续分布条件包括相邻重复视频帧的时间差小于第三阈值。
需要说明的是,具体执行过程可以参见图1所示方法实施例的具体说明,在此不进行赘述。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于计算机可读取存储介质中,该程序在执行时,包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。

Claims (14)

  1. 一种视频帧处理方法,其特征在于,包括:
    获取目标视频帧的卷积神经网络CNN特征和所述目标视频帧的局部特征,所述目标视频帧的局部特征包括所述目标视频帧的第一关键点和与所述第一关键点对应的特征描述子;
    对所述目标视频帧的CNN特征进行降维处理,获取所述目标视频帧的降维CNN特征;
    从多个样本视频帧中获取第一视频帧,所述第一视频帧的降维CNN特征与所述目标视频帧的降维CNN特征之间的距离符合第一预设条件;
    获取所述第一视频帧的局部特征,所述第一视频帧的局部特征包括所述第一视频帧中的第二关键点和与所述第二关键点对应的特征描述子;
    计算所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度;
    若所述匹配度符合第二预设条件,将所述第一视频帧作为所述目标视频帧的重复视频帧。
  2. 如权利要求1所述的方法,其特征在于,所述获取目标视频帧的卷积神经网络CNN特征,包括:
    获取待查询视频;
    对所述待查询视频进行等间隔视频截图,获得目标视频帧,所述目标视频帧为对所述待查询视频进行等间隔视频截图得到的多个视频帧中的任意一个视频帧;
    将所述目标视频帧输入CNN网络进行处理,获得所述目标视频帧的CNN特征。
  3. 如权利要求1所述的方法,其特征在于,所述从多个样本视频帧中获取第一视频帧,包括:
    获取样本视频的CNN特征索引,所述样本视频包括所述多个样本视频帧,所述CNN特征索引用于表示根据所述多个样本视频帧的降维CNN特征聚类形成的多个聚类,每个所述聚类包含聚类中心和所述聚类中的至少一个样本视频帧的降维CNN特征;
    计算所述目标视频帧的降维CNN特征和所述多个聚类中每个聚类的聚类中心之间的距离,并将距离最近的聚类中心所对应的聚类作为目标聚类;
    计算所述目标视频帧的降维CNN特征和所述目标聚类包含的至少一个样本视频帧中每个样本视频帧的降维CNN特征之间的距离,并将距离最近的降维CNN特征所对应的样本视频帧作为第一视频帧。
  4. 如权利要求3所述的方法,其特征在于,所述目标视频帧的局部特征包括m个第一关键点以及和所述m个关键点对应的m个第一特征描述子,一个所述第一关键点对应一个所述第一特征描述子,所述第一视频帧的局部特征包含n个第二关键点以及和所述n个第二关键点对应的n个第二特征描述子,一个所述第二关键点对应一个所述第二特征描述子,所述m为大于或者等于2的自然数,所述n为大于或者等于2的自然数;
    所述计算所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度,包括:
    针对每个所述第一特征描述子,获取所述n个第二特征描述子中每个第二特征描述子与所述第一特征描述子之间的n个距离;
    按照从大到小的先后顺序,将所述n个距离排序形成排序队列;
    获取所述排序队列中排序在最后的k个距离,所述k为大于或者等于2的自然数;
    根据所述k个距离,确定所述第一特征描述子是与所述第一视频帧匹配的有效描述子;
    根据所述m个第一特征描述子中是有效描述子的数量,确定所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度。
  5. 如权利要求3或4所述的方法,其特征在于,所述获取目标视频帧的卷积神经网络CNN特征和局部特征之前,还包括:
    获取所述多个样本视频帧的CNN特征;
    采用主成分分析PCA矩阵对所述多个样本视频帧的CNN特征进行降维处理,获得所述多个样本视频帧的降维CNN特征;
    对所述多个样本视频帧的降维CNN特征进行k-means聚类,形成多个聚类,每个所述聚类包含至少一个样本视频帧的降维CNN特征;
    对每个所述聚类包含的至少一个样本视频帧的降维CNN特征进行量化压缩,获得所述聚类对应的压缩后的CNN特征;
    根据所述多个聚类、每个所述聚类对应的压缩后的CNN特征以及每个所述聚类的聚类中心,生成样本视频的CNN特征索引。
  6. 如权利要求1所述的方法,其特征在于,所述目标视频帧属于待查询视频,且所述目标视频帧在所述待查询视频中的播放时间点为第一播放时间点,所述第一视频帧属于样本视频中的目标视频,所述将所述第一视频帧作为所述目标视频帧的重复视频帧,包括:
    获取所述第一视频帧的帧标识;
    查找所述第一视频帧的帧标识对应于所述目标视频的第二播放时间点,并将所述目标视频中第二播放时间点的视频帧作为所述待查询视频中第一播放时间点的视频帧的重复视频帧。
  7. 如权利要求6所述的方法,其特征在于,所述方法还包括,
    若所述待查询视频与所述目标视频所有重复视频帧的帧数量大于第二阈值,且所述所有重复视频帧中满足连续分布条件的重复视频帧分别分布于所述待查询视频的第一时间段和所述目标视频的第二时间段,将所述待查询视频中所述第一时间段的视频和所述目标视频中所述第二时间段的视频确定为重复视频段,所述第一时间段包括所述第一播放时间点,所述第二时间段包括所述第二播放时间点,所述连续分布条件包括相邻重复视频帧的时间差小于第三阈值。
  8. 一种视频帧处理装置,其特征在于,包括:
    第一获取模块,用于获取目标视频帧的卷积神经网络CNN特征和所述目标视频帧的局部特征,所述目标视频帧的局部特征包括所述目标视频帧的第一关键点和与所述第一关键点对应的特征描述子;
    降维处理模块,用于对所述目标视频帧的CNN特征进行降维处理,获取所述目标视频帧的降维CNN特征;
    第二获取模块,用于从多个样本视频帧中获取第一视频帧,所述第一视频帧的降维CNN特征与所述目标视频帧的降维CNN特征之间的距离符合第一预设条件;
    第三获取模块,用于获取所述第一视频帧的局部特征,所述第一视频帧的局部特征包括所述第一视频帧中的第二关键点和与所述第二关键点对应的特征描述子;
    计算模块,用于计算所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度;
    第一确定模块,用于若所述匹配度符合第二预设条件,将所述第一视频帧作为所述目标视频帧的重复视频帧。
  9. 如权利要求8所述的装置,其特征在于,所述第一获取模块具体用于获取待查询视频;对所述待查询视频进行等间隔视频截图,获得目标视频帧,所述目标视频帧为对所述待查询视频进行等间隔视频截图得到的多个视频帧中的任意一个视频帧;将所述目标视频帧输入CNN网络进行处理,获得所述目标视频帧的CNN特征。
  10. 如权利要求8所述的装置,其特征在于,所述第二获取模块包括:
    第一获取单元,用于获取样本视频的CNN特征索引,所述样本视频包括所述多个样本视频帧,所述CNN特征索引用于表示根据所述多个样本视频帧的降维CNN特征聚类形成的多个聚类,每个所述聚类包含聚类中心和所述聚类中的至少一个样本视频帧的降维CNN特征;
    第一计算单元,用于计算所述目标视频帧的降维CNN特征和所述多个聚类中每个聚类的聚类中心之间的距离,并将距离最近的聚类中心所对应的聚类作为目标聚类;
    第二计算单元,用于计算所述目标视频帧的降维CNN特征和所述目标聚类包含的至少一个样本视频帧中每个样本视频帧的降维CNN特征之间的距离,并将距离最近的降维CNN特征所对应的样本视频帧作为第一视频帧。
  11. 如权利要求10所述的装置,其特征在于,所述目标视频帧的局部特征包括m个第一关键点以及和所述m个关键点对应的m个第一特征描述子,一个所述第一关键点对应一个所述第一特征描述子,所述第一视频帧的局部特征包含n个第二关键点以及和所述n个第二关键点对应的n个第二特征描述子,一个所述第二关键点对应一个所述第二特征描述子,所述m为大于或者等于2的自然数,所述n为大于或者等于2的自然数;所述计算模块包括:
    第二获取单元,用于针对每个所述第一特征描述子,获取所述n个第二特征描述子中 每个第二特征描述子与所述第一特征描述子之间的n个距离;
    排序单元,用于按照从大到小的先后顺序,将所述n个距离排序形成排序队列;
    第三获取单元,用于获取所述排序队列中排序在最后的k个距离,所述k为大于或者等于2的自然数;
    第一确定单元,用于根据所述k个距离,确定所述第一特征描述子是与所述第一视频帧匹配的有效描述子;
    第二确定单元,用于根据所述m个第一特征描述子中是有效描述子的数量,确定所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度。
  12. 如权利要求10或11所述的装置,其特征在于,所述装置还包括:
    第四获取模块,用于获取所述多个样本视频帧的CNN特征;
    降维处理模块,用于采用主成分分析PCA矩阵对所述多个样本视频帧的CNN特征进行降维处理,获得所述多个样本视频帧的降维CNN特征;
    聚类模块,用于对所述多个样本视频帧的降维CNN特征进行k-means聚类,形成多个聚类,每个所述聚类包含至少一个样本视频帧的降维CNN特征;
    量化压缩模块,用于对每个所述聚类包含的至少一个样本视频帧的降维CNN特征进行量化压缩,获得所述聚类对应的压缩后的CNN特征;
    生成模块,用于根据所述多个聚类、每个所述聚类对应的压缩后的CNN特征以及每个所述聚类的聚类中心,生成样本视频的CNN特征索引。
  13. 如权利要求8所述的装置,其特征在于,所述目标视频帧属于待查询视频,且所述目标视频帧在所述待查询视频中的播放时间点为第一播放时间点,所述第一视频帧属于样本视频中的目标视频,所述第一确定模块包括:
    第四获取单元,用于获取所述第一视频帧的帧标识;
    第三确定单元,用于查找所述第一视频帧的帧标识对应于所述目标视频的第二播放时间点,并将所述目标视频中第二播放时间点的视频帧作为所述待查询视频中第一播放时间点的视频帧的重复视频帧。
  14. 如权利要求13所述的装置,其特征在于,所述装置还包括:
    第二确定模块,用于若所述待查询视频与所述目标视频所有重复视频帧的帧数量大于第二阈值,且所述所有重复视频帧中满足连续分布条件的重复视频帧分别分布于所述待查询视频的第一时间段和所述目标视频的第二时间段,将所述待查询视频中所述第一时间段的视频和所述目标视频中所述第二时间段的视频确定为重复视频段,所述第一时间段包括所述第一播放时间点,所述第二时间段包括所述第二播放时间点,所述连续分布条件包括相邻重复视频帧的时间差小于第三阈值。
PCT/CN2019/114271 2019-07-18 2019-10-30 视频帧处理方法及装置 WO2021007999A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
KR1020227005421A KR20220032627A (ko) 2019-07-18 2019-10-30 프레임 처리방법 및 장치
US17/575,140 US20220139085A1 (en) 2019-07-18 2022-01-13 Method and apparatus for video frame processing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910651319.3A CN110442749B (zh) 2019-07-18 2019-07-18 视频帧处理方法及装置
CN201910651319.3 2019-07-18

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/575,140 Continuation US20220139085A1 (en) 2019-07-18 2022-01-13 Method and apparatus for video frame processing

Publications (1)

Publication Number Publication Date
WO2021007999A1 true WO2021007999A1 (zh) 2021-01-21

Family

ID=68430885

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/114271 WO2021007999A1 (zh) 2019-07-18 2019-10-30 视频帧处理方法及装置

Country Status (4)

Country Link
US (1) US20220139085A1 (zh)
KR (1) KR20220032627A (zh)
CN (1) CN110442749B (zh)
WO (1) WO2021007999A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11868441B2 (en) * 2020-09-22 2024-01-09 Nbcuniversal Media, Llc Duplicate frames detection
CN113780319A (zh) * 2020-09-27 2021-12-10 北京沃东天骏信息技术有限公司 闭环检测方法及装置、计算机可存储介质
CN112507875A (zh) * 2020-12-10 2021-03-16 上海连尚网络科技有限公司 一种用于检测视频重复度的方法与设备

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229710A (zh) * 2017-05-27 2017-10-03 深圳市唯特视科技有限公司 一种基于局部特征描述符的视频分析方法
CN109543735A (zh) * 2018-11-14 2019-03-29 北京工商大学 视频拷贝检测方法及其系统

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ITTO20120986A1 (it) * 2012-11-14 2014-05-15 St Microelectronics Srl Procedimento per l'estrazione di informazioni distintive da un flusso di frame video digitali, sistema e prodotto informatico relativi
CN103631932B (zh) * 2013-12-06 2017-03-01 中国科学院自动化研究所 一种对重复视频进行检测的方法
CN106021575A (zh) * 2016-05-31 2016-10-12 北京奇艺世纪科技有限公司 一种视频中同款商品检索方法及装置
CN108363771B (zh) * 2018-02-08 2020-05-01 杭州电子科技大学 一种面向公安侦查应用的图像检索方法
CN109871490B (zh) * 2019-03-08 2021-03-09 腾讯科技(深圳)有限公司 媒体资源匹配方法、装置、存储介质和计算机设备

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229710A (zh) * 2017-05-27 2017-10-03 深圳市唯特视科技有限公司 一种基于局部特征描述符的视频分析方法
CN109543735A (zh) * 2018-11-14 2019-03-29 北京工商大学 视频拷贝检测方法及其系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG, LING: "Content-Based Video Copy Detection", CHINESE MASTER’S THESES FULL-TEXT DATABASE, INFORMATION SCIENCE AND TECHNOLOGY, 15 April 2018 (2018-04-15), DOI: 20200119212411X *

Also Published As

Publication number Publication date
CN110442749A (zh) 2019-11-12
US20220139085A1 (en) 2022-05-05
KR20220032627A (ko) 2022-03-15
CN110442749B (zh) 2023-05-23

Similar Documents

Publication Publication Date Title
JP5926291B2 (ja) 類似画像を識別する方法および装置
CN106445939B (zh) 图像检索、获取图像信息及图像识别方法、装置及系统
JP6333190B2 (ja) ビデオのデータベースをクエリ実行する方法
US20130254191A1 (en) Systems and methods for mobile search using bag of hash bits and boundary reranking
WO2021007999A1 (zh) 视频帧处理方法及装置
US10796196B2 (en) Large scale image recognition using global signatures and local feature information
CN108881947A (zh) 一种直播流的侵权检测方法及装置
CN104160409A (zh) 用于图像分析的方法和系统
KR20200011988A (ko) 이미지 검색 방법, 장치, 기기 및 판독 가능 저장 매체
CN110825894A (zh) 数据索引建立、数据检索方法、装置、设备和存储介质
WO2023108995A1 (zh) 向量相似度计算方法、装置、设备及存储介质
CN111090763A (zh) 一种图片自动标签方法及装置
KR20150013572A (ko) 이미지 분석 방법 및 시스템
EP3115908A1 (en) Method and apparatus for multimedia content indexing and retrieval based on product quantization
TW201828109A (zh) 圖像檢索、獲取圖像資訊及圖像識別方法、裝置及系統
CN113657504A (zh) 图像检索方法、装置、计算机设备和存储介质
CN115878824B (zh) 图像检索系统、方法和装置
JP2020013272A (ja) 特徴量生成方法、特徴量生成装置、及び特徴量生成プログラム
CN110765291A (zh) 检索方法、装置及电子设备
JP5923744B2 (ja) 画像検索システム、画像検索方法及び検索装置
CN114595350B (zh) 一种百亿级图像快速搜索的方法
JP6283308B2 (ja) 画像辞書構成方法、画像表現方法、装置、及びプログラム
Liu et al. Boosting VLAD with weighted fusion of local descriptors for image retrieval
JP6368688B2 (ja) 画像認識装置、画像認識方法、及び画像認識プログラム
Amato et al. Indexing vectors of locally aggregated descriptors using inverted files

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19938083

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20227005421

Country of ref document: KR

Kind code of ref document: A

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 17.05.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19938083

Country of ref document: EP

Kind code of ref document: A1