WO2021007999A1 - 视频帧处理方法及装置 - Google Patents
视频帧处理方法及装置 Download PDFInfo
- Publication number
- WO2021007999A1 WO2021007999A1 PCT/CN2019/114271 CN2019114271W WO2021007999A1 WO 2021007999 A1 WO2021007999 A1 WO 2021007999A1 CN 2019114271 W CN2019114271 W CN 2019114271W WO 2021007999 A1 WO2021007999 A1 WO 2021007999A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video frame
- feature
- video
- cnn
- target
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 9
- 238000000034 method Methods 0.000 claims abstract description 39
- 238000013527 convolutional neural network Methods 0.000 claims description 333
- 238000012545 processing Methods 0.000 claims description 46
- 238000000513 principal component analysis Methods 0.000 claims description 28
- 239000011159 matrix material Substances 0.000 claims description 19
- 238000004364 calculation method Methods 0.000 claims description 18
- 238000013139 quantization Methods 0.000 claims description 17
- 238000003064 k means clustering Methods 0.000 claims description 13
- 230000006835 compression Effects 0.000 claims description 9
- 238000007906 compression Methods 0.000 claims description 9
- 230000002776 aggregation Effects 0.000 claims description 3
- 238000004220 aggregation Methods 0.000 claims description 3
- 230000003252 repetitive effect Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 10
- 230000004044 response Effects 0.000 description 8
- 238000000605 extraction Methods 0.000 description 7
- 238000001914 filtration Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 6
- 238000001514 detection method Methods 0.000 description 6
- 238000007667 floating Methods 0.000 description 4
- 238000012216 screening Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000001174 ascending effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 238000010187 selection method Methods 0.000 description 2
- 230000002087 whitening effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/75—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/762—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/48—Matching video sequences
Definitions
- the present invention relates to the field of Internet technology, in particular to a video frame processing method and device.
- the inventor found that a large amount of videos are stored in the video library of the existing video website. In order to avoid storing duplicate videos in the video library, duplicate videos are usually detected. In the process of detecting duplicate videos, the detection of duplicate video frames is particularly important.
- the embodiments of the present invention provide a video frame processing method and device, which can realize the detection of repeated video frames with high accuracy.
- an embodiment of the present invention provides a video frame processing method, including:
- the local feature of the target video frame includes the first key point of the target video frame and the corresponding first key point Feature descriptor
- the local feature of the first video frame includes a second key point in the first video frame and a feature descriptor corresponding to the second key point;
- the acquiring the convolutional neural network CNN feature of the target video frame includes:
- the target video frame is input to a CNN network for processing, and the CNN feature of the target video frame is obtained.
- the acquiring the first video frame from multiple sample video frames includes:
- Obtain a CNN feature index of a sample video the sample video including the plurality of sample video frames, and the CNN feature index is used to represent a plurality of clusters formed according to the reduced-dimensional CNN feature clustering of the plurality of sample video frames , Each of the clusters includes a cluster center and a reduced-dimensional CNN feature of at least one sample video frame in the cluster;
- the local features of the target video frame include m first key points and m first feature descriptors corresponding to the m key points, and one first key point corresponds to One of the first feature descriptors, the local features of the first video frame include n second key points and n second feature descriptors corresponding to the n second key points, one of the second The key point corresponds to one of the second feature descriptors, the m is a natural number greater than or equal to 2, and the n is a natural number greater than or equal to 2;
- the calculating the matching degree between the local feature of the first video frame and the local feature of the target video frame includes:
- the first feature descriptor is a valid descriptor matching the first video frame
- the degree of matching between the local features of the first video frame and the local features of the target video frame is determined.
- the method before acquiring the convolutional neural network CNN features and local features of the target video frame, the method further includes:
- Dimensionality reduction processing is performed on the CNN features of the multiple sample video frames by using the PCA matrix of principal component analysis to obtain the dimensionality reduction CNN features of the multiple sample video frames;
- the compressed CNN feature corresponding to each cluster, and the cluster center of each cluster a CNN feature index of the sample video is generated.
- the target video frame belongs to the video to be queried, and the playback time point of the target video frame in the video to be queried is the first playback time point, and the first video frame belongs to The target video in the sample video, the using the first video frame as the repeated video frame of the target video frame includes:
- the method further includes:
- the time period and the second time period of the target video, the video in the first time period in the video to be queried and the video in the second time period in the target video are determined as repeated video segments, and
- the first time period includes the first play time point
- the second time period includes the second play time point
- the continuous distribution condition includes that the time difference between adjacent repeated video frames is less than a third threshold.
- an embodiment of the present invention provides a video frame processing device, including:
- the first acquisition module is used to acquire the convolutional neural network CNN feature of the target video frame and the local feature of the target video frame.
- the local feature of the target video frame includes the first key point of the target video frame and Describe the feature descriptor corresponding to the first key point;
- a dimensionality reduction processing module configured to perform dimensionality reduction processing on the CNN feature of the target video frame, and obtain the dimensionality reduction CNN feature of the target video frame;
- the second acquisition module is configured to acquire a first video frame from a plurality of sample video frames, and the distance between the reduced-dimensional CNN feature of the first video frame and the reduced-dimensional CNN feature of the target video frame meets the first prediction Set conditions
- the third acquisition module is configured to acquire the local features of the first video frame, and the local features of the first video frame include a second key point in the first video frame and a second key point corresponding to the second key point.
- a calculation module configured to calculate the degree of matching between the local feature of the first video frame and the local feature of the target video frame
- the first determining module is configured to use the first video frame as a repeated video frame of the target video frame if the matching degree meets a second preset condition.
- the first acquisition module is specifically configured to acquire the video to be queried; take screenshots of the video to be queried at equal intervals to obtain a target video frame, and the target video frame Query any one of the multiple video frames obtained by taking screenshots of the video at equal intervals; input the target video frame into a CNN network for processing, and obtain the CNN feature of the target video frame.
- the second acquisition module includes:
- the first acquiring unit is configured to acquire a CNN feature index of a sample video, where the sample video includes the plurality of sample video frames, and the CNN feature index is used to represent the dimensionality reduction CNN feature aggregation of the plurality of sample video frames.
- the first calculation unit is used to calculate the distance between the reduced-dimensional CNN feature of the target video frame and the cluster center of each cluster in the plurality of clusters, and calculate the distance between the closest cluster center Cluster as the target cluster;
- the second calculation unit is used to calculate the distance between the reduced-dimensional CNN feature of the target video frame and the reduced-dimensional CNN feature of each sample video frame in at least one sample video frame included in the target cluster, and calculate the distance
- the sample video frame corresponding to the latest dimensionality reduction CNN feature is taken as the first video frame.
- the local features of the target video frame include m first key points and m first feature descriptors corresponding to the m key points, and one first key point corresponds to One of the first feature descriptors, the local features of the first video frame include n second key points and n second feature descriptors corresponding to the n second key points, one of the second The key point corresponds to one of the second feature descriptors, the m is a natural number greater than or equal to 2, and the n is a natural number greater than or equal to 2.
- the calculation module includes:
- the second acquiring unit is configured to acquire, for each of the first feature descriptors, n distances between each second feature descriptor of the n second feature descriptors and the first feature descriptor ;
- a sorting unit configured to sort the n distances in a descending order to form a sorting queue
- the third acquiring unit is configured to acquire the last k distances sorted in the sorting queue, where k is a natural number greater than or equal to 2;
- a first determining unit configured to determine, according to the k distances, that the first feature descriptor is a valid descriptor matching the first video frame
- the second determining unit is configured to determine the degree of matching between the local feature of the first video frame and the local feature of the target video frame according to the number of valid descriptors in the m first feature descriptors.
- the device further includes:
- the fourth acquisition module is used to acquire the CNN features of the multiple sample video frames
- a dimensionality reduction processing module configured to perform dimensionality reduction processing on the CNN features of the multiple sample video frames by using the PCA matrix of principal component analysis to obtain the dimensionality reduction CNN features of the multiple sample video frames;
- a clustering module configured to perform k-means clustering on the reduced-dimensional CNN features of the multiple sample video frames to form multiple clusters, each of the clusters containing at least one reduced-dimensional CNN feature of the sample video frames;
- a quantization compression module configured to quantify and compress the dimensionality reduction CNN features of at least one sample video frame included in each cluster to obtain compressed CNN features corresponding to the cluster;
- the generating module is configured to generate the CNN feature index of the sample video according to the multiple clusters, the compressed CNN feature corresponding to each cluster, and the cluster center of each cluster.
- the target video frame belongs to the video to be queried
- the playback time point of the target video frame in the video to be queried is the first playback time point
- the first video frame belongs to
- the first determining module includes:
- the fourth acquiring unit is configured to acquire the frame identifier of the first video frame
- the third determining unit is configured to find that the frame identifier of the first video frame corresponds to the second play time point of the target video, and use the video frame at the second play time point in the target video as the to-be-query The repeated video frame of the video frame at the first playback time point in the video.
- the device further includes:
- the second determining module is configured to: if the number of repeated video frames of the video to be queried and the target video is greater than a second threshold, and the repeated video frames that meet the continuous distribution condition among all repeated video frames are distributed in all
- the first time period of the video to be queried and the second time period of the target video, and the video of the first time period in the video to be queried and the video of the second time period in the target video are determined
- the first time period includes the first play time point
- the second time period includes the second play time point
- the continuous distribution condition includes that the time difference between adjacent repeated video frames is less than the first Three thresholds.
- an embodiment of the present invention provides a video frame processing device, where the video frame processing device includes a processor and a memory;
- the processor is connected to a memory, wherein the memory is used to store program code, and the processor is used to call the program code to execute the method described in the first aspect.
- an embodiment of the present invention provides a computer storage medium, wherein the computer storage medium stores a computer program, the computer program includes program instructions, and when executed by a processor, the program instructions execute the first The method described in one aspect.
- the distance between the reduced-dimensional CNN feature of the target video frame and the reduced-dimensional CNN feature between the sample video frame is used as the first layer of filtering, and then the local features of the target video frame and the first video frame The degree of matching between the local features is screened in the second layer, so that the repeated video frames of the target video frame can be accurately detected with high accuracy.
- FIG. 1 is a flowchart of a method for processing video frames according to an embodiment of the present invention
- FIG. 2 is a schematic diagram of a generation process of a CNN feature index provided by an embodiment of the present invention
- FIG. 3 is a flowchart of extracting dual features of a video frame according to an embodiment of the present invention.
- FIG. 4 is a schematic diagram of retrieving duplicate videos provided by an embodiment of the present invention.
- FIG. 5 is a schematic structural diagram of a video frame processing apparatus provided by an embodiment of the present invention.
- FIG. 6 is a schematic structural diagram of another video frame processing apparatus provided by an embodiment of the present invention.
- FIG. 7 is a schematic structural diagram of another video frame processing device provided by an embodiment of the present invention.
- FIG. 1 is a schematic flowchart of a method for processing video frames according to an embodiment of the present invention.
- the video frame processing method of the embodiment of the present invention may include the following steps S101 to S106.
- S102 Perform dimensionality reduction processing on the CNN feature of the target video frame, and obtain the dimensionality reduction CNN feature of the target video frame;
- the target video frame may be any video frame in the video to be queried, or the target video frame may be a single picture frame that needs to be compared. If the target video frame is a video frame in the video to be queried, the video to be queried is captured at equal intervals to generate multiple video frames, and the target video frame can be any one of the multiple video frames.
- the screenshot interval of the video to be queried should be smaller than the screenshot interval of the sample video in the database to be compared, that is, the query video is taken at a higher frequency, such as 5 frames per second, to ensure that it can match the sample video frames in the library. More preferably, the time point of the screenshot can be randomly jittered to avoid the extreme situation where all the video frames to be queried happen to have a large interval with the sample video frames in the library.
- the local feature of the target video frame is the feature descriptor of the key points extracted in the target video frame.
- the extracted key points are pixels in the target video frame that have a large difference in pixel values from adjacent pixels, for example, the edges and corners of the image of the target video frame.
- the way to obtain the CNN features of the target video frame can be to select a CNN network pre-trained on a large-scale general image data set (imagenet, open-images, or ml-images data set), and set the target
- the video frame is input to the CNN network, and the feature map output by the last one or more convolutional layers is pooled to obtain the original CNN feature of the target video frame.
- the CNN feature is a floating-point vector with a fixed length and a high dimensionality (for example, 2048 dimensions).
- the Principal Component Analysis (PCA) matrix can be used to perform dimensionality reduction processing on the CNN feature to obtain the dimensionality reduction CNN feature of the target video frame.
- the method for obtaining the local feature of the target video frame may be that any one of the local feature extractors such as SIFT, SURF, ORB, AKAZE, BRISK, etc. can be used to extract the local features.
- BRISK has relatively high accuracy and speed.
- the local features extracted using default parameters contain a large number of key points and corresponding feature descriptors.
- the key points can be sorted by response, and only the dozens of key points with the highest response and the corresponding feature descriptors are retained, and one key point corresponds to a feature descriptor.
- the detection threshold can be lowered one or more times until enough key points are detected.
- S103 Acquire a first video frame from a plurality of sample video frames, and the distance between the reduced-dimensional CNN feature of the first video frame and the reduced-dimensional CNN feature of the target video frame meets a first preset condition;
- the K sample video frames that are closest to the dimensionality reduction CNN feature of the target video frame are selected from the sample video frames already stored in the database, and the first video frame may be the K sample video frames Any one of.
- the sample video frames stored in the database are sample video frames of multiple sample videos.
- the dimensionality reduction CNN feature of the K sample video frames is the closest to the dimensionality reduction CNN feature of the target video frame, that is, the dimensionality reduction of the K sample video frames
- the K distances between the CNN feature and the reduced-dimensional CNN feature of the target video frame are ranked in the top K positions among the distances between the reduced-dimensional CNN features of all sample video frames and the reduced-dimensional CNN features of the target video frame.
- the distance between the dimensionality-reduced CNN feature of the sample video frame and the dimensionality-reduced CNN feature of the target video frame is sorted in ascending order of distance.
- the selection method of selecting the closest K sample video frames may be to first obtain the CNN feature index of the sample video.
- the sample video includes all the sample videos in the library, that is, according to the reduction of the sample video frames of all the sample videos in the library.
- One-dimensional CNN feature can generate a CNN feature index.
- the CNN feature index is a structure, which is used to represent multiple clusters formed by dimensionality reduction CNN feature clusters of sample video frames of all sample videos in the library. Each cluster contains the cluster center and the cluster.
- the dimensionality reduction CNN feature of at least one sample video frame Then calculate the distance between the reduced-dimensional CNN feature of the target video frame and the cluster center of each cluster of the above multiple cluster frames, and use the cluster corresponding to the closest cluster center as the target cluster.
- the target cluster can include one cluster or multiple clusters. If it includes multiple clusters, it can be the distance between the cluster centers of the multiple clusters and the reduced-dimensional CNN features of the target video frame. First. Finally, calculate the distance between the dimensionality-reduced CNN feature of the target video frame and the dimensionality-reduced CNN feature of each sample video frame in at least one sample video frame contained in the target cluster, and select any of the closest K sample video frames One video frame is regarded as the first video frame, that is, there are K first video frames.
- the stored target cluster can be dynamically read from the database.
- the original CNN features of at least one sample video frame are calculated and sorted, and the closest K sample video frames are obtained as K first video frames.
- the K first video frames can also be filtered, for example, if the distance exceeds a certain threshold, they are directly eliminated.
- the CNN feature index is generated based on the sample video frames contained in all sample videos in the library, and the videos in the library are screenshots at equal intervals to generate corresponding sample video frames.
- the interval of the screenshots depends on the desired time resolution. For example, if you need to detect repeats of 5s and above, the interval between screenshots should be less than 5s.
- a CNN network pre-trained on a large-scale general image data set imagenet, open-images, or ml-images data set
- the CNN feature is a fixed length
- the floating-point vector has a higher dimensionality (for example, 2048 dimensions).
- the local features of the sample video frames of all the sample videos in the library can be extracted separately, and one of the local feature extractors such as SIFT, SURF, ORB, AKAZE, BRISK, etc. can be selected for local feature extraction.
- BRISK is relatively high in accuracy and speed.
- the local features extracted by default parameters contain a large number of key points and corresponding feature descriptors.
- the key points can be sorted by response, and only the dozens of key points with the highest response and the corresponding feature descriptors are kept. More preferably, for some video frames with relatively smooth texture, too few key points may be detected, and the detection threshold can be lowered one or more times until sufficient key points and corresponding feature descriptors are detected.
- a database can be established according to the video id and video frame id of the sample video, and (video id, video frame id, video frame time point, CNN feature, local Feature)
- the information tuple is uploaded to the database.
- the CNN feature is a 2048-dimensional single-precision floating point number (occupies 8KB space)
- database The size is about 1.07TB and can be deployed on a single machine.
- FIG. 3 it is a schematic diagram of CNN feature and local feature extraction provided by the embodiment of the present invention.
- a screenshot is taken of the video in the library to obtain a sample video frame, and the CNN feature map is further calculated through the CNN network.
- the CNN feature map is further pooled to obtain the CNN features of the sample video frame.
- local feature extraction is performed on the sample video frame to obtain the local feature of the sample video frame.
- the local features are screened to obtain the local features of the sample video frame after screening. Upload the CNN features and local features of the sample video frames to the database for storage.
- the original CNN features of all or part of the sample video frames are read from the database and loaded into the memory, and the original CNN features loaded into the memory are used to train the PCA matrix, which can reduce the CNN feature dimension At the same time, keep the original information as much as possible.
- the PCA matrix has a data whitening effect.
- the original CNN features of all sample video frames in the database are loaded into the memory at one time or in batches, and the original CNN features of all sample video frames are used to reduce the dimensionality of the original CNN features of all sample video frames. Obtain the reduced-dimensional CNN features of all sample video frames.
- k-means clustering is performed on the reduced-dimensional CNN features of all sample video frames to obtain N clusters and corresponding cluster centers.
- Each cluster contains at least one Dimensionality reduction CNN features of sample video frames.
- the dimensionality reduction CNN features of the sample video frames included in each cluster are quantized and compressed.
- scalar quantization or product quantization can be used to quantify each cluster
- the dimensionality reduction CNN features of the included sample video frames are further compressed. For example, if scalar quantization is used, each dimension of the reduced-dimensional CNN feature of the sample video frame can be compressed from a 4-byte floating point number to a 1-byte integer.
- S25 Generate a CNN feature index of the sample video according to the multiple clusters, the compressed CNN feature corresponding to each of the clusters, and the cluster center of each cluster.
- the CNN features of the sample videos in the library can be generated Index
- the CNN feature index is a structure
- the above N clusters, the cluster centers of the N clusters, and the compressed CNN features corresponding to each cluster can be obtained through the CNN feature index.
- S104 Acquire a local feature of the first video frame, where the local feature of the first video frame includes a second key point in the first video frame and a feature descriptor corresponding to the second key point;
- the method for acquiring the local features of the first video frame may be to extract the local features of the first video frame from the database established in step S21.
- the corresponding local feature may be searched from the database frame according to the video id to which the first video frame belongs and the first video frame id.
- the local features of the first video frame include the second key point extracted in the first video frame and the feature descriptor corresponding to the second key point.
- the key points extracted in the first video frame may be adjacent to the first video frame. Pixels with large differences in pixel values, such as corners in the first video frame.
- S105 Calculate the degree of matching between the local feature of the first video frame and the local feature of the target video frame
- the degree of matching between the local feature of the video frame and the local feature of the target video frame is greater than the first threshold, it is determined that the first video frame and the target video frame are repeated video frames.
- the calculation method for calculating the matching degree between the local feature of the first video frame and the local feature of the target video frame may be, for example, the local feature of the target video frame includes m first feature descriptors corresponding to m first key points, A first key point corresponds to a first feature descriptor.
- each first feature descriptor A i For each first feature descriptor A i , obtain n distances between each second feature descriptor of n second feature descriptors B j and the first feature descriptor. According to the order from largest to smallest, the n distances are sorted to form a sorting queue, and the last k distances of the sorting queue frame are obtained, that is, the closest k distances, and the first feature is determined according to the k distances
- the descriptor is a valid descriptor matching the first video frame.
- the first feature descriptor Ai is considered to be an effective matcher matching the first video frame.
- the degree of matching between the local features of the first video frame and the local features of the target video frame is determined.
- K first video frames that is, the target video frame to be queried and multiple sample video frames may be repeated video frames
- the playback time point at which the repeated video frame is located can be further located, and which video in the sample video the repeated video frame comes from.
- the target video frame is the video frame at the first playback time point of the video to be queried
- the frame identifier of the first video frame can be obtained
- the frame identifier of the first video frame corresponding to the second playback time of the target video can be found from the database.
- the K first video frames may correspond to different playback time points of different videos.
- the repeated video frames that meet the continuous distribution condition among all the repeated video frames are respectively distributed in the to-be Query the first time period of the video and the second time period of the target video, and determine that the video in the first time period in the video to be queried and the video in the second time period in the target video are repeated Video segment, the first time period includes the first play time point, the second time period includes the second play time point, and the continuous distribution condition includes that the time difference between adjacent repeated video frames is less than a third threshold .
- the target video frame is any video frame in the video to be queried. All video frames included in the video to be queried are compared with sample video frames of sample videos in the library to determine whether it is a repeated video frame. The number of repetitive video frames of the video to be queried and the target video is greater than the second threshold. The number of frames can be determined by the number of repetitive video frames in the video to be queried.
- the continuous distribution condition may be that the time difference between adjacent repeated video frames is less than the third threshold, that is, the repeated video frames are basically continuously concentrated in the first time period of the video to be queried and the second time period of the target video.
- the specific method is as follows: the sample video frames of the sample videos in the library are evenly distributed to P computers, and the CNN features and local features of the sample video frames allocated to each are extracted in parallel, and each computer will The obtained CNN features and local features are uploaded to the database.
- the database should adopt a solution that can support massive amounts of data, such as cloud storage.
- the total number of readings/P CNN features on P computers is neither omitted nor overlapped.
- Clustering is performed according to the shared parameters and the CNN features read by each, and the CNN feature index is established in each memory.
- the CNN feature index on each computer is different.
- When querying repeated videos calculate the CNN features and local features of the video frame to be queried on one or more computers, and then send the obtained CNN features and local features to all computers, and each computer calculates the CNN of the video frame to be queried in parallel
- the distance between the feature and the CNN feature in each cluster indicated by the CNN feature index of the respective computer, and the distance calculated by each is sent to a computer.
- the computer reorders according to the distance and takes the K results with the shortest distance. Determine the K first video frames, and further determine whether they are repeated video frames through matching of local features.
- the distance between the reduced-dimensional CNN feature of the target video frame and the reduced-dimensional CNN feature between the sample video frame is used as the first layer of filtering, and then the local features of the target video frame and the first video frame The degree of matching between the local features is screened in the second layer, so that the repeated video frames of the target video frame can be accurately detected with high accuracy.
- FIG 4 is a flow chart for detecting repeated video frames according to an embodiment of the present invention.
- a screenshot of the video to be queried is taken to obtain the video frame, which is further calculated by the CNN network, pooling and dimensionality reduction During the processing, the CNN features of the video frame are obtained.
- the nearest neighbor search and filtering operations are performed to obtain the sample video frame closest to the CNN feature of the video frame as the first video frame.
- further Verification by local features That is, read the local features of the first video frame from the database.
- local feature extraction is performed on the video frame, and the local features of the video frame are obtained by sorting and filtering according to the responsivity. Calculate the matching degree between the local feature of the video frame and the local feature of the first video frame in the library. If the matching degree is greater than the threshold, further determine the video id and time point from which the first video frame comes.
- FIG. 5 is a schematic structural diagram of a video frame processing apparatus provided in an embodiment of the present invention.
- the video frame processing apparatus of the embodiment of the present invention may include:
- the first acquisition module 11 is used to acquire the convolutional neural network CNN feature of the target video frame and the local feature of the target video frame.
- the local feature of the target video frame includes the first key point and the The feature descriptor corresponding to the first key point;
- the dimensionality reduction processing module 12 is configured to perform dimensionality reduction processing on the CNN feature of the target video frame, and obtain the dimensionality reduction CNN feature of the target video frame;
- the target video frame may be any video frame in the video to be queried, or the target video frame may be a single picture frame that needs to be compared. If the target video frame is a video frame in the video to be queried, the video to be queried is captured at equal intervals to generate multiple video frames, and the target video frame can be any one of the multiple video frames.
- the screenshot interval of the video to be queried should be smaller than the screenshot interval of the sample video in the database to be compared, that is, the query video is taken at a higher frequency, such as 5 frames per second, to ensure that it can match the sample video frames in the library. More preferably, the time point of the screenshot can be randomly jittered to avoid the extreme situation where all the video frames to be queried happen to have a large interval with the sample video frames in the library.
- the local feature of the target video frame is the feature descriptor of the key points extracted in the target video frame.
- the extracted key points are pixels that have a large difference in pixel value from adjacent pixels, for example, the edges and corners of the image of the target video frame.
- the way to obtain the CNN features of the target video frame can be to select a CNN network pre-trained on a large-scale general image data set (imagenet, open-images, or ml-images data set), and set the target
- the video frame is input to the CNN network, and the feature map output by the last one or more convolutional layers is pooled to obtain the CNN feature of the target video frame.
- the CNN feature is a floating-point vector with a fixed length and a high dimensionality (for example, 2048 dimensions).
- a principal component analysis (Principal Components Analysis, PCA) matrix can be used to reduce the dimensionality of the original CNN features.
- PCA Principal Components Analysis
- the method for obtaining the local feature of the target video frame may be that any one of the local feature extractors such as SIFT, SURF, ORB, AKAZE, BRISK, etc. can be used to extract the local features.
- BRISK has relatively high accuracy and speed.
- the local features extracted using default parameters contain a large number of key points and corresponding feature descriptors.
- the key points can be sorted by response, and only the dozens of key points with the highest response and the corresponding feature descriptors are retained, and one key point corresponds to a feature descriptor.
- the detection threshold can be lowered one or more times until enough key points are detected.
- the second acquisition module 13 is configured to acquire a first video frame from a plurality of sample video frames, and the distance between the reduced-dimensional CNN feature of the first video frame and the reduced-dimensional CNN feature of the target video frame conforms to the first Preset condition
- the second acquisition module may include a first acquisition unit, a first calculation unit, and a second calculation unit;
- the first acquiring unit is configured to acquire a CNN feature index of a sample video, where the sample video includes the plurality of sample video frames, and the CNN feature index is used to represent the dimensionality reduction CNN feature aggregation of the plurality of sample video frames.
- the first calculation unit is used to calculate the distance between the reduced-dimensional CNN feature of the target video frame and the cluster center of each cluster in the plurality of clusters, and calculate the distance between the closest cluster center Cluster as the target cluster;
- the second calculation unit is used to calculate the distance between the reduced-dimensional CNN feature of the target video frame and the reduced-dimensional CNN feature of each sample video frame in at least one sample video frame included in the target cluster, and calculate the distance
- the sample video frame corresponding to the latest dimensionality reduction CNN feature is taken as the first video frame.
- the K sample video frames that are closest to the dimensionality reduction CNN feature of the target video frame are selected from the sample video frames already stored in the database, and the first video frame may be the K sample video frames Any one of.
- the sample video frames stored in the database are sample video frames of multiple sample videos.
- the dimensionality reduction CNN feature of the K sample video frames is the closest to the dimensionality reduction CNN feature of the target video frame, that is, the dimensionality reduction of the K sample video frames
- the K distances between the CNN feature and the reduced-dimensional CNN feature of the target video frame are ranked in the top K positions among the distances between the reduced-dimensional CNN features of all sample video frames and the reduced-dimensional CNN features of the target video frame.
- the distance between the dimensionality-reduced CNN feature of the sample video frame and the dimensionality-reduced CNN feature of the target video frame is sorted in ascending order of distance.
- the selection method of selecting the closest K sample video frames may be to first obtain the CNN feature index of the sample video.
- the sample video includes all the sample videos in the library, that is, according to the reduction of the sample video frames of all the sample videos in the library.
- Dimension CNN features can generate a dimensionality reduction CNN feature index.
- the CNN feature index is a structure, which is used to represent multiple clusters formed by dimensionality reduction CNN feature clusters of sample video frames of all sample videos in the library. Each cluster contains the cluster center and the cluster.
- the dimensionality reduction CNN feature of at least one sample video frame Then calculate the distance between the reduced-dimensional CNN feature of the target video frame and the cluster center of each cluster of the above multiple cluster frames, and use the cluster corresponding to the closest cluster center as the target cluster.
- the target cluster can include one cluster or multiple clusters. If it includes multiple clusters, it can be the distance between the cluster centers of the multiple clusters and the reduced-dimensional CNN features of the target video frame. First. Finally, calculate the distance between the dimensionality-reduced CNN feature of the target video frame and the dimensionality-reduced CNN feature of each sample video frame in at least one sample video frame contained in the target cluster, and select any of the closest K sample video frames One video frame is regarded as the first video frame, that is, there are K first video frames.
- the stored target cluster can be dynamically read from the database.
- the original CNN features of at least one sample video frame are calculated and sorted, and the closest K sample video frames are obtained as K first video frames.
- the K first video frames can also be filtered, for example, if the distance exceeds a certain threshold, they are directly eliminated.
- the third acquisition module 14 is configured to acquire the local features of the first video frame, and the local features of the first video frame include a second key point in the first video frame and a second key point corresponding to the second key point Feature descriptor;
- the method for acquiring the local features of the first video frame may be to extract the local features of the first video frame from the database established in step S21.
- the corresponding local feature may be searched from the database frame according to the video id to which the first video frame belongs and the first video frame id.
- the local features of the first video frame include the second key point extracted in the first video frame and the second feature descriptor corresponding to the second key point.
- the second key point extracted in the first video frame may be the first Pixels in the video frame that have large differences in pixel values from adjacent pixels, such as the corners and corners in the first video frame.
- the calculation module 15 is configured to calculate the degree of matching between the local features of the first video frame and the local features of the target video frame;
- the calculation module 15 may include a second acquiring unit, a sorting unit, a third acquiring unit, a first determining unit, and a second determining unit;
- the second acquiring unit is configured to acquire, for each of the first feature descriptors, n distances between each second feature descriptor of the n second feature descriptors and the first feature descriptor ;
- a sorting unit configured to sort the n distances in a descending order to form a sorting queue
- the third acquiring unit is configured to acquire the last k distances sorted in the sorting queue, where k is a natural number greater than or equal to 2;
- a first determining unit configured to determine, according to the k distances, that the first feature descriptor is a valid descriptor matching the first video frame
- the second determining unit is configured to determine the degree of matching between the local feature of the first video frame and the local feature of the target video frame according to the number of valid descriptors in the m first feature descriptors.
- the first determining module 16 is configured to use the first video frame as a repeated video frame of the target video frame if the matching degree meets a second preset condition.
- the first determining module may include a fourth acquiring unit and a third determining unit;
- the fourth acquiring unit is configured to acquire the frame identifier of the first video frame
- the third determining unit is configured to find that the frame identifier of the first video frame corresponds to the target video in the plurality of videos, and the frame identifier of the first video frame corresponds to the second play time of the target video Point, and determine the video frame at the first play time point in the video to be queried and the video frame at the second play time point in the target video as repeated video frames.
- the degree of matching between the local feature of the video frame and the local feature of the target video frame is greater than the first threshold, it is determined that the first video frame and the target video frame are repeated video frames.
- the calculation method for calculating the matching degree between the local feature of the first video frame and the local feature of the target video frame may be, for example, the local feature of the target video frame includes m first feature descriptors corresponding to m first key points, A first key point corresponds to a first feature descriptor.
- each first feature descriptor A i For each first feature descriptor A i , obtain n distances between each second feature descriptor of n second feature descriptors B j and the first feature descriptor. According to the order from largest to smallest, the n distances are sorted to form a sorting queue, and the last k distances of the sorting queue frame are obtained, that is, the closest k distances, and the first feature is determined according to the k distances
- the descriptor is a valid descriptor matching the first video frame.
- the first feature descriptor Ai is considered to be an effective matcher matching the first video frame.
- the degree of matching between the local features of the first video frame and the local features of the target video frame is determined.
- K first video frames that is, the target video frame to be queried and multiple sample video frames may be repeated video frames
- the playback time point at which the repeated video frame is located can be further located, and which video in the sample video the repeated video frame comes from.
- the target video frame is the video frame at the first playback time point of the video to be queried
- the frame identifier of the first video frame can be obtained
- the frame identifier of the first video frame corresponding to the second playback time of the target video can be found from the database.
- the K first video frames may correspond to different playback time points of different videos.
- the CNN feature and local feature of the target video frame to be queried are first acquired, and then the sample video frame with the closest distance to the CNN feature of the target video frame is selected from a plurality of sample video frames as the first video frame , And then obtain the local features of the first video frame, and finally calculate the matching degree between the local features of the first video frame and the local features of the target video frame. If the matching degree is greater than the first threshold, determine the first video frame and the target video The frame is a repeated video frame. In this way, the first level of screening is performed through CNN features, and then the second level of screening is performed through local feature matching, so as to accurately determine whether the target video frame and the sample video frame are duplicate video frames with high accuracy.
- FIG. 6 it is a schematic structural diagram of another video frame processing apparatus provided by an embodiment of the present invention.
- the video frame processing apparatus provided by an embodiment of the present invention includes: a first acquisition module 21, dimensionality reduction processing Module 22, second acquisition module 23, third acquisition module 24, calculation module 25, first determination module 26, fourth acquisition module 27, dimensionality reduction processing module 28, clustering module 29, quantization compression module 30, generation module 31 And the second determination module 32; wherein, the first acquisition module 21, the dimensionality reduction processing module 22, the second acquisition module 23, the third acquisition module 24, the calculation module 25, the first determination module 26, please refer to the description of the embodiment in FIG. 5 , I won’t repeat it here.
- the fourth acquisition module 27 is configured to acquire the CNN features of the multiple sample video frames
- the CNN feature index is generated based on the sample video frames contained in all sample videos in the library, and the videos in the library are screenshots at equal intervals to generate corresponding sample video frames.
- the interval of the screenshots depends on the desired time resolution. For example, if you need to detect repeats of 5s and above, the interval between screenshots should be less than 5s.
- a CNN network pre-trained on a large-scale general image data set imagenet, open-images, or ml-images data set
- Each sample video frame extracted is input into the CNN network, and the feature map output by the last one or more convolutional layers is pooled to obtain the original CNN feature of the video frame.
- the original CNN feature is a A fixed-length floating-point vector has a higher dimensionality (for example, 2048 dimensions).
- the local features of the sample video frames of all the sample videos in the library can be extracted separately, and one of the local feature extractors such as SIFT, SURF, ORB, AKAZE, BRISK, etc. can be selected for local feature extraction.
- BRISK is relatively high in accuracy and speed.
- the local features extracted with default parameters contain a large number of key points and corresponding feature descriptors.
- the key points can be sorted by response, and only the dozens of key points with the highest response and the corresponding feature descriptors are kept. More preferably, for some video frames with relatively smooth texture, too few key points may be detected, and the detection threshold can be lowered one or more times until sufficient key points and corresponding feature descriptors are detected.
- a database can be established according to the video id and video frame id of the sample video, and (video id, video frame id, video frame time point, CNN feature, local Feature)
- the information tuple is uploaded to the database.
- the CNN feature is a 2048-dimensional single-precision floating point number (occupies 8KB space)
- database The size is about 1.07TB and can be deployed on a single machine.
- FIG. 3 it is a schematic diagram of CNN feature and local feature extraction provided by the embodiment of the present invention.
- a screenshot is taken of the video in the library to obtain a sample video frame, and the CNN feature map is further calculated through the CNN network.
- the CNN feature map is further pooled to obtain the CNN features of the sample video frame.
- local feature extraction is performed on the sample video frame to obtain the local feature of the sample video frame.
- the local features are screened to obtain the local features of the sample video frame after screening. Upload the CNN features and local features of the sample video frames to the database for storage.
- the dimensionality reduction processing module 28 is configured to perform dimensionality reduction processing on the CNN features of the multiple sample video frames by using the PCA matrix of principal component analysis to obtain the low-dimensional CNN features of the multiple sample video frames;
- the original CNN features of all or part of the sample video frames are read from the database and loaded into the memory, and the original CNN features loaded into the memory are used to train the PCA matrix, which can reduce the CNN feature dimension At the same time, keep the original information as much as possible.
- the PCA matrix has a data whitening effect.
- the original CNN features of all sample video frames in the database are loaded into the memory at one time or in batches, and the original CNN features of all sample video frames are used to reduce the dimensionality of the original CNN features of all sample video frames. Obtain the reduced-dimensional CNN features of all sample video frames.
- the clustering module 29 is configured to perform k-means clustering on the reduced-dimensional CNN features of the multiple sample video frames to form multiple clusters, each of the clusters containing at least one reduced-dimensional CNN feature of the sample video frames ;
- k-means clustering is performed on the low-dimensional CNN features of all sample video frames to obtain N clusters and corresponding cluster centers.
- Each cluster contains at least one Dimensionality reduction CNN features of sample video frames.
- the quantization compression module 30 is configured to quantize and compress the dimensionality reduction CNN features of at least one sample video frame included in each of the clusters, to obtain compressed CNN features corresponding to the clusters;
- the dimensionality reduction CNN features of the sample video frames included in each cluster are quantized and compressed.
- scalar quantization or product quantization can be used to quantify each cluster
- the dimensionality reduction CNN features of the included sample video frames are further compressed. For example, if scalar quantization is used, each dimension of the reduced-dimensional CNN feature of the sample video frame can be compressed from a 4-byte floating point number to a 1-byte integer.
- the generating module 31 is configured to generate a CNN feature index of the sample video according to the multiple clusters, the compressed CNN feature corresponding to each cluster, and the cluster center of each cluster.
- the CNN features of the sample videos in the library can be generated Index
- the CNN feature index is a structure
- the above N clusters, the cluster centers of the N clusters, and the compressed CNN features corresponding to each cluster can be obtained through the CNN feature index.
- the second determining module 32 is configured to: if the number of repeated video frames of the video to be queried and the target video is greater than a second threshold, and the repeated video frames that meet the continuous distribution condition among all the repeated video frames are respectively distributed in For the first time period of the video to be queried and the second time period of the target video, combine the video of the first time period in the video to be queried and the video of the second time period in the target video Determined to be a repeated video segment, the first time period includes the first play time point, the second time period includes the second play time point, and the continuous distribution condition includes that the time difference between adjacent repeated video frames is less than The third threshold.
- the target video frame is any video frame in the video to be queried. All video frames included in the video to be queried are compared with sample video frames of sample videos in the library to determine whether it is a repeated video frame. The number of repetitive video frames of the video to be queried and the target video is greater than the second threshold. The number of frames can be determined by the number of repetitive video frames in the video to be queried.
- the continuous distribution condition may be that the time difference between adjacent repeated video frames is less than the third threshold, that is, the repeated video frames are basically continuously concentrated in the first time period of the video to be queried and the second time period of the target video.
- the distance between the reduced-dimensional CNN feature of the target video frame and the reduced-dimensional CNN feature between the sample video frame is used as the first layer of filtering, and then the local features of the target video frame and the first video frame The degree of matching between the local features is screened in the second layer, so that the repeated video frames of the target video frame can be accurately detected with high accuracy.
- the embodiment of the present invention also provides a computer storage medium.
- the computer storage medium may store multiple instructions, and the instructions are suitable for being loaded by a processor and executing the method steps of the embodiment shown in FIG. For the process, reference may be made to the specific description of the embodiment shown in FIG. 1, which will not be repeated here.
- the video frame processing apparatus 1000 may include: at least one processor 1001, such as a CPU, at least one Communication interface 1003, memory 1004, and at least one communication bus 1002.
- the communication bus 1002 is used to implement connection and communication between these components.
- the communication interface 1003 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
- the memory 1004 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as at least one disk memory.
- the memory 1004 may also be at least one storage device located far away from the foregoing processor 1001.
- the memory 1004 as a computer storage medium may include an operating system, a network communication module, and program instructions.
- the processor 1001 may be used to load program instructions stored in the memory 1004, and specifically perform the following operations:
- the local feature of the target video frame includes the first key point of the target video frame and the corresponding first key point Feature descriptor
- the local feature of the first video frame includes a second key point in the first video frame and a feature descriptor corresponding to the second key point;
- the acquiring the convolutional neural network CNN feature of the target video frame includes:
- the target video frame is input to a CNN network for processing, and the CNN feature of the target video frame is obtained.
- the acquiring the first video frame from multiple sample video frames includes:
- Obtain a CNN feature index of a sample video the sample video including the plurality of sample video frames, and the CNN feature index is used to represent a plurality of clusters formed according to the reduced-dimensional CNN feature clustering of the plurality of sample video frames , Each of the clusters includes a cluster center and a reduced-dimensional CNN feature of at least one sample video frame in the cluster;
- the local features of the target video frame include m first key points and m first feature descriptors corresponding to the m key points, and one first key point corresponds to one first key point.
- the local features of the first video frame include n second key points and n second feature descriptors corresponding to the n second key points, and one second key point corresponds to one
- the m is a natural number greater than or equal to 2
- the n is a natural number greater than or equal to 2;
- the calculating the matching degree between the local feature of the first video frame and the local feature of the target video frame includes:
- the first feature descriptor is a valid descriptor matching the first video frame
- the degree of matching between the local features of the first video frame and the local features of the target video frame is determined.
- the method before acquiring the convolutional neural network CNN features and local features of the target video frame, the method further includes:
- Dimensionality reduction processing is performed on the CNN features of the multiple sample video frames by using the PCA matrix of principal component analysis to obtain the dimensionality reduction CNN features of the multiple sample video frames;
- the compressed CNN feature corresponding to each cluster, and the cluster center of each cluster a CNN feature index of the sample video is generated.
- the target video frame belongs to the video to be queried
- the playback time point of the target video frame in the video to be queried is the first playback time point
- the first video frame belongs to the target in the sample video Video
- the using the first video frame as the repeated video frame of the target video frame includes:
- processor 1001 may also be used to load program instructions stored in the memory 1004 to perform the following operations:
- the time period and the second time period of the target video, the video in the first time period in the video to be queried and the video in the second time period in the target video are determined as repeated video segments, and
- the first time period includes the first play time point
- the second time period includes the second play time point
- the continuous distribution condition includes that the time difference between adjacent repeated video frames is less than a third threshold.
- the program can be stored in a computer readable storage medium. When executed, it includes the processes of the above-mentioned method embodiments.
- the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Library & Information Science (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biodiversity & Conservation Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (14)
- 一种视频帧处理方法,其特征在于,包括:获取目标视频帧的卷积神经网络CNN特征和所述目标视频帧的局部特征,所述目标视频帧的局部特征包括所述目标视频帧的第一关键点和与所述第一关键点对应的特征描述子;对所述目标视频帧的CNN特征进行降维处理,获取所述目标视频帧的降维CNN特征;从多个样本视频帧中获取第一视频帧,所述第一视频帧的降维CNN特征与所述目标视频帧的降维CNN特征之间的距离符合第一预设条件;获取所述第一视频帧的局部特征,所述第一视频帧的局部特征包括所述第一视频帧中的第二关键点和与所述第二关键点对应的特征描述子;计算所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度;若所述匹配度符合第二预设条件,将所述第一视频帧作为所述目标视频帧的重复视频帧。
- 如权利要求1所述的方法,其特征在于,所述获取目标视频帧的卷积神经网络CNN特征,包括:获取待查询视频;对所述待查询视频进行等间隔视频截图,获得目标视频帧,所述目标视频帧为对所述待查询视频进行等间隔视频截图得到的多个视频帧中的任意一个视频帧;将所述目标视频帧输入CNN网络进行处理,获得所述目标视频帧的CNN特征。
- 如权利要求1所述的方法,其特征在于,所述从多个样本视频帧中获取第一视频帧,包括:获取样本视频的CNN特征索引,所述样本视频包括所述多个样本视频帧,所述CNN特征索引用于表示根据所述多个样本视频帧的降维CNN特征聚类形成的多个聚类,每个所述聚类包含聚类中心和所述聚类中的至少一个样本视频帧的降维CNN特征;计算所述目标视频帧的降维CNN特征和所述多个聚类中每个聚类的聚类中心之间的距离,并将距离最近的聚类中心所对应的聚类作为目标聚类;计算所述目标视频帧的降维CNN特征和所述目标聚类包含的至少一个样本视频帧中每个样本视频帧的降维CNN特征之间的距离,并将距离最近的降维CNN特征所对应的样本视频帧作为第一视频帧。
- 如权利要求3所述的方法,其特征在于,所述目标视频帧的局部特征包括m个第一关键点以及和所述m个关键点对应的m个第一特征描述子,一个所述第一关键点对应一个所述第一特征描述子,所述第一视频帧的局部特征包含n个第二关键点以及和所述n个第二关键点对应的n个第二特征描述子,一个所述第二关键点对应一个所述第二特征描述子,所述m为大于或者等于2的自然数,所述n为大于或者等于2的自然数;所述计算所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度,包括:针对每个所述第一特征描述子,获取所述n个第二特征描述子中每个第二特征描述子与所述第一特征描述子之间的n个距离;按照从大到小的先后顺序,将所述n个距离排序形成排序队列;获取所述排序队列中排序在最后的k个距离,所述k为大于或者等于2的自然数;根据所述k个距离,确定所述第一特征描述子是与所述第一视频帧匹配的有效描述子;根据所述m个第一特征描述子中是有效描述子的数量,确定所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度。
- 如权利要求3或4所述的方法,其特征在于,所述获取目标视频帧的卷积神经网络CNN特征和局部特征之前,还包括:获取所述多个样本视频帧的CNN特征;采用主成分分析PCA矩阵对所述多个样本视频帧的CNN特征进行降维处理,获得所述多个样本视频帧的降维CNN特征;对所述多个样本视频帧的降维CNN特征进行k-means聚类,形成多个聚类,每个所述聚类包含至少一个样本视频帧的降维CNN特征;对每个所述聚类包含的至少一个样本视频帧的降维CNN特征进行量化压缩,获得所述聚类对应的压缩后的CNN特征;根据所述多个聚类、每个所述聚类对应的压缩后的CNN特征以及每个所述聚类的聚类中心,生成样本视频的CNN特征索引。
- 如权利要求1所述的方法,其特征在于,所述目标视频帧属于待查询视频,且所述目标视频帧在所述待查询视频中的播放时间点为第一播放时间点,所述第一视频帧属于样本视频中的目标视频,所述将所述第一视频帧作为所述目标视频帧的重复视频帧,包括:获取所述第一视频帧的帧标识;查找所述第一视频帧的帧标识对应于所述目标视频的第二播放时间点,并将所述目标视频中第二播放时间点的视频帧作为所述待查询视频中第一播放时间点的视频帧的重复视频帧。
- 如权利要求6所述的方法,其特征在于,所述方法还包括,若所述待查询视频与所述目标视频所有重复视频帧的帧数量大于第二阈值,且所述所有重复视频帧中满足连续分布条件的重复视频帧分别分布于所述待查询视频的第一时间段和所述目标视频的第二时间段,将所述待查询视频中所述第一时间段的视频和所述目标视频中所述第二时间段的视频确定为重复视频段,所述第一时间段包括所述第一播放时间点,所述第二时间段包括所述第二播放时间点,所述连续分布条件包括相邻重复视频帧的时间差小于第三阈值。
- 一种视频帧处理装置,其特征在于,包括:第一获取模块,用于获取目标视频帧的卷积神经网络CNN特征和所述目标视频帧的局部特征,所述目标视频帧的局部特征包括所述目标视频帧的第一关键点和与所述第一关键点对应的特征描述子;降维处理模块,用于对所述目标视频帧的CNN特征进行降维处理,获取所述目标视频帧的降维CNN特征;第二获取模块,用于从多个样本视频帧中获取第一视频帧,所述第一视频帧的降维CNN特征与所述目标视频帧的降维CNN特征之间的距离符合第一预设条件;第三获取模块,用于获取所述第一视频帧的局部特征,所述第一视频帧的局部特征包括所述第一视频帧中的第二关键点和与所述第二关键点对应的特征描述子;计算模块,用于计算所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度;第一确定模块,用于若所述匹配度符合第二预设条件,将所述第一视频帧作为所述目标视频帧的重复视频帧。
- 如权利要求8所述的装置,其特征在于,所述第一获取模块具体用于获取待查询视频;对所述待查询视频进行等间隔视频截图,获得目标视频帧,所述目标视频帧为对所述待查询视频进行等间隔视频截图得到的多个视频帧中的任意一个视频帧;将所述目标视频帧输入CNN网络进行处理,获得所述目标视频帧的CNN特征。
- 如权利要求8所述的装置,其特征在于,所述第二获取模块包括:第一获取单元,用于获取样本视频的CNN特征索引,所述样本视频包括所述多个样本视频帧,所述CNN特征索引用于表示根据所述多个样本视频帧的降维CNN特征聚类形成的多个聚类,每个所述聚类包含聚类中心和所述聚类中的至少一个样本视频帧的降维CNN特征;第一计算单元,用于计算所述目标视频帧的降维CNN特征和所述多个聚类中每个聚类的聚类中心之间的距离,并将距离最近的聚类中心所对应的聚类作为目标聚类;第二计算单元,用于计算所述目标视频帧的降维CNN特征和所述目标聚类包含的至少一个样本视频帧中每个样本视频帧的降维CNN特征之间的距离,并将距离最近的降维CNN特征所对应的样本视频帧作为第一视频帧。
- 如权利要求10所述的装置,其特征在于,所述目标视频帧的局部特征包括m个第一关键点以及和所述m个关键点对应的m个第一特征描述子,一个所述第一关键点对应一个所述第一特征描述子,所述第一视频帧的局部特征包含n个第二关键点以及和所述n个第二关键点对应的n个第二特征描述子,一个所述第二关键点对应一个所述第二特征描述子,所述m为大于或者等于2的自然数,所述n为大于或者等于2的自然数;所述计算模块包括:第二获取单元,用于针对每个所述第一特征描述子,获取所述n个第二特征描述子中 每个第二特征描述子与所述第一特征描述子之间的n个距离;排序单元,用于按照从大到小的先后顺序,将所述n个距离排序形成排序队列;第三获取单元,用于获取所述排序队列中排序在最后的k个距离,所述k为大于或者等于2的自然数;第一确定单元,用于根据所述k个距离,确定所述第一特征描述子是与所述第一视频帧匹配的有效描述子;第二确定单元,用于根据所述m个第一特征描述子中是有效描述子的数量,确定所述第一视频帧的局部特征和所述目标视频帧的局部特征之间的匹配度。
- 如权利要求10或11所述的装置,其特征在于,所述装置还包括:第四获取模块,用于获取所述多个样本视频帧的CNN特征;降维处理模块,用于采用主成分分析PCA矩阵对所述多个样本视频帧的CNN特征进行降维处理,获得所述多个样本视频帧的降维CNN特征;聚类模块,用于对所述多个样本视频帧的降维CNN特征进行k-means聚类,形成多个聚类,每个所述聚类包含至少一个样本视频帧的降维CNN特征;量化压缩模块,用于对每个所述聚类包含的至少一个样本视频帧的降维CNN特征进行量化压缩,获得所述聚类对应的压缩后的CNN特征;生成模块,用于根据所述多个聚类、每个所述聚类对应的压缩后的CNN特征以及每个所述聚类的聚类中心,生成样本视频的CNN特征索引。
- 如权利要求8所述的装置,其特征在于,所述目标视频帧属于待查询视频,且所述目标视频帧在所述待查询视频中的播放时间点为第一播放时间点,所述第一视频帧属于样本视频中的目标视频,所述第一确定模块包括:第四获取单元,用于获取所述第一视频帧的帧标识;第三确定单元,用于查找所述第一视频帧的帧标识对应于所述目标视频的第二播放时间点,并将所述目标视频中第二播放时间点的视频帧作为所述待查询视频中第一播放时间点的视频帧的重复视频帧。
- 如权利要求13所述的装置,其特征在于,所述装置还包括:第二确定模块,用于若所述待查询视频与所述目标视频所有重复视频帧的帧数量大于第二阈值,且所述所有重复视频帧中满足连续分布条件的重复视频帧分别分布于所述待查询视频的第一时间段和所述目标视频的第二时间段,将所述待查询视频中所述第一时间段的视频和所述目标视频中所述第二时间段的视频确定为重复视频段,所述第一时间段包括所述第一播放时间点,所述第二时间段包括所述第二播放时间点,所述连续分布条件包括相邻重复视频帧的时间差小于第三阈值。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020227005421A KR20220032627A (ko) | 2019-07-18 | 2019-10-30 | 프레임 처리방법 및 장치 |
US17/575,140 US20220139085A1 (en) | 2019-07-18 | 2022-01-13 | Method and apparatus for video frame processing |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910651319.3A CN110442749B (zh) | 2019-07-18 | 2019-07-18 | 视频帧处理方法及装置 |
CN201910651319.3 | 2019-07-18 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/575,140 Continuation US20220139085A1 (en) | 2019-07-18 | 2022-01-13 | Method and apparatus for video frame processing |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021007999A1 true WO2021007999A1 (zh) | 2021-01-21 |
Family
ID=68430885
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/114271 WO2021007999A1 (zh) | 2019-07-18 | 2019-10-30 | 视频帧处理方法及装置 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20220139085A1 (zh) |
KR (1) | KR20220032627A (zh) |
CN (1) | CN110442749B (zh) |
WO (1) | WO2021007999A1 (zh) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11868441B2 (en) * | 2020-09-22 | 2024-01-09 | Nbcuniversal Media, Llc | Duplicate frames detection |
CN113780319A (zh) * | 2020-09-27 | 2021-12-10 | 北京沃东天骏信息技术有限公司 | 闭环检测方法及装置、计算机可存储介质 |
CN112507875A (zh) * | 2020-12-10 | 2021-03-16 | 上海连尚网络科技有限公司 | 一种用于检测视频重复度的方法与设备 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107229710A (zh) * | 2017-05-27 | 2017-10-03 | 深圳市唯特视科技有限公司 | 一种基于局部特征描述符的视频分析方法 |
CN109543735A (zh) * | 2018-11-14 | 2019-03-29 | 北京工商大学 | 视频拷贝检测方法及其系统 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
ITTO20120986A1 (it) * | 2012-11-14 | 2014-05-15 | St Microelectronics Srl | Procedimento per l'estrazione di informazioni distintive da un flusso di frame video digitali, sistema e prodotto informatico relativi |
CN103631932B (zh) * | 2013-12-06 | 2017-03-01 | 中国科学院自动化研究所 | 一种对重复视频进行检测的方法 |
CN106021575A (zh) * | 2016-05-31 | 2016-10-12 | 北京奇艺世纪科技有限公司 | 一种视频中同款商品检索方法及装置 |
CN108363771B (zh) * | 2018-02-08 | 2020-05-01 | 杭州电子科技大学 | 一种面向公安侦查应用的图像检索方法 |
CN109871490B (zh) * | 2019-03-08 | 2021-03-09 | 腾讯科技(深圳)有限公司 | 媒体资源匹配方法、装置、存储介质和计算机设备 |
-
2019
- 2019-07-18 CN CN201910651319.3A patent/CN110442749B/zh active Active
- 2019-10-30 WO PCT/CN2019/114271 patent/WO2021007999A1/zh active Application Filing
- 2019-10-30 KR KR1020227005421A patent/KR20220032627A/ko unknown
-
2022
- 2022-01-13 US US17/575,140 patent/US20220139085A1/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107229710A (zh) * | 2017-05-27 | 2017-10-03 | 深圳市唯特视科技有限公司 | 一种基于局部特征描述符的视频分析方法 |
CN109543735A (zh) * | 2018-11-14 | 2019-03-29 | 北京工商大学 | 视频拷贝检测方法及其系统 |
Non-Patent Citations (1)
Title |
---|
WANG, LING: "Content-Based Video Copy Detection", CHINESE MASTER’S THESES FULL-TEXT DATABASE, INFORMATION SCIENCE AND TECHNOLOGY, 15 April 2018 (2018-04-15), DOI: 20200119212411X * |
Also Published As
Publication number | Publication date |
---|---|
CN110442749A (zh) | 2019-11-12 |
US20220139085A1 (en) | 2022-05-05 |
KR20220032627A (ko) | 2022-03-15 |
CN110442749B (zh) | 2023-05-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5926291B2 (ja) | 類似画像を識別する方法および装置 | |
CN106445939B (zh) | 图像检索、获取图像信息及图像识别方法、装置及系统 | |
JP6333190B2 (ja) | ビデオのデータベースをクエリ実行する方法 | |
US20130254191A1 (en) | Systems and methods for mobile search using bag of hash bits and boundary reranking | |
WO2021007999A1 (zh) | 视频帧处理方法及装置 | |
US10796196B2 (en) | Large scale image recognition using global signatures and local feature information | |
CN108881947A (zh) | 一种直播流的侵权检测方法及装置 | |
CN104160409A (zh) | 用于图像分析的方法和系统 | |
KR20200011988A (ko) | 이미지 검색 방법, 장치, 기기 및 판독 가능 저장 매체 | |
CN110825894A (zh) | 数据索引建立、数据检索方法、装置、设备和存储介质 | |
WO2023108995A1 (zh) | 向量相似度计算方法、装置、设备及存储介质 | |
CN111090763A (zh) | 一种图片自动标签方法及装置 | |
KR20150013572A (ko) | 이미지 분석 방법 및 시스템 | |
EP3115908A1 (en) | Method and apparatus for multimedia content indexing and retrieval based on product quantization | |
TW201828109A (zh) | 圖像檢索、獲取圖像資訊及圖像識別方法、裝置及系統 | |
CN113657504A (zh) | 图像检索方法、装置、计算机设备和存储介质 | |
CN115878824B (zh) | 图像检索系统、方法和装置 | |
JP2020013272A (ja) | 特徴量生成方法、特徴量生成装置、及び特徴量生成プログラム | |
CN110765291A (zh) | 检索方法、装置及电子设备 | |
JP5923744B2 (ja) | 画像検索システム、画像検索方法及び検索装置 | |
CN114595350B (zh) | 一种百亿级图像快速搜索的方法 | |
JP6283308B2 (ja) | 画像辞書構成方法、画像表現方法、装置、及びプログラム | |
Liu et al. | Boosting VLAD with weighted fusion of local descriptors for image retrieval | |
JP6368688B2 (ja) | 画像認識装置、画像認識方法、及び画像認識プログラム | |
Amato et al. | Indexing vectors of locally aggregated descriptors using inverted files |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19938083 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 20227005421 Country of ref document: KR Kind code of ref document: A |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 17.05.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19938083 Country of ref document: EP Kind code of ref document: A1 |