CN111897995A - Video feature extraction method and video quantization method applying same - Google Patents

Video feature extraction method and video quantization method applying same Download PDF

Info

Publication number
CN111897995A
CN111897995A CN202010771697.8A CN202010771697A CN111897995A CN 111897995 A CN111897995 A CN 111897995A CN 202010771697 A CN202010771697 A CN 202010771697A CN 111897995 A CN111897995 A CN 111897995A
Authority
CN
China
Prior art keywords
feature
quantization
matrix
video
heat map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010771697.8A
Other languages
Chinese (zh)
Inventor
宋井宽
郎睿敏
朱筱苏
高联丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Jingzhili Technology Co ltd
University of Electronic Science and Technology of China
Original Assignee
Chengdu Jingzhili Technology Co ltd
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Jingzhili Technology Co ltd, University of Electronic Science and Technology of China filed Critical Chengdu Jingzhili Technology Co ltd
Priority to CN202010771697.8A priority Critical patent/CN111897995A/en
Publication of CN111897995A publication Critical patent/CN111897995A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Algebra (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention relates to the technical field of computer vision, in particular to a video feature extraction method and a video quantization method applying the same, provides a video feature extraction method, aims to solve the technical problem of effectively obtaining video features containing abundant context information, and simultaneously provides a video quantization method applying the video feature extraction method. The video feature extraction method comprises the following steps: extracting original visual features from a target video and constructing an original feature matrix, wherein the original feature matrix comprises spatial information of each frame of sampling image and time sequence information between each frame of sampling image; generating a sampling image space attention heat map and a sampling image time sequence attention heat map according to the original characteristic matrix; and adding and fusing the original characteristic matrix, the sampling image space attention heat map and the sampling image time sequence attention heat map to obtain a target characteristic matrix.

Description

Video feature extraction method and video quantization method applying same
Technical Field
The invention relates to the technical field of computer vision, in particular to a video feature extraction method and a video quantization method applying the same.
Background
Video retrieval is a fundamental and challenging problem in computer vision, which aims to retrieve the video that is most similar to the input video from a vast video library. And the unsupervised video quantitative retrieval realizes the quick retrieval of the video by compressing the visual characteristics of the original unlabeled video into a compact binary code.
At present, a known unsupervised video quantitative retrieval method is to extract visual feature information of each frame of video picture by using a convolutional neural network, process features of the frames by using a cyclic neural network to obtain video features, and compress the feature information to an extremely short binary code by using a hash algorithm, so that the volume of a database is reduced, and the retrieval speed is increased.
The above method has two problems. First, it is difficult to obtain information in a long time range by means of a convolutional neural network and a cyclic neural network, so that it is difficult to retain context information of a video, and it is impossible to obtain a better video feature. Secondly, under a large-scale video library, the video characteristics are very complex, and the Hash algorithm is difficult to obtain good accuracy.
Summary of the invention
The technical problem to be solved by the invention is as follows: the video feature extraction method is provided to solve the technical problem of effectively obtaining video features containing abundant context information, and the video quantization method applying the video feature extraction method is provided.
The technical scheme adopted by the invention for solving the technical problems is as follows: a video feature extraction method comprises the following steps: extracting original visual features from a target video and constructing an original feature matrix, wherein the original feature matrix comprises spatial information of each frame of sampling image and time sequence information between each frame of sampling image; generating a sampling image space attention heat map and a sampling image time sequence attention heat map according to the original characteristic matrix; and adding and fusing the original characteristic matrix, the sampling image space attention heat map and the sampling image time sequence attention heat map to obtain a target characteristic matrix.
According to an embodiment provided by the present specification, generating a spatial attention heat map of a sampled image from a raw feature matrix includes: generating a line dimension attention heat map representing the information dependency relationship between each pixel point in each frame of sampling image and all other pixel points in the same line with the pixel point according to the original characteristic matrix; and generating a column dimension attention heat map representing the information dependence relationship between each pixel point in each frame of sampling image and all other pixel points in the same column with the pixel point according to the original characteristic matrix.
According to an embodiment provided by the present specification, generating a time-series attention heat map of a sampled image from an original feature matrix includes: and generating a time sequence dimension attention heat map representing the information dependence relationship between each pixel point in each frame of sampling image and all other pixel points which are in the same time sequence with the pixel point according to the original characteristic matrix.
According to the embodiments provided in the present specification, if the original feature matrix O of the target video is seti∈RT′×h×w×cWherein h is the height of each frame of video, w is the width of each frame of video, c is the number of channels of each frame of video, and T' is the number of frames of sampled images. Generating a line dimension attention heat map representing the information dependency relationship between each pixel point in each frame of sampling image and all other pixel points in the same line with the pixel point according to the original characteristic matrix, wherein the line dimension attention heat map comprises the following steps: remodeling the original characteristic matrix into { T' × h } × w × c; performing convolution operation on the reshaped matrix by using three convolution kernels with the size of c 1 x 1 to obtain a characteristic matrix r with three dimensions being { T' × h } × w × cθ,rρ,rγWherein c 1 x 1 is the channel height width; the three feature matrices rθ,rρ,rγAccording to the formula
Figure BDA0002616877480000021
Performing operation to obtain a line dimension attention heat map r, wherein
Figure BDA0002616877480000022
Is a feature matrix rγThe transposed matrix of (2).
According to the embodiments provided in the present specification, if the original feature matrix O of the target video is seti∈RT′×h×w×cWherein h is the height of each frame of video image, w is the width of each frame of video image, c is the number of channels of each frame of video image, and T' is the number of frames of sampled images, then, a column dimension attention heat map representing the information dependency relationship between each pixel point in each frame of sampled image and all other pixel points in the same column with the pixel point is generated according to the original characteristic matrix, and the column dimension attention heat map comprises: remodeling the original characteristic matrix into { T' × w } × hXc; performing convolution operation on the reshaped matrix by using three convolution kernels with the size of c 1 x 1 to obtain three characteristic matrixes c with the three dimensions of { T' × w } × h × cθ,cρ,cγWherein c 1 x 1 is the channel height width; the three feature matrices cθ,cρ,cγAccording to the formula
Figure BDA0002616877480000023
Performing operation to obtain a column dimension attention heat map c, wherein
Figure BDA0002616877480000024
Is a feature matrix cγThe transposed matrix of (2).
According to the embodiments provided in the present specification, if the original feature matrix O of the target video is seti∈RT ′×h×w×cWherein h is the height of each frame of video image, w is the width of each frame of video image, c is the channel number of each frame of video image, and T' is the number of sampling image frames, then, each pixel point in each frame of sampling image and all other pixel points in the same time sequence with the pixel point are generated and expressed according to the original characteristic matrixThe time-sequence dimension attention heat map of the information dependency relationship comprises the following steps: remodeling the original characteristic matrix into { w x h } × T' × c; performing convolution operation on the reshaped matrix by respectively adopting three convolution kernels of 1 x 1 to obtain three characteristic matrixes T with dimensions of { w x h } × T' × cθ,tρ,tγ(ii) a The three feature matrices tθ,tρ,tγAccording to the formula
Figure BDA0002616877480000025
Figure BDA0002616877480000026
Performing operation to obtain a time sequence dimension attention heat map t, wherein
Figure BDA0002616877480000027
Is a feature matrix tγThe transposed matrix of (2).
In order to achieve the above object, according to an aspect created by embodiments provided in the present specification, a video quantization method is provided. The method comprises the following steps: obtaining a target feature matrix according to any one of the video feature extraction methods; converting the target characteristic matrix into a characteristic vector representing a target video; and compressing the feature vectors into binary coding to achieve video quantization.
According to the embodiment created by the embodiment provided by the present specification, the converting the target feature matrix into the feature vector representing the target video includes: respectively reshaping a row dimension attention heat map r, a column dimension attention heat map c and a time sequence dimension attention heat map T into T' × h × w × c; then, the reshaped row dimension attention heat degree graph matrix r, column dimension attention heat degree graph c, time sequence dimension attention heat degree graph t and original characteristic matrix OiAdding to obtain a feature matrix O 'fused with three-dimensional attention'iIts dimension and original feature matrix OiThe consistency is achieved; thereafter, a feature matrix O 'of three-dimensional attention will be fused'iInputting the three-dimensional attention model as input, and obtaining a feature matrix O' through twice fusion of three-dimensional attention through the calculationiDimension ofAnd original feature matrix O'iThe dimensions are T' × h × w × c; finally, the feature matrix O' subjected to two times of three-dimensional attention fusioniPerforming global average pooling operation on dimensions T', h and w respectively to obtain a final feature matrix, wherein the dimension of the final feature matrix is 1 multiplied by c, namely a c-dimensional feature vector; and taking c as D to obtain a subsequent feature vector x with the length of D dimension.
Compressing the eigenvectors into binary codes to realize video quantization according to the embodiment created by the embodiment provided in the specification comprises the process of inputting the eigenvectors into a progressive eigen quantization network and then outputting the binary codes from the progressive eigen quantization network, wherein the progressive eigen quantization network comprises a plurality of quantization layers, if the eigenvectors are assumed as eigenvectors x with the length of D dimension, each quantization layer comprises a codebook with M D-dimension code words, and each code word in the codebook corresponds to a corresponding index; after any quantization layer in the progressive characteristic quantization network receives an input vector, the quantization layer calculates the distance D between the input vector and each code word in the codebook of the quantization layer, so as to obtain a distance vector D consisting of M distances, then the distance vector D passes through a normalized exponential function to obtain a normalized distance vector P, then an index of the code word corresponding to the maximum value in the normalized distance vector P is extracted as a first output, and the difference value of the input vector and the approximate value of the input vector obtained by weighting and summing each code word in the codebook of the quantization layer by using the normalized distance vector P, namely the quantization layer quantization error, is used as a second output; the binary code is obtained by concatenating the first outputs of the quantization layers in the progressive feature quantization network, the second output of each quantization layer being used as the input vector for the next quantization layer of the quantization layer outputting the second output, and the feature vector x being used as the input vector for the first quantization layer in the progressive feature quantization network.
According to an embodiment created by an embodiment provided in the present specification, the codebook of each quantization layer of the progressive feature quantization network includes 256 codewords, and the first output of each quantization layer is an 8-bit binary code.
According to the embodiment created by the embodiment provided by the present specification, the progressive feature quantization network comprises four quantization layers, and the binary code is a 32-bit binary code by connecting the first outputs of the four quantization layers in the progressive feature quantization network.
The video feature extraction method can effectively obtain the video features containing rich context information. On the basis, the high-efficiency and accurate quantization of the video can be realized through the designed progressive characteristic quantization network, and further the quick retrieval of the video is realized.
The embodiments provided in the present specification will be further described with reference to the drawings and detailed description. Additional aspects and advantages of the embodiments provided herein will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the embodiments provided herein.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to assist in understanding the embodiments of the invention provided herein, and the description of the embodiments provided herein and their accompanying description is intended to explain the embodiments provided herein and not to limit the embodiments provided herein. In the drawings:
fig. 1 is an overall framework diagram of an embodiment of a video quantization method provided in this specification.
Fig. 2 is a three-dimensional self-attention module structure diagram of an embodiment of a video quantization method provided in the present specification.
Detailed Description
The embodiments provided in the present specification will be described in detail and fully with reference to the accompanying drawings. Those skilled in the art will be able to implement the embodiments provided in this specification based on these descriptions. Before the embodiments provided in this specification are explained with reference to the drawings, it should be particularly pointed out that:
the technical solutions and the technical features provided in the embodiments provided in the present specification in the following portions including the following description may be combined with each other without conflict.
In addition, the embodiments of the present invention provided in the present specification and referred to in the following description are generally only a part of the embodiments of the present invention and not all of the embodiments, and therefore, all other embodiments obtained by a person of ordinary skill in the art without any creative effort based on the embodiments of the present invention provided in the present specification should fall within the scope of the protection of the embodiments provided in the present specification.
The terms and units in the examples provided in this specification were created with reference to: the terms "comprising," "including," "having," and any variations thereof in the description and claims and any relevant portions of the examples provided herein are intended to cover non-exclusive inclusions. In addition, other related terms and units in the embodiment creations provided by the specification can be reasonably interpreted based on the related contents of the embodiment creations provided by the specification.
Fig. 1 is an overall framework diagram of a video quantization method provided in this specification. Fig. 2 is a three-dimensional self-attention module structure diagram of an embodiment of a video quantization method provided in the present specification. As shown in fig. 1-2, the video quantization method includes two parts, namely video feature extraction and video quantization. In the video feature extraction part, a video feature extraction module based on a three-dimensional self-attention mechanism is adopted to simultaneously acquire the time information and the spatial information of a video. In the two parts of video quantization, a gradient descent based progressive quantization algorithm is adopted to quantize the visual characteristics of the whole video.
Video feature extraction
The video feature extraction comprises the following steps: extracting original visual features from a target video and constructing an original feature matrix, wherein the original feature matrix comprises spatial information of each frame of sampling image and time sequence information between each frame of sampling image; generating a sampling image space attention heat map and a sampling image time sequence attention heat map according to the original characteristic matrix; and adding and fusing the original characteristic matrix, the sampling image space attention heat map and the sampling image time sequence attention heat map to obtain a target characteristic matrix.
In one embodiment, a deep convolutional neural network is used to extract the original features V E {0, 1, …, 255} for the entire video libraryN×T×H×W×CThe video comprises N videos, and each video has T frame images with the height of H, the width of W and the number of channels of C. A uniform sampling strategy is adopted for each video, and in the embodiment, T' 25 frames of images are extracted for each video at equal intervals. Therefore, this embodiment can obtain a reduced feature set F ∈ {0, 1, …, 255}N×T′×H×W×C. These feature matrices mainly contain two aspects of information: 1) spatial information of each frame of image, which may be shape, position or even semantic information. 2) Timing information from frame to frame, such as motion information. The information of both aspects is highly relevant. Therefore, the embodiment designs a feature module based on a three-dimensional self-attention mechanism, and obtains an attention heat map from two aspects of time sequence and space for each pixel point. This process can be interpreted as calculating the effect of each other neighboring pixel point on the current determined pixel point. For a certain pixel point, every time a three-dimensional self-attention mechanism is carried out, information dependency relationships among all other pixel points which are in the same row and the same column and have the same time sequence with the pixel point are obtained through calculation. Therefore, the present embodiment adopts a strategy of cycling this three-dimensional self-attention mechanism to acquire global information. Specifically, for one pixel point, all other pixel points and the relationship thereof are calculated through two times of three-dimensional self-attention iteration.
In implementing such a three-dimensional self-attention mechanism module, the present embodiment employs three independent attention operations, namely, row, column, and time sequence from three directions. In the row direction, for example, in the original feature matrix O of each videoi∈RT′×h×w×cAs inputs, where h is the height of each frame of video, w is the width of each frame of video, c is the number of channels of each frame of video, and T' is the sampled imageNumber of frames. First, this embodiment reshapes this feature matrix into { T' × h } × w × c and performs convolution operation using three convolution kernels with the size of c × 1 (channel number × height × width), to obtain three feature matrices r in the same dimensionθ,rρ,rγ. Then r is immediately sentρAnd rγIs transferred to
Figure BDA0002616877480000051
Multiplying the matrix, passing the obtained result through a softmax function, and finally, performing multiplication with rθPerforming dot multiplication to obtain a line dimension attention heat map r, wherein
Figure BDA0002616877480000052
Is a feature matrix rγThe transposed matrix of (2). The above operation can be summarized as the following formula:
Figure BDA0002616877480000061
attention in the column direction is given to the feature matrix of { T' × w } × h × c dimensions by performing the similar operation as described above. Remodeling the original characteristic matrix into { T' × w } × hXc; performing convolution operation on the reshaped matrixes by using three convolution kernels with the size of c 1 x 1 (the number of channels, the height and the width) respectively to obtain three characteristic matrixes c with the three dimensions of { T' × w } × h × cθ,cρ,cγ(ii) a The three feature matrices cθ,cρ,cγAccording to the formula
Figure BDA0002616877480000062
Figure BDA0002616877480000063
Performing operation to obtain a column dimension attention heat map c, wherein
Figure BDA0002616877480000064
Is a feature matrix cγThe transposed matrix of (2).
The attention in the time sequence direction is the feature for the { w × h } × T' × c dimensionThe matrix performs similar operations as described above. Remodeling the original characteristic matrix into { w x h } × T' × c; performing convolution operation on the reshaped matrix by respectively adopting three convolution kernels of 1 x 1 to obtain three characteristic matrixes T with dimensions of { w x h } × T' × cθ,tρ,tγ(ii) a The three feature matrices tθ,tρ,tγAccording to the formula
Figure BDA0002616877480000065
Performing operation to obtain a time sequence dimension attention heat map t, wherein
Figure BDA0002616877480000066
Is a feature matrix tγThe transposed matrix of (2).
And finally, adding and fusing the row dimension attention heat map r (matrix), the column dimension attention heat map c (matrix), the time sequence dimension attention heat map t (matrix) and the original feature matrix to obtain a final feature matrix.
And after the three-dimensional self-attention module is processed twice, averaging the obtained feature matrix fused with the three-dimensional attention to obtain a feature vector x representing the D-dimensional length of each video, and taking the feature vector x as the input of the quantization module.
Specifically, as shown in fig. 2, the row dimension attention heat map r, the column dimension attention heat map c, and the timing dimension attention heat map T are reshaped into T' × h × w × c, respectively; then, the reshaped row dimension attention heat degree graph matrix r, column dimension attention heat degree graph c, time sequence dimension attention heat degree graph t and original characteristic matrix OiAdding to obtain a feature matrix O 'fused with three-dimensional attention'iIts dimension and original feature matrix OiThe same is true; thereafter, a feature matrix O 'of three-dimensional attention will be fused'iInputting the three-dimensional attention model as input, and obtaining a feature matrix O' through twice fusion of three-dimensional attention through the calculationiIts dimension is equal to the original feature matrix O'iSimilarly, the dimensions are T' × h × w × c; finally, the feature matrix O' subjected to two times of fusion of three-dimensional attentioniAt T', h, respectively,And performing global average pooling operation on the w dimension to obtain a final feature matrix, wherein the dimension of the final feature matrix is 1 × 1 × 1 × c, namely a c-dimension feature vector, and taking c as D, namely obtaining a subsequent D-dimension feature vector x.
Second, video quantization
Video quantization comprises compressing the feature vectors into binary coding to achieve video quantization. In one embodiment, compressing the eigenvectors into binary coding to achieve video quantization comprises inputting the eigenvectors into a progressive eigen quantization network and outputting the binary coding from the progressive eigen quantization network, wherein the progressive eigen quantization network comprises a plurality of quantization layers, and if the eigenvectors are eigenvectors x with a length of D dimension, each quantization layer comprises a codebook comprising M D-dimensional codewords, and each codeword in the codebook corresponds to a corresponding index; after any quantization layer in the progressive characteristic quantization network receives an input vector, the quantization layer calculates the distance D between the input vector and each code word in the codebook of the quantization layer, so as to obtain a distance vector D consisting of M distances, then the distance vector D passes through a normalization exponential function (softmax function) to obtain a normalized distance vector P, then an index of the code word corresponding to the maximum value in the normalized distance vector P is extracted as a first output, and the difference value of the input vector and an input vector approximate value obtained by weighting and summing each code word in the codebook of the quantization layer by using the normalized distance vector P, namely quantization layer quantization error, is used as a second output; the binary code is obtained by concatenating the first outputs of the quantization layers in the progressive feature quantization network, the second output of each quantization layer being used as the input vector for the next quantization layer of the quantization layer outputting the second output, and the feature vector x being used as the input vector for the first quantization layer in the progressive feature quantization network.
Thus, each quantization layer outputs to the next quantization layer the quantization error of the quantization layer (the previous quantization layer), and the error is used as the input of the next quantization layer, so that the error can be further reduced by quantization, and the quantized outputs of the layers can gradually approximate the feature vector x. When the quantization precision required by the present embodiment is not high, the present embodiment uses only the quantization coding of the first layer, and the present embodiment uses the quantization coding of the first layer plus the second layer when the quantization precision requirement is increased. Along with the gradual increase of the quantization layer number, the quantization precision is gradually improved, and a gradual process is embodied.
In one embodiment, the codebook of each quantization layer of the progressive feature quantization network contains 256 codewords, and the first output of each quantization layer is an 8-bit binary code. Meanwhile, the progressive feature quantization network comprises four quantization layers, and the binary code is a 32-bit binary code obtained by connecting first outputs of the four quantization layers in the progressive feature quantization network.
The technical effects of the above embodiment are as follows:
1) the embodiment designs a novel three-dimensional self-attention module, and simultaneously obtains the context information of time and space. After the original characteristic matrix passes through the primary three-dimensional self-power module, for each pixel point, only the relation heat degree of other pixel points which are positioned in the same row with the pixel point, other pixel points which are positioned in the same column with the pixel point and other pixel points which are positioned in the same time sequence with the pixel point to the pixel point are calculated. At the moment, the feature matrix which is obtained by the primary three-dimensional self-attention module and is fused with the three-dimensional self-attention is input into the three-dimensional self-attention module again, so that the relation between all other pixel points and the pixel point can be obtained for a specific pixel point, and more global information is obtained.
2) The quantization algorithm is introduced to a video retrieval task for the first time, a well-designed depth quantization algorithm based on gradient descent is adopted, video features are quantized to extremely short binary codes, and a progressive quantization method is realized.
3) A large number of experimental results show that the video quantization method based on the three-dimensional self-attention mechanism proposed in this embodiment is superior to the current latest video hash algorithm, especially on the challenging FCVID data set. On the FCVID data set, the result of the encoding experiment with 64-bit length is adopted, the mAP @5 index of the embodiment reaches 51.1 percent, and is 6.1 points higher than the same index (45 percent) of the existing best method.
The contents of the embodiments provided in the present specification are explained above. Those skilled in the art will be able to implement the embodiments provided in this specification based on these descriptions. Based on the above description of the embodiments provided in the present specification, all other embodiments obtained by those skilled in the art without any creative effort should fall within the protection scope of the embodiments provided in the present specification.

Claims (8)

1. The video feature extraction method is characterized by comprising the following steps:
extracting original visual features from a target video and constructing an original feature matrix, wherein the original feature matrix comprises spatial information of each frame of sampling image and time sequence information between each frame of sampling image;
generating a sampling image space attention heat map and a sampling image time sequence attention heat map according to the original characteristic matrix; and
and adding and fusing the original characteristic matrix, the sampling image space attention heat map and the sampling image time sequence attention heat map to obtain a target characteristic matrix.
2. The video feature extraction method of claim 1,
A) generating a sampling image space attention heat map according to the original characteristic matrix comprises the following steps:
generating a line dimension attention heat map representing the information dependency relationship between each pixel point in each frame of sampling image and all other pixel points in the same line with the pixel point according to the original characteristic matrix; and
generating a column dimension attention heat map representing the information dependence relationship between each pixel point in each frame of sampling image and all other pixel points in the same column with the pixel point according to the original characteristic matrix;
and/or the like and/or,
B) the method for generating the time-series attention heat map of the sampling image according to the original characteristic matrix comprises the following steps:
and generating a time sequence dimension attention heat map representing the information dependence relationship between each pixel point in each frame of sampling image and all other pixel points which are in the same time sequence with the pixel point according to the original characteristic matrix.
3. The video feature extraction method of claim 2, wherein:
if the original characteristic matrix o of the target video is seti∈RT′×h×w×cWherein h is the height of each frame of video, w is the width of each frame of video, c is the number of channels of each frame of video, and T' is the number of frames of sampled images, then
A) Generating a line dimension attention heat map representing the information dependency relationship between each pixel point in each frame of sampling image and all other pixel points in the same line with the pixel point according to the original characteristic matrix, wherein the line dimension attention heat map comprises the following steps:
remodeling the original characteristic matrix into { T' × h } × w × c; performing convolution operation on the reshaped matrix by using three convolution kernels with the size of c 1 x 1 to obtain a characteristic matrix r with three dimensions being { T' × h } × w × cθ,rρ,rγWherein c 1 x 1 is the channel height width; the three feature matrices rθ,rρ,rγAccording to the formula
Figure FDA0002616877470000014
Figure FDA0002616877470000012
Performing operation to obtain a line dimension attention heat map r, wherein
Figure FDA0002616877470000013
Is a feature matrix rγThe transposed matrix of (2);
and/or the like and/or,
B) generating a column dimension attention heat map representing the information dependence relationship between each pixel point in each frame of sampling image and all other pixel points in the same column with the pixel point according to the original characteristic matrix, wherein the column dimension attention heat map comprises the following steps:
remodeling the original characteristic matrix into { T' × w } × hXc; performing convolution operation on the reshaped matrix by using three convolution kernels with the size of c 1 x 1 to obtain three characteristic matrixes c with the three dimensions of { T' × w } × h × cθ,cρ,cγWherein c 1 x 1 is the channel height width; the three feature matrices cθ,cρ,cγAccording to the formula
Figure FDA0002616877470000021
Performing operation to obtain a column dimension attention heat map c, wherein
Figure FDA0002616877470000022
Is a feature matrix cγThe transposed matrix of (2);
and/or the like and/or,
C) generating a time sequence dimension attention heat map representing the information dependence relationship between each pixel point in each frame of sampling image and all other pixel points which are in the same time sequence with the pixel point according to the original characteristic matrix, wherein the time sequence dimension attention heat map comprises the following steps:
remodeling the original characteristic matrix into { w x h } × T' × c; performing convolution operation on the reshaped matrix by respectively adopting three convolution kernels of 1 x 1 to obtain three characteristic matrixes T with dimensions of { w x h } × T' × cθ,tρ,ty(ii) a The three feature matrices tθ,tρ,tγAccording to the formula
Figure FDA0002616877470000023
Performing operation to obtain a time sequence dimension attention heat map t, wherein
Figure FDA0002616877470000024
Is a feature matrix tγThe transposed matrix of (2).
4. A video quantization method, comprising:
the video feature extraction method according to claim 1, 2 or 3, obtaining a target feature matrix;
converting the target characteristic matrix into a characteristic vector representing a target video; and
compressing the feature vectors into binary coding achieves video quantization.
5. The video quantization method of claim 4, wherein transforming the target feature matrix into the feature vector representing the target video comprises:
respectively reshaping a row dimension attention heat map r, a column dimension attention heat map c and a time sequence dimension attention heat map T into T' × h × w × c;
then, the reshaped row dimension attention heat degree graph matrix r, column dimension attention heat degree graph c, time sequence dimension attention heat degree graph t and original characteristic matrix OiAdding to obtain a feature matrix O 'fused with three-dimensional attention'iIts dimension and original feature matrix OiThe consistency is achieved;
thereafter, a feature matrix O 'of three-dimensional attention will be fused'iInputting the three-dimensional attention model as input, and obtaining a feature matrix O' through twice fusion of three-dimensional attention through the calculationiIts dimension is equal to the original feature matrix O'iThe dimensions are T' × h × w × c;
finally, the feature matrix O' subjected to two times of three-dimensional attention fusioniPerforming global average pooling operation on dimensions T', h and w respectively to obtain a final feature matrix, wherein the dimension of the final feature matrix is 1 multiplied by c, namely a c-dimensional feature vector; and taking c as D to obtain a subsequent feature vector x with the length of D dimension.
6. The video quantization method of claim 4, wherein compressing the feature vectors into binary coding to achieve video quantization comprises a process of inputting the feature vectors into a progressive feature quantization network and then outputting the binary coding from the progressive feature quantization network, wherein,
the progressive feature quantization network comprises a plurality of quantization layers, and if the feature vector is a feature vector x with a length of D dimension, each quantization layer comprises a codebook comprising M D-dimensional codewords, and each codeword in the codebook corresponds to a corresponding index;
after any quantization layer in the progressive characteristic quantization network receives an input vector, the quantization layer calculates the distance D between the input vector and each code word in the codebook of the quantization layer, so as to obtain a distance vector D consisting of M distances, then the distance vector D passes through a normalized exponential function to obtain a normalized distance vector P, then an index of the code word corresponding to the maximum value in the normalized distance vector P is extracted as a first output, and the difference value of the input vector and the approximate value of the input vector obtained by weighting and summing each code word in the codebook of the quantization layer by using the normalized distance vector P, namely the quantization layer quantization error, is used as a second output;
the binary code is obtained by concatenating the first outputs of the quantization layers in the progressive feature quantization network, the second output of each quantization layer being used as the input vector for the next quantization layer of the quantization layer outputting the second output, and the feature vector x being used as the input vector for the first quantization layer in the progressive feature quantization network.
7. The video quantization method of claim 6, wherein: the codebook of each quantization layer of the progressive feature quantization network contains 256 codewords, and the first output of each quantization layer is an 8-bit binary code.
8. The video quantization method of claim 7, wherein: the progressive feature quantization network comprises four quantization layers, and the binary code is a 32-bit binary code by connecting first outputs of the four quantization layers in the progressive feature quantization network.
CN202010771697.8A 2020-08-04 2020-08-04 Video feature extraction method and video quantization method applying same Pending CN111897995A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010771697.8A CN111897995A (en) 2020-08-04 2020-08-04 Video feature extraction method and video quantization method applying same

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010771697.8A CN111897995A (en) 2020-08-04 2020-08-04 Video feature extraction method and video quantization method applying same

Publications (1)

Publication Number Publication Date
CN111897995A true CN111897995A (en) 2020-11-06

Family

ID=73184158

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010771697.8A Pending CN111897995A (en) 2020-08-04 2020-08-04 Video feature extraction method and video quantization method applying same

Country Status (1)

Country Link
CN (1) CN111897995A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883227A (en) * 2021-01-07 2021-06-01 北京邮电大学 Video abstract generation method and device based on multi-scale time sequence characteristics

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229338A (en) * 2017-12-14 2018-06-29 华南理工大学 A kind of video behavior recognition methods based on depth convolution feature
CN110110601A (en) * 2019-04-04 2019-08-09 深圳久凌软件技术有限公司 Video pedestrian weight recognizer and device based on multi-space attention model
CN111241996A (en) * 2020-01-09 2020-06-05 桂林电子科技大学 Method for identifying human motion in video

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229338A (en) * 2017-12-14 2018-06-29 华南理工大学 A kind of video behavior recognition methods based on depth convolution feature
CN110110601A (en) * 2019-04-04 2019-08-09 深圳久凌软件技术有限公司 Video pedestrian weight recognizer and device based on multi-space attention model
CN111241996A (en) * 2020-01-09 2020-06-05 桂林电子科技大学 Method for identifying human motion in video

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JINGKUAN SONG、RUIMIN LANG、XIAOSU ZHU: ""3D Self-Attention for Unsupervised Video Quantization"", 《SIGIR’20:PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883227A (en) * 2021-01-07 2021-06-01 北京邮电大学 Video abstract generation method and device based on multi-scale time sequence characteristics
CN112883227B (en) * 2021-01-07 2022-08-09 北京邮电大学 Video abstract generation method and device based on multi-scale time sequence characteristics

Similar Documents

Publication Publication Date Title
Song et al. Monocular depth estimation using laplacian pyramid-based depth residuals
Kossaifi et al. Factorized higher-order cnns with an application to spatio-temporal emotion estimation
US10834415B2 (en) Devices for compression/decompression, system, chip, and electronic device
KR20220050758A (en) Multi-directional scene text recognition method and system based on multidimensional attention mechanism
CN109711422B (en) Image data processing method, image data processing device, image data model building method, image data model building device, computer equipment and storage medium
CN113673594B (en) Defect point identification method based on deep learning network
US9330332B2 (en) Fast computation of kernel descriptors
CN107463932B (en) Method for extracting picture features by using binary bottleneck neural network
Islam et al. Image compression with recurrent neural network and generalized divisive normalization
CN108631787B (en) Data encoding method, data encoding device, computer equipment and storage medium
CN114048818A (en) Video classification method based on accelerated transform model
CN112288690B (en) Satellite image dense matching method integrating multi-scale multi-level features
CN116469100A (en) Dual-band image semantic segmentation method based on Transformer
CN116978011B (en) Image semantic communication method and system for intelligent target recognition
CN114677412A (en) Method, device and equipment for estimating optical flow
CN106997381B (en) Method and device for recommending movies to target user
CN113240683A (en) Attention mechanism-based lightweight semantic segmentation model construction method
CN116204694A (en) Multi-mode retrieval method based on deep learning and hash algorithm
CN110083734B (en) Semi-supervised image retrieval method based on self-coding network and robust kernel hash
CN111897995A (en) Video feature extraction method and video quantization method applying same
CN114818889A (en) Image classification method based on linear self-attention transducer
CN114897711A (en) Method, device and equipment for processing images in video and storage medium
CN111882028B (en) Convolution operation device for convolution neural network
CN112528077B (en) Video face retrieval method and system based on video embedding
WO2023051335A1 (en) Data encoding method, data decoding method, and data processing apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201106