CN110069666B - Hash learning method and device based on neighbor structure keeping - Google Patents
Hash learning method and device based on neighbor structure keeping Download PDFInfo
- Publication number
- CN110069666B CN110069666B CN201910264740.9A CN201910264740A CN110069666B CN 110069666 B CN110069666 B CN 110069666B CN 201910264740 A CN201910264740 A CN 201910264740A CN 110069666 B CN110069666 B CN 110069666B
- Authority
- CN
- China
- Prior art keywords
- video
- training
- training video
- neighbor
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/75—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Image Analysis (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention provides a Hash learning method and a Hash learning device based on neighbor structure keeping, wherein the method comprises the following steps: acquiring a video training set, and extracting M frame-level features of each training video; extracting the time domain appearance characteristics of each training video, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set; acquiring a time domain appearance neighbor characteristic corresponding to each training video from the anchor point characteristic set; coding each training video into a corresponding depth expression by adopting a coding network according to the time domain appearance neighbor characteristic; converting the depth expression corresponding to each training video into a list of binary codes; reconstructing M reconstructed frame level characteristics corresponding to each training video according to the binary code; generating a reconstruction error function and a neighbor similarity error function; the network is trained to minimize a reconstruction error function and a neighbor similarity error function. The method can ensure the perfect preservation of the neighbor structure in the Hamming space and improve the retrieval precision on a large-scale unsupervised video database.
Description
Technical Field
The invention relates to the technical field of video processing, in particular to a hash learning method and device based on neighbor structure keeping.
Background
Large-scale video retrieval, which aims to retrieve videos similar to a given query video from a huge database, is generally performed by representing the videos by a series of sampled video frames, and each frame of the videos can be represented by a feature. During video retrieval, relevant videos can be determined according to the feature sets corresponding to the videos.
In the presence of high-dimensional characteristics and mass data, the Hash method obtains great achievement in large-scale visual retrieval tasks, video Hash codes a video into a compact binary code, guarantees the similarity structure of a video space, and saves the video in a Hamming space. The learning-based video hashing method explores data characteristics and achieves better performance than a manually designed hashing method, and because the trouble of manual labeling is avoided, unsupervised hashing is more feasible in a large-scale video retrieval task than supervised hashing.
At present, most of the unsupervised hashes are focused on utilizing the representation and the time sequence information of the video, but neglect the utilization of the neighbor structure, so that the coding network can absorb the content of the input video without distinction whether the content is similar to the neighbor content, which is not beneficial to the storage of the neighbor similarity, and thus, when the video retrieval is performed on a large-scale unsupervised video database, the retrieval accuracy cannot be ensured.
Disclosure of Invention
The invention provides a Hash learning method and a Hash learning device based on neighbor structure preservation, which are used for guaranteeing the perfect preservation of neighbor structures in a Hamming space and improving the retrieval precision on a large-scale unsupervised video database and are used for solving the technical problems that in the prior art, unsupervised Hash is focused on utilizing the representation and time sequence information of videos, but neglects the utilization of the neighbor structures and cannot guarantee the precision of video retrieval.
An embodiment of one aspect of the present invention provides a hash learning method based on neighbor structure preservation, including:
s1, acquiring a video training set, and extracting M frame-level features of each training video in the video training set;
s2, extracting the time domain appearance characteristics of each training video by adopting an automatic encoder, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set;
s3, acquiring time domain appearance neighbor characteristics corresponding to each training video from the anchor point characteristic set aiming at each training video;
s4, coding each training video into a corresponding depth expression according to the time domain appearance neighbor characteristic by adopting a coding network;
s5, converting the depth expression corresponding to each training video into a list of binary codes according to the full link layer using the activation function;
s6, reconstructing M reconstructed frame-level features corresponding to each training video according to the binary code by adopting a decoding network;
s7, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, and generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes;
s8, training a network to minimize the reconstruction error function and minimize the neighbor similarity error function; wherein the network comprises the encoding network, the full link layer, and the transcoding network.
The Hash learning method based on neighbor structure keeping of the embodiment of the invention extracts M frame-level features of each training video aiming at each training video in a video training set by acquiring the video training set, then extracts time domain appearance features of each training video by adopting an automatic encoder and clusters the time domain appearance features to obtain an anchor point feature set, then acquires time domain appearance neighbor features corresponding to each training video from the anchor point feature set aiming at each training video, and codes each training video into corresponding depth expression according to the time domain appearance neighbor features by adopting a coding network, then converts the depth expression corresponding to each training video into a column of binary codes according to a full link layer using an activation function, and then reconstructs M reconstructed frame-level features corresponding to each training video according to the binary codes by adopting a decoding network, then, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes, and finally training the network to minimize the reconstruction error function and the neighbor similarity error function; the network comprises an encoding neural network, a full link layer and a decoding network. In the invention, the neighbors of the video are embedded into the coding network, so that in the process of coding the frame-level characteristics of the video, the content similar to the neighbors in the video is paid more attention, and the retrieval precision on a large-scale unsupervised video database can be further improved. In addition, by minimizing the reconstruction error and the neighbor similarity error, the perfect preservation of the neighbor structure in the Hamming space can be ensured, and the retrieval precision on the video database is further improved.
Another embodiment of the present invention provides a hash learning apparatus based on neighbor structure preservation, including:
the acquisition module is used for acquiring a video training set and extracting M frame-level features of each training video in the video training set;
the extraction module is used for extracting the time domain appearance characteristics of each training video by adopting an automatic encoder, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set;
the acquisition module is further configured to acquire, for each training video, a time-domain appearance neighbor feature corresponding to each training video from the anchor point feature set;
the coding module is used for coding each training video into a corresponding depth expression according to the time domain appearance neighbor characteristic by adopting a coding network;
the conversion module is used for converting the depth expression corresponding to each training video into a row of binary codes according to the full link layer using the activation function;
the reconstruction module is used for reconstructing M reconstruction frame level characteristics corresponding to each training video according to the binary code by adopting a decoding network;
the generating module is used for generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video and generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes;
a training module to train a network to minimize the reconstruction error function and to minimize the neighbor similarity error function; wherein the network comprises the encoding network, the full link layer, and the transcoding network.
The Hash learning device based on neighbor structure keeping of the embodiment of the invention extracts M frame-level features of each training video aiming at each training video in a video training set by acquiring the video training set, then extracts time domain appearance features of each training video by adopting an automatic encoder and clusters the time domain appearance features to obtain an anchor point feature set, then acquires time domain appearance neighbor features corresponding to each training video from the anchor point feature set aiming at each training video, codes each training video into corresponding depth expression according to the time domain appearance neighbor features by adopting a coding network, then converts the depth expression corresponding to each training video into a column of binary codes according to a full link layer using an activation function, and then reconstructs M reconstructed frame-level features corresponding to each training video according to the binary codes by adopting a decoding network, then, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes, and finally training the network to minimize the reconstruction error function and the neighbor similarity error function; the network comprises an encoding network, a full link layer and a decoding network. In the invention, the neighbors of the video are embedded into the coding network, so that in the process of coding the frame-level characteristics of the video, the content similar to the neighbors in the video is paid more attention, and the retrieval precision on a large-scale unsupervised video database can be further improved. In addition, by minimizing the reconstruction error and the neighbor similarity error, the perfect preservation of the neighbor structure in the Hamming space can be ensured, and the retrieval precision on the video database is further improved.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a hash learning method based on neighbor structure preservation according to an embodiment of the present invention;
FIG. 2 is a first diagram illustrating a hash learning process according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a second hash learning process according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a hash learning apparatus based on neighbor structure preservation according to a second embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Currently, some video hashing methods integrate a hash function into a deep neural network, specifically, extract features of a video frame through a deep convolutional network, and the features are further encoded into a binary code through a time sequence pooling operation or a deep cyclic network. Compared with the supervised hash, the unsupervised hash is more feasible in a large-scale video retrieval task because the trouble of manual labeling is avoided.
However, most of the unsupervised hashing aims at utilizing the representation and timing information of the video, but neglects the utilization of a neighbor structure, and although some hashing methods design some neighbor similarity cost functions to train the network, the neighbor structure is only used for guiding the generation of the binary code and is not utilized in the video feature coding. In this way, the designed coding network will absorb the content of the input video without distinction whether the content is similar to the neighboring content, and is not favorable for storing the neighboring similarity, so that the precision of the retrieval cannot be guaranteed when the video retrieval is performed on a large-scale unsupervised video database.
Therefore, the invention provides a hash learning method based on neighbor structure retention, which mainly aims at the technical problems that in the prior art, unsupervised hash focuses on utilizing the representation and time sequence information of a video, but neglects the utilization of a neighbor structure and cannot ensure the precision of video retrieval.
According to the Hash learning method based on neighbor structure keeping, disclosed by the embodiment of the invention, the neighbors of the video are embedded into the coding network, so that the content similar to the neighbors in the video is paid more attention in the process of coding the video frame-level features, and the retrieval precision on a large-scale unsupervised video database can be further improved. In addition, when the network is trained, the perfect preservation of the neighbor structure in the Hamming space can be ensured by minimizing the reconstruction error and the neighbor similarity error, and the retrieval precision on the video database is further improved.
The following describes a hash learning method and apparatus based on neighbor structure keeping according to an embodiment of the present invention with reference to the drawings.
Fig. 1 is a schematic flowchart of a hash learning method based on neighbor structure preservation according to an embodiment of the present invention.
The embodiment of the invention is exemplified in that the hash learning method based on neighbor structure keeping is configured in the hash learning apparatus based on neighbor structure keeping, and the hash learning apparatus based on neighbor structure keeping can be applied to any computer device, so that the computer device can execute the hash learning function based on neighbor structure keeping.
The computer device may be a Personal Computer (PC), a cloud device, a mobile device, and the like, and the mobile device may be a hardware device having various operating systems, touch screens, and/or display screens, such as a mobile phone, a tablet computer, a personal digital assistant, a wearable device, and an in-vehicle device.
As shown in fig. 1, the neighbor structure preserving-based hash learning method may include the following steps:
s1, acquiring a video training set, and extracting M frame-level features of each training video aiming at each training video in the video training set.
In the embodiment of the present invention, the video training set includes N training videos, where the N training videos may be videos stored locally by the computer device, or may also be videos downloaded online by the computer device, which is not limited herein. Wherein, the size of N is preset, and the size of M is also preset.
In the embodiment of the invention, the video training set is marked asFor each training video in the video training set, M frames can be uniformly sampled, and M frame-level features with dimension l corresponding to each training video are extracted by the deep convolutional network, so that each training video can be converted into a frame-level feature set
And S2, extracting the time domain appearance characteristics of each training video by adopting an automatic encoder, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set.
In the embodiment of the invention, for each training video, d-dimensional time domain appearance characteristics can be obtained through an automatic encoder. For each training video, the distance between the training video and other videos in the video library can be calculated and ranked to determine a time-domain appearance neighbor features corresponding to the training video, for example, the distance between different videos can be calculated by using a two-norm, since this calculation process also needs to be calculated in the testing stage, and neighbor search facing the whole video training set consumes a lot of time, which is not practical. Therefore, in the invention, K-means clustering can be performed on the training videos in the video training set to obtain n clustering centers, for example, K-means clustering can be performed on the time domain appearance characteristics to obtain n clustering centers. For each cluster center, the temporal appearance feature closest (or least distant) to the cluster center may be determined, resulting in n temporal appearance features. Then, n time domain appearance features can be used as anchor points and are listed into an anchor point feature set, and the anchor point feature set is marked as
And S3, acquiring the time domain appearance neighbor feature corresponding to each training video from the anchor point feature set aiming at each training video.
In the embodiment of the invention, for each training video, a time domain appearance neighbor features corresponding to the training video can be obtained from the anchor point feature setAre respectively asBecause a is less than N, only a little time is consumed for obtaining a time domain appearance neighbor characteristics, and the efficiency of video retrieval can be greatly improved.
And S4, coding each training video into a corresponding depth expression according to the time domain appearance neighbor characteristics by adopting a coding network.
In the embodiment of the invention, the coding network learns the corresponding relation between the time domain appearance neighbor features and the depth expressions corresponding to the videos, and after the time domain appearance neighbor features corresponding to each training video are determined, the time domain appearance neighbor features corresponding to each training video can be input to the coding network to obtain the depth expressions corresponding to each training video.
As a possible implementation manner, in the neighbor attention learning mechanism, the neighbor structure expression n needs to be obtainedi. Specifically, for each training video, a time domain appearance neighbor features corresponding to the training video may be column-wise combined to obtain a first vectorAnd mapping the first vector to a b-dimensional neighbor structure expression niThen the neighbor structure expresses niComprises the following steps:
where FC denotes a full link layer map.
For each training video, at a first time instant, inputting a first frame-level feature of the training video into the coding network, and expressing n adjacent structuresiEmbedding into the b-dimensional memory state as follows:
wherein d is a fixed value, Wq、Wk、WvIn order to encode the parameter values of the network,it is shown that the column direction is merged,representing the frame-level features, m, input at the first moment of the training videoi,1Indicating the memory state corresponding to the first moment.
By means of the formula (2), the information of the neighbor structure will exist in the memory state at each moment, and when 1< t ≦ M, when there is a new video frame level feature input to the coding network, the memory state can be updated as follows:
wherein the content of the first and second substances,representing the video frame-level features, m, input at the t-th instanti,tRepresents the memory state corresponding to the t-th time, mi,t-1The memory state corresponding to the t-1 th time is shown.
By means of the formulas (2) and (3), at each moment, the memory state will select the useful information in the input features to write into the new memory state according to the neighbor structure information contained in the memory state. The neighbor attention learning mechanism is embedded into the coding network, and the operation units are obtained as follows:
where MLP denotes multi-level mapping, BN denotes batch normalization, Wiv、Wih、Wfv、Wfh、Wov、WohA parameter value representing the coding network,expressing the inner product, the sigma function is calculated in a manner that sigma is 1/(1+ e)-x) The tan h function is calculated by σ ═ ex-e-x)/(ex+e-x). Hidden layer output h obtained at last momenti,MI.e. a deep representation of the training video. Specifically, for each training video, the depth of the training video is expressed as:
wherein the content of the first and second substances,representing the corresponding frame-level features of the training video, and theta represents the parameters of the coding network.
In the embodiment of the invention, the neighbors of the video are embedded into the coding network, so that in the process of coding the frame-level features of the video, the content similar to the neighbors in the video is paid more attention, and the retrieval precision on a large-scale unsupervised video database can be further improved.
And S5, converting the depth expression corresponding to each training video into a list of binary codes according to the full link layer using the activation function.
In the embodiment of the invention, for each training video, according to the full link layer using the activation function, the depth corresponding to the training video is expressed, and a list of binary codes obtained by conversion is as follows:
bi=sgn(ti);(6)
wherein, ti=FC(hi,MK); FC denotes the full link layer map, sgn denotes the sign function, when tiWhen greater than 0, sgn (t)i) Is 1 when t isiWhen the value is less than or equal to 0, sgn (t)i) To-1, k represents the length of a column of binary codes.
And S6, reconstructing M reconstructed frame-level features corresponding to each training video according to the binary code by adopting a decoding network.
In the embodiment of the present invention, a Long Short Term Memory (LSTM) network may be used as the decoding network. Specifically, a column of binary codes corresponding to each training video may be mapped to an l-dimensional vectorAt the first moment, willInput to a decoding network to obtain a first reconstructed video frame-level featureFeatures noted in the embodiments of the present invention as reconstructed frame levelAt the second moment, willInputting the data into a decoding network to obtain a second reconstructed frame level featureWill be provided withInputting the data into a decoding network to obtain a third reconstructed frame level featureThe above steps are circulated until the Mth reconstructed frame level characteristic is output by the decoding networkWhen so, decoding is completed.
As an example, referring to fig. 2, fig. 2 is a schematic diagram of a hash learning process in an embodiment of the present invention. It is composed ofIn the method, M frame-level features v corresponding to the training video are obtained1、v2、…、vMThen, corresponding depth expressions can be output through the coding network, and after corresponding binary codes are obtained through a full link layer using an activation function, corresponding M reconstructed frame-level features can be output through the decoding network
And S7, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, and generating a neighbor similarity error function according to the time domain appearance neighbor characteristics and the binary code.
In the embodiment of the invention, two loss functions are designed to train the network, namely a reconstruction error function LrAnd neighbor similarity error function Ls。
Wherein an error function L is reconstructedrRepresenting the difference between the corresponding frame-level features of the input training video and the decoded reconstructed frame-level features, the mean square error may be used to represent the reconstruction error function Lr:
Wherein the content of the first and second substances,representing the mth frame-level feature in the ith training video,representing the mth reconstructed frame-level feature in the ith training video.
In the embodiment of the invention, the neighbor similarity error function represents the difference of the similarity structure in the original video space and the Hamming space, and the neighbor similarity error function L can be obtained according to a formula (8)s:
Wherein s isijRepresenting the similarity between the temporal appearance features of the ith training video and the temporal appearance features of the jth training video,represents the ith corresponding binary code biCorresponding to the jth binary code bjThe similarity between them.
To calculate s in equation (8)ijThe approximate similarity matrix a may be established as follows. First, the corresponding frame-level features x of the training video can be obtainediAnd corresponding a time domain appearance neighbor featuresDefining a truncated similarity matrixEach element Y in Y can be represented by formula (9)ij:
Wherein, < i > represents the positions of a time domain appearance neighbor features in the anchor point feature set, Dist represents a distance calculation function, the distance can be calculated by adopting a two-norm, and t represents a bandwidth parameter.
The approximate similarity matrix a may be calculated according to equation (10):
A=YΛ-1YT;(10)
a calculated according to the formula (10) is a sparse non-negative matrix, the sum of each row and each column of the matrix is 1, and when A isij>When 0, s can be substitutedijIs set to 1, and when AijWhen s is less than or equal to 0, s can be addedijIs set to 0.
The binary code b is represented in formula (8)iAnd binary code bjSimilarity between themCan be defined asTo avoid oscillation during network training, the method can be usedApproximate representationWherein, tiIs a binary code biThe slack of (a) indicates.
To reduceAndwith respect to t, an approximation error can be introducediAnd biThe auxiliary loss term of (c), equation (8) can be converted into:
s8, training the network to minimize the reconstruction error function and minimize the neighbor similarity error function; the network comprises an encoding network, a full link layer and a decoding network.
In the embodiment of the invention, the network can be divided into three parts: the first part is a coding network with a neighbor attention learning mechanism, and the deep expression of the training video can be learned through the coding network; the second part is a full link layer with a nonlinear activation function, and is used for converting the depth expression into a binary code with a K dimension; the third part is a decoding network, and the reconstructed frame-level features of each frame of the training video are decoded from the binary code obtained by encoding.
In the embodiment of the invention, the network can be trained according to the reconstruction error function and the neighbor similarity error function, the information contained in the input training video can be better utilized by minimizing the reconstruction error function, and the neighbor similarity can be maximally stored by minimizing the neighbor similarity error function. When training the network, the used training loss function can be weighted by a reconstruction error function and a neighbor similarity error function:
L=αLs+(1-α)Lr;(12)
where α represents a hyperparameter balancing the reconstruction error function and the neighbor similarity error function.
In the embodiment of the invention, when the network is trained end to end, a reverse gradient conduction mode can be adopted to optimize the network parameters.
As an example, referring to fig. 3, fig. 3 is a schematic diagram of a hash learning process according to an embodiment of the present invention. When the network is trained, when a training video is input, the time domain appearance neighbor characteristic can be embedded into a Hash coding network for Hash learning, a Hash code is generated through a coding network, and the perfect preservation of a neighbor structure in a Hamming space is ensured by minimizing a reconstruction error and a neighbor similarity error.
The Hash learning method based on neighbor structure keeping of the embodiment of the invention extracts M frame-level features of each training video aiming at each training video in a video training set by acquiring the video training set, then extracts time domain appearance features of each training video by adopting an automatic encoder and clusters the time domain appearance features to obtain an anchor point feature set, then acquires time domain appearance neighbor features corresponding to each training video from the anchor point feature set aiming at each training video, and codes each training video into corresponding depth expression according to the time domain appearance neighbor features by adopting a coding network, then converts the depth expression corresponding to each training video into a column of binary codes according to a full link layer using an activation function, and then reconstructs M reconstructed frame-level features corresponding to each training video according to the binary codes by adopting a decoding network, then, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes, and finally training the network to minimize the reconstruction error function and the neighbor similarity error function; the network comprises an encoding network, a full link layer and a decoding network. In the invention, the neighbors of the video are embedded into the coding network, so that in the process of coding the frame-level characteristics of the video, the content similar to the neighbors in the video is paid more attention, and the retrieval precision on a large-scale unsupervised video database can be further improved. In addition, by minimizing the reconstruction error and the neighbor similarity error, the perfect preservation of the neighbor structure in the Hamming space can be ensured, and the retrieval precision on the video database is further improved.
In order to implement the above embodiment, the present invention further provides a hash learning apparatus based on neighbor structure preservation.
Fig. 4 is a schematic structural diagram of a hash learning apparatus based on neighbor structure preservation according to a second embodiment of the present invention.
As shown in fig. 4, the neighbor structure preserving-based hash learning apparatus includes: the system comprises an acquisition module 101, an extraction module 102, an encoding module 103, a transformation module 104, a reconstruction module 105, a generation module 106 and a training module 107.
The obtaining module 101 is configured to obtain a video training set, and extract, for each training video in the video training set, M frame-level features of each training video.
The extracting module 102 is configured to extract a time domain appearance feature of each training video by using an automatic encoder, and cluster the time domain appearance features to obtain an anchor point feature set.
The obtaining module 101 is further configured to, for each training video, obtain, from the anchor point feature set, a time-domain appearance neighbor feature corresponding to each training video.
And the coding module 103 is configured to code each training video into a corresponding depth expression according to the time domain appearance neighbor feature by using a coding network.
As a possible implementation manner, each training video has a time domain appearance neighbor features, which are respectively The encoding module 103 is specifically configured to:
combining the column direction of a time domain appearance neighbor characteristics corresponding to each training video to obtain a first vector
Mapping a first vector to a b-dimensional neighbor structure expression niWherein, in the step (A),FC denotes full link layer mapping;
for each training video, at a first moment, inputting a first frame-level feature of each training video into the coding network, and expressing n adjacent structuresiEmbedding into the b-dimensional memory state as follows:
wherein d is a fixed value, Wq、Wk、WvIn order to encode the parameter values of the network,it is shown that the column direction is merged,representing the frame-level feature, m, input corresponding to the first moment of the training videoi,1Representing the memory state corresponding to the first moment;
when new frame-level features are input into the coding network, the memory state is updated as follows:
wherein 1 is<t≤M,Representing the frame-level features of the input at the t-th instant, mi,tRepresents the memory state corresponding to the t-th time, mi,t-1The memory state corresponding to the t-1 th moment is shown;
the coding network is an LSTM network, and each operation unit in the coding network is as follows:
where MLP denotes multi-level mapping, BN denotes batch normalization, Wiv、Wih、Wfv、Wfh、Wov、WohA parameter value representing the coding network,represents the inner product;
outputting the hidden layer obtained at the last moment hi,MAs a depth representation of the corresponding training video; wherein the content of the first and second substances, representing frame-level features of the corresponding training video and theta representing a parameter of the coding network.
And the conversion module 104 is configured to convert the depth expression corresponding to each training video into a list of binary codes according to the full link layer using the activation function.
As a possible implementation, according to a full link layer using an activation function, a deep representation corresponding to a training video is transformed, and a list of binary codes obtained is: bi=sgn(ti) (ii) a Wherein, ti=FC(hi,MK); FC denotes the full link layer map, sgn denotes the sign function, when tiWhen greater than 0, sgn (t)i) Is 1 when t isiWhen the value is less than or equal to 0, sgn (t)i) To-1, k represents the length of a column of binary codes.
And the reconstructing module 105 is configured to reconstruct M reconstructed frame-level features corresponding to each training video according to the binary code by using a decoding network.
And the generating module 106 is configured to generate a reconstruction error function according to the frame level feature and the reconstruction frame level feature corresponding to each training video, and generate a neighbor similarity error function according to the time domain appearance feature and the binary code.
As a possible implementation, the video training set includes N training videos, and the reconstruction error function is:wherein the content of the first and second substances,representing the mth frame-level feature in the ith training video,representing the mth reconstructed frame-level feature in the ith training video.
As a possible implementation, the neighbor similarity error function is:
wherein s isijRepresents the ith trainingSimilarity between the time-domain appearance features of the training video and the time-domain appearance features of the jth training video, tiIs a binary code biThe slack of (a) indicates.
A training module 107 for training the network to minimize a reconstruction error function and to minimize a neighbor similarity error function; the network comprises an encoding network, a full link layer and a decoding network.
It should be noted that the foregoing explanation on the embodiment of the hash learning method based on neighbor structure keeping is also applicable to the hash learning apparatus based on neighbor structure keeping in this embodiment, and details are not repeated here.
The Hash learning device based on neighbor structure keeping of the embodiment of the invention extracts M frame-level features of each training video aiming at each training video in a video training set by acquiring the video training set, then extracts time domain appearance features of each training video by adopting an automatic encoder and clusters the time domain appearance features to obtain an anchor point feature set, then acquires time domain appearance neighbor features corresponding to each training video from the anchor point feature set aiming at each training video, codes each training video into corresponding depth expression according to the time domain appearance neighbor features by adopting a coding network, then converts the depth expression corresponding to each training video into a column of binary codes according to a full link layer using an activation function, and then reconstructs M reconstructed frame-level features corresponding to each training video according to the binary codes by adopting a decoding network, then, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes, and finally training the network to minimize the reconstruction error function and the neighbor similarity error function; the network comprises an encoding network, a full link layer and a decoding network. In the invention, the neighbors of the video are embedded into the coding network, so that in the process of coding the frame-level characteristics of the video, the content similar to the neighbors in the video is paid more attention, and the retrieval precision on a large-scale unsupervised video database can be further improved. In addition, by minimizing the reconstruction error and the neighbor similarity error, the perfect preservation of the neighbor structure in the Hamming space can be ensured, and the retrieval precision on the video database is further improved.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.
Claims (10)
1. A hash learning method based on neighbor structure preservation is characterized by comprising the following steps:
s1, acquiring a video training set, and extracting M frame-level features of each training video in the video training set;
s2, extracting the time domain appearance characteristics of each training video by adopting an automatic encoder, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set;
s3, acquiring time domain appearance neighbor characteristics corresponding to each training video from the anchor point characteristic set aiming at each training video;
s4, coding each training video into a corresponding depth expression according to the time domain appearance neighbor characteristic by adopting a coding network;
s5, converting the depth expression corresponding to each training video into a list of binary codes according to the full link layer using the activation function;
s6, reconstructing M reconstructed frame-level features corresponding to each training video according to the binary code by adopting a decoding network;
s7, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, and generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes;
s8, training a network to minimize the reconstruction error function and minimize the neighbor similarity error function; wherein the network comprises the encoding network, the full link layer, and the transcoding network.
2. The method of claim 1, wherein each training video has a temporal appearance neighbor features, respectivelyWherein i is 1, 2, 3, …, and N is the number of training videos in the video training set; step S4 specifically includes:
s41, combining the column direction of a time domain appearance neighbor characteristics corresponding to each training video to obtain a first vector
S42, mapping the first vector into b-dimensional neighbor structure expression niWherein, in the step (A),FC denotes full link layer mapping;
s43, inputting the first frame-level feature of each training video to the coding network at the first moment, and expressing n adjacent structureiEmbedded into b-dimensional memory shape in the following mannerIn the state:
wherein d is a fixed value, Wq、Wk、WvIn order to encode the parameter values of the network,it is shown that the column direction is merged,representing the frame-level feature, m, input corresponding to the first moment of the training videoi,1Representing the memory state corresponding to the first moment;
s44, when new frame-level characteristics are input into the coding network, the memory state is updated according to the following mode:
wherein 1 is<t≤M,Representing the frame-level features of the input at the t-th instant, mi,tRepresents the memory state corresponding to the t-th time, mi,t-1The memory state corresponding to the t-1 th moment is shown;
the coding network is an LSTM network, and each operation unit in the coding network is as follows:
where MLP denotes multi-level mapping, BN denotes batch normalization, Wiv、Wih、Wfv、Wfh、Wov、WohTo express said compilationA parameter value of the code network, which indicates an inner product; wherein, the calculation mode of the sigma function is that sigma (x) is 1/(1+ e)-x);hi,t-1H represents the output of the hidden layer at the t-1 th timeitAn output representing the hidden layer at time t;
s45, outputting h from the hidden layer obtained at the last momenti,MAs a depth representation of the corresponding training video;
3. The method of claim 2, wherein the deep representation of the corresponding training video is transformed according to a full link layer using an activation function, and the obtained list of binary codes is:
bi=sgn(ti);
wherein, ti=FC(hi,MK); FC denotes the full link layer map, sgn denotes the sign function, when tiWhen greater than 0, sgn (t)i) Is 1 when t isiWhen the value is less than or equal to 0, sgn (t)i) K represents the length of the sequence of binary codes for-1.
4. The method of claim 1, wherein the video training set includes N training videos,
the reconstruction error function is:
5. The method of claim 4, wherein the neighbor similarity error function is:
wherein s isijRepresenting the similarity between the temporal appearance features of the ith training video and the temporal appearance features of the jth training video, ti,tjAre respectively binary codes bi,bjK denotes the length of the binary code, and j is a positive integer not greater than N.
6. A neighbor structure preserving-based hash learning apparatus, the apparatus comprising:
the acquisition module is used for acquiring a video training set and extracting M frame-level features of each training video in the video training set;
the extraction module is used for extracting the time domain appearance characteristics of each training video by adopting an automatic encoder, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set;
the acquisition module is further configured to acquire, for each training video, a time-domain appearance neighbor feature corresponding to each training video from the anchor point feature set;
the coding module is used for coding each training video into a corresponding depth expression according to the time domain appearance neighbor characteristic by adopting a coding network;
the conversion module is used for converting the depth expression corresponding to each training video into a row of binary codes according to the full link layer using the activation function;
the reconstruction module is used for reconstructing M reconstruction frame level characteristics corresponding to each training video according to the binary code by adopting a decoding network;
the generating module is used for generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video and generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes;
a training module to train a network to minimize the reconstruction error function and to minimize the neighbor similarity error function; wherein the network comprises the encoding network, the full link layer, and the transcoding network.
7. The apparatus of claim 6, wherein each training video has a temporal appearance neighbor features, respectivelyWherein i is 1, 2, 3, …, and N is the number of training videos in the video training set; the encoding module is specifically configured to:
combining the column direction of a time domain appearance neighbor characteristics corresponding to each training video to obtain a first vector
Mapping the first vector to a b-dimensional neighbor structure expression niWherein, in the step (A),FC denotes full link layer mapping;
for each training video, at a first time instant, inputting a first frame-level feature of each training video into the coding network, and expressing n adjacent structuresiEmbedding into the b-dimensional memory state as follows:
wherein d is a fixed value, Wq、Wk、WvIn order to encode the parameter values of the network,it is shown that the column direction is merged,representing the frame-level feature, m, input corresponding to the first moment of the training videoi,1Representing the memory state corresponding to the first moment;
when new frame-level features are input into the coding network, the memory state is updated as follows:
wherein 1 is<t≤M,Representing the frame-level features of the input at the t-th instant, mi,tRepresents the memory state corresponding to the t-th time, mi,t-1The memory state corresponding to the t-1 th moment is shown;
the coding network is an LSTM network, and each operation unit in the coding network is as follows:
where MLP denotes multi-level mapping, BN denotes batch normalization, Wiv、Wih、Wfv、Wfh、Wov、WohParameter values indicating the coding network, <' > indicating inner products(ii) a Wherein, the calculation mode of the sigma function is that sigma (x) is 1/(1+ e)-x);hi,t-1H represents the output of the hidden layer at the t-1 th timeitAn output representing the hidden layer at time t;
outputting the hidden layer obtained at the last moment hi,MAs a depth representation of the corresponding training video; wherein the content of the first and second substances, representing frame-level features of the corresponding training video and theta representing a parameter of the coding network.
8. The apparatus of claim 7, wherein the deep representation of the corresponding training video is transformed according to a full link layer using an activation function, and a list of binary codes is obtained by:
bi=sgn(ti);
wherein, ti=FC(hi,MK); FC denotes the full link layer map, sgn denotes the sign function, when tiWhen greater than 0, sgn (t)i) Is 1 when t isiWhen the value is less than or equal to 0, sgn (t)i) K represents the length of the sequence of binary codes for-1.
9. The apparatus of claim 6, wherein the video training set comprises N training videos,
the reconstruction error function is:
10. The apparatus of claim 8, wherein the neighbor similarity error function is:
wherein s isijRepresenting the similarity between the temporal appearance features of the ith training video and the temporal appearance features of the jth training video, ti,tjAre respectively binary codes bi,bjK denotes the length of the binary code, and j is a positive integer not greater than N.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910264740.9A CN110069666B (en) | 2019-04-03 | 2019-04-03 | Hash learning method and device based on neighbor structure keeping |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910264740.9A CN110069666B (en) | 2019-04-03 | 2019-04-03 | Hash learning method and device based on neighbor structure keeping |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110069666A CN110069666A (en) | 2019-07-30 |
CN110069666B true CN110069666B (en) | 2021-04-06 |
Family
ID=67366914
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910264740.9A Active CN110069666B (en) | 2019-04-03 | 2019-04-03 | Hash learning method and device based on neighbor structure keeping |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110069666B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112199520B (en) * | 2020-09-19 | 2022-07-22 | 复旦大学 | Cross-modal Hash retrieval algorithm based on fine-grained similarity matrix |
CN113111836B (en) * | 2021-04-25 | 2022-08-19 | 山东省人工智能研究院 | Video analysis method based on cross-modal Hash learning |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012077818A1 (en) * | 2010-12-10 | 2012-06-14 | 国立大学法人豊橋技術科学大学 | Method for determining conversion matrix for hash function, hash-type approximation nearest neighbour search method using said hash function, and device and computer program therefor |
CN103744973A (en) * | 2014-01-11 | 2014-04-23 | 西安电子科技大学 | Video copy detection method based on multi-feature Hash |
CN107229757A (en) * | 2017-06-30 | 2017-10-03 | 中国科学院计算技术研究所 | The video retrieval method encoded based on deep learning and Hash |
CN108304808A (en) * | 2018-02-06 | 2018-07-20 | 广东顺德西安交通大学研究院 | A kind of monitor video method for checking object based on space time information Yu depth network |
CN108763481A (en) * | 2018-05-29 | 2018-11-06 | 清华大学深圳研究生院 | A kind of picture geographic positioning and system based on extensive streetscape data |
CN109151501A (en) * | 2018-10-09 | 2019-01-04 | 北京周同科技有限公司 | A kind of video key frame extracting method, device, terminal device and storage medium |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106777130B (en) * | 2016-12-16 | 2020-05-12 | 西安电子科技大学 | Index generation method, data retrieval method and device |
CN109409208A (en) * | 2018-09-10 | 2019-03-01 | 东南大学 | A kind of vehicle characteristics extraction and matching process based on video |
CN109299097B (en) * | 2018-09-27 | 2022-06-21 | 宁波大学 | Online high-dimensional data nearest neighbor query method based on Hash learning |
-
2019
- 2019-04-03 CN CN201910264740.9A patent/CN110069666B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012077818A1 (en) * | 2010-12-10 | 2012-06-14 | 国立大学法人豊橋技術科学大学 | Method for determining conversion matrix for hash function, hash-type approximation nearest neighbour search method using said hash function, and device and computer program therefor |
CN103744973A (en) * | 2014-01-11 | 2014-04-23 | 西安电子科技大学 | Video copy detection method based on multi-feature Hash |
CN107229757A (en) * | 2017-06-30 | 2017-10-03 | 中国科学院计算技术研究所 | The video retrieval method encoded based on deep learning and Hash |
CN108304808A (en) * | 2018-02-06 | 2018-07-20 | 广东顺德西安交通大学研究院 | A kind of monitor video method for checking object based on space time information Yu depth network |
CN108763481A (en) * | 2018-05-29 | 2018-11-06 | 清华大学深圳研究生院 | A kind of picture geographic positioning and system based on extensive streetscape data |
CN109151501A (en) * | 2018-10-09 | 2019-01-04 | 北京周同科技有限公司 | A kind of video key frame extracting method, device, terminal device and storage medium |
Non-Patent Citations (1)
Title |
---|
"二值表示学习及其应用";鲁继文;《模式识别与人工智能》;20180131;第31卷(第1期);第12-21页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110069666A (en) | 2019-07-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2019360080B2 (en) | Image captioning with weakly-supervised attention penalty | |
US20200104640A1 (en) | Committed information rate variational autoencoders | |
EP3298576A1 (en) | Generative methods of super resolution | |
CN112509555B (en) | Dialect voice recognition method, device, medium and electronic equipment | |
CN112418292B (en) | Image quality evaluation method, device, computer equipment and storage medium | |
Cascianelli et al. | Full-GRU natural language video description for service robotics applications | |
CN111932546A (en) | Image segmentation model training method, image segmentation method, device, equipment and medium | |
CN110990596B (en) | Multi-mode hash retrieval method and system based on self-adaptive quantization | |
CN110069666B (en) | Hash learning method and device based on neighbor structure keeping | |
CN115687571B (en) | Depth unsupervised cross-modal retrieval method based on modal fusion reconstruction hash | |
US20220309292A1 (en) | Growing labels from semi-supervised learning | |
CN110188827A (en) | A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model | |
CN115083435B (en) | Audio data processing method and device, computer equipment and storage medium | |
CN114596456B (en) | Image set classification method based on aggregated hash learning | |
CN116543351A (en) | Self-supervision group behavior identification method based on space-time serial-parallel relation coding | |
US20230252993A1 (en) | Visual speech recognition for digital videos utilizing generative adversarial learning | |
CN115775350A (en) | Image enhancement method and device and computing equipment | |
CN115880556B (en) | Multi-mode data fusion processing method, device, equipment and storage medium | |
Ma et al. | Partial hash update via hamming subspace learning | |
CN116168394A (en) | Image text recognition method and device | |
CN115965833A (en) | Point cloud sequence recognition model training and recognition method, device, equipment and medium | |
CN113704466B (en) | Text multi-label classification method and device based on iterative network and electronic equipment | |
US20220058842A1 (en) | Generating handwriting via decoupled style descriptors | |
CN114913358B (en) | Medical hyperspectral foreign matter detection method based on automatic encoder | |
CN115129713A (en) | Data retrieval method, data retrieval device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |