CN110069666B

CN110069666B - Hash learning method and device based on neighbor structure keeping

Info

Publication number: CN110069666B
Application number: CN201910264740.9A
Authority: CN
Inventors: 鲁继文; 周杰; 李舒燕
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2021-04-06
Anticipated expiration: 2039-04-03
Also published as: CN110069666A

Abstract

The invention provides a Hash learning method and a Hash learning device based on neighbor structure keeping, wherein the method comprises the following steps: acquiring a video training set, and extracting M frame-level features of each training video; extracting the time domain appearance characteristics of each training video, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set; acquiring a time domain appearance neighbor characteristic corresponding to each training video from the anchor point characteristic set; coding each training video into a corresponding depth expression by adopting a coding network according to the time domain appearance neighbor characteristic; converting the depth expression corresponding to each training video into a list of binary codes; reconstructing M reconstructed frame level characteristics corresponding to each training video according to the binary code; generating a reconstruction error function and a neighbor similarity error function; the network is trained to minimize a reconstruction error function and a neighbor similarity error function. The method can ensure the perfect preservation of the neighbor structure in the Hamming space and improve the retrieval precision on a large-scale unsupervised video database.

Description

Hash learning method and device based on neighbor structure keeping

Technical Field

The invention relates to the technical field of video processing, in particular to a hash learning method and device based on neighbor structure keeping.

Background

Large-scale video retrieval, which aims to retrieve videos similar to a given query video from a huge database, is generally performed by representing the videos by a series of sampled video frames, and each frame of the videos can be represented by a feature. During video retrieval, relevant videos can be determined according to the feature sets corresponding to the videos.

In the presence of high-dimensional characteristics and mass data, the Hash method obtains great achievement in large-scale visual retrieval tasks, video Hash codes a video into a compact binary code, guarantees the similarity structure of a video space, and saves the video in a Hamming space. The learning-based video hashing method explores data characteristics and achieves better performance than a manually designed hashing method, and because the trouble of manual labeling is avoided, unsupervised hashing is more feasible in a large-scale video retrieval task than supervised hashing.

At present, most of the unsupervised hashes are focused on utilizing the representation and the time sequence information of the video, but neglect the utilization of the neighbor structure, so that the coding network can absorb the content of the input video without distinction whether the content is similar to the neighbor content, which is not beneficial to the storage of the neighbor similarity, and thus, when the video retrieval is performed on a large-scale unsupervised video database, the retrieval accuracy cannot be ensured.

Disclosure of Invention

The invention provides a Hash learning method and a Hash learning device based on neighbor structure preservation, which are used for guaranteeing the perfect preservation of neighbor structures in a Hamming space and improving the retrieval precision on a large-scale unsupervised video database and are used for solving the technical problems that in the prior art, unsupervised Hash is focused on utilizing the representation and time sequence information of videos, but neglects the utilization of the neighbor structures and cannot guarantee the precision of video retrieval.

An embodiment of one aspect of the present invention provides a hash learning method based on neighbor structure preservation, including:

s1, acquiring a video training set, and extracting M frame-level features of each training video in the video training set;

s2, extracting the time domain appearance characteristics of each training video by adopting an automatic encoder, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set;

s3, acquiring time domain appearance neighbor characteristics corresponding to each training video from the anchor point characteristic set aiming at each training video;

s4, coding each training video into a corresponding depth expression according to the time domain appearance neighbor characteristic by adopting a coding network;

s5, converting the depth expression corresponding to each training video into a list of binary codes according to the full link layer using the activation function;

s6, reconstructing M reconstructed frame-level features corresponding to each training video according to the binary code by adopting a decoding network;

s7, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, and generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes;

s8, training a network to minimize the reconstruction error function and minimize the neighbor similarity error function; wherein the network comprises the encoding network, the full link layer, and the transcoding network.

The Hash learning method based on neighbor structure keeping of the embodiment of the invention extracts M frame-level features of each training video aiming at each training video in a video training set by acquiring the video training set, then extracts time domain appearance features of each training video by adopting an automatic encoder and clusters the time domain appearance features to obtain an anchor point feature set, then acquires time domain appearance neighbor features corresponding to each training video from the anchor point feature set aiming at each training video, and codes each training video into corresponding depth expression according to the time domain appearance neighbor features by adopting a coding network, then converts the depth expression corresponding to each training video into a column of binary codes according to a full link layer using an activation function, and then reconstructs M reconstructed frame-level features corresponding to each training video according to the binary codes by adopting a decoding network, then, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes, and finally training the network to minimize the reconstruction error function and the neighbor similarity error function; the network comprises an encoding neural network, a full link layer and a decoding network. In the invention, the neighbors of the video are embedded into the coding network, so that in the process of coding the frame-level characteristics of the video, the content similar to the neighbors in the video is paid more attention, and the retrieval precision on a large-scale unsupervised video database can be further improved. In addition, by minimizing the reconstruction error and the neighbor similarity error, the perfect preservation of the neighbor structure in the Hamming space can be ensured, and the retrieval precision on the video database is further improved.

Another embodiment of the present invention provides a hash learning apparatus based on neighbor structure preservation, including:

the acquisition module is used for acquiring a video training set and extracting M frame-level features of each training video in the video training set;

the extraction module is used for extracting the time domain appearance characteristics of each training video by adopting an automatic encoder, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set;

the acquisition module is further configured to acquire, for each training video, a time-domain appearance neighbor feature corresponding to each training video from the anchor point feature set;

the coding module is used for coding each training video into a corresponding depth expression according to the time domain appearance neighbor characteristic by adopting a coding network;

the conversion module is used for converting the depth expression corresponding to each training video into a row of binary codes according to the full link layer using the activation function;

the reconstruction module is used for reconstructing M reconstruction frame level characteristics corresponding to each training video according to the binary code by adopting a decoding network;

the generating module is used for generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video and generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes;

a training module to train a network to minimize the reconstruction error function and to minimize the neighbor similarity error function; wherein the network comprises the encoding network, the full link layer, and the transcoding network.

The Hash learning device based on neighbor structure keeping of the embodiment of the invention extracts M frame-level features of each training video aiming at each training video in a video training set by acquiring the video training set, then extracts time domain appearance features of each training video by adopting an automatic encoder and clusters the time domain appearance features to obtain an anchor point feature set, then acquires time domain appearance neighbor features corresponding to each training video from the anchor point feature set aiming at each training video, codes each training video into corresponding depth expression according to the time domain appearance neighbor features by adopting a coding network, then converts the depth expression corresponding to each training video into a column of binary codes according to a full link layer using an activation function, and then reconstructs M reconstructed frame-level features corresponding to each training video according to the binary codes by adopting a decoding network, then, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes, and finally training the network to minimize the reconstruction error function and the neighbor similarity error function; the network comprises an encoding network, a full link layer and a decoding network. In the invention, the neighbors of the video are embedded into the coding network, so that in the process of coding the frame-level characteristics of the video, the content similar to the neighbors in the video is paid more attention, and the retrieval precision on a large-scale unsupervised video database can be further improved. In addition, by minimizing the reconstruction error and the neighbor similarity error, the perfect preservation of the neighbor structure in the Hamming space can be ensured, and the retrieval precision on the video database is further improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of a hash learning method based on neighbor structure preservation according to an embodiment of the present invention;

FIG. 2 is a first diagram illustrating a hash learning process according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a second hash learning process according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a hash learning apparatus based on neighbor structure preservation according to a second embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Currently, some video hashing methods integrate a hash function into a deep neural network, specifically, extract features of a video frame through a deep convolutional network, and the features are further encoded into a binary code through a time sequence pooling operation or a deep cyclic network. Compared with the supervised hash, the unsupervised hash is more feasible in a large-scale video retrieval task because the trouble of manual labeling is avoided.

However, most of the unsupervised hashing aims at utilizing the representation and timing information of the video, but neglects the utilization of a neighbor structure, and although some hashing methods design some neighbor similarity cost functions to train the network, the neighbor structure is only used for guiding the generation of the binary code and is not utilized in the video feature coding. In this way, the designed coding network will absorb the content of the input video without distinction whether the content is similar to the neighboring content, and is not favorable for storing the neighboring similarity, so that the precision of the retrieval cannot be guaranteed when the video retrieval is performed on a large-scale unsupervised video database.

Therefore, the invention provides a hash learning method based on neighbor structure retention, which mainly aims at the technical problems that in the prior art, unsupervised hash focuses on utilizing the representation and time sequence information of a video, but neglects the utilization of a neighbor structure and cannot ensure the precision of video retrieval.

According to the Hash learning method based on neighbor structure keeping, disclosed by the embodiment of the invention, the neighbors of the video are embedded into the coding network, so that the content similar to the neighbors in the video is paid more attention in the process of coding the video frame-level features, and the retrieval precision on a large-scale unsupervised video database can be further improved. In addition, when the network is trained, the perfect preservation of the neighbor structure in the Hamming space can be ensured by minimizing the reconstruction error and the neighbor similarity error, and the retrieval precision on the video database is further improved.

The following describes a hash learning method and apparatus based on neighbor structure keeping according to an embodiment of the present invention with reference to the drawings.

Fig. 1 is a schematic flowchart of a hash learning method based on neighbor structure preservation according to an embodiment of the present invention.

The embodiment of the invention is exemplified in that the hash learning method based on neighbor structure keeping is configured in the hash learning apparatus based on neighbor structure keeping, and the hash learning apparatus based on neighbor structure keeping can be applied to any computer device, so that the computer device can execute the hash learning function based on neighbor structure keeping.

The computer device may be a Personal Computer (PC), a cloud device, a mobile device, and the like, and the mobile device may be a hardware device having various operating systems, touch screens, and/or display screens, such as a mobile phone, a tablet computer, a personal digital assistant, a wearable device, and an in-vehicle device.

As shown in fig. 1, the neighbor structure preserving-based hash learning method may include the following steps:

s1, acquiring a video training set, and extracting M frame-level features of each training video aiming at each training video in the video training set.

In the embodiment of the present invention, the video training set includes N training videos, where the N training videos may be videos stored locally by the computer device, or may also be videos downloaded online by the computer device, which is not limited herein. Wherein, the size of N is preset, and the size of M is also preset.

In the embodiment of the invention, the video training set is marked as

For each training video in the video training set, M frames can be uniformly sampled, and M frame-level features with dimension l corresponding to each training video are extracted by the deep convolutional network, so that each training video can be converted into a frame-level feature set

And S2, extracting the time domain appearance characteristics of each training video by adopting an automatic encoder, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set.

In the embodiment of the invention, for each training video, d-dimensional time domain appearance characteristics can be obtained through an automatic encoder. For each training video, the distance between the training video and other videos in the video library can be calculated and ranked to determine a time-domain appearance neighbor features corresponding to the training video, for example, the distance between different videos can be calculated by using a two-norm, since this calculation process also needs to be calculated in the testing stage, and neighbor search facing the whole video training set consumes a lot of time, which is not practical. Therefore, in the invention, K-means clustering can be performed on the training videos in the video training set to obtain n clustering centers, for example, K-means clustering can be performed on the time domain appearance characteristics to obtain n clustering centers. For each cluster center, the temporal appearance feature closest (or least distant) to the cluster center may be determined, resulting in n temporal appearance features. Then, n time domain appearance features can be used as anchor points and are listed into an anchor point feature set, and the anchor point feature set is marked as

And S3, acquiring the time domain appearance neighbor feature corresponding to each training video from the anchor point feature set aiming at each training video.

In the embodiment of the invention, for each training video, a time domain appearance neighbor features corresponding to the training video can be obtained from the anchor point feature setAre respectively as

Because a is less than N, only a little time is consumed for obtaining a time domain appearance neighbor characteristics, and the efficiency of video retrieval can be greatly improved.

And S4, coding each training video into a corresponding depth expression according to the time domain appearance neighbor characteristics by adopting a coding network.

In the embodiment of the invention, the coding network learns the corresponding relation between the time domain appearance neighbor features and the depth expressions corresponding to the videos, and after the time domain appearance neighbor features corresponding to each training video are determined, the time domain appearance neighbor features corresponding to each training video can be input to the coding network to obtain the depth expressions corresponding to each training video.

As a possible implementation manner, in the neighbor attention learning mechanism, the neighbor structure expression n needs to be obtained_i. Specifically, for each training video, a time domain appearance neighbor features corresponding to the training video may be column-wise combined to obtain a first vector

And mapping the first vector to a b-dimensional neighbor structure expression n_iThen the neighbor structure expresses n_iComprises the following steps:

where FC denotes a full link layer map.

For each training video, at a first time instant, inputting a first frame-level feature of the training video into the coding network, and expressing n adjacent structures_iEmbedding into the b-dimensional memory state as follows:

wherein d is a fixed value, W^q、W^k、W^vIn order to encode the parameter values of the network,

it is shown that the column direction is merged,

representing the frame-level features, m, input at the first moment of the training video_i,1Indicating the memory state corresponding to the first moment.

By means of the formula (2), the information of the neighbor structure will exist in the memory state at each moment, and when 1< t ≦ M, when there is a new video frame level feature input to the coding network, the memory state can be updated as follows:

wherein the content of the first and second substances,

representing the video frame-level features, m, input at the t-th instant_i,tRepresents the memory state corresponding to the t-th time, m_i,t-1The memory state corresponding to the t-1 th time is shown.

By means of the formulas (2) and (3), at each moment, the memory state will select the useful information in the input features to write into the new memory state according to the neighbor structure information contained in the memory state. The neighbor attention learning mechanism is embedded into the coding network, and the operation units are obtained as follows:

where MLP denotes multi-level mapping, BN denotes batch normalization, W^iv、W^ih、W^fv、W^fh、W^ov、W^ohA parameter value representing the coding network,

expressing the inner product, the sigma function is calculated in a manner that sigma is 1/(1+ e)^-x) The tan h function is calculated by σ ═ e^x-e^-x)/(e^x+e^-x). Hidden layer output h obtained at last moment_i,MI.e. a deep representation of the training video. Specifically, for each training video, the depth of the training video is expressed as:

wherein the content of the first and second substances,

representing the corresponding frame-level features of the training video, and theta represents the parameters of the coding network.

In the embodiment of the invention, the neighbors of the video are embedded into the coding network, so that in the process of coding the frame-level features of the video, the content similar to the neighbors in the video is paid more attention, and the retrieval precision on a large-scale unsupervised video database can be further improved.

And S5, converting the depth expression corresponding to each training video into a list of binary codes according to the full link layer using the activation function.

In the embodiment of the invention, for each training video, according to the full link layer using the activation function, the depth corresponding to the training video is expressed, and a list of binary codes obtained by conversion is as follows:

b_i＝sgn(t_i)；(6)

wherein, t_i＝FC(h_i,MK); FC denotes the full link layer map, sgn denotes the sign function, when t_iWhen greater than 0, sgn (t)_i) Is 1 when t is_iWhen the value is less than or equal to 0, sgn (t)_i) To-1, k represents the length of a column of binary codes.

And S6, reconstructing M reconstructed frame-level features corresponding to each training video according to the binary code by adopting a decoding network.

In the embodiment of the present invention, a Long Short Term Memory (LSTM) network may be used as the decoding network. Specifically, a column of binary codes corresponding to each training video may be mapped to an l-dimensional vector

At the first moment, will

Input to a decoding network to obtain a first reconstructed video frame-level feature

Features noted in the embodiments of the present invention as reconstructed frame level

At the second moment, will

Inputting the data into a decoding network to obtain a second reconstructed frame level feature

Will be provided with

Inputting the data into a decoding network to obtain a third reconstructed frame level feature

The above steps are circulated until the Mth reconstructed frame level characteristic is output by the decoding network

When so, decoding is completed.

As an example, referring to fig. 2, fig. 2 is a schematic diagram of a hash learning process in an embodiment of the present invention. It is composed ofIn the method, M frame-level features v corresponding to the training video are obtained¹、v²、…、v^MThen, corresponding depth expressions can be output through the coding network, and after corresponding binary codes are obtained through a full link layer using an activation function, corresponding M reconstructed frame-level features can be output through the decoding network

And S7, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, and generating a neighbor similarity error function according to the time domain appearance neighbor characteristics and the binary code.

In the embodiment of the invention, two loss functions are designed to train the network, namely a reconstruction error function L_rAnd neighbor similarity error function L_s。

Wherein an error function L is reconstructed_rRepresenting the difference between the corresponding frame-level features of the input training video and the decoded reconstructed frame-level features, the mean square error may be used to represent the reconstruction error function L_r：

Wherein the content of the first and second substances,

representing the mth frame-level feature in the ith training video,

representing the mth reconstructed frame-level feature in the ith training video.

In the embodiment of the invention, the neighbor similarity error function represents the difference of the similarity structure in the original video space and the Hamming space, and the neighbor similarity error function L can be obtained according to a formula (8)_s：

Wherein s is_ijRepresenting the similarity between the temporal appearance features of the ith training video and the temporal appearance features of the jth training video,

represents the ith corresponding binary code b_iCorresponding to the jth binary code b_jThe similarity between them.

To calculate s in equation (8)_ijThe approximate similarity matrix a may be established as follows. First, the corresponding frame-level features x of the training video can be obtained_iAnd corresponding a time domain appearance neighbor features

Defining a truncated similarity matrix

Each element Y in Y can be represented by formula (9)_ij：

Wherein, < i > represents the positions of a time domain appearance neighbor features in the anchor point feature set, Dist represents a distance calculation function, the distance can be calculated by adopting a two-norm, and t represents a bandwidth parameter.

The approximate similarity matrix a may be calculated according to equation (10):

A＝YΛ^-1Y^T；(10)

wherein the content of the first and second substances,

a calculated according to the formula (10) is a sparse non-negative matrix, the sum of each row and each column of the matrix is 1, and when A is_ij>When 0, s can be substituted_ijIs set to 1, and when A_ijWhen s is less than or equal to 0, s can be added_ijIs set to 0.

The binary code b is represented in formula (8)_iAnd binary code b_jSimilarity between them

Can be defined as

To avoid oscillation during network training, the method can be used

Approximate representation

Wherein, t_iIs a binary code b_iThe slack of (a) indicates.

To reduce

And

with respect to t, an approximation error can be introduced_iAnd b_iThe auxiliary loss term of (c), equation (8) can be converted into:

s8, training the network to minimize the reconstruction error function and minimize the neighbor similarity error function; the network comprises an encoding network, a full link layer and a decoding network.

In the embodiment of the invention, the network can be divided into three parts: the first part is a coding network with a neighbor attention learning mechanism, and the deep expression of the training video can be learned through the coding network; the second part is a full link layer with a nonlinear activation function, and is used for converting the depth expression into a binary code with a K dimension; the third part is a decoding network, and the reconstructed frame-level features of each frame of the training video are decoded from the binary code obtained by encoding.

In the embodiment of the invention, the network can be trained according to the reconstruction error function and the neighbor similarity error function, the information contained in the input training video can be better utilized by minimizing the reconstruction error function, and the neighbor similarity can be maximally stored by minimizing the neighbor similarity error function. When training the network, the used training loss function can be weighted by a reconstruction error function and a neighbor similarity error function:

L＝αL_s+(1-α)L_r；(12)

where α represents a hyperparameter balancing the reconstruction error function and the neighbor similarity error function.

In the embodiment of the invention, when the network is trained end to end, a reverse gradient conduction mode can be adopted to optimize the network parameters.

As an example, referring to fig. 3, fig. 3 is a schematic diagram of a hash learning process according to an embodiment of the present invention. When the network is trained, when a training video is input, the time domain appearance neighbor characteristic can be embedded into a Hash coding network for Hash learning, a Hash code is generated through a coding network, and the perfect preservation of a neighbor structure in a Hamming space is ensured by minimizing a reconstruction error and a neighbor similarity error.

The Hash learning method based on neighbor structure keeping of the embodiment of the invention extracts M frame-level features of each training video aiming at each training video in a video training set by acquiring the video training set, then extracts time domain appearance features of each training video by adopting an automatic encoder and clusters the time domain appearance features to obtain an anchor point feature set, then acquires time domain appearance neighbor features corresponding to each training video from the anchor point feature set aiming at each training video, and codes each training video into corresponding depth expression according to the time domain appearance neighbor features by adopting a coding network, then converts the depth expression corresponding to each training video into a column of binary codes according to a full link layer using an activation function, and then reconstructs M reconstructed frame-level features corresponding to each training video according to the binary codes by adopting a decoding network, then, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes, and finally training the network to minimize the reconstruction error function and the neighbor similarity error function; the network comprises an encoding network, a full link layer and a decoding network. In the invention, the neighbors of the video are embedded into the coding network, so that in the process of coding the frame-level characteristics of the video, the content similar to the neighbors in the video is paid more attention, and the retrieval precision on a large-scale unsupervised video database can be further improved. In addition, by minimizing the reconstruction error and the neighbor similarity error, the perfect preservation of the neighbor structure in the Hamming space can be ensured, and the retrieval precision on the video database is further improved.

In order to implement the above embodiment, the present invention further provides a hash learning apparatus based on neighbor structure preservation.

As shown in fig. 4, the neighbor structure preserving-based hash learning apparatus includes: the system comprises an acquisition module 101, an extraction module 102, an encoding module 103, a transformation module 104, a reconstruction module 105, a generation module 106 and a training module 107.

The obtaining module 101 is configured to obtain a video training set, and extract, for each training video in the video training set, M frame-level features of each training video.

The extracting module 102 is configured to extract a time domain appearance feature of each training video by using an automatic encoder, and cluster the time domain appearance features to obtain an anchor point feature set.

The obtaining module 101 is further configured to, for each training video, obtain, from the anchor point feature set, a time-domain appearance neighbor feature corresponding to each training video.

And the coding module 103 is configured to code each training video into a corresponding depth expression according to the time domain appearance neighbor feature by using a coding network.

As a possible implementation manner, each training video has a time domain appearance neighbor features, which are respectively

The encoding module 103 is specifically configured to:

combining the column direction of a time domain appearance neighbor characteristics corresponding to each training video to obtain a first vector

Mapping a first vector to a b-dimensional neighbor structure expression n_iWherein, in the step (A),

FC denotes full link layer mapping;

for each training video, at a first moment, inputting a first frame-level feature of each training video into the coding network, and expressing n adjacent structures_iEmbedding into the b-dimensional memory state as follows:

it is shown that the column direction is merged,

representing the frame-level feature, m, input corresponding to the first moment of the training video_i,1Representing the memory state corresponding to the first moment;

when new frame-level features are input into the coding network, the memory state is updated as follows:

wherein 1 is<t≤M，

Representing the frame-level features of the input at the t-th instant, m_i,tRepresents the memory state corresponding to the t-th time, m_i,t-1The memory state corresponding to the t-1 th moment is shown;

the coding network is an LSTM network, and each operation unit in the coding network is as follows:

represents the inner product;

outputting the hidden layer obtained at the last moment h_i,MAs a depth representation of the corresponding training video; wherein the content of the first and second substances,

representing frame-level features of the corresponding training video and theta representing a parameter of the coding network.

And the conversion module 104 is configured to convert the depth expression corresponding to each training video into a list of binary codes according to the full link layer using the activation function.

As a possible implementation, according to a full link layer using an activation function, a deep representation corresponding to a training video is transformed, and a list of binary codes obtained is: b_i＝sgn(t_i) (ii) a Wherein, t_i＝FC(h_i,MK); FC denotes the full link layer map, sgn denotes the sign function, when t_iWhen greater than 0, sgn (t)_i) Is 1 when t is_iWhen the value is less than or equal to 0, sgn (t)_i) To-1, k represents the length of a column of binary codes.

And the reconstructing module 105 is configured to reconstruct M reconstructed frame-level features corresponding to each training video according to the binary code by using a decoding network.

And the generating module 106 is configured to generate a reconstruction error function according to the frame level feature and the reconstruction frame level feature corresponding to each training video, and generate a neighbor similarity error function according to the time domain appearance feature and the binary code.

As a possible implementation, the video training set includes N training videos, and the reconstruction error function is:

wherein the content of the first and second substances,

representing the mth frame-level feature in the ith training video,

As a possible implementation, the neighbor similarity error function is:

wherein s is_ijRepresents the ith trainingSimilarity between the time-domain appearance features of the training video and the time-domain appearance features of the jth training video, t_iIs a binary code b_iThe slack of (a) indicates.

A training module 107 for training the network to minimize a reconstruction error function and to minimize a neighbor similarity error function; the network comprises an encoding network, a full link layer and a decoding network.

It should be noted that the foregoing explanation on the embodiment of the hash learning method based on neighbor structure keeping is also applicable to the hash learning apparatus based on neighbor structure keeping in this embodiment, and details are not repeated here.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A hash learning method based on neighbor structure preservation is characterized by comprising the following steps:

2. The method of claim 1, wherein each training video has a temporal appearance neighbor features, respectively

Wherein i is 1, 2, 3, …, and N is the number of training videos in the video training set; step S4 specifically includes:

s41, combining the column direction of a time domain appearance neighbor characteristics corresponding to each training video to obtain a first vector

S42, mapping the first vector into b-dimensional neighbor structure expression n_iWherein, in the step (A),

FC denotes full link layer mapping;

s43, inputting the first frame-level feature of each training video to the coding network at the first moment, and expressing n adjacent structure_iEmbedded into b-dimensional memory shape in the following mannerIn the state:

it is shown that the column direction is merged,

s44, when new frame-level characteristics are input into the coding network, the memory state is updated according to the following mode:

wherein 1 is<t≤M，

where MLP denotes multi-level mapping, BN denotes batch normalization, W^iv、W^ih、W^fv、W^fh、W^ov、W^ohTo express said compilationA parameter value of the code network, which indicates an inner product; wherein, the calculation mode of the sigma function is that sigma (x) is 1/(1+ e)^-x)；h_i,t-1H represents the output of the hidden layer at the t-1 th time_itAn output representing the hidden layer at time t;

s45, outputting h from the hidden layer obtained at the last moment_i,MAs a depth representation of the corresponding training video;

wherein the content of the first and second substances,

3. The method of claim 2, wherein the deep representation of the corresponding training video is transformed according to a full link layer using an activation function, and the obtained list of binary codes is:

b_i＝sgn(t_i)；

wherein, t_i＝FC(h_i,MK); FC denotes the full link layer map, sgn denotes the sign function, when t_iWhen greater than 0, sgn (t)_i) Is 1 when t is_iWhen the value is less than or equal to 0, sgn (t)_i) K represents the length of the sequence of binary codes for-1.

4. The method of claim 1, wherein the video training set includes N training videos,

the reconstruction error function is:

wherein the content of the first and second substances,

representing the mth frame-level feature in the ith training video,

representing the mth reconstructed frame-level feature in the ith training video, and l representing the dimension of the frame-level feature.

5. The method of claim 4, wherein the neighbor similarity error function is:

wherein s is_ijRepresenting the similarity between the temporal appearance features of the ith training video and the temporal appearance features of the jth training video, t_i，t_jAre respectively binary codes b_i，b_jK denotes the length of the binary code, and j is a positive integer not greater than N.

6. A neighbor structure preserving-based hash learning apparatus, the apparatus comprising:

7. The apparatus of claim 6, wherein each training video has a temporal appearance neighbor features, respectively

Wherein i is 1, 2, 3, …, and N is the number of training videos in the video training set; the encoding module is specifically configured to:

Mapping the first vector to a b-dimensional neighbor structure expression n_iWherein, in the step (A),

FC denotes full link layer mapping;

for each training video, at a first time instant, inputting a first frame-level feature of each training video into the coding network, and expressing n adjacent structures_iEmbedding into the b-dimensional memory state as follows:

it is shown that the column direction is merged,

wherein 1 is<t≤M，

where MLP denotes multi-level mapping, BN denotes batch normalization, W^iv、W^ih、W^fv、W^fh、W^ov、W^ohParameter values indicating the coding network, <' > indicating inner products(ii) a Wherein, the calculation mode of the sigma function is that sigma (x) is 1/(1+ e)^-x)；h_i,t-1H represents the output of the hidden layer at the t-1 th time_itAn output representing the hidden layer at time t;

8. The apparatus of claim 7, wherein the deep representation of the corresponding training video is transformed according to a full link layer using an activation function, and a list of binary codes is obtained by:

b_i＝sgn(t_i)；

9. The apparatus of claim 6, wherein the video training set comprises N training videos,

the reconstruction error function is:

wherein the content of the first and second substances,

representing the mth frame-level feature in the ith training video,

10. The apparatus of claim 8, wherein the neighbor similarity error function is: