CN110069666A

CN110069666A - The Hash learning method and device kept based on Near-neighbor Structure

Info

Publication number: CN110069666A
Application number: CN201910264740.9A
Authority: CN
Inventors: 鲁继文; 周杰; 李舒燕
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2019-07-30
Anticipated expiration: 2039-04-03
Also published as: CN110069666B

Abstract

The present invention proposes a kind of Hash learning method and device based on Near-neighbor Structure holding, wherein method includes: to obtain video training set, and extract M frame level feature of each training video；The time domain external appearance characteristic of each training video is extracted, clock synchronization domain external appearance characteristic is clustered, and anchor point characteristic set is obtained；The corresponding time domain appearance neighbour's feature of each training video is obtained from anchor point characteristic set；Each training video is encoded to corresponding depth and is expressed according to time domain appearance neighbour's feature using coding network；A column two-value code is converted by the corresponding depth expression of each training video；The corresponding M reconstruct frame level feature of each training video is reconstructed according to two-value code；Generate reconstructed error function and neighbour's similitude error function；Network is trained, so that reconstructed error function and neighbour's similitude error function minimize.It can be realized the intact preservation for guaranteeing Near-neighbor Structure in Hamming space, improve the retrieval precision on extensive unsupervised video database.

Description

The Hash learning method and device kept based on Near-neighbor Structure

Technical field

The present invention relates to technical field of video processing more particularly to a kind of Hash learning methods kept based on Near-neighbor Structure And device.

Background technique

Extensive video frequency searching, it is intended to retrieved in the database huge from one with to the similar view of inquiry video Frequently, under normal circumstances, video can be indicated with a series of obtained video frames of sampling, also, every frame video frame can be by One feature is indicated.In video frequency searching, relevant video can be determined according to the corresponding characteristic set of video.

In face of high dimensional feature and mass data, hash method achieved in extensive vision retrieval tasks it is very big at Just, video Hash encodes video into fine and close two-value code, and guarantees the Similarity Structure of sdi video, in Hamming space To save.Video hash method heuristic data characteristic based on study simultaneously achieves hash method more preferably property than hand-designed Can because eliminating the trouble manually marked, unsupervised Hash compared supervision Hash in extensive video frequency searching task more Add feasible.

Currently, most of unsupervised Hash are conceived to characterization and timing information using video, but have ignored and neighbour is tied The utilization of structure will have no to absorb the content of input video differentially so as to cause coding network, distinguish that these contents are without going It is no similar to neighbour's content, it is unfavorable for the preservation of neighbour's similitude in this way, thus on extensive unsupervised video database When carrying out video frequency searching, the precision of retrieval not can guarantee.

Summary of the invention

The present invention proposes a kind of Hash learning method and device based on Near-neighbor Structure holding, guarantees Hamming space to realize The intact preservation of middle Near-neighbor Structure improves the retrieval precision on extensive unsupervised video database, for solving the prior art In unsupervised Hash be conceived to characterization and timing information using video, but have ignored the utilization to Near-neighbor Structure, not can guarantee The technical issues of precision of video frequency searching.

One aspect of the present invention embodiment proposes a kind of Hash learning method kept based on Near-neighbor Structure, comprising:

S1, video training set is obtained, for each training video in the video training set, extracts each training M frame level feature of video；

S2, using autocoder, extract the time domain external appearance characteristic of each training video, and to the time domain external appearance characteristic It is clustered, obtains anchor point characteristic set；

S3, be directed to each training video, obtained from the anchor point characteristic set each training video it is corresponding when it is overseas See neighbour's feature；

S4, each training video is encoded to according to time domain appearance neighbour's feature by corresponding depth using coding network Degree expression；

S5, according to the full linking layer for using activation primitive, the corresponding depth of each training video is expressed, is converted into One column two-value code；

S6, using decoding network, it is special that the corresponding M reconstruct frame level of each training video is reconstructed according to the two-value code Sign；

S7, according to the corresponding frame level feature of each training video and the reconstruct frame level feature, generate reconstructed error Function, and according to the time domain external appearance characteristic and the two-value code, generate neighbour's similitude error function；

S8, network is trained, so that the reconstructed error function minimization, and make neighbour's similitude error Function minimization；Wherein, the network includes the coding network, the full linking layer and the decoding network.

The Hash learning method of the embodiment of the present invention kept based on Near-neighbor Structure, by obtaining video training set, for Each training video in video training set extracts M frame level feature of each training video, later, using autocoder, The time domain external appearance characteristic of each training video is extracted, and clock synchronization domain external appearance characteristic is clustered, obtains anchor point characteristic set, and Afterwards, for each training video, the corresponding time domain appearance neighbour's feature of each training video is obtained from anchor point characteristic set, and Each training video is encoded to corresponding depth and is expressed according to time domain appearance neighbour's feature using coding network, later, root According to the full linking layer for using activation primitive, the corresponding depth of each training video is expressed, a column two-value code is converted into, then, Using decoding network, the corresponding M reconstruct frame level feature of each training video is reconstructed according to two-value code, later, according to each The corresponding frame level feature of training video and reconstruct frame level feature, generate reconstructed error function, and according to time domain external appearance characteristic and two It is worth code, generates neighbour's similitude error function, finally, network is trained, so that reconstructed error function minimization, and make Neighbour's similitude error function minimizes；Wherein, network includes encoding nerve network, full linking layer and decoding network.The present invention In, the neighbour of video is embedded into coding network, is carried out in cataloged procedure in the frame level feature to video as a result, the video In content similar with its neighbour be able to more be paid close attention to, and then the inspection on extensive unsupervised video database can be improved Suo Jingdu.Also, by minimizing reconstruction error and neighbour's similitude error, it is ensured that Near-neighbor Structure is complete in Hamming space It is good to save, further increase the retrieval precision on video database.

Another aspect of the invention embodiment proposes a kind of Hash learning device kept based on Near-neighbor Structure, comprising:

Module is obtained, for obtaining video training set, for each training video in the video training set, extracts institute State M frame level feature of each training video；

Extraction module extracts the time domain external appearance characteristic of each training video for using autocoder, and to it is described when Domain external appearance characteristic is clustered, and anchor point characteristic set is obtained；

The acquisition module is also used to obtain each training from the anchor point characteristic set for each training video The corresponding time domain appearance neighbour's feature of video；

Coding module, for being encoded each training video according to time domain appearance neighbour's feature using coding network For the expression of corresponding depth；

Conversion module, for according to the full linking layer for using activation primitive, by the corresponding depth of each training video Expression, is converted into a column two-value code；

Reconstructed module reconstructs the corresponding M weight of each training video according to the two-value code for using decoding network Structure frame level feature；

Generation module, for according to the corresponding frame level feature of each training video and the reconstruct frame level feature, life At reconstructed error function, and according to the time domain external appearance characteristic and the two-value code, neighbour's similitude error function is generated；

Training module, for being trained to network, so that the reconstructed error function minimization, and make the neighbour Similitude error function minimizes；Wherein, the network includes the coding network, the full linking layer and the decoding net Network.

The Hash learning device of the embodiment of the present invention kept based on Near-neighbor Structure, by obtaining video training set, for Each training video in video training set extracts M frame level feature of each training video, later, using autocoder, The time domain external appearance characteristic of each training video is extracted, and clock synchronization domain external appearance characteristic is clustered, obtains anchor point characteristic set, and Afterwards, for each training video, the corresponding time domain appearance neighbour's feature of each training video is obtained from anchor point characteristic set, and Each training video is encoded to corresponding depth and is expressed according to time domain appearance neighbour's feature using coding network, later, root According to the full linking layer for using activation primitive, the corresponding depth of each training video is expressed, a column two-value code is converted into, then, Using decoding network, the corresponding M reconstruct frame level feature of each training video is reconstructed according to two-value code, later, according to each The corresponding frame level feature of training video and reconstruct frame level feature, generate reconstructed error function, and according to time domain external appearance characteristic and two It is worth code, generates neighbour's similitude error function, finally, network is trained, so that reconstructed error function minimization, and make Neighbour's similitude error function minimizes；Wherein, network includes coding network, full linking layer and decoding network.It, will in the present invention The neighbour of video is embedded into coding network, as a result, to video frame level feature carry out cataloged procedure in, in the video with its The similar content of neighbour is able to more be paid close attention to, and then the retrieval essence on extensive unsupervised video database can be improved Degree.Also, by minimizing reconstruction error and neighbour's similitude error, it is ensured that the intact guarantor of Near-neighbor Structure in Hamming space It deposits, further increases the retrieval precision on video database.

The additional aspect of the present invention and advantage will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.

Detailed description of the invention

Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, in which:

Fig. 1 is the process signal of the Hash learning method kept provided by the embodiment of the present invention one based on Near-neighbor Structure Figure；

Fig. 2 is Hash learning process schematic diagram one in the embodiment of the present invention；

Fig. 3 is the Hash learning process schematic diagram two of the embodiment of the present invention；

Fig. 4 is the structural representation of the Hash learning device kept provided by the embodiment of the present invention two based on Near-neighbor Structure Figure.

Specific embodiment

The embodiment of the present invention is described below in detail, the example of embodiment is shown in the accompanying drawings, wherein identical from beginning to end Or similar label indicates same or similar element or element with the same or similar functions.It is retouched below with reference to attached drawing The embodiment stated is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.

Currently, hash function has been integrated into deep neural network by some video hash methods, specifically, pass through depth Convolutional network extracts the feature of video frame, these features are operated by timing pondization or deep-cycle network, is further compiled Code is at two-value code.Because eliminating the trouble manually marked, unsupervised Hash has compared supervision Hash in extensive video frequency searching It is more feasible in task.

However, most of unsupervised Hash are conceived to characterization and timing information using video, but have ignored and neighbour is tied The utilization of structure devises certain neighbour's similarity cost function in spite of some hash methods to train network, but Near-neighbor Structure It is only applied to instruct the generation of two-value code, and is not utilized on video features coding.Under this mode, design Coding network will have no differentially to absorb the content of input video, distinguish whether these contents are similar to neighbour's content without going, It is unfavorable for the preservation of neighbour's similitude, thus when carrying out video frequency searching on extensive unsupervised video database, Wu Fabao Demonstrate,prove the precision of retrieval.

Therefore, it is conceived to present invention is generally directed to unsupervised Hash in the prior art and is believed using the characterization and timing of video The technical issues of ceasing, but having ignored the utilization to Near-neighbor Structure, not can guarantee the precision of video frequency searching proposes a kind of based on neighbour Structure-preserved Hash learning method.

The Hash learning method of the embodiment of the present invention kept based on Near-neighbor Structure, by the way that the neighbour of video is embedded into volume In code network, carried out in cataloged procedure to video frame level feature as a result, content similar with its neighbour is able to by more in video More concerns, and then the retrieval precision on extensive unsupervised video database can be improved.Also, it is trained to network When, by minimizing reconstruction error and neighbour's similitude error, it is ensured that the intact preservation of Near-neighbor Structure in Hamming space, into One step improves the retrieval precision on video database.

Below with reference to the accompanying drawings the Hash learning method kept based on Near-neighbor Structure and device of the embodiment of the present invention are described.

Fig. 1 is the process signal of the Hash learning method kept provided by the embodiment of the present invention one based on Near-neighbor Structure Figure.

The embodiment of the present invention is configured in the Hash learning method kept based on Near-neighbor Structure and is kept based on Near-neighbor Structure Hash learning device in come for example, should the Hash learning device that be kept based on Near-neighbor Structure can be applied to any calculating In machine equipment, so that the computer equipment can execute the Hash learning functionality kept based on Near-neighbor Structure.

Wherein, computer equipment can be PC (PersonalComputer, abbreviation PC), cloud device, movement Equipment etc., mobile device can for example have for mobile phone, tablet computer, personal digital assistant, wearable device, mobile unit etc. The hardware device of various operating systems, touch screen and/or display screen.

As shown in Figure 1, the Hash learning method that should be kept based on Near-neighbor Structure may comprise steps of:

S1 obtains video training set for each training video in video training set and extracts the M of each training video A frame level feature.

It include N number of training video in video training set in the embodiment of the present invention, N number of training video can set for computer The standby video being locally stored, or, or the video of computer equipment download online, with no restriction to this.Wherein, N Size be it is pre-set, the size of M is also pre-set.

In the embodiment of the present invention, marking video training set isIt is regarded for the training of each of video training set Frequently, the M frame level that the corresponding dimension of each training video is l can be extracted to its uniform sampling M frame, and by depth convolutional network Feature then can convert frame level characteristic set for each training video

S2 extracts the time domain external appearance characteristic of each training video using autocoder, and clock synchronization domain external appearance characteristic carries out Cluster, obtains anchor point characteristic set.

In the embodiment of the present invention, for each training video, the time domain appearance that d dimension can be obtained by autocoder is special Sign.For each training video, and can be arranged by calculating the training video in video library at a distance from other videos Sequence determines the corresponding a time domain appearance neighbour's feature of the training video, for example, can calculate difference by using two norms The distance between video, since this calculating process is also required to calculate in test phase, and towards the close of entire video training set Neighbour's retrieval can consume the plenty of time, be unpractical.Therefore, the present invention in, can to the training video in video training set into Row K mean cluster obtains n cluster centre, for example, K mean cluster can be carried out with clock synchronization domain external appearance characteristic, obtains n cluster Center.For each cluster centre, (or apart from the smallest) time domain external appearance characteristic nearest with the cluster centre can be determined, from And obtain n time domain external appearance characteristic.Then, can be using n time domain external appearance characteristic as anchor point, and it is included in anchor point characteristic set, The anchor point characteristic set is marked to be

It is close to obtain the corresponding time domain appearance of each training video for each training video from anchor point characteristic set by S3 Adjacent feature.

In the embodiment of the present invention, for each training video, it is corresponding can to obtain the training video from anchor point characteristic set A time domain appearance neighbour's feature, respectivelyDue to a < < n < < N, obtain a time domain Appearance neighbour feature need to only consume the micro time, can greatly promote the efficiency of video frequency searching.

Each training video is encoded to corresponding depthmeter according to time domain appearance neighbour's feature using coding network by S4 It reaches.

In the embodiment of the present invention, coding network has learnt to obtain the corresponding time domain appearance neighbour feature of each video and depthmeter Corresponding relationship between reaching can regard each training after determining the corresponding time domain appearance neighbour's feature of each training video Frequently corresponding time domain appearance neighbour's feature, is input to coding network, to obtain the corresponding depth expression of each training video.

As a kind of possible implementation, in neighbour's attention study mechanism, need to obtain Near-neighbor Structure expression n_i。 Specifically, for each training video, the corresponding a time domain appearance neighbour's feature of the training video can be arranged to merging To primary vectorAnd the Near-neighbor Structure that the primary vector is mapped as b dimension is expressed into n_i, then Near-neighbor Structure table Up to n_iAre as follows:

Wherein, FC indicates full linking layer mapping.

For each training video, first moment, the first frame frame level feature of the training video is input to coding Network, and by neighbour's structure representation n_iIt is embedded in the memory state of b dimension as follows:

Wherein, d is fixed value, W^q、W^k、W^vFor the parameter value of coding network,It indicates to arrange to merging,Indicate training The frame level feature of first moment input of video, m_i,1Indicate first moment corresponding memory state.

By way of such as formula (2), the information of Near-neighbor Structure will be present in the memory state at each moment, in 1 < t When≤M, when there is new video frame level feature to be input to coding network, memory state can be carried out as follows more It is new:

Wherein,Indicate the video frame level feature of t-th of moment input, m_i,tIndicate t-th of moment corresponding memory shape State, m_i,t-1Indicate the t-1 moment corresponding memory state.

By way of such as formula (2) and (3), at each moment, memory state ties the neighbour for being included according to it Structure information, to select information useful in input feature vector to be written into new memory state.By above-mentioned neighbour's attention learning machine System is embedded in coding network, available each arithmetic element are as follows:

Wherein, MLP indicates that multi-level mapping, BN indicate batch standardization, W^iv、W^ih、W^fv、W^fh、W^ov、W^ohPresentation code network Parameter value,Indicate that inner product, the calculation of sigma function are σ=1/ (1+e^-x), the calculation of tanh function is σ=(e^x-e^-x)/(e^x+e^-x).H is exported in the last one moment resulting hidden layer_i,M, the as depth expression of training video.Specifically, for Each training video, the depth expression of the training video are as follows:

Wherein,Indicate the corresponding frame level feature of the training video, the parameter of θ presentation code network.

In the embodiment of the present invention, the neighbour of video is embedded into coding network, as a result, the frame level feature to video into In row cataloged procedure, content similar with its neighbour is able to more be paid close attention in the video, and then extensive nothing can be improved Supervise the retrieval precision on video database.

The corresponding depth of each training video is expressed according to the full linking layer for using activation primitive, is converted into a column by S5 Two-value code.

In the embodiment of the present invention, the training is regarded according to the full linking layer for using activation primitive for each training video Frequently corresponding depth expression, the column two-value code converted are as follows:

b_i=sgn (t_i)；(6)

Wherein, t_i=FC (h_i,M,k)；FC indicates full linking layer mapping, and sgn indicates sign function, works as t_iWhen greater than 0, sgn (t_i) it is 1, work as t_iWhen less than or equal to 0, sgn (t_i) it is the length that -1, k indicates a column two-value code.

S6 reconstructs the corresponding M reconstruct frame level feature of each training video according to two-value code using decoding network.

It, can be using long memory network in short-term (Long Short Term Memory, abbreviation in the embodiment of the present invention LSTM) it is used as decoding network.Specifically, the corresponding column two-value code of each training video can be mapped as l dimensional vector? It first moment, willIt is input to decoding network, the video frame level feature of available first reconstructThe embodiment of the present invention In be denoted as reconstruct frame level featureIt, will second momentIt is input to decoding network, obtains second reconstruct frame level feature It willIt is input to decoding network, obtains third reconstruct frame level featureSo circulation, until decoding network exports m-th weight Structure frame level featureWhen, decoding is completed.

As an example, referring to fig. 2, Fig. 2 is Hash learning process schematic diagram one in the embodiment of the present invention.Wherein, exist Obtain the corresponding M frame level feature v of training video¹、v²、…、v^MAfterwards, corresponding depth expression can be exported by coding network, And the full linking layer by using activation primitive, after obtaining corresponding two-value code, corresponding M can be exported by decoding network A reconstruct frame level feature

S7 generates reconstructed error function, and root according to the corresponding frame level feature of each training video and reconstruct frame level feature According to time domain appearance neighbour feature and two-value code, neighbour's similitude error function is generated.

In the embodiment of the present invention, two loss functions are devised to train network, respectively reconstructed error function L_rWith it is close Adjacent similitude error function L_s。

Wherein, reconstructed error function L_rIndicate the corresponding frame level feature of training video of input and the reconstructed frame that decoding obtains Difference between grade feature, can be used mean square error to indicate reconstructed error function L_r:

Wherein,Indicate m-th of frame level feature in i-th of training video,Indicate the m in i-th of training video A reconstruct frame level feature.

In the embodiment of the present invention, neighbour's similitude error function indicates similitude knot in original video space and Hamming space The difference of structure can obtain neighbour's similitude error function L according to formula (8)_s:

Wherein, s_ijIndicate i-th of training video time domain external appearance characteristic and j-th of training video time domain external appearance characteristic it Between similitude,Indicate i-th of corresponding two-value code b_iTwo-value code b corresponding with j-th_jBetween similitude.

For the s in calculation formula (8)_ij, approximate similarity matrix A can be established as follows.It is possible, firstly, to root According to the corresponding frame level feature x of training video_iAnd corresponding a time domain appearance neighbour's featureDefinition One similarity matrix deletedEach of Y element Y can be indicated with formula (9)_ij:

Wherein,<i>indicates position of a time domain appearance neighbour feature in anchor point characteristic set, and Dist indicates distance meter Function is calculated, two norm calculation distances can be used, t indicates bandwidth parameter.

Approximate similarity matrix A can be calculated according to formula (10):

A=Y Λ^-1Y^T；(10)

Wherein,

It is sparse nonnegative matrix according to the A that formula (10) is calculated, the sum of each column of every a line of matrix are 1, work as A_ij> It, can be by s when 0_ijIt is set as 1, and works as A_ijIt, can be by s when≤0_ijIt is set as 0.

Two-value code b is indicated in formula (8)_iWith two-value code b_jBetween similitudeIt can be defined asIn order to Avoid the concussion in network training process, Ke YiyongApproximate representationWherein, t_iFor two-value code b_iLoose list Show.

In order to reduceWithBetween approximate error, can introduce about t_iAnd b_iAuxiliary lose item, then can will be public Formula (8) conversion are as follows:

S8 is trained network, so that reconstructed error function minimization, and keep neighbour's similitude error function minimum Change；Wherein, network includes coding network, full linking layer and decoding network.

In the embodiment of the present invention, network can be divided into three parts: first part, which is one, has neighbour's attention learning machine The coding network of system is expressed by the depth that the coding network can learn to obtain training video；The second part is a band There is the full linking layer of nonlinear activation function, for depth expression to be converted to the two-value code of a K dimension；Third part is One decoding network decodes the reconstruct frame level feature of each frame of training video from the two-value code that coding obtains.

In the embodiment of the present invention, network can be carried out according to reconstructed error function and neighbour's similitude error function Training is led to by the way that the information for preferably being included using the training video of input may be implemented by reconstructed error function minimization It crosses and minimizes neighbour's similitude error function, can maximize and save neighbour's similitude.When being trained to network, use Training loss function can be reconstruct error function and the weighting of neighbour's similitude error function:

L=α L_s+(1-α)L_r；(12)

Wherein, α indicates the hyper parameter of balance reconstruct error function and neighbour's similitude error function.

, can be by the way of the conduction of reversed gradient in end-to-end training network in the embodiment of the present invention, Lai Youhua Network parameter.

It as an example, is the Hash learning process schematic diagram two of the embodiment of the present invention referring to Fig. 3, Fig. 3.To network into When row training, when inputting a training video, time domain appearance neighbour's feature can be embedded into Hash coding network and be carried out Hash study, generates Hash codes by a coding network, is protected by minimizing reconstruction error and neighbour's similitude error Demonstrate,prove the intact preservation of Near-neighbor Structure in Hamming space.

The Hash learning method of the embodiment of the present invention kept based on Near-neighbor Structure, by obtaining video training set, for Each training video in video training set extracts M frame level feature of each training video, later, using autocoder, The time domain external appearance characteristic of each training video is extracted, and clock synchronization domain external appearance characteristic is clustered, obtains anchor point characteristic set, and Afterwards, for each training video, the corresponding time domain appearance neighbour's feature of each training video is obtained from anchor point characteristic set, and Each training video is encoded to corresponding depth and is expressed according to time domain appearance neighbour's feature using coding network, later, root According to the full linking layer for using activation primitive, the corresponding depth of each training video is expressed, a column two-value code is converted into, then, Using decoding network, the corresponding M reconstruct frame level feature of each training video is reconstructed according to two-value code, later, according to each The corresponding frame level feature of training video and reconstruct frame level feature, generate reconstructed error function, and according to time domain external appearance characteristic and two It is worth code, generates neighbour's similitude error function, finally, network is trained, so that reconstructed error function minimization, and make Neighbour's similitude error function minimizes；Wherein, network includes coding network, full linking layer and decoding network.It, will in the present invention The neighbour of video is embedded into coding network, as a result, to video frame level feature carry out cataloged procedure in, in the video with its The similar content of neighbour is able to more be paid close attention to, and then the retrieval essence on extensive unsupervised video database can be improved Degree.Also, by minimizing reconstruction error and neighbour's similitude error, it is ensured that the intact guarantor of Near-neighbor Structure in Hamming space It deposits, further increases the retrieval precision on video database.

In order to realize above-described embodiment, the present invention also proposes a kind of Hash learning device kept based on Near-neighbor Structure.

As shown in figure 4, the Hash learning device that should be kept based on Near-neighbor Structure includes: to obtain module 101, extraction module 102, coding module 103, conversion module 104, reconstructed module 105, generation module 106 and training module 107.

Wherein, module 101 is obtained, for each training video in video training set, to mention for obtaining video training set Take M frame level feature of each training video.

Extraction module 102 extracts the time domain external appearance characteristic of each training video, and clock synchronization for using autocoder Domain external appearance characteristic is clustered, and anchor point characteristic set is obtained.

Module 101 is obtained, is also used to obtain each training video pair from anchor point characteristic set for each training video The time domain appearance neighbour's feature answered.

Coding module 103, for being encoded each training video according to time domain appearance neighbour's feature using coding network For the expression of corresponding depth.

As a kind of possible implementation, each training video has a time domain appearance neighbour's feature, respectively Coding module 103, is specifically used for:

By the corresponding a time domain appearance neighbour's feature of each training video, arranges to merging and obtain primary vector

The Near-neighbor Structure that primary vector is mapped as b dimension is expressed into n_i, whereinFC table Show full linking layer mapping；

For each training video, first moment, the first frame frame level feature of each training video is input to volume Code network, and by neighbour's structure representation n_iIt is embedded in the memory state of b dimension as follows:

Wherein, d is fixed value, W^q、W^k、W^vFor the parameter value of coding network,It indicates to arrange to merging,It indicates to correspond to The frame level feature of first moment input of training video, m_i,1Indicate first moment corresponding memory state；

When there is new frame level feature to be input to coding network, memory state is carried out as follows update:

Wherein, 1 < t≤M,Indicate the frame level feature of t-th of moment input, m_i,tIndicate the corresponding memory of t-th of moment State, m_i,t-1Indicate the t-1 moment corresponding memory state；

Coding network is LSTM network, each arithmetic element in coding network are as follows:

Wherein, MLP indicates that multi-level mapping, BN indicate batch standardization, W^iv、W^ih、W^fv、W^fh、W^ov、W^ohPresentation code network Parameter value,Indicate inner product；

The last one moment resulting hidden layer is exported into h_i,M, as the depth expression for corresponding to training video；Wherein, Indicate the frame level feature of corresponding training video, the parameter of θ presentation code network.

Conversion module 104, for according to the full linking layer for using activation primitive, by the corresponding depthmeter of each training video It reaches, is converted into a column two-value code.

As a kind of possible implementation, according to the full linking layer for using activation primitive, by the depth of corresponding training video Degree expression is converted, an obtained column two-value code are as follows: b_i=sgn (t_i)；Wherein, t_i=FC (h_i,M,k)；FC indicates full link Layer mapping, sgn indicate sign function, work as t_iWhen greater than 0, sgn (t_i) it is 1, work as t_iWhen less than or equal to 0, sgn (t_i) it is -1, k Indicate the length of a column two-value code.

Reconstructed module 105 reconstructs the corresponding M weight of each training video according to two-value code for using decoding network Structure frame level feature.

Generation module 106, for generating reconstruct according to the corresponding frame level feature of each training video and reconstruct frame level feature Error function, and according to time domain external appearance characteristic and two-value code, generate neighbour's similitude error function.

It include N number of training video, reconstructed error function in video training set as a kind of possible implementation are as follows:Wherein,Indicate m-th of frame level feature in i-th of training video,It indicates i-th M-th of reconstruct frame level feature in training video.

As a kind of possible implementation, neighbour's similitude error function are as follows:

Wherein, s_ijIndicate i-th of training video time domain external appearance characteristic and j-th of training video time domain external appearance characteristic it Between similitude, t_iFor two-value code b_iRelaxation indicate.

Training module 107, for being trained to network, so that reconstructed error function minimization, and keep neighbour similar Property error function minimize；Wherein, network includes coding network, full linking layer and decoding network.

It should be noted that the aforementioned explanation to the Hash learning method embodiment kept based on Near-neighbor Structure is also fitted For the Hash learning device of the embodiment kept based on Near-neighbor Structure, details are not described herein again.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.

In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.

Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing custom logic function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discussed suitable Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be of the invention Embodiment person of ordinary skill in the field understood.

Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment It sets.The more specific example (non-exhaustive list) of computer-readable medium include the following: there is the electricity of one or more wirings Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable Medium, because can then be edited, be interpreted or when necessary with it for example by carrying out optical scanner to paper or other media His suitable method is handled electronically to obtain described program, is then stored in computer storage.

It should be appreciated that each section of the invention can be realized with hardware, software, firmware or their combination.Above-mentioned In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware is realized.Such as, if realized with hardware in another embodiment, following skill well known in the art can be used Any one of art or their combination are realized: have for data-signal is realized the logic gates of logic function from Logic circuit is dissipated, the specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene can compile Journey gate array (FPGA) etc..

Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.

It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer In read/write memory medium.

Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above The embodiment of the present invention is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as to limit of the invention System, those skilled in the art can be changed above-described embodiment, modify, replace and become within the scope of the invention Type.

Claims

1. a kind of Hash learning method kept based on Near-neighbor Structure, which is characterized in that the described method includes:

S1, video training set is obtained, for each training video in the video training set, extracts each training video M frame level feature；

S2, using autocoder, extract the time domain external appearance characteristic of each training video, and carry out to the time domain external appearance characteristic Cluster, obtains anchor point characteristic set；

S3, it is directed to each training video, it is close that the corresponding time domain appearance of each training video is obtained from the anchor point characteristic set Adjacent feature；

S4, each training video is encoded to according to time domain appearance neighbour's feature by corresponding depthmeter using coding network It reaches；

S5, according to the full linking layer for using activation primitive, the corresponding depth of each training video is expressed, a column are converted into Two-value code；

S6, using decoding network, the corresponding M reconstruct frame level feature of each training video is reconstructed according to the two-value code；

S7, according to the corresponding frame level feature of each training video and the reconstruct frame level feature, generate reconstructed error function, And according to the time domain external appearance characteristic and the two-value code, neighbour's similitude error function is generated；

S8, network is trained, so that the reconstructed error function minimization, and make neighbour's similitude error function It minimizes；Wherein, the network includes the coding network, the full linking layer and the decoding network.

2. the method according to claim 1, wherein each training video have a time domain appearance neighbour's feature, RespectivelyStep S4 is specifically included:

S41, by the corresponding a time domain appearance neighbour's feature of each training video, arrange to merging and obtain primary vector

S42, the Near-neighbor Structure expression n that the primary vector is mapped as to b dimension_i, wherein FC indicates full linking layer mapping；

S43, the first frame frame level feature of each training video is input to institute first moment for each training video State coding network, and by neighbour's structure representation n_iIt is embedded in the memory state of b dimension as follows:

Wherein, d is fixed value, W^q、W^k、W^vFor the parameter value of coding network,It indicates to arrange to merging,Indicate corresponding training The frame level feature of first moment input of video, m_i,1Indicate first moment corresponding memory state；

S44, when there is new frame level feature to be input to coding network, memory state is carried out as follows update:

Wherein, 1 < t≤M,Indicate the frame level feature of t-th of moment input, m_i,tIndicate t-th of moment corresponding memory state, m_i,t-1Indicate the t-1 moment corresponding memory state；

The coding network is LSTM network, each arithmetic element in the coding network are as follows:

Wherein, MLP indicates that multi-level mapping, BN indicate batch standardization, W^iv、W^ih、W^fv、W^fh、W^ov、W^ohIndicate the coding network Parameter value, ⊙ indicate inner product；

The last one moment resulting hidden layer is exported h by S45_i,M, as the depth expression for corresponding to training video；

Wherein, Indicate the frame level feature of corresponding training video, the parameter of θ presentation code network.

3. according to the method described in claim 2, it is characterized in that, being instructed according to the full linking layer for using activation primitive by corresponding The depth expression for practicing video is converted, an obtained column two-value code are as follows:

b_i=sgn (t_i)；

Wherein, t_i=FC (h_i,M,k)；FC indicates full linking layer mapping, and sgn indicates sign function, works as t_iWhen greater than 0, sgn (t_i) It is 1, works as t_iWhen less than or equal to 0, sgn (t_i) it is the length that -1, k indicates the column two-value code.

4. the method according to claim 1, wherein in the video training set include N number of training video,

The reconstructed error function are as follows:

Wherein,Indicate m-th of frame level feature in i-th of training video,Indicate m-th of weight in i-th of training video Structure frame level feature.

5. according to the method described in claim 4, it is characterized in that, neighbour's similitude error function are as follows:

Wherein, s_ijIt indicates between the time domain external appearance characteristic of i-th of training video and the time domain external appearance characteristic of j-th of training video Similitude, t_iFor two-value code b_iRelaxation indicate.

6. a kind of Hash learning device kept based on Near-neighbor Structure, which is characterized in that described device includes:

Module is obtained, for obtaining video training set, for each training video in the video training set, is extracted described every M frame level feature of a training video；

Extraction module extracts the time domain external appearance characteristic of each training video for using autocoder, and to it is described when it is overseas It sees feature to be clustered, obtains anchor point characteristic set；

The acquisition module is also used to obtain each training video from the anchor point characteristic set for each training video Corresponding time domain appearance neighbour's feature；

According to time domain appearance neighbour's feature, each training video is encoded to pair for using coding network for coding module The depth expression answered；

Conversion module expresses the corresponding depth of each training video for according to the full linking layer for using activation primitive, It is converted into a column two-value code；

Reconstructed module reconstructs the corresponding M reconstructed frame of each training video according to the two-value code for using decoding network Grade feature；

Generation module, for generating weight according to the corresponding frame level feature of each training video and the reconstruct frame level feature Structure error function, and according to the time domain external appearance characteristic and the two-value code, generate neighbour's similitude error function；

Training module, for being trained to network, so that the reconstructed error function minimization, and keep the neighbour similar Property error function minimize；Wherein, the network includes the coding network, the full linking layer and the decoding network.

7. device according to claim 6, which is characterized in that each training video has a time domain appearance neighbour's feature, RespectivelyThe coding module, is specifically used for:

The Near-neighbor Structure that the primary vector is mapped as b dimension is expressed into n_i, whereinFC table Show full linking layer mapping；

For each training video, first moment, the first frame frame level feature of each training video is input to the volume Code network, and by neighbour's structure representation n_iIt is embedded in the memory state of b dimension as follows:

8. device according to claim 7, which is characterized in that according to the full linking layer for using activation primitive, instructed corresponding The depth expression for practicing video is converted, an obtained column two-value code are as follows:

b_i=sgn (t_i)；

9. device according to claim 6, which is characterized in that it include N number of training video in the video training set,

The reconstructed error function are as follows:

10. device according to claim 8, which is characterized in that neighbour's similitude error function are as follows: