CN110069666B - Hash learning method and device based on neighbor structure keeping - Google Patents

Hash learning method and device based on neighbor structure keeping Download PDF

Info

Publication number
CN110069666B
CN110069666B CN201910264740.9A CN201910264740A CN110069666B CN 110069666 B CN110069666 B CN 110069666B CN 201910264740 A CN201910264740 A CN 201910264740A CN 110069666 B CN110069666 B CN 110069666B
Authority
CN
China
Prior art keywords
video
training
training video
neighbor
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910264740.9A
Other languages
Chinese (zh)
Other versions
CN110069666A (en
Inventor
鲁继文
周杰
李舒燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201910264740.9A priority Critical patent/CN110069666B/en
Publication of CN110069666A publication Critical patent/CN110069666A/en
Application granted granted Critical
Publication of CN110069666B publication Critical patent/CN110069666B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Image Analysis (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention provides a Hash learning method and a Hash learning device based on neighbor structure keeping, wherein the method comprises the following steps: acquiring a video training set, and extracting M frame-level features of each training video; extracting the time domain appearance characteristics of each training video, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set; acquiring a time domain appearance neighbor characteristic corresponding to each training video from the anchor point characteristic set; coding each training video into a corresponding depth expression by adopting a coding network according to the time domain appearance neighbor characteristic; converting the depth expression corresponding to each training video into a list of binary codes; reconstructing M reconstructed frame level characteristics corresponding to each training video according to the binary code; generating a reconstruction error function and a neighbor similarity error function; the network is trained to minimize a reconstruction error function and a neighbor similarity error function. The method can ensure the perfect preservation of the neighbor structure in the Hamming space and improve the retrieval precision on a large-scale unsupervised video database.

Description

Hash learning method and device based on neighbor structure keeping
Technical Field
The invention relates to the technical field of video processing, in particular to a hash learning method and device based on neighbor structure keeping.
Background
Large-scale video retrieval, which aims to retrieve videos similar to a given query video from a huge database, is generally performed by representing the videos by a series of sampled video frames, and each frame of the videos can be represented by a feature. During video retrieval, relevant videos can be determined according to the feature sets corresponding to the videos.
In the presence of high-dimensional characteristics and mass data, the Hash method obtains great achievement in large-scale visual retrieval tasks, video Hash codes a video into a compact binary code, guarantees the similarity structure of a video space, and saves the video in a Hamming space. The learning-based video hashing method explores data characteristics and achieves better performance than a manually designed hashing method, and because the trouble of manual labeling is avoided, unsupervised hashing is more feasible in a large-scale video retrieval task than supervised hashing.
At present, most of the unsupervised hashes are focused on utilizing the representation and the time sequence information of the video, but neglect the utilization of the neighbor structure, so that the coding network can absorb the content of the input video without distinction whether the content is similar to the neighbor content, which is not beneficial to the storage of the neighbor similarity, and thus, when the video retrieval is performed on a large-scale unsupervised video database, the retrieval accuracy cannot be ensured.
Disclosure of Invention
The invention provides a Hash learning method and a Hash learning device based on neighbor structure preservation, which are used for guaranteeing the perfect preservation of neighbor structures in a Hamming space and improving the retrieval precision on a large-scale unsupervised video database and are used for solving the technical problems that in the prior art, unsupervised Hash is focused on utilizing the representation and time sequence information of videos, but neglects the utilization of the neighbor structures and cannot guarantee the precision of video retrieval.
An embodiment of one aspect of the present invention provides a hash learning method based on neighbor structure preservation, including:
s1, acquiring a video training set, and extracting M frame-level features of each training video in the video training set;
s2, extracting the time domain appearance characteristics of each training video by adopting an automatic encoder, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set;
s3, acquiring time domain appearance neighbor characteristics corresponding to each training video from the anchor point characteristic set aiming at each training video;
s4, coding each training video into a corresponding depth expression according to the time domain appearance neighbor characteristic by adopting a coding network;
s5, converting the depth expression corresponding to each training video into a list of binary codes according to the full link layer using the activation function;
s6, reconstructing M reconstructed frame-level features corresponding to each training video according to the binary code by adopting a decoding network;
s7, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, and generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes;
s8, training a network to minimize the reconstruction error function and minimize the neighbor similarity error function; wherein the network comprises the encoding network, the full link layer, and the transcoding network.
The Hash learning method based on neighbor structure keeping of the embodiment of the invention extracts M frame-level features of each training video aiming at each training video in a video training set by acquiring the video training set, then extracts time domain appearance features of each training video by adopting an automatic encoder and clusters the time domain appearance features to obtain an anchor point feature set, then acquires time domain appearance neighbor features corresponding to each training video from the anchor point feature set aiming at each training video, and codes each training video into corresponding depth expression according to the time domain appearance neighbor features by adopting a coding network, then converts the depth expression corresponding to each training video into a column of binary codes according to a full link layer using an activation function, and then reconstructs M reconstructed frame-level features corresponding to each training video according to the binary codes by adopting a decoding network, then, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes, and finally training the network to minimize the reconstruction error function and the neighbor similarity error function; the network comprises an encoding neural network, a full link layer and a decoding network. In the invention, the neighbors of the video are embedded into the coding network, so that in the process of coding the frame-level characteristics of the video, the content similar to the neighbors in the video is paid more attention, and the retrieval precision on a large-scale unsupervised video database can be further improved. In addition, by minimizing the reconstruction error and the neighbor similarity error, the perfect preservation of the neighbor structure in the Hamming space can be ensured, and the retrieval precision on the video database is further improved.
Another embodiment of the present invention provides a hash learning apparatus based on neighbor structure preservation, including:
the acquisition module is used for acquiring a video training set and extracting M frame-level features of each training video in the video training set;
the extraction module is used for extracting the time domain appearance characteristics of each training video by adopting an automatic encoder, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set;
the acquisition module is further configured to acquire, for each training video, a time-domain appearance neighbor feature corresponding to each training video from the anchor point feature set;
the coding module is used for coding each training video into a corresponding depth expression according to the time domain appearance neighbor characteristic by adopting a coding network;
the conversion module is used for converting the depth expression corresponding to each training video into a row of binary codes according to the full link layer using the activation function;
the reconstruction module is used for reconstructing M reconstruction frame level characteristics corresponding to each training video according to the binary code by adopting a decoding network;
the generating module is used for generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video and generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes;
a training module to train a network to minimize the reconstruction error function and to minimize the neighbor similarity error function; wherein the network comprises the encoding network, the full link layer, and the transcoding network.
The Hash learning device based on neighbor structure keeping of the embodiment of the invention extracts M frame-level features of each training video aiming at each training video in a video training set by acquiring the video training set, then extracts time domain appearance features of each training video by adopting an automatic encoder and clusters the time domain appearance features to obtain an anchor point feature set, then acquires time domain appearance neighbor features corresponding to each training video from the anchor point feature set aiming at each training video, codes each training video into corresponding depth expression according to the time domain appearance neighbor features by adopting a coding network, then converts the depth expression corresponding to each training video into a column of binary codes according to a full link layer using an activation function, and then reconstructs M reconstructed frame-level features corresponding to each training video according to the binary codes by adopting a decoding network, then, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes, and finally training the network to minimize the reconstruction error function and the neighbor similarity error function; the network comprises an encoding network, a full link layer and a decoding network. In the invention, the neighbors of the video are embedded into the coding network, so that in the process of coding the frame-level characteristics of the video, the content similar to the neighbors in the video is paid more attention, and the retrieval precision on a large-scale unsupervised video database can be further improved. In addition, by minimizing the reconstruction error and the neighbor similarity error, the perfect preservation of the neighbor structure in the Hamming space can be ensured, and the retrieval precision on the video database is further improved.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a hash learning method based on neighbor structure preservation according to an embodiment of the present invention;
FIG. 2 is a first diagram illustrating a hash learning process according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a second hash learning process according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a hash learning apparatus based on neighbor structure preservation according to a second embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Currently, some video hashing methods integrate a hash function into a deep neural network, specifically, extract features of a video frame through a deep convolutional network, and the features are further encoded into a binary code through a time sequence pooling operation or a deep cyclic network. Compared with the supervised hash, the unsupervised hash is more feasible in a large-scale video retrieval task because the trouble of manual labeling is avoided.
However, most of the unsupervised hashing aims at utilizing the representation and timing information of the video, but neglects the utilization of a neighbor structure, and although some hashing methods design some neighbor similarity cost functions to train the network, the neighbor structure is only used for guiding the generation of the binary code and is not utilized in the video feature coding. In this way, the designed coding network will absorb the content of the input video without distinction whether the content is similar to the neighboring content, and is not favorable for storing the neighboring similarity, so that the precision of the retrieval cannot be guaranteed when the video retrieval is performed on a large-scale unsupervised video database.
Therefore, the invention provides a hash learning method based on neighbor structure retention, which mainly aims at the technical problems that in the prior art, unsupervised hash focuses on utilizing the representation and time sequence information of a video, but neglects the utilization of a neighbor structure and cannot ensure the precision of video retrieval.
According to the Hash learning method based on neighbor structure keeping, disclosed by the embodiment of the invention, the neighbors of the video are embedded into the coding network, so that the content similar to the neighbors in the video is paid more attention in the process of coding the video frame-level features, and the retrieval precision on a large-scale unsupervised video database can be further improved. In addition, when the network is trained, the perfect preservation of the neighbor structure in the Hamming space can be ensured by minimizing the reconstruction error and the neighbor similarity error, and the retrieval precision on the video database is further improved.
The following describes a hash learning method and apparatus based on neighbor structure keeping according to an embodiment of the present invention with reference to the drawings.
Fig. 1 is a schematic flowchart of a hash learning method based on neighbor structure preservation according to an embodiment of the present invention.
The embodiment of the invention is exemplified in that the hash learning method based on neighbor structure keeping is configured in the hash learning apparatus based on neighbor structure keeping, and the hash learning apparatus based on neighbor structure keeping can be applied to any computer device, so that the computer device can execute the hash learning function based on neighbor structure keeping.
The computer device may be a Personal Computer (PC), a cloud device, a mobile device, and the like, and the mobile device may be a hardware device having various operating systems, touch screens, and/or display screens, such as a mobile phone, a tablet computer, a personal digital assistant, a wearable device, and an in-vehicle device.
As shown in fig. 1, the neighbor structure preserving-based hash learning method may include the following steps:
s1, acquiring a video training set, and extracting M frame-level features of each training video aiming at each training video in the video training set.
In the embodiment of the present invention, the video training set includes N training videos, where the N training videos may be videos stored locally by the computer device, or may also be videos downloaded online by the computer device, which is not limited herein. Wherein, the size of N is preset, and the size of M is also preset.
In the embodiment of the invention, the video training set is marked as
Figure BDA0002016433290000051
For each training video in the video training set, M frames can be uniformly sampled, and M frame-level features with dimension l corresponding to each training video are extracted by the deep convolutional network, so that each training video can be converted into a frame-level feature set
Figure BDA0002016433290000052
And S2, extracting the time domain appearance characteristics of each training video by adopting an automatic encoder, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set.
In the embodiment of the invention, for each training video, d-dimensional time domain appearance characteristics can be obtained through an automatic encoder. For each training video, the distance between the training video and other videos in the video library can be calculated and ranked to determine a time-domain appearance neighbor features corresponding to the training video, for example, the distance between different videos can be calculated by using a two-norm, since this calculation process also needs to be calculated in the testing stage, and neighbor search facing the whole video training set consumes a lot of time, which is not practical. Therefore, in the invention, K-means clustering can be performed on the training videos in the video training set to obtain n clustering centers, for example, K-means clustering can be performed on the time domain appearance characteristics to obtain n clustering centers. For each cluster center, the temporal appearance feature closest (or least distant) to the cluster center may be determined, resulting in n temporal appearance features. Then, n time domain appearance features can be used as anchor points and are listed into an anchor point feature set, and the anchor point feature set is marked as
Figure BDA0002016433290000053
And S3, acquiring the time domain appearance neighbor feature corresponding to each training video from the anchor point feature set aiming at each training video.
In the embodiment of the invention, for each training video, a time domain appearance neighbor features corresponding to the training video can be obtained from the anchor point feature setAre respectively as
Figure BDA0002016433290000054
Because a is less than N, only a little time is consumed for obtaining a time domain appearance neighbor characteristics, and the efficiency of video retrieval can be greatly improved.
And S4, coding each training video into a corresponding depth expression according to the time domain appearance neighbor characteristics by adopting a coding network.
In the embodiment of the invention, the coding network learns the corresponding relation between the time domain appearance neighbor features and the depth expressions corresponding to the videos, and after the time domain appearance neighbor features corresponding to each training video are determined, the time domain appearance neighbor features corresponding to each training video can be input to the coding network to obtain the depth expressions corresponding to each training video.
As a possible implementation manner, in the neighbor attention learning mechanism, the neighbor structure expression n needs to be obtainedi. Specifically, for each training video, a time domain appearance neighbor features corresponding to the training video may be column-wise combined to obtain a first vector
Figure BDA0002016433290000055
And mapping the first vector to a b-dimensional neighbor structure expression niThen the neighbor structure expresses niComprises the following steps:
Figure BDA0002016433290000061
where FC denotes a full link layer map.
For each training video, at a first time instant, inputting a first frame-level feature of the training video into the coding network, and expressing n adjacent structuresiEmbedding into the b-dimensional memory state as follows:
Figure BDA0002016433290000062
wherein d is a fixed value, Wq、Wk、WvIn order to encode the parameter values of the network,
Figure BDA0002016433290000063
it is shown that the column direction is merged,
Figure BDA0002016433290000064
representing the frame-level features, m, input at the first moment of the training videoi,1Indicating the memory state corresponding to the first moment.
By means of the formula (2), the information of the neighbor structure will exist in the memory state at each moment, and when 1< t ≦ M, when there is a new video frame level feature input to the coding network, the memory state can be updated as follows:
Figure BDA0002016433290000065
wherein the content of the first and second substances,
Figure BDA0002016433290000066
representing the video frame-level features, m, input at the t-th instanti,tRepresents the memory state corresponding to the t-th time, mi,t-1The memory state corresponding to the t-1 th time is shown.
By means of the formulas (2) and (3), at each moment, the memory state will select the useful information in the input features to write into the new memory state according to the neighbor structure information contained in the memory state. The neighbor attention learning mechanism is embedded into the coding network, and the operation units are obtained as follows:
Figure BDA0002016433290000067
where MLP denotes multi-level mapping, BN denotes batch normalization, Wiv、Wih、Wfv、Wfh、Wov、WohA parameter value representing the coding network,
Figure BDA0002016433290000068
expressing the inner product, the sigma function is calculated in a manner that sigma is 1/(1+ e)-x) The tan h function is calculated by σ ═ ex-e-x)/(ex+e-x). Hidden layer output h obtained at last momenti,MI.e. a deep representation of the training video. Specifically, for each training video, the depth of the training video is expressed as:
Figure BDA0002016433290000069
wherein the content of the first and second substances,
Figure BDA00020164332900000610
representing the corresponding frame-level features of the training video, and theta represents the parameters of the coding network.
In the embodiment of the invention, the neighbors of the video are embedded into the coding network, so that in the process of coding the frame-level features of the video, the content similar to the neighbors in the video is paid more attention, and the retrieval precision on a large-scale unsupervised video database can be further improved.
And S5, converting the depth expression corresponding to each training video into a list of binary codes according to the full link layer using the activation function.
In the embodiment of the invention, for each training video, according to the full link layer using the activation function, the depth corresponding to the training video is expressed, and a list of binary codes obtained by conversion is as follows:
bi=sgn(ti);(6)
wherein, ti=FC(hi,MK); FC denotes the full link layer map, sgn denotes the sign function, when tiWhen greater than 0, sgn (t)i) Is 1 when t isiWhen the value is less than or equal to 0, sgn (t)i) To-1, k represents the length of a column of binary codes.
And S6, reconstructing M reconstructed frame-level features corresponding to each training video according to the binary code by adopting a decoding network.
In the embodiment of the present invention, a Long Short Term Memory (LSTM) network may be used as the decoding network. Specifically, a column of binary codes corresponding to each training video may be mapped to an l-dimensional vector
Figure BDA0002016433290000071
At the first moment, will
Figure BDA0002016433290000072
Input to a decoding network to obtain a first reconstructed video frame-level feature
Figure BDA0002016433290000073
Features noted in the embodiments of the present invention as reconstructed frame level
Figure BDA0002016433290000074
At the second moment, will
Figure BDA00020164332900000711
Inputting the data into a decoding network to obtain a second reconstructed frame level feature
Figure BDA0002016433290000075
Will be provided with
Figure BDA0002016433290000076
Inputting the data into a decoding network to obtain a third reconstructed frame level feature
Figure BDA0002016433290000077
The above steps are circulated until the Mth reconstructed frame level characteristic is output by the decoding network
Figure BDA0002016433290000078
When so, decoding is completed.
As an example, referring to fig. 2, fig. 2 is a schematic diagram of a hash learning process in an embodiment of the present invention. It is composed ofIn the method, M frame-level features v corresponding to the training video are obtained1、v2、…、vMThen, corresponding depth expressions can be output through the coding network, and after corresponding binary codes are obtained through a full link layer using an activation function, corresponding M reconstructed frame-level features can be output through the decoding network
Figure BDA0002016433290000079
And S7, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, and generating a neighbor similarity error function according to the time domain appearance neighbor characteristics and the binary code.
In the embodiment of the invention, two loss functions are designed to train the network, namely a reconstruction error function LrAnd neighbor similarity error function Ls
Wherein an error function L is reconstructedrRepresenting the difference between the corresponding frame-level features of the input training video and the decoded reconstructed frame-level features, the mean square error may be used to represent the reconstruction error function Lr
Figure BDA00020164332900000710
Wherein the content of the first and second substances,
Figure BDA0002016433290000081
representing the mth frame-level feature in the ith training video,
Figure BDA0002016433290000082
representing the mth reconstructed frame-level feature in the ith training video.
In the embodiment of the invention, the neighbor similarity error function represents the difference of the similarity structure in the original video space and the Hamming space, and the neighbor similarity error function L can be obtained according to a formula (8)s
Figure BDA0002016433290000083
Wherein s isijRepresenting the similarity between the temporal appearance features of the ith training video and the temporal appearance features of the jth training video,
Figure BDA0002016433290000084
represents the ith corresponding binary code biCorresponding to the jth binary code bjThe similarity between them.
To calculate s in equation (8)ijThe approximate similarity matrix a may be established as follows. First, the corresponding frame-level features x of the training video can be obtainediAnd corresponding a time domain appearance neighbor features
Figure BDA0002016433290000085
Defining a truncated similarity matrix
Figure BDA0002016433290000086
Each element Y in Y can be represented by formula (9)ij
Figure BDA0002016433290000087
Wherein, < i > represents the positions of a time domain appearance neighbor features in the anchor point feature set, Dist represents a distance calculation function, the distance can be calculated by adopting a two-norm, and t represents a bandwidth parameter.
The approximate similarity matrix a may be calculated according to equation (10):
A=YΛ-1YT;(10)
wherein the content of the first and second substances,
Figure BDA0002016433290000088
a calculated according to the formula (10) is a sparse non-negative matrix, the sum of each row and each column of the matrix is 1, and when A isij>When 0, s can be substitutedijIs set to 1, and when AijWhen s is less than or equal to 0, s can be addedijIs set to 0.
The binary code b is represented in formula (8)iAnd binary code bjSimilarity between them
Figure BDA0002016433290000089
Can be defined as
Figure BDA00020164332900000810
To avoid oscillation during network training, the method can be used
Figure BDA00020164332900000811
Approximate representation
Figure BDA00020164332900000812
Wherein, tiIs a binary code biThe slack of (a) indicates.
To reduce
Figure BDA00020164332900000813
And
Figure BDA00020164332900000814
with respect to t, an approximation error can be introducediAnd biThe auxiliary loss term of (c), equation (8) can be converted into:
Figure BDA00020164332900000815
s8, training the network to minimize the reconstruction error function and minimize the neighbor similarity error function; the network comprises an encoding network, a full link layer and a decoding network.
In the embodiment of the invention, the network can be divided into three parts: the first part is a coding network with a neighbor attention learning mechanism, and the deep expression of the training video can be learned through the coding network; the second part is a full link layer with a nonlinear activation function, and is used for converting the depth expression into a binary code with a K dimension; the third part is a decoding network, and the reconstructed frame-level features of each frame of the training video are decoded from the binary code obtained by encoding.
In the embodiment of the invention, the network can be trained according to the reconstruction error function and the neighbor similarity error function, the information contained in the input training video can be better utilized by minimizing the reconstruction error function, and the neighbor similarity can be maximally stored by minimizing the neighbor similarity error function. When training the network, the used training loss function can be weighted by a reconstruction error function and a neighbor similarity error function:
L=αLs+(1-α)Lr;(12)
where α represents a hyperparameter balancing the reconstruction error function and the neighbor similarity error function.
In the embodiment of the invention, when the network is trained end to end, a reverse gradient conduction mode can be adopted to optimize the network parameters.
As an example, referring to fig. 3, fig. 3 is a schematic diagram of a hash learning process according to an embodiment of the present invention. When the network is trained, when a training video is input, the time domain appearance neighbor characteristic can be embedded into a Hash coding network for Hash learning, a Hash code is generated through a coding network, and the perfect preservation of a neighbor structure in a Hamming space is ensured by minimizing a reconstruction error and a neighbor similarity error.
The Hash learning method based on neighbor structure keeping of the embodiment of the invention extracts M frame-level features of each training video aiming at each training video in a video training set by acquiring the video training set, then extracts time domain appearance features of each training video by adopting an automatic encoder and clusters the time domain appearance features to obtain an anchor point feature set, then acquires time domain appearance neighbor features corresponding to each training video from the anchor point feature set aiming at each training video, and codes each training video into corresponding depth expression according to the time domain appearance neighbor features by adopting a coding network, then converts the depth expression corresponding to each training video into a column of binary codes according to a full link layer using an activation function, and then reconstructs M reconstructed frame-level features corresponding to each training video according to the binary codes by adopting a decoding network, then, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes, and finally training the network to minimize the reconstruction error function and the neighbor similarity error function; the network comprises an encoding network, a full link layer and a decoding network. In the invention, the neighbors of the video are embedded into the coding network, so that in the process of coding the frame-level characteristics of the video, the content similar to the neighbors in the video is paid more attention, and the retrieval precision on a large-scale unsupervised video database can be further improved. In addition, by minimizing the reconstruction error and the neighbor similarity error, the perfect preservation of the neighbor structure in the Hamming space can be ensured, and the retrieval precision on the video database is further improved.
In order to implement the above embodiment, the present invention further provides a hash learning apparatus based on neighbor structure preservation.
Fig. 4 is a schematic structural diagram of a hash learning apparatus based on neighbor structure preservation according to a second embodiment of the present invention.
As shown in fig. 4, the neighbor structure preserving-based hash learning apparatus includes: the system comprises an acquisition module 101, an extraction module 102, an encoding module 103, a transformation module 104, a reconstruction module 105, a generation module 106 and a training module 107.
The obtaining module 101 is configured to obtain a video training set, and extract, for each training video in the video training set, M frame-level features of each training video.
The extracting module 102 is configured to extract a time domain appearance feature of each training video by using an automatic encoder, and cluster the time domain appearance features to obtain an anchor point feature set.
The obtaining module 101 is further configured to, for each training video, obtain, from the anchor point feature set, a time-domain appearance neighbor feature corresponding to each training video.
And the coding module 103 is configured to code each training video into a corresponding depth expression according to the time domain appearance neighbor feature by using a coding network.
As a possible implementation manner, each training video has a time domain appearance neighbor features, which are respectively
Figure BDA0002016433290000101
Figure BDA0002016433290000102
The encoding module 103 is specifically configured to:
combining the column direction of a time domain appearance neighbor characteristics corresponding to each training video to obtain a first vector
Figure BDA0002016433290000103
Mapping a first vector to a b-dimensional neighbor structure expression niWherein, in the step (A),
Figure BDA0002016433290000104
FC denotes full link layer mapping;
for each training video, at a first moment, inputting a first frame-level feature of each training video into the coding network, and expressing n adjacent structuresiEmbedding into the b-dimensional memory state as follows:
Figure BDA0002016433290000105
wherein d is a fixed value, Wq、Wk、WvIn order to encode the parameter values of the network,
Figure BDA0002016433290000106
it is shown that the column direction is merged,
Figure BDA0002016433290000107
representing the frame-level feature, m, input corresponding to the first moment of the training videoi,1Representing the memory state corresponding to the first moment;
when new frame-level features are input into the coding network, the memory state is updated as follows:
Figure BDA0002016433290000108
wherein 1 is<t≤M,
Figure BDA0002016433290000109
Representing the frame-level features of the input at the t-th instant, mi,tRepresents the memory state corresponding to the t-th time, mi,t-1The memory state corresponding to the t-1 th moment is shown;
the coding network is an LSTM network, and each operation unit in the coding network is as follows:
Figure BDA0002016433290000111
where MLP denotes multi-level mapping, BN denotes batch normalization, Wiv、Wih、Wfv、Wfh、Wov、WohA parameter value representing the coding network,
Figure BDA0002016433290000112
represents the inner product;
outputting the hidden layer obtained at the last moment hi,MAs a depth representation of the corresponding training video; wherein the content of the first and second substances,
Figure BDA0002016433290000113
Figure BDA0002016433290000114
representing frame-level features of the corresponding training video and theta representing a parameter of the coding network.
And the conversion module 104 is configured to convert the depth expression corresponding to each training video into a list of binary codes according to the full link layer using the activation function.
As a possible implementation, according to a full link layer using an activation function, a deep representation corresponding to a training video is transformed, and a list of binary codes obtained is: bi=sgn(ti) (ii) a Wherein, ti=FC(hi,MK); FC denotes the full link layer map, sgn denotes the sign function, when tiWhen greater than 0, sgn (t)i) Is 1 when t isiWhen the value is less than or equal to 0, sgn (t)i) To-1, k represents the length of a column of binary codes.
And the reconstructing module 105 is configured to reconstruct M reconstructed frame-level features corresponding to each training video according to the binary code by using a decoding network.
And the generating module 106 is configured to generate a reconstruction error function according to the frame level feature and the reconstruction frame level feature corresponding to each training video, and generate a neighbor similarity error function according to the time domain appearance feature and the binary code.
As a possible implementation, the video training set includes N training videos, and the reconstruction error function is:
Figure BDA0002016433290000115
wherein the content of the first and second substances,
Figure BDA0002016433290000116
representing the mth frame-level feature in the ith training video,
Figure BDA0002016433290000117
representing the mth reconstructed frame-level feature in the ith training video.
As a possible implementation, the neighbor similarity error function is:
Figure BDA0002016433290000118
wherein s isijRepresents the ith trainingSimilarity between the time-domain appearance features of the training video and the time-domain appearance features of the jth training video, tiIs a binary code biThe slack of (a) indicates.
A training module 107 for training the network to minimize a reconstruction error function and to minimize a neighbor similarity error function; the network comprises an encoding network, a full link layer and a decoding network.
It should be noted that the foregoing explanation on the embodiment of the hash learning method based on neighbor structure keeping is also applicable to the hash learning apparatus based on neighbor structure keeping in this embodiment, and details are not repeated here.
The Hash learning device based on neighbor structure keeping of the embodiment of the invention extracts M frame-level features of each training video aiming at each training video in a video training set by acquiring the video training set, then extracts time domain appearance features of each training video by adopting an automatic encoder and clusters the time domain appearance features to obtain an anchor point feature set, then acquires time domain appearance neighbor features corresponding to each training video from the anchor point feature set aiming at each training video, codes each training video into corresponding depth expression according to the time domain appearance neighbor features by adopting a coding network, then converts the depth expression corresponding to each training video into a column of binary codes according to a full link layer using an activation function, and then reconstructs M reconstructed frame-level features corresponding to each training video according to the binary codes by adopting a decoding network, then, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes, and finally training the network to minimize the reconstruction error function and the neighbor similarity error function; the network comprises an encoding network, a full link layer and a decoding network. In the invention, the neighbors of the video are embedded into the coding network, so that in the process of coding the frame-level characteristics of the video, the content similar to the neighbors in the video is paid more attention, and the retrieval precision on a large-scale unsupervised video database can be further improved. In addition, by minimizing the reconstruction error and the neighbor similarity error, the perfect preservation of the neighbor structure in the Hamming space can be ensured, and the retrieval precision on the video database is further improved.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A hash learning method based on neighbor structure preservation is characterized by comprising the following steps:
s1, acquiring a video training set, and extracting M frame-level features of each training video in the video training set;
s2, extracting the time domain appearance characteristics of each training video by adopting an automatic encoder, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set;
s3, acquiring time domain appearance neighbor characteristics corresponding to each training video from the anchor point characteristic set aiming at each training video;
s4, coding each training video into a corresponding depth expression according to the time domain appearance neighbor characteristic by adopting a coding network;
s5, converting the depth expression corresponding to each training video into a list of binary codes according to the full link layer using the activation function;
s6, reconstructing M reconstructed frame-level features corresponding to each training video according to the binary code by adopting a decoding network;
s7, generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video, and generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes;
s8, training a network to minimize the reconstruction error function and minimize the neighbor similarity error function; wherein the network comprises the encoding network, the full link layer, and the transcoding network.
2. The method of claim 1, wherein each training video has a temporal appearance neighbor features, respectively
Figure FDA0002951578810000011
Wherein i is 1, 2, 3, …, and N is the number of training videos in the video training set; step S4 specifically includes:
s41, combining the column direction of a time domain appearance neighbor characteristics corresponding to each training video to obtain a first vector
Figure FDA0002951578810000012
S42, mapping the first vector into b-dimensional neighbor structure expression niWherein, in the step (A),
Figure FDA0002951578810000013
FC denotes full link layer mapping;
s43, inputting the first frame-level feature of each training video to the coding network at the first moment, and expressing n adjacent structureiEmbedded into b-dimensional memory shape in the following mannerIn the state:
Figure FDA0002951578810000014
wherein d is a fixed value, Wq、Wk、WvIn order to encode the parameter values of the network,
Figure FDA0002951578810000015
it is shown that the column direction is merged,
Figure FDA0002951578810000016
representing the frame-level feature, m, input corresponding to the first moment of the training videoi,1Representing the memory state corresponding to the first moment;
s44, when new frame-level characteristics are input into the coding network, the memory state is updated according to the following mode:
Figure FDA0002951578810000021
wherein 1 is<t≤M,
Figure FDA0002951578810000022
Representing the frame-level features of the input at the t-th instant, mi,tRepresents the memory state corresponding to the t-th time, mi,t-1The memory state corresponding to the t-1 th moment is shown;
the coding network is an LSTM network, and each operation unit in the coding network is as follows:
Figure FDA0002951578810000023
where MLP denotes multi-level mapping, BN denotes batch normalization, Wiv、Wih、Wfv、Wfh、Wov、WohTo express said compilationA parameter value of the code network, which indicates an inner product; wherein, the calculation mode of the sigma function is that sigma (x) is 1/(1+ e)-x);hi,t-1H represents the output of the hidden layer at the t-1 th timeitAn output representing the hidden layer at time t;
s45, outputting h from the hidden layer obtained at the last momenti,MAs a depth representation of the corresponding training video;
wherein the content of the first and second substances,
Figure FDA0002951578810000024
Figure FDA0002951578810000025
representing frame-level features of the corresponding training video and theta representing a parameter of the coding network.
3. The method of claim 2, wherein the deep representation of the corresponding training video is transformed according to a full link layer using an activation function, and the obtained list of binary codes is:
bi=sgn(ti);
wherein, ti=FC(hi,MK); FC denotes the full link layer map, sgn denotes the sign function, when tiWhen greater than 0, sgn (t)i) Is 1 when t isiWhen the value is less than or equal to 0, sgn (t)i) K represents the length of the sequence of binary codes for-1.
4. The method of claim 1, wherein the video training set includes N training videos,
the reconstruction error function is:
Figure FDA0002951578810000026
wherein the content of the first and second substances,
Figure FDA0002951578810000027
representing the mth frame-level feature in the ith training video,
Figure FDA0002951578810000028
representing the mth reconstructed frame-level feature in the ith training video, and l representing the dimension of the frame-level feature.
5. The method of claim 4, wherein the neighbor similarity error function is:
Figure FDA0002951578810000031
wherein s isijRepresenting the similarity between the temporal appearance features of the ith training video and the temporal appearance features of the jth training video, ti,tjAre respectively binary codes bi,bjK denotes the length of the binary code, and j is a positive integer not greater than N.
6. A neighbor structure preserving-based hash learning apparatus, the apparatus comprising:
the acquisition module is used for acquiring a video training set and extracting M frame-level features of each training video in the video training set;
the extraction module is used for extracting the time domain appearance characteristics of each training video by adopting an automatic encoder, and clustering the time domain appearance characteristics to obtain an anchor point characteristic set;
the acquisition module is further configured to acquire, for each training video, a time-domain appearance neighbor feature corresponding to each training video from the anchor point feature set;
the coding module is used for coding each training video into a corresponding depth expression according to the time domain appearance neighbor characteristic by adopting a coding network;
the conversion module is used for converting the depth expression corresponding to each training video into a row of binary codes according to the full link layer using the activation function;
the reconstruction module is used for reconstructing M reconstruction frame level characteristics corresponding to each training video according to the binary code by adopting a decoding network;
the generating module is used for generating a reconstruction error function according to the frame level characteristics and the reconstruction frame level characteristics corresponding to each training video and generating a neighbor similarity error function according to the time domain appearance characteristics and the binary codes;
a training module to train a network to minimize the reconstruction error function and to minimize the neighbor similarity error function; wherein the network comprises the encoding network, the full link layer, and the transcoding network.
7. The apparatus of claim 6, wherein each training video has a temporal appearance neighbor features, respectively
Figure FDA0002951578810000032
Wherein i is 1, 2, 3, …, and N is the number of training videos in the video training set; the encoding module is specifically configured to:
combining the column direction of a time domain appearance neighbor characteristics corresponding to each training video to obtain a first vector
Figure FDA0002951578810000033
Mapping the first vector to a b-dimensional neighbor structure expression niWherein, in the step (A),
Figure FDA0002951578810000034
FC denotes full link layer mapping;
for each training video, at a first time instant, inputting a first frame-level feature of each training video into the coding network, and expressing n adjacent structuresiEmbedding into the b-dimensional memory state as follows:
Figure FDA0002951578810000041
wherein d is a fixed value, Wq、Wk、WvIn order to encode the parameter values of the network,
Figure FDA0002951578810000042
it is shown that the column direction is merged,
Figure FDA0002951578810000043
representing the frame-level feature, m, input corresponding to the first moment of the training videoi,1Representing the memory state corresponding to the first moment;
when new frame-level features are input into the coding network, the memory state is updated as follows:
Figure FDA0002951578810000044
wherein 1 is<t≤M,
Figure FDA0002951578810000045
Representing the frame-level features of the input at the t-th instant, mi,tRepresents the memory state corresponding to the t-th time, mi,t-1The memory state corresponding to the t-1 th moment is shown;
the coding network is an LSTM network, and each operation unit in the coding network is as follows:
Figure FDA0002951578810000046
where MLP denotes multi-level mapping, BN denotes batch normalization, Wiv、Wih、Wfv、Wfh、Wov、WohParameter values indicating the coding network, <' > indicating inner products(ii) a Wherein, the calculation mode of the sigma function is that sigma (x) is 1/(1+ e)-x);hi,t-1H represents the output of the hidden layer at the t-1 th timeitAn output representing the hidden layer at time t;
outputting the hidden layer obtained at the last moment hi,MAs a depth representation of the corresponding training video; wherein the content of the first and second substances,
Figure FDA0002951578810000047
Figure FDA0002951578810000048
representing frame-level features of the corresponding training video and theta representing a parameter of the coding network.
8. The apparatus of claim 7, wherein the deep representation of the corresponding training video is transformed according to a full link layer using an activation function, and a list of binary codes is obtained by:
bi=sgn(ti);
wherein, ti=FC(hi,MK); FC denotes the full link layer map, sgn denotes the sign function, when tiWhen greater than 0, sgn (t)i) Is 1 when t isiWhen the value is less than or equal to 0, sgn (t)i) K represents the length of the sequence of binary codes for-1.
9. The apparatus of claim 6, wherein the video training set comprises N training videos,
the reconstruction error function is:
Figure FDA0002951578810000051
wherein the content of the first and second substances,
Figure FDA0002951578810000052
representing the mth frame-level feature in the ith training video,
Figure FDA0002951578810000053
representing the mth reconstructed frame-level feature in the ith training video, and l representing the dimension of the frame-level feature.
10. The apparatus of claim 8, wherein the neighbor similarity error function is:
Figure FDA0002951578810000054
wherein s isijRepresenting the similarity between the temporal appearance features of the ith training video and the temporal appearance features of the jth training video, ti,tjAre respectively binary codes bi,bjK denotes the length of the binary code, and j is a positive integer not greater than N.
CN201910264740.9A 2019-04-03 2019-04-03 Hash learning method and device based on neighbor structure keeping Active CN110069666B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910264740.9A CN110069666B (en) 2019-04-03 2019-04-03 Hash learning method and device based on neighbor structure keeping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910264740.9A CN110069666B (en) 2019-04-03 2019-04-03 Hash learning method and device based on neighbor structure keeping

Publications (2)

Publication Number Publication Date
CN110069666A CN110069666A (en) 2019-07-30
CN110069666B true CN110069666B (en) 2021-04-06

Family

ID=67366914

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910264740.9A Active CN110069666B (en) 2019-04-03 2019-04-03 Hash learning method and device based on neighbor structure keeping

Country Status (1)

Country Link
CN (1) CN110069666B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199520B (en) * 2020-09-19 2022-07-22 复旦大学 Cross-modal Hash retrieval algorithm based on fine-grained similarity matrix
CN113111836B (en) * 2021-04-25 2022-08-19 山东省人工智能研究院 Video analysis method based on cross-modal Hash learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012077818A1 (en) * 2010-12-10 2012-06-14 国立大学法人豊橋技術科学大学 Method for determining conversion matrix for hash function, hash-type approximation nearest neighbour search method using said hash function, and device and computer program therefor
CN103744973A (en) * 2014-01-11 2014-04-23 西安电子科技大学 Video copy detection method based on multi-feature Hash
CN107229757A (en) * 2017-06-30 2017-10-03 中国科学院计算技术研究所 The video retrieval method encoded based on deep learning and Hash
CN108304808A (en) * 2018-02-06 2018-07-20 广东顺德西安交通大学研究院 A kind of monitor video method for checking object based on space time information Yu depth network
CN108763481A (en) * 2018-05-29 2018-11-06 清华大学深圳研究生院 A kind of picture geographic positioning and system based on extensive streetscape data
CN109151501A (en) * 2018-10-09 2019-01-04 北京周同科技有限公司 A kind of video key frame extracting method, device, terminal device and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777130B (en) * 2016-12-16 2020-05-12 西安电子科技大学 Index generation method, data retrieval method and device
CN109409208A (en) * 2018-09-10 2019-03-01 东南大学 A kind of vehicle characteristics extraction and matching process based on video
CN109299097B (en) * 2018-09-27 2022-06-21 宁波大学 Online high-dimensional data nearest neighbor query method based on Hash learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012077818A1 (en) * 2010-12-10 2012-06-14 国立大学法人豊橋技術科学大学 Method for determining conversion matrix for hash function, hash-type approximation nearest neighbour search method using said hash function, and device and computer program therefor
CN103744973A (en) * 2014-01-11 2014-04-23 西安电子科技大学 Video copy detection method based on multi-feature Hash
CN107229757A (en) * 2017-06-30 2017-10-03 中国科学院计算技术研究所 The video retrieval method encoded based on deep learning and Hash
CN108304808A (en) * 2018-02-06 2018-07-20 广东顺德西安交通大学研究院 A kind of monitor video method for checking object based on space time information Yu depth network
CN108763481A (en) * 2018-05-29 2018-11-06 清华大学深圳研究生院 A kind of picture geographic positioning and system based on extensive streetscape data
CN109151501A (en) * 2018-10-09 2019-01-04 北京周同科技有限公司 A kind of video key frame extracting method, device, terminal device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"二值表示学习及其应用";鲁继文;《模式识别与人工智能》;20180131;第31卷(第1期);第12-21页 *

Also Published As

Publication number Publication date
CN110069666A (en) 2019-07-30

Similar Documents

Publication Publication Date Title
AU2019360080B2 (en) Image captioning with weakly-supervised attention penalty
US20200104640A1 (en) Committed information rate variational autoencoders
EP3298576A1 (en) Generative methods of super resolution
CN112509555B (en) Dialect voice recognition method, device, medium and electronic equipment
CN112418292B (en) Image quality evaluation method, device, computer equipment and storage medium
Cascianelli et al. Full-GRU natural language video description for service robotics applications
CN111932546A (en) Image segmentation model training method, image segmentation method, device, equipment and medium
CN110990596B (en) Multi-mode hash retrieval method and system based on self-adaptive quantization
CN110069666B (en) Hash learning method and device based on neighbor structure keeping
CN115687571B (en) Depth unsupervised cross-modal retrieval method based on modal fusion reconstruction hash
US20220309292A1 (en) Growing labels from semi-supervised learning
CN110188827A (en) A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model
CN115083435B (en) Audio data processing method and device, computer equipment and storage medium
CN114596456B (en) Image set classification method based on aggregated hash learning
CN116543351A (en) Self-supervision group behavior identification method based on space-time serial-parallel relation coding
US20230252993A1 (en) Visual speech recognition for digital videos utilizing generative adversarial learning
CN115775350A (en) Image enhancement method and device and computing equipment
CN115880556B (en) Multi-mode data fusion processing method, device, equipment and storage medium
Ma et al. Partial hash update via hamming subspace learning
CN116168394A (en) Image text recognition method and device
CN115965833A (en) Point cloud sequence recognition model training and recognition method, device, equipment and medium
CN113704466B (en) Text multi-label classification method and device based on iterative network and electronic equipment
US20220058842A1 (en) Generating handwriting via decoupled style descriptors
CN114913358B (en) Medical hyperspectral foreign matter detection method based on automatic encoder
CN115129713A (en) Data retrieval method, data retrieval device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant