CN111930972B - Cross-modal retrieval method and system for multimedia data by using label level information - Google Patents

Cross-modal retrieval method and system for multimedia data by using label level information Download PDF

Info

Publication number
CN111930972B
CN111930972B CN202010771701.0A CN202010771701A CN111930972B CN 111930972 B CN111930972 B CN 111930972B CN 202010771701 A CN202010771701 A CN 202010771701A CN 111930972 B CN111930972 B CN 111930972B
Authority
CN
China
Prior art keywords
layer
objective function
multimedia data
hash
hash code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010771701.0A
Other languages
Chinese (zh)
Other versions
CN111930972A (en
Inventor
罗昕
詹雨薇
许信顺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202010771701.0A priority Critical patent/CN111930972B/en
Publication of CN111930972A publication Critical patent/CN111930972A/en
Application granted granted Critical
Publication of CN111930972B publication Critical patent/CN111930972B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a method and a system for cross-modal retrieval of multimedia data by using label hierarchy information, wherein the method comprises the following steps: acquiring first modal multimedia data to be retrieved; performing feature extraction on first modal multimedia data to be retrieved to obtain a first hash code; distance calculation is carried out on the first hash code and known hash codes corresponding to all pre-stored multimedia data in the second mode; and selecting the multimedia data of the second modality corresponding to the plurality of hash codes with the closest distance, and outputting the multimedia data as a retrieval result.

Description

Cross-modal retrieval method and system for multimedia data by using label level information
Technical Field
The present application relates to the field of cross-media retrieval technologies, and in particular, to a method and a system for cross-modal retrieval of multimedia data using tag hierarchy information.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
With the explosive growth of multimedia data, data is often represented in multiple modalities, such as images and text. In the face of massive data, rapid similarity comparison is usually required, which is the fundamental operation for managing and using data. Therefore, the demand for fast cross-modality retrieval is increasing. To meet this need, a cross-modality hashing method using data of one modality to retrieve similar samples within another modality has been proposed in succession.
The cross-modal Hash learning belongs to Hash learning and has the advantage of Hash learning. The hash learning method is one of the most popular methods for storing or retrieving large-scale data at present. The main idea of the hash learning method is to design a hash function obtained by learning, which can transform data from the original data representation in the high-dimensional feature space to the binary code representation in the low-dimensional hamming space. Through the transformation, the purpose of reducing the dimension can be achieved, and the space consumption can be effectively reduced; the hash function maintains the similarity between data while achieving the purpose of dimensionality reduction. In addition, the data is expressed in the form of binary codes, and the advantage of quick retrieval can be obtained, because the computer has high efficiency in processing pairwise comparison between binary codes, so that the retrieval speed can be fast.
The existing cross-modal hash learning method can have various division methods. For example, the hash can be classified into an unsupervised cross-modal hash and a supervised cross-modal hash according to whether the supervised information of the data can be utilized. Compared with an unsupervised cross-modal hashing method, the supervised cross-modal hashing model can utilize semantic information, so that the learned hashing function can generate a representation of the data hash code with higher quality. For example, the cross-modal hashing method can be further divided into a deep cross-modal hashing method and a non-deep cross-modal hashing method according to whether the method can use a deep learning technique to extract features and learn a hash function. Non-deep hash methods use manually designed features for learning of hash functions and hash code representations, which separate the feature extraction phase from the hash learning phase, which can result in sub-optimal learning results. In the deep cross-modal hashing method, feature extraction and hash learning are integrated into one framework, so that the two stages can be mutually promoted, and the learning quality is improved.
In many real data, such as CIFAR-100 and Imagenet data sets, semantic tags corresponding to the data usually carry a hierarchical structure. For example, one simple example includes a three-layer label: both "motorcycles" and "trucks" belong to the parent "vehicle", while both "boats" and "vehicles" belong to the parent "vehicle". The hierarchical structures contain a lot of useful information, and if the information in the hierarchical structures is sufficiently mined in the learning process, the learning effect is obviously improved, and the effect of improving the retrieval accuracy is achieved. However, most current prior art ignores hierarchical information of tags. The inventors have found that, although there are very few methods that attempt to use this hierarchical information in the learning process, these cross-modal hashing methods suffer from the following disadvantages: it generates hierarchical hash codes for each level of the label hierarchy, without considering the correlation information between cross-level labels.
Disclosure of Invention
In order to solve the defects of the prior art, the application provides a multimedia data cross-modal retrieval method and a multimedia data cross-modal retrieval system by utilizing label level information;
in a first aspect, the present application provides a cross-modal retrieval method for multimedia data using tag level information;
the cross-modal retrieval method of the multimedia data by using the label hierarchy information comprises the following steps:
acquiring first modal multimedia data to be retrieved;
performing feature extraction on first modal multimedia data to be retrieved to obtain a first hash code;
distance calculation is carried out on the first hash code and known hash codes corresponding to all pre-stored multimedia data in the second mode; and selecting the multimedia data of the second modality corresponding to the plurality of hash codes with the closest distance, and outputting the multimedia data as a retrieval result.
In a second aspect, the present application provides a cross-modality retrieval system for multimedia data using tag level information;
a multimedia data cross-modal retrieval system using tag hierarchy information, comprising:
an acquisition module configured to: acquiring first modal multimedia data to be retrieved;
a feature extraction module configured to: performing feature extraction on first modal multimedia data to be retrieved to obtain a first hash code;
a retrieval output module configured to: distance calculation is carried out on the first hash code and known hash codes corresponding to all pre-stored multimedia data in the second mode; and selecting the multimedia data of the second modality corresponding to the plurality of hash codes with the closest distance, and outputting the multimedia data as a retrieval result.
In a third aspect, the present application further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.
In a fourth aspect, the present application also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.
In a fifth aspect, the present application also provides a computer program (product) comprising a computer program for implementing the method of any of the preceding first aspects when run on one or more processors.
Compared with the prior art, the beneficial effects of this application are:
1. the application provides a novel cross-modal Hash retrieval method based on deep learning and utilizing label level information, and a model simultaneously performs feature learning and Hash learning by using two parallel neural networks (corresponding to two different modalities).
2. The method and the device make full use of the hierarchical information in the label and generate the hash code under the supervision of the similarity and the correlation between layers.
3. The iterative optimization algorithm for directly learning the discrete hash code is used, so that the quality of the learned hash code can be effectively guaranteed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a flow chart of a method of the first embodiment;
fig. 2 is an overall block diagram of the first embodiment.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the embodiment of the present application, "and/or" is only one kind of association relation describing an association object, and means that three kinds of relations may exist. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the present application, "a plurality" means two or more than two.
In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the words "first", "second", etc. do not necessarily define a quantity or order of execution and that the words "first", "second", etc. do not necessarily differ.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
In order to solve the defects of the existing method and effectively utilize the hierarchical information of the label so as to carry out efficient and accurate similarity retrieval, the invention provides a cross-modal Hash retrieval method and a cross-modal Hash retrieval system based on deep learning by utilizing the hierarchical information of the label. In the method, the supervision information of the data is fully used, and deep learning is utilized for feature extraction and hash learning, and the invention focuses on two aspects. 1) How to fully utilize semantic information of data, namely labels of each layer in supervision information, and improve the quality of Hash learning. 2) How to fully mine the hierarchical structure in the semantic information, namely the associated information between different layers of the label, enables the model to further improve the learning quality. Specifically, the semantic features of the image and the text mode are respectively obtained by adopting a CNN network and an MLP network, and information in each layer of a label and information across layers are used as supervision information to guide Hash learning.
Example one
The embodiment provides a multimedia data cross-modal retrieval method using label hierarchy information;
as shown in fig. 1, the method for cross-modal retrieval of multimedia data using tag hierarchy information includes:
s101: acquiring first modal multimedia data to be retrieved;
s102: performing feature extraction on first modal multimedia data to be retrieved to obtain a first hash code;
s103: distance calculation is carried out on the first hash code and known hash codes corresponding to all pre-stored multimedia data in the second mode; and selecting the multimedia data of the second modality corresponding to the plurality of hash codes with the closest distance, and outputting the multimedia data as a retrieval result.
As one or more embodiments, in S101, the first-modality multimedia data to be retrieved includes, but is not limited to: one or more of text data, image data, audio data, or video data.
As one or more embodiments, in S103, the multimedia data of the second modality includes, but is not limited to: one or more of text data, image data, audio data, or video data.
As one or more embodiments, in S102, the feature extraction is performed on the first-modality multimedia data to be retrieved to obtain a first hash code; the method comprises the following specific steps:
when the first-mode multimedia data to be retrieved is image data, performing feature extraction on the first-mode multimedia data to be retrieved by using a pre-trained Convolutional Neural Network (CNN) to obtain a first hash code;
alternatively, the first and second electrodes may be,
and when the first-mode multimedia data to be retrieved is text data, performing feature extraction on the first-mode multimedia data to be retrieved by using a pre-trained multi-layer perceptron MLP model to obtain a first hash code.
Further, the convolutional neural network CNN includes: the device comprises a first input layer, a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a fifth convolution layer, a first full-link layer, a second full-link layer and a third full-link layer which are sequentially connected, wherein the number of neurons of the third full-link layer is equal to the length of a first Hash code.
Further, the multi-layered perceptron MLP model comprises: the second input layer, the fourth full-link layer and the fifth full-link layer are connected in sequence, and the length of the hash code of the fifth full-link layer is equal to that of the first hash code.
Further, the pre-trained convolutional neural network CNN specifically comprises the following training steps:
constructing a first training set, wherein the first training set is image data of hash codes corresponding to known image class labels; and inputting the first training set into the convolutional neural network CNN, and training the convolutional neural network CNN to obtain a pre-trained convolutional neural network CNN.
Further, the pre-trained MLP model comprises the following specific training steps:
constructing a second training set, wherein the second training set is image data of hash codes corresponding to known text class labels; and inputting the second training set into the multi-layer perceptron MLP model, and training the multi-layer perceptron MLP model to obtain the pre-trained multi-layer perceptron MLP model.
Further, the step of obtaining the hash code corresponding to the image category label and/or the hash code corresponding to the text category label includes:
embedding the similarity of the instance and the category label of each layer in the label hierarchical structure into a first objective function; the first objective function is used for keeping semantic similarity of each layer;
embedding the similarity between the cross-layer category labels in the label hierarchical structure and the category labels into a second objective function; the second objective function is used for maintaining cross-layer association information;
integrating the first objective function and the second objective function to obtain a final objective function, wherein the final objective function can enable hash codes learned from known class labels to keep the similarity of each layer and the cross-layer association degree;
and carrying out optimization solution on the final objective function to obtain the hash codes corresponding to the image class labels and the hash codes corresponding to the text class labels.
Illustratively, the first objective function refers to:
Figure BDA0002616878540000081
s.t.B∈{-1,1}r×n
Figure BDA0002616878540000082
where B is the hash of all samples, YkIs a hash code of the k-th layer class, αkIs the confidence of the k-th layer. Here, an arrangement is provided
Figure BDA0002616878540000083
Illustratively, the second objective function is:
Figure BDA0002616878540000084
s.t.S∈{-1,1}r×n
Figure BDA0002616878540000085
Ykis a Hash-like code matrix of the k-th layer, betakThe importance of the balance of the k-th layer, and αkSimilarly, set up
Figure BDA0002616878540000086
Illustratively, the final objective function is:
Figure BDA0002616878540000087
s.t.B∈{-1,1}r×n
Figure BDA0002616878540000088
where η is the hyperparameter, K ∈ {1, …, K-1 }.
Illustratively, the optimal solution of the final objective function is calculated by using an Error Back Propagation (BP) algorithm.
The work of the present application is described by taking data containing two modalities, namely images and texts, and the model can be conveniently extended to other types of modality data. The overall framework is shown in fig. 2. The model designed by the application consists of two main components, a characteristic extraction module and a hash code learning module. In the feature extraction part, the CNN network is adopted to extract the features of the image modality and the MLP network to obtain the features of the text modality. In the hash code learning section, the present application first introduces definitions of the similarity of the hierarchical instances and class labels, and the similarity of the class labels and class labels across the layers. Then, the above-mentioned information is embedded into the objective function as supervision information, and an intermediate product, namely a hash code corresponding to the class label, is further obtained. Finally, intermediate products are used to generate hash codes that can retain two defined similarities simultaneously. The technical contents will be described in detail below by dividing the method into four parts in order.
a) And a feature extraction module. The present application employs two deep neural networks to learn features, one for the image modality and the other for the text modality.
For the image modality, a CNN-F model is adopted as a feature learning model and initialized with weights pre-trained on ImageNet. Specifically, CNN-F consists of five convolutional layers, "conv 1-conv 5", and threeA full connection layer, namely fc6-fc 8. The application modifies the number of neurons at the fc8 level to the length of the hash code. Further, the activation function of the fc8 layer is set to the tanh function. This application let f (x)i;θx) Representing the output of a CNN-F network, where xiRepresenting the input of the network, thetaxRepresenting parameters of the CNN network.
For the text modality, the present application designs a multi-layer perceptron (MLP) model consisting of two fully-connected layers. Also, the number of second layer nodes is set to be the length of the hash code. First, the activation functions used by the second layer are ReLU and tanh, respectively. The input to the MLP network is tiIs a bag-of-words representation of the ith text, the output is f (t)i;θt),θtAre parameters of the deep neural network.
b) And a hash code learning module. The module first defines the instance-class similarity of the hierarchy and class labels, and the cross-layer class-wise similarity of class labels to class labels. Then, the above-mentioned information is embedded into the target function as the supervision information, and further a class-wise hash code (class-wise hash code) corresponding to the class label is obtained. Finally, the class hash codes are used to generate hash codes that can retain two defined similarities simultaneously.
(ii) similarity of each layer in the hierarchy of labels. The class labels have a hierarchical structure, with each layer (e.g., the k-th layer) having a layer-level label matrix
Figure BDA0002616878540000101
ckRepresenting the number of tags at the k-th level. For the hash method using the supervised information, it is most important to maintain semantic similarity in hamming space. In other words, instances in the same class should have similar hash codes. The application generates the hash code of each class as global information to guide the learning of the hash code. Based on the assumption that the hash code of an instance in a class should be similar to the class hash code of the class, the label matrix LkCan be considered as instance-class similarity, approximated by the inner product of the class hash code of the layer and the hash codes of all samples. Let
Figure BDA0002616878540000102
To represent the class hash code of the k-th layer,
Figure BDA0002616878540000103
is YkRow i of (2). In order to maintain semantic similarity of each layer, the present application defines an objective function as follows:
Figure BDA0002616878540000104
s.t.B∈{-1,1}r×n
Figure BDA0002616878540000105
where B is the hash of all samples, YkIs a hash code of the k-th layer class, αkIs the confidence of the k-th layer. Here, an arrangement is provided
Figure BDA0002616878540000106
In order to fully combine feature learning and hash code learning, the application introduces the output of the CNN network into the above formula, and the optimization problem is:
Figure BDA0002616878540000107
s.t.B∈{-1,1}r×n
Figure BDA0002616878540000108
wherein F ═ F (X; theta)x),G=g(T;θt) μ is a trade-off parameter. In particular, the second term of the optimization function may consider F and G as the result of B being real valued.
② cross-layer similarity in the label hierarchy. For the hierarchy in the tag information, the correlation information between different layers is also valuable. To capture cross-layer dependencies, cross-layer class-wise similarity is defined herein, which measures the similarity between two classes from different layers:
Figure BDA0002616878540000111
wherein K is in the range of { 1.,. K-1}, ikI-th class, α ncestor (j), representing k-th layerK) Represents the set of j-th category ancestor nodes of the k-th layer.
Figure BDA0002616878540000112
Meaning that the i-th class at the K-th level is the ancestor node of the j-th class at the K-th level. Otherwise
Figure BDA0002616878540000113
From this definition, it is clear that cross-layer association information is mainly focused on the finest granularity class, i.e., the K-th layer, because this layer contains more categories, and these labels can describe the sample more accurately than other layers.
Then, the present application defines the following objective function to maintain cross-layer association information:
Figure BDA0002616878540000114
s.t.S∈{-1,1}r×n
Figure BDA0002616878540000115
Ykis a Hash-like code matrix of the k-th layer, betakThe importance of the balance of the k-th layer, and αkSimilarly, set up
Figure BDA0002616878540000116
And merging the first step and the second step. In order to enable the learned hash code to keep the similarity of each layer and the cross-layer association degree, the target functions of the first two parts are integrated to further obtain a final target function:
Figure BDA0002616878540000121
s.t.B∈{-1,1}r×n
Figure BDA0002616878540000122
where η is the hyperparameter, K ∈ {1, …, K-1 }.
c) And (5) optimizing the process.
As can be seen from the objective function, there are five variables to be optimized, B and Y respectivelyk,YK,θx,θy. Similar to most deep cross-modal hash retrieval methods, the method adopts an iterative optimization mode to minimize the loss function, namely only one variable is optimized each time, and other variables are kept unchanged. The specific optimization strategy is as follows:
the first step is as follows: fixed variable thetax,θyUpdating the variables B, Yk,YK. The application uses the BP algorithm to carry out the parameter thetax,θyAnd (6) updating.
The second step is that: fixed variable Yk,YK,θx,θyAnd updating B. When other variables are fixed, the objective function can be rewritten as:
Figure BDA0002616878540000123
s.t.B∈{-1,1}r×n
the formula has the advantages of closed solution,
B=sign(F+G).
the third step: fixing variables B, Yk,θx,θyUpdate YK. The overall objective function is rewritten as:
Figure BDA0002616878540000124
Figure BDA0002616878540000125
the above formula is developed, the constant term is omitted, and the following results are obtained:
Figure BDA0002616878540000131
Figure BDA0002616878540000132
wherein Q isKTo represent
Figure BDA0002616878540000133
Thereafter, the present application may use a discrete cyclic coordinate descent algorithm to obtain YKOne row is closed and the other rows are fixed. In other words, the hash-like code of the K-th layer can be generated discretely one bit by one bit. Let (y)k)TRepresents YKL-th row of (1), l ∈ {1, 2.., r }, and K ∈ {1, 2.., K }. (Y)k) ' is YKRemoving (y)k)TMatrix of ykOne bit of all class hash codes of the k-th layer can be represented. Similarly, fTIs line l-th of F, F' is F except for FTOf the matrix of (a). gTIs line l-th of G, G' is G except GTOf the matrix of (a).
Then the first term of the above formula is taken as an example:
Figure BDA0002616878540000134
wherein the content of the first and second substances,
Figure BDA0002616878540000135
similarly, other items may be written as,
Figure BDA0002616878540000136
Figure BDA0002616878540000137
Tr((YK)TQK)=(qK)TyK+const.
the objective function can then be rewritten as:
Figure BDA0002616878540000138
Figure BDA0002616878540000141
this problem has a closed solution:
Figure BDA0002616878540000142
it is obvious that each bit yKIs based on (Y)K) To compute, the present application can iteratively update each bit until convergence, resulting in a better YK
The fourth step: fixing B, YK,θx,θyUpdate Yk. The objective function is rewritten as,
Figure BDA0002616878540000143
Figure BDA0002616878540000144
due to the optimization process and the solution of YKSimilarly, Y is given directly herekThe solution of (a):
yk=sign(qkk(Yk)′T(F′f+G′g)-ηβk(Yk)′T(YK)′yK),
where K ∈ {1, …, K-1}, (q)k)TIs QkL-th line of, Qk=r(αk(F+G)Lk+ηβkYKSkK)。
d) For processing of query data, a hash representation of the query sample is generated.
When a sample not belonging to the training set is used as query data, a corresponding hash code b needs to be generatedq. Because the cross-modal search problem is oriented, and two modalities, image and text, are taken as examples, separate analysis is required.
If the query sample is an image, the application takes the query sample as the input of the CNN-F network, and generates the hash code as follows:
bq=hx(xq)=sign(f(xq;θx)).
if the query sample is a bag-of-word feature vector of a text, the application takes the query sample as the input of the MLP network, and generates a hash code as follows:
bq=ht(tq)=sign(g(tq;θt)).
the specific implementation steps are as follows:
firstly, acquiring original pictures and original texts of a training set, and representing the texts by using BoW.
And secondly, inputting the training samples into a feature extraction module, respectively extracting features of the pictures and the texts, and then learning to obtain the hash codes of the samples in the training set through a hash code learning stage. The model parameters are obtained by learning, i.e. minimizing the loss function.
And thirdly, fixing model parameters. And obtaining the hash codes corresponding to all samples by using the model, and storing the hash codes into a database for use.
And fourthly, when the model is used for retrieval, firstly generating a hash code of a query sample (a certain mode), then comparing the hash code with the hash codes of all samples stored in the database, searching N (self-defining as required) samples with the nearest Hamming distance, and returning the data of the other mode corresponding to the samples.
Example two
The embodiment provides a multimedia data cross-modal retrieval system using tag hierarchy information;
a multimedia data cross-modal retrieval system using tag hierarchy information, comprising:
an acquisition module configured to: acquiring first modal multimedia data to be retrieved;
a feature extraction module configured to: performing feature extraction on first modal multimedia data to be retrieved to obtain a first hash code;
a retrieval output module configured to: distance calculation is carried out on the first hash code and known hash codes corresponding to all pre-stored multimedia data in the second mode; and selecting the multimedia data of the second modality corresponding to the plurality of hash codes with the closest distance, and outputting the multimedia data as a retrieval result.
It should be noted here that the above-mentioned obtaining module, the feature extraction module and the retrieval output module correspond to steps S101 to S103 in the first embodiment, and the above-mentioned modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical functional division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.
EXAMPLE III
The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.
The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Example four
The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (9)

1. The cross-modal retrieval method of multimedia data by using label level information is characterized by comprising the following steps:
acquiring first modal multimedia data to be retrieved;
performing feature extraction on first modal multimedia data to be retrieved to obtain a first hash code; when the first-mode multimedia data to be retrieved is image data, performing feature extraction on the first-mode multimedia data to be retrieved by using a pre-trained Convolutional Neural Network (CNN) to obtain a first hash code; the pre-trained convolutional neural network CNN comprises the following specific training steps:
constructing a first training set, wherein the first training set is image data of hash codes corresponding to known image class labels; inputting the first training set into a convolutional neural network CNN, and training the convolutional neural network CNN to obtain a pre-trained convolutional neural network CNN;
the step of obtaining the hash code corresponding to the image category label comprises:
embedding the similarity of the instance and the category label of each layer in the label hierarchical structure into a first objective function; the first objective function is used for keeping semantic similarity of each layer;
embedding the similarity between the cross-layer category labels in the label hierarchical structure and the category labels into a second objective function; the second objective function is used for maintaining cross-layer association information;
integrating the first objective function and the second objective function to obtain a final objective function, wherein the final objective function can enable hash codes learned from known class labels to keep the similarity of each layer and the cross-layer association degree;
optimizing and solving the final objective function to obtain a hash code corresponding to the image class label;
distance calculation is carried out on the first hash code and known hash codes corresponding to all pre-stored multimedia data in the second mode; and selecting the multimedia data of the second modality corresponding to the plurality of hash codes with the closest distance, and outputting the multimedia data as a retrieval result.
2. The method according to claim 1, wherein the first modality multimedia data to be retrieved is subjected to feature extraction to obtain a first hash code; the method comprises the following specific steps:
and when the first-mode multimedia data to be retrieved is text data, performing feature extraction on the first-mode multimedia data to be retrieved by using a pre-trained multi-layer perceptron MLP model to obtain a first hash code.
3. The method of claim 1, wherein the convolutional neural network CNN comprises: the device comprises a first input layer, a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a fifth convolution layer, a first full-link layer, a second full-link layer and a third full-link layer which are sequentially connected, wherein the number of neurons of the third full-link layer is equal to the length of a first Hash code.
4. The method of claim 2, wherein the multi-layered perceptron MLP model comprises: the second input layer, the fourth full connection layer and the fifth full connection layer are connected in sequence, and the number of nodes of the fifth full connection layer is equal to the length of the first hash code.
5. The method as claimed in claim 2, wherein the pre-trained MLP model comprises the following training steps:
constructing a second training set, wherein the second training set is text data of hash codes corresponding to known text type labels; and inputting the second training set into the multi-layer perceptron MLP model, and training the multi-layer perceptron MLP model to obtain the pre-trained multi-layer perceptron MLP model.
6. The method as claimed in claim 5, wherein the step of obtaining the hash code corresponding to the text type label comprises:
embedding the similarity of the instance and the category label of each layer in the label hierarchical structure into a first objective function; the first objective function is used for keeping semantic similarity of each layer;
embedding the similarity between the cross-layer category labels in the label hierarchical structure and the category labels into a second objective function; the second objective function is used for maintaining cross-layer association information;
integrating the first objective function and the second objective function to obtain a final objective function, wherein the final objective function can enable hash codes learned from known class labels to keep the similarity of each layer and the cross-layer association degree;
and carrying out optimization solution on the final objective function to obtain the hash code corresponding to the text type label.
7. A multimedia data cross-modal retrieval system using tag hierarchy information, comprising:
an acquisition module configured to: acquiring first modal multimedia data to be retrieved;
a feature extraction module configured to: performing feature extraction on first modal multimedia data to be retrieved to obtain a first hash code; when the first-mode multimedia data to be retrieved is image data, performing feature extraction on the first-mode multimedia data to be retrieved by using a pre-trained Convolutional Neural Network (CNN) to obtain a first hash code; the pre-trained convolutional neural network CNN comprises the following specific training steps:
constructing a first training set, wherein the first training set is image data of hash codes corresponding to known image class labels; inputting the first training set into a convolutional neural network CNN, and training the convolutional neural network CNN to obtain a pre-trained convolutional neural network CNN;
the step of obtaining the hash code corresponding to the image category label comprises:
embedding the similarity of the instance and the category label of each layer in the label hierarchical structure into a first objective function; the first objective function is used for keeping semantic similarity of each layer;
embedding the similarity between the cross-layer category labels in the label hierarchical structure and the category labels into a second objective function; the second objective function is used for maintaining cross-layer association information;
integrating the first objective function and the second objective function to obtain a final objective function, wherein the final objective function can enable hash codes learned from known class labels to keep the similarity of each layer and the cross-layer association degree;
optimizing and solving the final objective function to obtain a hash code corresponding to the image class label;
a retrieval output module configured to: distance calculation is carried out on the first hash code and known hash codes corresponding to all pre-stored multimedia data in the second mode; and selecting the multimedia data of the second modality corresponding to the plurality of hash codes with the closest distance, and outputting the multimedia data as a retrieval result.
8. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-7.
9. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 6.
CN202010771701.0A 2020-08-04 2020-08-04 Cross-modal retrieval method and system for multimedia data by using label level information Active CN111930972B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010771701.0A CN111930972B (en) 2020-08-04 2020-08-04 Cross-modal retrieval method and system for multimedia data by using label level information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010771701.0A CN111930972B (en) 2020-08-04 2020-08-04 Cross-modal retrieval method and system for multimedia data by using label level information

Publications (2)

Publication Number Publication Date
CN111930972A CN111930972A (en) 2020-11-13
CN111930972B true CN111930972B (en) 2021-04-27

Family

ID=73307193

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010771701.0A Active CN111930972B (en) 2020-08-04 2020-08-04 Cross-modal retrieval method and system for multimedia data by using label level information

Country Status (1)

Country Link
CN (1) CN111930972B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113159071B (en) * 2021-04-20 2022-06-21 复旦大学 Cross-modal image-text association anomaly detection method
CN116244483B (en) * 2023-05-12 2023-07-28 山东建筑大学 Large-scale zero sample data retrieval method and system based on data synthesis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488713A (en) * 2013-09-10 2014-01-01 浙江大学 Cross-modal search method capable of directly measuring similarity of different modal data
CN105205096A (en) * 2015-08-18 2015-12-30 天津中科智能识别产业技术研究院有限公司 Text modal and image modal crossing type data retrieval method
CN106951509A (en) * 2017-03-17 2017-07-14 中国人民解放军国防科学技术大学 Multi-tag coring canonical correlation analysis search method
CN110188209A (en) * 2019-05-13 2019-08-30 山东大学 Cross-module state Hash model building method, searching method and device based on level label
CN110569387A (en) * 2019-08-20 2019-12-13 清华大学 radar-image cross-modal retrieval method based on depth hash algorithm

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8204869B2 (en) * 2008-09-30 2012-06-19 International Business Machines Corporation Method and apparatus to define and justify policy requirements using a legal reference library
US10387386B2 (en) * 2015-08-11 2019-08-20 International Business Machines Corporation Automatic attribute structural variation detection for not only structured query language database
US11372909B2 (en) * 2018-08-30 2022-06-28 Kavita Ramnik Shah Mehta System and method for recommending business schools based on assessing profiles of applicants and business schools
CN110188414B (en) * 2019-05-13 2020-12-29 山东大学 Personalized capsule wardrobe creating method and device and capsule wardrobe

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488713A (en) * 2013-09-10 2014-01-01 浙江大学 Cross-modal search method capable of directly measuring similarity of different modal data
CN105205096A (en) * 2015-08-18 2015-12-30 天津中科智能识别产业技术研究院有限公司 Text modal and image modal crossing type data retrieval method
CN106951509A (en) * 2017-03-17 2017-07-14 中国人民解放军国防科学技术大学 Multi-tag coring canonical correlation analysis search method
CN110188209A (en) * 2019-05-13 2019-08-30 山东大学 Cross-module state Hash model building method, searching method and device based on level label
CN110569387A (en) * 2019-08-20 2019-12-13 清华大学 radar-image cross-modal retrieval method based on depth hash algorithm

Also Published As

Publication number Publication date
CN111930972A (en) 2020-11-13

Similar Documents

Publication Publication Date Title
Bai et al. Boundary content graph neural network for temporal action proposal generation
CN108984724B (en) Method for improving emotion classification accuracy of specific attributes by using high-dimensional representation
CN111639240B (en) Cross-modal Hash retrieval method and system based on attention awareness mechanism
Najafabadi et al. Deep learning applications and challenges in big data analytics
Liu et al. Ranking-based deep cross-modal hashing
CN111639197B (en) Cross-modal multimedia data retrieval method and system with label embedded online hash
CA3076638A1 (en) Systems and methods for learning user representations for open vocabulary data sets
Yu et al. Multi-label classification by exploiting label correlations
CN111914085B (en) Text fine granularity emotion classification method, system, device and storage medium
Tian et al. Meta-learning approaches for learning-to-learn in deep learning: A survey
Jin et al. Deep saliency hashing for fine-grained retrieval
CN113326289B (en) Rapid cross-modal retrieval method and system for incremental data carrying new categories
Plested et al. Deep transfer learning for image classification: a survey
CN113688878B (en) Small sample image classification method based on memory mechanism and graph neural network
CN111930972B (en) Cross-modal retrieval method and system for multimedia data by using label level information
US20220375090A1 (en) Generating improved panoptic segmented digital images based on panoptic segmentation neural networks that utilize exemplar unknown object classes
CN112639769A (en) Accelerating machine learning reasoning with probabilistic predicates
Do et al. Attentional multilabel learning over graphs: a message passing approach
CN111461175A (en) Label recommendation model construction method and device of self-attention and cooperative attention mechanism
CN111080551A (en) Multi-label image completion method based on depth convolution characteristics and semantic neighbor
Furht et al. Deep learning techniques in big data analytics
CN116594748A (en) Model customization processing method, device, equipment and medium for task
Xing et al. Few-shot single-view 3d reconstruction with memory prior contrastive network
CN115438160A (en) Question and answer method and device based on deep learning and electronic equipment
Widhianingsih et al. Augmented domain agreement for adaptable Meta-Learner on Few-Shot classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant