CN111930972B

CN111930972B - Cross-modal retrieval method and system for multimedia data by using label level information

Info

Publication number: CN111930972B
Application number: CN202010771701.0A
Authority: CN
Inventors: 罗昕; 詹雨薇; 许信顺
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2021-04-27
Anticipated expiration: 2040-08-04
Also published as: CN111930972A

Abstract

The invention discloses a method and a system for cross-modal retrieval of multimedia data by using label hierarchy information, wherein the method comprises the following steps: acquiring first modal multimedia data to be retrieved; performing feature extraction on first modal multimedia data to be retrieved to obtain a first hash code; distance calculation is carried out on the first hash code and known hash codes corresponding to all pre-stored multimedia data in the second mode; and selecting the multimedia data of the second modality corresponding to the plurality of hash codes with the closest distance, and outputting the multimedia data as a retrieval result.

Description

Cross-modal retrieval method and system for multimedia data by using label level information

Technical Field

The present application relates to the field of cross-media retrieval technologies, and in particular, to a method and a system for cross-modal retrieval of multimedia data using tag hierarchy information.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

With the explosive growth of multimedia data, data is often represented in multiple modalities, such as images and text. In the face of massive data, rapid similarity comparison is usually required, which is the fundamental operation for managing and using data. Therefore, the demand for fast cross-modality retrieval is increasing. To meet this need, a cross-modality hashing method using data of one modality to retrieve similar samples within another modality has been proposed in succession.

The cross-modal Hash learning belongs to Hash learning and has the advantage of Hash learning. The hash learning method is one of the most popular methods for storing or retrieving large-scale data at present. The main idea of the hash learning method is to design a hash function obtained by learning, which can transform data from the original data representation in the high-dimensional feature space to the binary code representation in the low-dimensional hamming space. Through the transformation, the purpose of reducing the dimension can be achieved, and the space consumption can be effectively reduced; the hash function maintains the similarity between data while achieving the purpose of dimensionality reduction. In addition, the data is expressed in the form of binary codes, and the advantage of quick retrieval can be obtained, because the computer has high efficiency in processing pairwise comparison between binary codes, so that the retrieval speed can be fast.

The existing cross-modal hash learning method can have various division methods. For example, the hash can be classified into an unsupervised cross-modal hash and a supervised cross-modal hash according to whether the supervised information of the data can be utilized. Compared with an unsupervised cross-modal hashing method, the supervised cross-modal hashing model can utilize semantic information, so that the learned hashing function can generate a representation of the data hash code with higher quality. For example, the cross-modal hashing method can be further divided into a deep cross-modal hashing method and a non-deep cross-modal hashing method according to whether the method can use a deep learning technique to extract features and learn a hash function. Non-deep hash methods use manually designed features for learning of hash functions and hash code representations, which separate the feature extraction phase from the hash learning phase, which can result in sub-optimal learning results. In the deep cross-modal hashing method, feature extraction and hash learning are integrated into one framework, so that the two stages can be mutually promoted, and the learning quality is improved.

In many real data, such as CIFAR-100 and Imagenet data sets, semantic tags corresponding to the data usually carry a hierarchical structure. For example, one simple example includes a three-layer label: both "motorcycles" and "trucks" belong to the parent "vehicle", while both "boats" and "vehicles" belong to the parent "vehicle". The hierarchical structures contain a lot of useful information, and if the information in the hierarchical structures is sufficiently mined in the learning process, the learning effect is obviously improved, and the effect of improving the retrieval accuracy is achieved. However, most current prior art ignores hierarchical information of tags. The inventors have found that, although there are very few methods that attempt to use this hierarchical information in the learning process, these cross-modal hashing methods suffer from the following disadvantages: it generates hierarchical hash codes for each level of the label hierarchy, without considering the correlation information between cross-level labels.

Disclosure of Invention

In order to solve the defects of the prior art, the application provides a multimedia data cross-modal retrieval method and a multimedia data cross-modal retrieval system by utilizing label level information;

in a first aspect, the present application provides a cross-modal retrieval method for multimedia data using tag level information;

the cross-modal retrieval method of the multimedia data by using the label hierarchy information comprises the following steps:

acquiring first modal multimedia data to be retrieved;

performing feature extraction on first modal multimedia data to be retrieved to obtain a first hash code;

distance calculation is carried out on the first hash code and known hash codes corresponding to all pre-stored multimedia data in the second mode; and selecting the multimedia data of the second modality corresponding to the plurality of hash codes with the closest distance, and outputting the multimedia data as a retrieval result.

In a second aspect, the present application provides a cross-modality retrieval system for multimedia data using tag level information;

a multimedia data cross-modal retrieval system using tag hierarchy information, comprising:

an acquisition module configured to: acquiring first modal multimedia data to be retrieved;

a feature extraction module configured to: performing feature extraction on first modal multimedia data to be retrieved to obtain a first hash code;

a retrieval output module configured to: distance calculation is carried out on the first hash code and known hash codes corresponding to all pre-stored multimedia data in the second mode; and selecting the multimedia data of the second modality corresponding to the plurality of hash codes with the closest distance, and outputting the multimedia data as a retrieval result.

In a third aspect, the present application further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.

In a fourth aspect, the present application also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

In a fifth aspect, the present application also provides a computer program (product) comprising a computer program for implementing the method of any of the preceding first aspects when run on one or more processors.

Compared with the prior art, the beneficial effects of this application are:

1. the application provides a novel cross-modal Hash retrieval method based on deep learning and utilizing label level information, and a model simultaneously performs feature learning and Hash learning by using two parallel neural networks (corresponding to two different modalities).

2. The method and the device make full use of the hierarchical information in the label and generate the hash code under the supervision of the similarity and the correlation between layers.

3. The iterative optimization algorithm for directly learning the discrete hash code is used, so that the quality of the learned hash code can be effectively guaranteed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a flow chart of a method of the first embodiment;

fig. 2 is an overall block diagram of the first embodiment.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the embodiment of the present application, "and/or" is only one kind of association relation describing an association object, and means that three kinds of relations may exist. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the present application, "a plurality" means two or more than two.

In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the words "first", "second", etc. do not necessarily define a quantity or order of execution and that the words "first", "second", etc. do not necessarily differ.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

In order to solve the defects of the existing method and effectively utilize the hierarchical information of the label so as to carry out efficient and accurate similarity retrieval, the invention provides a cross-modal Hash retrieval method and a cross-modal Hash retrieval system based on deep learning by utilizing the hierarchical information of the label. In the method, the supervision information of the data is fully used, and deep learning is utilized for feature extraction and hash learning, and the invention focuses on two aspects. 1) How to fully utilize semantic information of data, namely labels of each layer in supervision information, and improve the quality of Hash learning. 2) How to fully mine the hierarchical structure in the semantic information, namely the associated information between different layers of the label, enables the model to further improve the learning quality. Specifically, the semantic features of the image and the text mode are respectively obtained by adopting a CNN network and an MLP network, and information in each layer of a label and information across layers are used as supervision information to guide Hash learning.

Example one

The embodiment provides a multimedia data cross-modal retrieval method using label hierarchy information;

as shown in fig. 1, the method for cross-modal retrieval of multimedia data using tag hierarchy information includes:

s101: acquiring first modal multimedia data to be retrieved;

s102: performing feature extraction on first modal multimedia data to be retrieved to obtain a first hash code;

s103: distance calculation is carried out on the first hash code and known hash codes corresponding to all pre-stored multimedia data in the second mode; and selecting the multimedia data of the second modality corresponding to the plurality of hash codes with the closest distance, and outputting the multimedia data as a retrieval result.

As one or more embodiments, in S101, the first-modality multimedia data to be retrieved includes, but is not limited to: one or more of text data, image data, audio data, or video data.

As one or more embodiments, in S103, the multimedia data of the second modality includes, but is not limited to: one or more of text data, image data, audio data, or video data.

As one or more embodiments, in S102, the feature extraction is performed on the first-modality multimedia data to be retrieved to obtain a first hash code; the method comprises the following specific steps:

when the first-mode multimedia data to be retrieved is image data, performing feature extraction on the first-mode multimedia data to be retrieved by using a pre-trained Convolutional Neural Network (CNN) to obtain a first hash code;

alternatively, the first and second electrodes may be,

and when the first-mode multimedia data to be retrieved is text data, performing feature extraction on the first-mode multimedia data to be retrieved by using a pre-trained multi-layer perceptron MLP model to obtain a first hash code.

Further, the convolutional neural network CNN includes: the device comprises a first input layer, a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a fifth convolution layer, a first full-link layer, a second full-link layer and a third full-link layer which are sequentially connected, wherein the number of neurons of the third full-link layer is equal to the length of a first Hash code.

Further, the multi-layered perceptron MLP model comprises: the second input layer, the fourth full-link layer and the fifth full-link layer are connected in sequence, and the length of the hash code of the fifth full-link layer is equal to that of the first hash code.

Further, the pre-trained convolutional neural network CNN specifically comprises the following training steps:

constructing a first training set, wherein the first training set is image data of hash codes corresponding to known image class labels; and inputting the first training set into the convolutional neural network CNN, and training the convolutional neural network CNN to obtain a pre-trained convolutional neural network CNN.

Further, the pre-trained MLP model comprises the following specific training steps:

constructing a second training set, wherein the second training set is image data of hash codes corresponding to known text class labels; and inputting the second training set into the multi-layer perceptron MLP model, and training the multi-layer perceptron MLP model to obtain the pre-trained multi-layer perceptron MLP model.

Further, the step of obtaining the hash code corresponding to the image category label and/or the hash code corresponding to the text category label includes:

embedding the similarity of the instance and the category label of each layer in the label hierarchical structure into a first objective function; the first objective function is used for keeping semantic similarity of each layer;

embedding the similarity between the cross-layer category labels in the label hierarchical structure and the category labels into a second objective function; the second objective function is used for maintaining cross-layer association information;

integrating the first objective function and the second objective function to obtain a final objective function, wherein the final objective function can enable hash codes learned from known class labels to keep the similarity of each layer and the cross-layer association degree;

and carrying out optimization solution on the final objective function to obtain the hash codes corresponding to the image class labels and the hash codes corresponding to the text class labels.

Illustratively, the first objective function refers to:

s.t.B∈{-1，1}^r×n，

where B is the hash of all samples, Y^kIs a hash code of the k-th layer class, α^kIs the confidence of the k-th layer. Here, an arrangement is provided

Illustratively, the second objective function is:

s.t.S∈{-1，1}^r×n，

Y^kis a Hash-like code matrix of the k-th layer, beta^kThe importance of the balance of the k-th layer, and α^kSimilarly, set up

Illustratively, the final objective function is:

s.t.B∈{-1，1}^r×n，

where η is the hyperparameter, K ∈ {1, …, K-1 }.

Illustratively, the optimal solution of the final objective function is calculated by using an Error Back Propagation (BP) algorithm.

The work of the present application is described by taking data containing two modalities, namely images and texts, and the model can be conveniently extended to other types of modality data. The overall framework is shown in fig. 2. The model designed by the application consists of two main components, a characteristic extraction module and a hash code learning module. In the feature extraction part, the CNN network is adopted to extract the features of the image modality and the MLP network to obtain the features of the text modality. In the hash code learning section, the present application first introduces definitions of the similarity of the hierarchical instances and class labels, and the similarity of the class labels and class labels across the layers. Then, the above-mentioned information is embedded into the objective function as supervision information, and an intermediate product, namely a hash code corresponding to the class label, is further obtained. Finally, intermediate products are used to generate hash codes that can retain two defined similarities simultaneously. The technical contents will be described in detail below by dividing the method into four parts in order.

a) And a feature extraction module. The present application employs two deep neural networks to learn features, one for the image modality and the other for the text modality.

For the image modality, a CNN-F model is adopted as a feature learning model and initialized with weights pre-trained on ImageNet. Specifically, CNN-F consists of five convolutional layers, "conv 1-conv 5", and threeA full connection layer, namely fc6-fc 8. The application modifies the number of neurons at the fc8 level to the length of the hash code. Further, the activation function of the fc8 layer is set to the tanh function. This application let f (x)_i；θ_x) Representing the output of a CNN-F network, where x_iRepresenting the input of the network, theta_xRepresenting parameters of the CNN network.

For the text modality, the present application designs a multi-layer perceptron (MLP) model consisting of two fully-connected layers. Also, the number of second layer nodes is set to be the length of the hash code. First, the activation functions used by the second layer are ReLU and tanh, respectively. The input to the MLP network is t_iIs a bag-of-words representation of the ith text, the output is f (t)_i；θ_t),θ_tAre parameters of the deep neural network.

b) And a hash code learning module. The module first defines the instance-class similarity of the hierarchy and class labels, and the cross-layer class-wise similarity of class labels to class labels. Then, the above-mentioned information is embedded into the target function as the supervision information, and further a class-wise hash code (class-wise hash code) corresponding to the class label is obtained. Finally, the class hash codes are used to generate hash codes that can retain two defined similarities simultaneously.

(ii) similarity of each layer in the hierarchy of labels. The class labels have a hierarchical structure, with each layer (e.g., the k-th layer) having a layer-level label matrix

c^kRepresenting the number of tags at the k-th level. For the hash method using the supervised information, it is most important to maintain semantic similarity in hamming space. In other words, instances in the same class should have similar hash codes. The application generates the hash code of each class as global information to guide the learning of the hash code. Based on the assumption that the hash code of an instance in a class should be similar to the class hash code of the class, the label matrix L^kCan be considered as instance-class similarity, approximated by the inner product of the class hash code of the layer and the hash codes of all samples. Let

To represent the class hash code of the k-th layer,

is Y^kRow i of (2). In order to maintain semantic similarity of each layer, the present application defines an objective function as follows:

s.t.B∈{-1，1}^r×n，

In order to fully combine feature learning and hash code learning, the application introduces the output of the CNN network into the above formula, and the optimization problem is:

s.t.B∈{-1，1}^r×n，

wherein F ═ F (X; theta)_x)，G＝g(T；θ_t) μ is a trade-off parameter. In particular, the second term of the optimization function may consider F and G as the result of B being real valued.

② cross-layer similarity in the label hierarchy. For the hierarchy in the tag information, the correlation information between different layers is also valuable. To capture cross-layer dependencies, cross-layer class-wise similarity is defined herein, which measures the similarity between two classes from different layers:

wherein K is in the range of { 1.,. K-1}, i^kI-th class, α ncestor (j), representing k-th layer^K) Represents the set of j-th category ancestor nodes of the k-th layer.

Meaning that the i-th class at the K-th level is the ancestor node of the j-th class at the K-th level. Otherwise

From this definition, it is clear that cross-layer association information is mainly focused on the finest granularity class, i.e., the K-th layer, because this layer contains more categories, and these labels can describe the sample more accurately than other layers.

Then, the present application defines the following objective function to maintain cross-layer association information:

s.t.S∈{-1，1}^r×n，

And merging the first step and the second step. In order to enable the learned hash code to keep the similarity of each layer and the cross-layer association degree, the target functions of the first two parts are integrated to further obtain a final target function:

s.t.B∈{-1，1}^r×n，

where η is the hyperparameter, K ∈ {1, …, K-1 }.

c) And (5) optimizing the process.

As can be seen from the objective function, there are five variables to be optimized, B and Y respectively^k，Y^K，θ_x，θ_y. Similar to most deep cross-modal hash retrieval methods, the method adopts an iterative optimization mode to minimize the loss function, namely only one variable is optimized each time, and other variables are kept unchanged. The specific optimization strategy is as follows:

the first step is as follows: fixed variable theta_x，θ_yUpdating the variables B, Y^k，Y^K. The application uses the BP algorithm to carry out the parameter theta_x，θ_yAnd (6) updating.

The second step is that: fixed variable Y^k，Y^K，θ_x，θ_yAnd updating B. When other variables are fixed, the objective function can be rewritten as:

s.t.B∈{-1，1}^r×n，

the formula has the advantages of closed solution,

B＝sign(F+G).

the third step: fixing variables B, Y^k，θ_x，θ_yUpdate Y^K. The overall objective function is rewritten as:

the above formula is developed, the constant term is omitted, and the following results are obtained:

wherein Q is^KTo represent

Thereafter, the present application may use a discrete cyclic coordinate descent algorithm to obtain Y^KOne row is closed and the other rows are fixed. In other words, the hash-like code of the K-th layer can be generated discretely one bit by one bit. Let (y)^k)^TRepresents Y^KL-th row of (1), l ∈ {1, 2.., r }, and K ∈ {1, 2.., K }. (Y)^k) ' is Y^KRemoving (y)^k)^TMatrix of y^kOne bit of all class hash codes of the k-th layer can be represented. Similarly, f^TIs line l-th of F, F' is F except for F^TOf the matrix of (a). g^TIs line l-th of G, G' is G except G^TOf the matrix of (a).

Then the first term of the above formula is taken as an example:

wherein the content of the first and second substances,

similarly, other items may be written as,

Tr((Y^K)^TQ^K)＝(q^K)^Ty^K+const.

the objective function can then be rewritten as:

this problem has a closed solution:

it is obvious that each bit y^KIs based on (Y)^K) To compute, the present application can iteratively update each bit until convergence, resulting in a better Y^K。

The fourth step: fixing B, Y^K，θ_x，θ_yUpdate Y^k. The objective function is rewritten as,

due to the optimization process and the solution of Y^KSimilarly, Y is given directly here^kThe solution of (a):

y^k＝sign(q^k-α^k(Y^k)′^T(F′f+G′g)-ηβ^k(Y^k)′^T(Y^K)′y^K)，

where K ∈ {1, …, K-1}, (q)^k)^TIs Q^kL-th line of, Q^k＝r(α^k(F+G)L^k+ηβ^kY^KS^kK)。

d) For processing of query data, a hash representation of the query sample is generated.

When a sample not belonging to the training set is used as query data, a corresponding hash code b needs to be generated_q. Because the cross-modal search problem is oriented, and two modalities, image and text, are taken as examples, separate analysis is required.

If the query sample is an image, the application takes the query sample as the input of the CNN-F network, and generates the hash code as follows:

b_q＝h^x(x_q)＝sign(f(x_q；θ_x)).

if the query sample is a bag-of-word feature vector of a text, the application takes the query sample as the input of the MLP network, and generates a hash code as follows:

b_q＝h^t(t_q)＝sign(g(t_q；θ_t)).

the specific implementation steps are as follows:

firstly, acquiring original pictures and original texts of a training set, and representing the texts by using BoW.

And secondly, inputting the training samples into a feature extraction module, respectively extracting features of the pictures and the texts, and then learning to obtain the hash codes of the samples in the training set through a hash code learning stage. The model parameters are obtained by learning, i.e. minimizing the loss function.

And thirdly, fixing model parameters. And obtaining the hash codes corresponding to all samples by using the model, and storing the hash codes into a database for use.

And fourthly, when the model is used for retrieval, firstly generating a hash code of a query sample (a certain mode), then comparing the hash code with the hash codes of all samples stored in the database, searching N (self-defining as required) samples with the nearest Hamming distance, and returning the data of the other mode corresponding to the samples.

Example two

The embodiment provides a multimedia data cross-modal retrieval system using tag hierarchy information;

It should be noted here that the above-mentioned obtaining module, the feature extraction module and the retrieval output module correspond to steps S101 to S103 in the first embodiment, and the above-mentioned modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical functional division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.

EXAMPLE III

The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Example four

The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The cross-modal retrieval method of multimedia data by using label level information is characterized by comprising the following steps:

acquiring first modal multimedia data to be retrieved;

performing feature extraction on first modal multimedia data to be retrieved to obtain a first hash code; when the first-mode multimedia data to be retrieved is image data, performing feature extraction on the first-mode multimedia data to be retrieved by using a pre-trained Convolutional Neural Network (CNN) to obtain a first hash code; the pre-trained convolutional neural network CNN comprises the following specific training steps:

constructing a first training set, wherein the first training set is image data of hash codes corresponding to known image class labels; inputting the first training set into a convolutional neural network CNN, and training the convolutional neural network CNN to obtain a pre-trained convolutional neural network CNN;

the step of obtaining the hash code corresponding to the image category label comprises:

optimizing and solving the final objective function to obtain a hash code corresponding to the image class label;

2. The method according to claim 1, wherein the first modality multimedia data to be retrieved is subjected to feature extraction to obtain a first hash code; the method comprises the following specific steps:

3. The method of claim 1, wherein the convolutional neural network CNN comprises: the device comprises a first input layer, a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a fifth convolution layer, a first full-link layer, a second full-link layer and a third full-link layer which are sequentially connected, wherein the number of neurons of the third full-link layer is equal to the length of a first Hash code.

4. The method of claim 2, wherein the multi-layered perceptron MLP model comprises: the second input layer, the fourth full connection layer and the fifth full connection layer are connected in sequence, and the number of nodes of the fifth full connection layer is equal to the length of the first hash code.

5. The method as claimed in claim 2, wherein the pre-trained MLP model comprises the following training steps:

constructing a second training set, wherein the second training set is text data of hash codes corresponding to known text type labels; and inputting the second training set into the multi-layer perceptron MLP model, and training the multi-layer perceptron MLP model to obtain the pre-trained multi-layer perceptron MLP model.

6. The method as claimed in claim 5, wherein the step of obtaining the hash code corresponding to the text type label comprises:

and carrying out optimization solution on the final objective function to obtain the hash code corresponding to the text type label.

7. A multimedia data cross-modal retrieval system using tag hierarchy information, comprising:

a feature extraction module configured to: performing feature extraction on first modal multimedia data to be retrieved to obtain a first hash code; when the first-mode multimedia data to be retrieved is image data, performing feature extraction on the first-mode multimedia data to be retrieved by using a pre-trained Convolutional Neural Network (CNN) to obtain a first hash code; the pre-trained convolutional neural network CNN comprises the following specific training steps:

8. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-7.

9. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 6.