CN116304135A

CN116304135A - Cross-modal retrieval method, device and medium based on discriminant hidden space learning

Info

Publication number: CN116304135A
Application number: CN202310594729.5A
Authority: CN
Inventors: 郑敏; 吴春鹏; 林龙; 刘卫卫; 张国梁; 初宗博; 周飞
Original assignee: State Grid Smart Grid Research Institute Co ltd
Current assignee: State Grid Smart Grid Research Institute Co ltd; State Grid Corp of China SGCC
Priority date: 2023-05-25
Filing date: 2023-05-25
Publication date: 2023-06-23
Anticipated expiration: 2043-05-25
Also published as: CN116304135B

Abstract

The invention discloses a cross-modal retrieval method, a device and a medium based on discriminant hidden space learning, wherein the method comprises the following steps: extracting first features of the first modal data and second features of the second modal data, and constructing a first training set and a second training set; training the dual-dictionary model with the discriminant attribute added by adopting a first training set and a second training set to obtain a hidden space feature model; based on the hidden space feature model, respectively projecting the mode data features to be searched and the mode data features in the search database into a hidden space to obtain corresponding hidden space feature representation; and carrying out similarity calculation according to the hidden space feature representation to obtain a retrieval result. By implementing the invention, the hidden space is constructed by adopting the double dictionary learning technology, so that the multi-mode data can be aligned in the hidden space. And adding discriminant attributes into the double dictionary model, so that the distance in the subclass is more compact and the distance between the subclasses is more sparse. The method is more in line with the fine-grained multi-mode scene setting of the power grid.

Description

Cross-modal retrieval method, device and medium based on discriminant hidden space learning

Technical Field

The invention relates to the technical field of cross-modal retrieval, in particular to a cross-modal retrieval method, device and medium based on discriminant hidden space learning.

Background

With the rapid development of internet technology, multimodal data (such as text, images, audio and video) is layered endlessly, and conventional single-modality retrieval cannot meet the demands of users. Cross-modal retrieval is becoming the mainstream of information retrieval gradually because of the integration and supplement of various modal information.

Multimodal retrieval in a power scenario refers to a user entering a query object (e.g., an image) into a computer, which returns a retrieval result (e.g., text) that is of the same subcategory as the query object. Wherein the query object and the search result belong to different modalities. With the increasing growth of big data and artificial intelligence technology, the multi-mode search of the power grid has wide development and application prospect. For example, in an electric power scenario, different operation violations can be distinguished by means of fine-grained feature learning, so that the safety of the grid operation environment is ensured. Currently, the challenges faced by fine-grained multi-modal retrieval of power grids: the intra-modal features are not strong in discrimination, and the inter-modal semantic relevance is weak. This is exactly contrary to what we want to have "the discriminant of different subclasses within the same modality is strong, the semantic relevance of data between different modalities is strong".

Disclosure of Invention

In view of the above, the embodiment of the invention provides a cross-modal retrieval method, device and medium based on discriminant hidden space learning, which are used for solving the technical problems of weak discriminant of intra-modal features and weak semantic relevance among modalities in the prior art.

The technical scheme provided by the invention is as follows:

the first aspect of the embodiment of the invention provides a cross-modal retrieval method based on discriminant hidden space learning, which comprises the following steps: extracting first features of the first modal data and second features of the second modal data, and constructing a first training set and a second training set; training the dual-dictionary model with the discriminant attribute added by adopting the first training set and the second training set to obtain a hidden space feature model; based on the hidden space feature model, respectively projecting the feature of the mode data to be searched and the feature of the mode data in the search database into a hidden space to obtain a hidden space feature representation of the mode data to be searched and a hidden space feature representation of the mode data in the search database; and carrying out similarity calculation on the hidden space feature representation of the modal data to be searched and the hidden space feature representation of the modal data in the search database to obtain a search result of the modal data to be searched.

Optionally, extracting the first feature of the first modality data and the second feature of the second modality data, and constructing the first training set and the second training set includes: acquiring first modality data and second modality data, wherein the first modality data is image data, and the second modality data is text data; extracting first features of the first modal data by using a first feature extractor, and constructing a first training set based on the first features; and extracting second features of the second modal data by adopting a second feature extractor, and constructing a second training set based on the second features.

Optionally, the discriminant attribute is an attribute indicating whether the first modality data and the second modality data belong to the same sub-category.

Optionally, the objective function and the constraint condition of the dual dictionary model with the discriminant attribute added are expressed as:

wherein V represents a first training set, T represents a second training set, D ₁ And D ₂ Respectively representing dictionaries in a dual dictionary model, Z representing a hidden space feature representation projected by a first feature in a first training set and a second feature in a second training set, H representing class labels of the first feature in the first training set and the second feature in the second training set, U representing weights of classifiers in the hidden space,

and->

Is a super parameter; />

Representation->

I column>

Representation->

Is selected from the group consisting of the (i) column,u _i the ith column of U.

Optionally, based on the hidden space feature model, projecting the feature of the mode data to be searched and the feature of the mode data in the search database into a hidden space respectively, and before obtaining the hidden space feature representation of the mode data to be searched and the hidden space feature representation of the mode data in the search database, further including:

and respectively carrying out feature extraction on the mode data to be searched and the mode data in the search database to obtain the mode data features to be searched and the mode data features in the search database, wherein the mode data to be searched and the mode data in the search database are different in mode.

Optionally, the implicit spatial feature representation of the modal data to be retrieved is calculated using the following formula:

in the method, in the process of the invention,

representing the characteristics of the data to be retrieved->

Representing the corresponding feature of the projection of the data feature to be retrieved into the hidden space, < >>

Representation->

Gamma represents the super-parameters;

the hidden space feature representation of the model data in the retrieval database is calculated by adopting the following formula:

in the method, in the process of the invention,

representing the characteristics of the model data in the search database, +.>

Representing the corresponding feature of the projection of the model data feature in the search database into the hidden space, +.>

Representation->

Is a solution to the optimization of (3).

Optionally, performing similarity calculation on the hidden space feature representation of the to-be-searched modal data and the hidden space feature representation of the modal data in the search database to obtain a search result of the to-be-searched modal data, including: calculating the cosine similarity of the hidden space feature representation of the modal data to be searched and the hidden space feature representation of each modal data in the search database to obtain a plurality of similarity results; and selecting the mode data in the retrieval database corresponding to the maximum value in the similarity results as the retrieval result of the mode data to be retrieved.

The second aspect of the embodiment of the invention provides a cross-modal retrieval device based on discriminant hidden space learning, which comprises: the feature extraction module is used for extracting first features of the first modal data and second features of the second modal data and constructing a first training set and a second training set; the training module is used for training the double dictionary model with the discriminant attribute added by adopting the first training set and the second training set to obtain a hidden space feature model; the projection module is used for respectively projecting the mode data features to be searched and the mode data features in the search database into the hidden space based on the hidden space feature model to obtain hidden space feature representation of the mode data to be searched and hidden space feature representation of the mode data in the search database; and the retrieval module is used for carrying out similarity calculation on the hidden space feature representation of the modal data to be retrieved and the hidden space feature representation of the modal data in the retrieval database to obtain a retrieval result of the modal data to be retrieved.

A third aspect of the embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, where the computer instructions are configured to cause the computer to perform the cross-modal retrieval method based on discriminant hidden space learning according to any one of the first aspect and the first aspect of the embodiment of the present invention.

A fourth aspect of an embodiment of the present invention provides an electronic device, including: the system comprises a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions so as to execute the cross-modal retrieval method based on discriminant hidden space learning according to the first aspect of the embodiment of the invention.

The technical scheme provided by the invention has the following effects:

according to the cross-modal retrieval method, device and medium based on discriminant hidden space learning, the hidden space is constructed through the double dictionary learning technology, and the multi-modal data are aligned in the hidden space. Aiming at the problem of weaker cross-modal semantic association, discriminant attributes are added into the double-dictionary model, so that the distance in the subclasses is more compact and the distance between the subclasses is more sparse. The method is more in line with the fine-grained multi-mode scene setting of the power grid. The method solves the technical problems of weak intra-modal feature discrimination and weak inter-modal semantic relevance in the prior art.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a cross-modal retrieval method based on discriminative hidden space learning, in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of a structure of a dual dictionary model with discriminant attributes added in a cross-modal retrieval method based on discriminant hidden space learning according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of multi-modal alignment by a cross-modal retrieval method based on discriminant hidden space learning according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a computer-readable storage medium provided according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

The terms first, second, third, fourth and the like in the description and in the claims and in the above drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an embodiment of the present invention, there is provided a cross-modal retrieval method based on discriminant hidden space learning, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different from that herein.

In this embodiment, a cross-modal searching method based on discriminant hidden space learning is provided, which may be used in an electronic device, and fig. 1 is a flowchart of a cross-modal searching method based on discriminant hidden space learning according to an embodiment of the present invention, as shown in fig. 1, and the method includes the following steps:

step S101: and extracting first features of the first modal data and second features of the second modal data, and constructing a first training set and a second training set. Specifically, the first modality data and the second modality data belong to data of different modalities, for example, the first modality data is image data, the second modality data is text data, or the first modality data is text data, the second modality data is image data, and the specific modality of the first modality data and the second modality data is not limited in the embodiment of the present invention.

In an embodiment, extracting a first feature of the first modality data and a second feature of the second modality data, constructing a first training set and a second training set, includes: acquiring first modality data and second modality data, wherein the first modality data is image data, and the second modality data is text data; extracting first features of the first modal data by using a first feature extractor, and constructing a first training set based on the first features; and extracting second features of the second modal data by adopting a second feature extractor, and constructing a second training set based on the second features.

Specifically, when the first modality data is image data and the second modality data is text data, a first feature of the first modality data is extracted by using a res net50 feature extractor, i.e., a first feature extractor, and a second feature of the second modality data is extracted by using a Bag of words (Bag of words) feature extractor, i.e., a second feature extractor, i.e., the image data and the text data are feature extracted by using the res net50 feature extractor and the BOW feature extractor, respectively. The ResNet50 is one of the residual networks, and is required to be pre-trained before use, and the training process is described in the related art and is not described herein. When the BOW is used for feature extraction, the steps of text word segmentation, vocabulary construction, word vector representation and statistics frequency are mainly adopted.

Wherein the constructed first training set is expressed as

，/>

Representing the ith first feature, n represents a total of n first features in the first training set. The constructed second training set is denoted +.>

，/>

Represents the firsti second features.

Step S102: training the dual-dictionary model with the discriminant attribute added by adopting the first training set and the second training set to obtain a hidden space feature model; specifically, the precondition for realizing cross-modal retrieval is that the features of different modalities are aligned so as to be associated and matched. Thus, features of different modalities are mapped into a common hidden space by setting up a dual dictionary model so that it has scalability. If the first modality and the second modality are images and texts respectively, the dual dictionary model comprises an image dictionary and a text dictionary. The goal of dictionary learning is to extract the most essential features of things (similar to words or terms in a dictionary). The most intrinsic connotation of this thing is grasped if this dictionary is obtained, which includes the most intrinsic features. In other words, the dimension of the information of the object is reduced, and the interference of some insignificant information of the object on defining the object is reduced. The sparse model removes a large number of redundant variables, only the explanatory variable most relevant to the response variable is reserved, the model is simplified, the most important information in the data set is reserved, and a plurality of problems in modeling of the high-dimensional data set are effectively solved. Thus, dictionary learning may also be referred to simply as sparse coding. From the perspective of matrix decomposition, see dictionary learning process: each column of a given sample dataset Y, Y represents a sample; the goal of dictionary learning is to decompose the Y matrix into D, X matrices. The dual dictionary model adopted in the step is to decompose two sample data sets of the first training set and the second training set.

In order to make the hidden space more discriminant, discriminant attributes are introduced into the hidden space to enhance the discriminant, wherein the discriminant attributes are attributes representing whether the first modal data and the second modal data belong to the same sub-category. By adding discriminant properties, the distance between the modal data of different subcategories can be as original as possible, and the distance between the modal data from the same subcategory is as far as possible, so that different fine-grained subcategories can be better distinguished in the hidden space.

Step S103: based on the hidden space feature model, respectively projecting the feature of the mode data to be searched and the feature of the mode data in the search database into a hidden space to obtain a hidden space feature representation of the mode data to be searched and a hidden space feature representation of the mode data in the search database; and in the retrieval pair, firstly extracting the characteristics of the modal data to be retrieved to obtain the characteristics of the modal data to be retrieved. Specifically, if the mode of the first-mode data is the same as that of the first-mode data, the first feature extractor is adopted to perform feature extraction, and if the mode of the first-mode data is the same as that of the second-mode data, the second feature extractor is adopted to perform feature extraction. And for the retrieval database, the same mode is adopted to conduct feature extraction on the modal data in the database to obtain the modal data features. The mode data to be retrieved is different from the mode data in the retrieval database.

Step S104: and carrying out similarity calculation on the hidden space feature representation of the modal data to be searched and the hidden space feature representation of the modal data in the search database to obtain a search result of the modal data to be searched. Specifically, similarity calculation is carried out on the hidden space feature representation of the modal data to be searched and the hidden space feature representation subsection of each modal data in the search database, and the most similar modal data is selected as a search result.

According to the cross-modal retrieval method based on discriminant hidden space learning, the hidden space is constructed through the double dictionary learning technology, and the multi-modal data are aligned in the hidden space. Aiming at the problem of weaker cross-modal semantic association, discriminant attributes are added into the double-dictionary model, so that the distance in the subclasses is more compact and the distance between the subclasses is more sparse. The method is more in line with the fine-grained multi-mode scene setting of the power grid. The method solves the technical problems of weak intra-modal feature discrimination and weak inter-modal semantic relevance in the prior art.

In one embodiment, the objective function and constraint of the dual dictionary model with added discriminant attributes are expressed as:

and->

Is a super parameter; />

Representation->

I column>

Representation->

Is selected from the group consisting of the (i) column,u _i the ith column of U. Wherein (1)>

，/>

J and p represent the dimensions of the first feature and the second feature, respectively, and n represents the number of feature samples in the first training set or the second training set. />

，/>

K represents the dimension of the hidden space. />

Representing a hidden spatial feature representation projected together by V and T. By constrainingThe hidden space features mapped by V and T are consistent in representation Z, so that two heterogeneous spaces are aligned. />

Representing weights that adjust the relative importance of the first modality space and the second modality space. />

，/>

Representing the number of classes, +.>

Corresponds to sample->

Wherein the non-zero term represents the class of the i-th first feature. U can be regarded as the weight of the classifier in hidden space. It should be noted that the third term in the objective function is intended to make the hidden space sufficiently discernable, specifically, to make samples under the same subclass as close together as possible, and samples between different subclasses as far apart as possible, so as to classify different fine-grained subclasses. Specifically, when the first modality and the second modality are images and texts, respectively, a schematic diagram when the objective function is adopted for multi-modality alignment is shown in fig. 2.

Specifically, for the objective function, an alternating optimization method is employed to solve for a closed-form solution. All variables are first initialized:

(1) Fixing

、/>

、/>

Update according to objective function>

. Targeting a targetFunction about->

The derivative of (2) is 0, then->

The closed-form solution of (2) is:

，

wherein, the liquid crystal display device comprises a liquid crystal display device,

；

(2) Fixing

Update->

. The sub-problem is formalized as:

，

this problem can be optimized according to the Lagrangian dual rule, solved as:

；

is a diagonal matrix made up of all lagrangian dual variables.

(3) Fixing

Update->

. The sub-problem is formalized as:

，

this problem can be optimized according to the Lagrangian dual rule, solved as:

；

is a diagonal matrix made up of all lagrangian dual variables.

(4) Fixing

Update->

. Formalized sub-problems are:

，

this problem can be optimized according to the Lagrangian dual rule, solved as:

。

in one embodiment, the implicit spatial signature representation of the modal data to be retrieved is calculated using the following formula:

in the method, in the process of the invention,

representing the characteristics of the data to be retrieved->

Representation->

Is the optimal solution of (a)Gamma represents a hyper-parameter;

in the method, in the process of the invention,

representing the characteristics of the model data in the search database, +.>

Representation->

Is a solution to the optimization of (3).

It should be noted that, in the above formula, agr is an abbreviation of term, and a parameter corresponding to an optimal solution of an optimization problem is taken, where the right side of the formula is an optimization problem, and the left side represents an optimal condition on the right side. The formula for calculating the hidden space feature representation is obtained by solving the objective function by adopting an alternate optimization method.

In one embodiment, performing similarity calculation on the hidden space feature representation of the to-be-searched modal data and the hidden space feature representation of the modal data in the search database to obtain a search result of the to-be-searched modal data, including: calculating the cosine similarity of the hidden space feature representation of the modal data to be searched and the hidden space feature representation of each modal data in the search database to obtain a plurality of similarity results; and selecting the mode data in the retrieval database corresponding to the maximum value in the similarity results as the retrieval result of the mode data to be retrieved.

In an embodiment, taking an image mode and a text mode as examples, the cross-mode searching method based on discriminant hidden space learning is implemented by adopting the following procedures:

1. and respectively carrying out feature extraction on the image and the text by adopting a ResNet50 feature extractor and a Bag of words (BOW) feature extractor to obtain an image feature set and a text feature set.

2. Training the double-dictionary model added with the discriminant attribute by adopting the image feature set and the text feature set, and determining model parameters to obtain the hidden space feature model.

3. Features of the query image are extracted using ResNet50, and BOW features of all text in the text retrieval database are extracted.

4. And respectively projecting the extracted features of the query image and the features of all texts into the hidden space according to the hidden space feature module to obtain corresponding hidden space feature representation.

5. And respectively calculating the hidden space feature representation similarity of the image and the hidden space feature representation similarity of all texts by adopting cosine distances, and selecting the text corresponding to the highest similarity as a retrieval result.

The embodiment of the invention also provides a cross-modal retrieval device based on discriminant hidden space learning, as shown in fig. 3, the device comprises:

the feature extraction module is used for extracting first features of the first modal data and second features of the second modal data and constructing a first training set and a second training set; the specific content refers to the corresponding parts of the above method embodiments, and will not be described herein.

The training module is used for training the double dictionary model with the discriminant attribute added by adopting the first training set and the second training set to obtain a hidden space feature model; the specific content refers to the corresponding parts of the above method embodiments, and will not be described herein.

The projection module is used for respectively projecting the mode data features to be searched and the mode data features in the search database into the hidden space based on the hidden space feature model to obtain hidden space feature representation of the mode data to be searched and hidden space feature representation of the mode data in the search database; the specific content refers to the corresponding parts of the above method embodiments, and will not be described herein.

And the retrieval module is used for carrying out similarity calculation on the hidden space feature representation of the modal data to be retrieved and the hidden space feature representation of the modal data in the retrieval database to obtain a retrieval result of the modal data to be retrieved. The specific content refers to the corresponding parts of the above method embodiments, and will not be described herein.

According to the cross-modal retrieval device based on discriminant hidden space learning, the hidden space is constructed through the double dictionary learning technology, so that the multi-modal data are aligned in the hidden space. Aiming at the problem of weaker cross-modal semantic association, discriminant attributes are added into the double-dictionary model, so that the distance in the subclasses is more compact and the distance between the subclasses is more sparse. The method is more in line with the fine-grained multi-mode scene setting of the power grid. The method solves the technical problems of weak intra-modal feature discrimination and weak inter-modal semantic relevance in the prior art.

The functional description of the cross-modal retrieval device based on discriminant hidden space learning provided by the embodiment of the invention is detailed by referring to the cross-modal retrieval method description based on discriminant hidden space learning in the embodiment.

The embodiment of the present invention further provides a storage medium, as shown in fig. 4, on which a computer program 601 is stored, which when executed by a processor, implements the steps of the cross-modal retrieval method based on discriminative hidden space learning in the above embodiment. The storage medium also stores audio and video stream data, characteristic frame data, interactive request signaling, encrypted data, preset data size and the like. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.

It will be appreciated by those skilled in the art that implementing all or part of the above-described embodiment method may be implemented by a computer program to instruct related hardware, where the program may be stored in a computer readable storage medium, and the program may include the above-described embodiment method when executed. Wherein the storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.

The embodiment of the present invention further provides an electronic device, as shown in fig. 5, where the electronic device may include a processor 51 and a memory 52, where the processor 51 and the memory 52 may be connected by a bus or other means, and in fig. 5, the connection is exemplified by a bus.

The processor 51 may be a central processing unit (Central Processing Unit, CPU). The processor 51 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations thereof.

The memory 52 serves as a non-transitory computer readable storage medium that may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as corresponding program instructions/modules in embodiments of the present invention. The processor 51 executes various functional applications of the processor and data processing by running non-transitory software programs, instructions and modules stored in the memory 52, i.e., implements the cross-modal retrieval method based on discriminant hidden space learning in the above-described method embodiment.

The memory 52 may include a memory program area that may store an operating device, an application program required for at least one function, and a memory data area; the storage data area may store data created by the processor 51, etc. In addition, memory 52 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 52 may optionally include memory located remotely from processor 51, which may be connected to processor 51 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory 52, which when executed by the processor 51, perform a cross-modal retrieval method based on discriminative implicit space learning as in the embodiment shown in fig. 1-2.

The specific details of the electronic device may be understood correspondingly with reference to the corresponding related descriptions and effects in the embodiments shown in fig. 1 to 2, which are not repeated here.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. A cross-modal retrieval method based on discriminant hidden space learning is characterized by comprising the following steps:

extracting first features of the first modal data and second features of the second modal data, and constructing a first training set and a second training set;

training the dual-dictionary model with the discriminant attribute added by adopting the first training set and the second training set to obtain a hidden space feature model;

based on the hidden space feature model, respectively projecting the feature of the mode data to be searched and the feature of the mode data in the search database into a hidden space to obtain a hidden space feature representation of the mode data to be searched and a hidden space feature representation of the mode data in the search database;

and carrying out similarity calculation on the hidden space feature representation of the modal data to be searched and the hidden space feature representation of the modal data in the search database to obtain a search result of the modal data to be searched.

2. The cross-modal retrieval method based on discriminant hidden space learning of claim 1, wherein extracting first features of first modal data and second features of second modal data to construct a first training set and a second training set comprises:

acquiring first modality data and second modality data, wherein the first modality data is image data, and the second modality data is text data;

extracting first features of the first modal data by using a first feature extractor, and constructing a first training set based on the first features;

and extracting second features of the second modal data by adopting a second feature extractor, and constructing a second training set based on the second features.

3. The cross-modal retrieval method based on discriminant hidden space learning of claim 1, wherein said discriminant attributes are attributes representing whether the first modal data and the second modal data belong to the same sub-category.

4. The cross-modal retrieval method based on discriminant hidden space learning of claim 1, wherein the objective function and constraint conditions of the dual dictionary model with added discriminant attributes are expressed as:

and->

Is a super parameter; />

Representation->

I column>

Representation->

5. The cross-modal retrieval method based on discriminant hidden space learning of claim 1, wherein based on the hidden space feature model, projecting the features of the modal data to be retrieved and the features of the modal data in the retrieval database into the hidden space respectively, before obtaining the hidden space feature representation of the modal data to be retrieved and the hidden space feature representation of the modal data in the retrieval database, further comprises:

6. The cross-modal retrieval method based on discriminant hidden space learning of claim 4, wherein the hidden space feature representation of the modal data to be retrieved is calculated using the following formula:

in the method, in the process of the invention,

representing the characteristics of the data to be retrieved->

Representing the corresponding features of the projection of the data features to be retrieved into the hidden space,

representation->

Gamma represents the super-parameters;

in the method, in the process of the invention,

representing the characteristics of the model data in the search database, +.>

Representation->

Is a solution to the optimization of (3).

7. The cross-modal retrieval method based on discriminant hidden space learning of claim 1, wherein performing similarity calculation on the hidden space feature representation of the modal data to be retrieved and the hidden space feature representation of the modal data in the retrieval database to obtain a retrieval result of the modal data to be retrieved, comprises:

calculating the cosine similarity of the hidden space feature representation of the modal data to be searched and the hidden space feature representation of each modal data in the search database to obtain a plurality of similarity results;

and selecting the mode data in the retrieval database corresponding to the maximum value in the similarity results as the retrieval result of the mode data to be retrieved.

8. Cross-modal retrieval device based on discriminant hidden space learning is characterized by comprising:

the feature extraction module is used for extracting first features of the first modal data and second features of the second modal data and constructing a first training set and a second training set;

the training module is used for training the double dictionary model with the discriminant attribute added by adopting the first training set and the second training set to obtain a hidden space feature model;

the projection module is used for respectively projecting the mode data features to be searched and the mode data features in the search database into the hidden space based on the hidden space feature model to obtain hidden space feature representation of the mode data to be searched and hidden space feature representation of the mode data in the search database;

and the retrieval module is used for carrying out similarity calculation on the hidden space feature representation of the modal data to be retrieved and the hidden space feature representation of the modal data in the retrieval database to obtain a retrieval result of the modal data to be retrieved.

9. A computer-readable storage medium storing computer instructions for causing the computer to perform the cross-modal retrieval method based on discriminative hidden space learning as claimed in any one of claims 1 to 7.

10. An electronic device, comprising: a memory and a processor, the memory and the processor being communicatively connected to each other, the memory storing computer instructions, the processor executing the computer instructions to perform the cross-modal retrieval method based on discriminative implicit space learning as claimed in any of claims 1 to 7.