CN116304135A - Cross-modal retrieval method, device and medium based on discriminant hidden space learning - Google Patents

Cross-modal retrieval method, device and medium based on discriminant hidden space learning Download PDF

Info

Publication number
CN116304135A
CN116304135A CN202310594729.5A CN202310594729A CN116304135A CN 116304135 A CN116304135 A CN 116304135A CN 202310594729 A CN202310594729 A CN 202310594729A CN 116304135 A CN116304135 A CN 116304135A
Authority
CN
China
Prior art keywords
data
hidden space
modal
feature
training set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310594729.5A
Other languages
Chinese (zh)
Other versions
CN116304135B (en
Inventor
郑敏
吴春鹏
林龙
刘卫卫
张国梁
初宗博
周飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Smart Grid Research Institute Co ltd
State Grid Corp of China SGCC
Original Assignee
State Grid Smart Grid Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Smart Grid Research Institute Co ltd filed Critical State Grid Smart Grid Research Institute Co ltd
Priority to CN202310594729.5A priority Critical patent/CN116304135B/en
Publication of CN116304135A publication Critical patent/CN116304135A/en
Application granted granted Critical
Publication of CN116304135B publication Critical patent/CN116304135B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a cross-modal retrieval method, a device and a medium based on discriminant hidden space learning, wherein the method comprises the following steps: extracting first features of the first modal data and second features of the second modal data, and constructing a first training set and a second training set; training the dual-dictionary model with the discriminant attribute added by adopting a first training set and a second training set to obtain a hidden space feature model; based on the hidden space feature model, respectively projecting the mode data features to be searched and the mode data features in the search database into a hidden space to obtain corresponding hidden space feature representation; and carrying out similarity calculation according to the hidden space feature representation to obtain a retrieval result. By implementing the invention, the hidden space is constructed by adopting the double dictionary learning technology, so that the multi-mode data can be aligned in the hidden space. And adding discriminant attributes into the double dictionary model, so that the distance in the subclass is more compact and the distance between the subclasses is more sparse. The method is more in line with the fine-grained multi-mode scene setting of the power grid.

Description

Cross-modal retrieval method, device and medium based on discriminant hidden space learning
Technical Field
The invention relates to the technical field of cross-modal retrieval, in particular to a cross-modal retrieval method, device and medium based on discriminant hidden space learning.
Background
With the rapid development of internet technology, multimodal data (such as text, images, audio and video) is layered endlessly, and conventional single-modality retrieval cannot meet the demands of users. Cross-modal retrieval is becoming the mainstream of information retrieval gradually because of the integration and supplement of various modal information.
Multimodal retrieval in a power scenario refers to a user entering a query object (e.g., an image) into a computer, which returns a retrieval result (e.g., text) that is of the same subcategory as the query object. Wherein the query object and the search result belong to different modalities. With the increasing growth of big data and artificial intelligence technology, the multi-mode search of the power grid has wide development and application prospect. For example, in an electric power scenario, different operation violations can be distinguished by means of fine-grained feature learning, so that the safety of the grid operation environment is ensured. Currently, the challenges faced by fine-grained multi-modal retrieval of power grids: the intra-modal features are not strong in discrimination, and the inter-modal semantic relevance is weak. This is exactly contrary to what we want to have "the discriminant of different subclasses within the same modality is strong, the semantic relevance of data between different modalities is strong".
Disclosure of Invention
In view of the above, the embodiment of the invention provides a cross-modal retrieval method, device and medium based on discriminant hidden space learning, which are used for solving the technical problems of weak discriminant of intra-modal features and weak semantic relevance among modalities in the prior art.
The technical scheme provided by the invention is as follows:
the first aspect of the embodiment of the invention provides a cross-modal retrieval method based on discriminant hidden space learning, which comprises the following steps: extracting first features of the first modal data and second features of the second modal data, and constructing a first training set and a second training set; training the dual-dictionary model with the discriminant attribute added by adopting the first training set and the second training set to obtain a hidden space feature model; based on the hidden space feature model, respectively projecting the feature of the mode data to be searched and the feature of the mode data in the search database into a hidden space to obtain a hidden space feature representation of the mode data to be searched and a hidden space feature representation of the mode data in the search database; and carrying out similarity calculation on the hidden space feature representation of the modal data to be searched and the hidden space feature representation of the modal data in the search database to obtain a search result of the modal data to be searched.
Optionally, extracting the first feature of the first modality data and the second feature of the second modality data, and constructing the first training set and the second training set includes: acquiring first modality data and second modality data, wherein the first modality data is image data, and the second modality data is text data; extracting first features of the first modal data by using a first feature extractor, and constructing a first training set based on the first features; and extracting second features of the second modal data by adopting a second feature extractor, and constructing a second training set based on the second features.
Optionally, the discriminant attribute is an attribute indicating whether the first modality data and the second modality data belong to the same sub-category.
Optionally, the objective function and the constraint condition of the dual dictionary model with the discriminant attribute added are expressed as:
Figure SMS_1
wherein V represents a first training set, T represents a second training set, D 1 And D 2 Respectively representing dictionaries in a dual dictionary model, Z representing a hidden space feature representation projected by a first feature in a first training set and a second feature in a second training set, H representing class labels of the first feature in the first training set and the second feature in the second training set, U representing weights of classifiers in the hidden space,
Figure SMS_2
and->
Figure SMS_3
Is a super parameter; />
Figure SMS_4
Representation->
Figure SMS_5
I column>
Figure SMS_6
Representation->
Figure SMS_7
Is selected from the group consisting of the (i) column,u i the ith column of U.
Optionally, based on the hidden space feature model, projecting the feature of the mode data to be searched and the feature of the mode data in the search database into a hidden space respectively, and before obtaining the hidden space feature representation of the mode data to be searched and the hidden space feature representation of the mode data in the search database, further including:
and respectively carrying out feature extraction on the mode data to be searched and the mode data in the search database to obtain the mode data features to be searched and the mode data features in the search database, wherein the mode data to be searched and the mode data in the search database are different in mode.
Optionally, the implicit spatial feature representation of the modal data to be retrieved is calculated using the following formula:
Figure SMS_8
in the method, in the process of the invention,
Figure SMS_9
representing the characteristics of the data to be retrieved->
Figure SMS_10
Representing the corresponding feature of the projection of the data feature to be retrieved into the hidden space, < >>
Figure SMS_11
Representation->
Figure SMS_12
Gamma represents the super-parameters;
the hidden space feature representation of the model data in the retrieval database is calculated by adopting the following formula:
Figure SMS_13
in the method, in the process of the invention,
Figure SMS_14
representing the characteristics of the model data in the search database, +.>
Figure SMS_15
Representing the corresponding feature of the projection of the model data feature in the search database into the hidden space, +.>
Figure SMS_16
Representation->
Figure SMS_17
Is a solution to the optimization of (3).
Optionally, performing similarity calculation on the hidden space feature representation of the to-be-searched modal data and the hidden space feature representation of the modal data in the search database to obtain a search result of the to-be-searched modal data, including: calculating the cosine similarity of the hidden space feature representation of the modal data to be searched and the hidden space feature representation of each modal data in the search database to obtain a plurality of similarity results; and selecting the mode data in the retrieval database corresponding to the maximum value in the similarity results as the retrieval result of the mode data to be retrieved.
The second aspect of the embodiment of the invention provides a cross-modal retrieval device based on discriminant hidden space learning, which comprises: the feature extraction module is used for extracting first features of the first modal data and second features of the second modal data and constructing a first training set and a second training set; the training module is used for training the double dictionary model with the discriminant attribute added by adopting the first training set and the second training set to obtain a hidden space feature model; the projection module is used for respectively projecting the mode data features to be searched and the mode data features in the search database into the hidden space based on the hidden space feature model to obtain hidden space feature representation of the mode data to be searched and hidden space feature representation of the mode data in the search database; and the retrieval module is used for carrying out similarity calculation on the hidden space feature representation of the modal data to be retrieved and the hidden space feature representation of the modal data in the retrieval database to obtain a retrieval result of the modal data to be retrieved.
A third aspect of the embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, where the computer instructions are configured to cause the computer to perform the cross-modal retrieval method based on discriminant hidden space learning according to any one of the first aspect and the first aspect of the embodiment of the present invention.
A fourth aspect of an embodiment of the present invention provides an electronic device, including: the system comprises a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions so as to execute the cross-modal retrieval method based on discriminant hidden space learning according to the first aspect of the embodiment of the invention.
The technical scheme provided by the invention has the following effects:
according to the cross-modal retrieval method, device and medium based on discriminant hidden space learning, the hidden space is constructed through the double dictionary learning technology, and the multi-modal data are aligned in the hidden space. Aiming at the problem of weaker cross-modal semantic association, discriminant attributes are added into the double-dictionary model, so that the distance in the subclasses is more compact and the distance between the subclasses is more sparse. The method is more in line with the fine-grained multi-mode scene setting of the power grid. The method solves the technical problems of weak intra-modal feature discrimination and weak inter-modal semantic relevance in the prior art.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a cross-modal retrieval method based on discriminative hidden space learning, in accordance with an embodiment of the present invention;
FIG. 2 is a schematic diagram of a structure of a dual dictionary model with discriminant attributes added in a cross-modal retrieval method based on discriminant hidden space learning according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of multi-modal alignment by a cross-modal retrieval method based on discriminant hidden space learning according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a computer-readable storage medium provided according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
The terms first, second, third, fourth and the like in the description and in the claims and in the above drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an embodiment of the present invention, there is provided a cross-modal retrieval method based on discriminant hidden space learning, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different from that herein.
In this embodiment, a cross-modal searching method based on discriminant hidden space learning is provided, which may be used in an electronic device, and fig. 1 is a flowchart of a cross-modal searching method based on discriminant hidden space learning according to an embodiment of the present invention, as shown in fig. 1, and the method includes the following steps:
step S101: and extracting first features of the first modal data and second features of the second modal data, and constructing a first training set and a second training set. Specifically, the first modality data and the second modality data belong to data of different modalities, for example, the first modality data is image data, the second modality data is text data, or the first modality data is text data, the second modality data is image data, and the specific modality of the first modality data and the second modality data is not limited in the embodiment of the present invention.
In an embodiment, extracting a first feature of the first modality data and a second feature of the second modality data, constructing a first training set and a second training set, includes: acquiring first modality data and second modality data, wherein the first modality data is image data, and the second modality data is text data; extracting first features of the first modal data by using a first feature extractor, and constructing a first training set based on the first features; and extracting second features of the second modal data by adopting a second feature extractor, and constructing a second training set based on the second features.
Specifically, when the first modality data is image data and the second modality data is text data, a first feature of the first modality data is extracted by using a res net50 feature extractor, i.e., a first feature extractor, and a second feature of the second modality data is extracted by using a Bag of words (Bag of words) feature extractor, i.e., a second feature extractor, i.e., the image data and the text data are feature extracted by using the res net50 feature extractor and the BOW feature extractor, respectively. The ResNet50 is one of the residual networks, and is required to be pre-trained before use, and the training process is described in the related art and is not described herein. When the BOW is used for feature extraction, the steps of text word segmentation, vocabulary construction, word vector representation and statistics frequency are mainly adopted.
Wherein the constructed first training set is expressed as
Figure SMS_18
,/>
Figure SMS_19
Representing the ith first feature, n represents a total of n first features in the first training set. The constructed second training set is denoted +.>
Figure SMS_20
,/>
Figure SMS_21
Represents the firsti second features.
Step S102: training the dual-dictionary model with the discriminant attribute added by adopting the first training set and the second training set to obtain a hidden space feature model; specifically, the precondition for realizing cross-modal retrieval is that the features of different modalities are aligned so as to be associated and matched. Thus, features of different modalities are mapped into a common hidden space by setting up a dual dictionary model so that it has scalability. If the first modality and the second modality are images and texts respectively, the dual dictionary model comprises an image dictionary and a text dictionary. The goal of dictionary learning is to extract the most essential features of things (similar to words or terms in a dictionary). The most intrinsic connotation of this thing is grasped if this dictionary is obtained, which includes the most intrinsic features. In other words, the dimension of the information of the object is reduced, and the interference of some insignificant information of the object on defining the object is reduced. The sparse model removes a large number of redundant variables, only the explanatory variable most relevant to the response variable is reserved, the model is simplified, the most important information in the data set is reserved, and a plurality of problems in modeling of the high-dimensional data set are effectively solved. Thus, dictionary learning may also be referred to simply as sparse coding. From the perspective of matrix decomposition, see dictionary learning process: each column of a given sample dataset Y, Y represents a sample; the goal of dictionary learning is to decompose the Y matrix into D, X matrices. The dual dictionary model adopted in the step is to decompose two sample data sets of the first training set and the second training set.
In order to make the hidden space more discriminant, discriminant attributes are introduced into the hidden space to enhance the discriminant, wherein the discriminant attributes are attributes representing whether the first modal data and the second modal data belong to the same sub-category. By adding discriminant properties, the distance between the modal data of different subcategories can be as original as possible, and the distance between the modal data from the same subcategory is as far as possible, so that different fine-grained subcategories can be better distinguished in the hidden space.
Step S103: based on the hidden space feature model, respectively projecting the feature of the mode data to be searched and the feature of the mode data in the search database into a hidden space to obtain a hidden space feature representation of the mode data to be searched and a hidden space feature representation of the mode data in the search database; and in the retrieval pair, firstly extracting the characteristics of the modal data to be retrieved to obtain the characteristics of the modal data to be retrieved. Specifically, if the mode of the first-mode data is the same as that of the first-mode data, the first feature extractor is adopted to perform feature extraction, and if the mode of the first-mode data is the same as that of the second-mode data, the second feature extractor is adopted to perform feature extraction. And for the retrieval database, the same mode is adopted to conduct feature extraction on the modal data in the database to obtain the modal data features. The mode data to be retrieved is different from the mode data in the retrieval database.
Step S104: and carrying out similarity calculation on the hidden space feature representation of the modal data to be searched and the hidden space feature representation of the modal data in the search database to obtain a search result of the modal data to be searched. Specifically, similarity calculation is carried out on the hidden space feature representation of the modal data to be searched and the hidden space feature representation subsection of each modal data in the search database, and the most similar modal data is selected as a search result.
According to the cross-modal retrieval method based on discriminant hidden space learning, the hidden space is constructed through the double dictionary learning technology, and the multi-modal data are aligned in the hidden space. Aiming at the problem of weaker cross-modal semantic association, discriminant attributes are added into the double-dictionary model, so that the distance in the subclasses is more compact and the distance between the subclasses is more sparse. The method is more in line with the fine-grained multi-mode scene setting of the power grid. The method solves the technical problems of weak intra-modal feature discrimination and weak inter-modal semantic relevance in the prior art.
In one embodiment, the objective function and constraint of the dual dictionary model with added discriminant attributes are expressed as:
Figure SMS_22
wherein V represents a first training set, T represents a second training set, D 1 And D 2 Respectively representing dictionaries in a dual dictionary model, Z representing a hidden space feature representation projected by a first feature in a first training set and a second feature in a second training set, H representing class labels of the first feature in the first training set and the second feature in the second training set, U representing weights of classifiers in the hidden space,
Figure SMS_30
and->
Figure SMS_29
Is a super parameter; />
Figure SMS_38
Representation->
Figure SMS_31
I column>
Figure SMS_36
Representation->
Figure SMS_28
Is selected from the group consisting of the (i) column,u i the ith column of U. Wherein (1)>
Figure SMS_35
,/>
Figure SMS_25
J and p represent the dimensions of the first feature and the second feature, respectively, and n represents the number of feature samples in the first training set or the second training set. />
Figure SMS_34
,/>
Figure SMS_23
K represents the dimension of the hidden space. />
Figure SMS_32
Representing a hidden spatial feature representation projected together by V and T. By constrainingThe hidden space features mapped by V and T are consistent in representation Z, so that two heterogeneous spaces are aligned. />
Figure SMS_24
Representing weights that adjust the relative importance of the first modality space and the second modality space. />
Figure SMS_37
,/>
Figure SMS_26
Representing the number of classes, +.>
Figure SMS_33
Corresponds to sample->
Figure SMS_27
Wherein the non-zero term represents the class of the i-th first feature. U can be regarded as the weight of the classifier in hidden space. It should be noted that the third term in the objective function is intended to make the hidden space sufficiently discernable, specifically, to make samples under the same subclass as close together as possible, and samples between different subclasses as far apart as possible, so as to classify different fine-grained subclasses. Specifically, when the first modality and the second modality are images and texts, respectively, a schematic diagram when the objective function is adopted for multi-modality alignment is shown in fig. 2.
Specifically, for the objective function, an alternating optimization method is employed to solve for a closed-form solution. All variables are first initialized:
(1) Fixing
Figure SMS_39
、/>
Figure SMS_40
、/>
Figure SMS_41
Update according to objective function>
Figure SMS_42
. Targeting a targetFunction about->
Figure SMS_43
The derivative of (2) is 0, then->
Figure SMS_44
The closed-form solution of (2) is:
Figure SMS_45
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_46
(2) Fixing
Figure SMS_47
Update->
Figure SMS_48
. The sub-problem is formalized as:
Figure SMS_49
this problem can be optimized according to the Lagrangian dual rule, solved as:
Figure SMS_50
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_51
is a diagonal matrix made up of all lagrangian dual variables.
(3) Fixing
Figure SMS_52
Update->
Figure SMS_53
. The sub-problem is formalized as:
Figure SMS_54
this problem can be optimized according to the Lagrangian dual rule, solved as:
Figure SMS_55
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_56
is a diagonal matrix made up of all lagrangian dual variables.
(4) Fixing
Figure SMS_57
Update->
Figure SMS_58
. Formalized sub-problems are:
Figure SMS_59
this problem can be optimized according to the Lagrangian dual rule, solved as:
Figure SMS_60
in one embodiment, the implicit spatial signature representation of the modal data to be retrieved is calculated using the following formula:
Figure SMS_61
in the method, in the process of the invention,
Figure SMS_62
representing the characteristics of the data to be retrieved->
Figure SMS_63
Representing the corresponding feature of the projection of the data feature to be retrieved into the hidden space, < >>
Figure SMS_64
Representation->
Figure SMS_65
Is the optimal solution of (a)Gamma represents a hyper-parameter;
the hidden space feature representation of the model data in the retrieval database is calculated by adopting the following formula:
Figure SMS_66
in the method, in the process of the invention,
Figure SMS_67
representing the characteristics of the model data in the search database, +.>
Figure SMS_68
Representing the corresponding feature of the projection of the model data feature in the search database into the hidden space, +.>
Figure SMS_69
Representation->
Figure SMS_70
Is a solution to the optimization of (3).
It should be noted that, in the above formula, agr is an abbreviation of term, and a parameter corresponding to an optimal solution of an optimization problem is taken, where the right side of the formula is an optimization problem, and the left side represents an optimal condition on the right side. The formula for calculating the hidden space feature representation is obtained by solving the objective function by adopting an alternate optimization method.
In one embodiment, performing similarity calculation on the hidden space feature representation of the to-be-searched modal data and the hidden space feature representation of the modal data in the search database to obtain a search result of the to-be-searched modal data, including: calculating the cosine similarity of the hidden space feature representation of the modal data to be searched and the hidden space feature representation of each modal data in the search database to obtain a plurality of similarity results; and selecting the mode data in the retrieval database corresponding to the maximum value in the similarity results as the retrieval result of the mode data to be retrieved.
In an embodiment, taking an image mode and a text mode as examples, the cross-mode searching method based on discriminant hidden space learning is implemented by adopting the following procedures:
1. and respectively carrying out feature extraction on the image and the text by adopting a ResNet50 feature extractor and a Bag of words (BOW) feature extractor to obtain an image feature set and a text feature set.
2. Training the double-dictionary model added with the discriminant attribute by adopting the image feature set and the text feature set, and determining model parameters to obtain the hidden space feature model.
3. Features of the query image are extracted using ResNet50, and BOW features of all text in the text retrieval database are extracted.
4. And respectively projecting the extracted features of the query image and the features of all texts into the hidden space according to the hidden space feature module to obtain corresponding hidden space feature representation.
5. And respectively calculating the hidden space feature representation similarity of the image and the hidden space feature representation similarity of all texts by adopting cosine distances, and selecting the text corresponding to the highest similarity as a retrieval result.
The embodiment of the invention also provides a cross-modal retrieval device based on discriminant hidden space learning, as shown in fig. 3, the device comprises:
the feature extraction module is used for extracting first features of the first modal data and second features of the second modal data and constructing a first training set and a second training set; the specific content refers to the corresponding parts of the above method embodiments, and will not be described herein.
The training module is used for training the double dictionary model with the discriminant attribute added by adopting the first training set and the second training set to obtain a hidden space feature model; the specific content refers to the corresponding parts of the above method embodiments, and will not be described herein.
The projection module is used for respectively projecting the mode data features to be searched and the mode data features in the search database into the hidden space based on the hidden space feature model to obtain hidden space feature representation of the mode data to be searched and hidden space feature representation of the mode data in the search database; the specific content refers to the corresponding parts of the above method embodiments, and will not be described herein.
And the retrieval module is used for carrying out similarity calculation on the hidden space feature representation of the modal data to be retrieved and the hidden space feature representation of the modal data in the retrieval database to obtain a retrieval result of the modal data to be retrieved. The specific content refers to the corresponding parts of the above method embodiments, and will not be described herein.
According to the cross-modal retrieval device based on discriminant hidden space learning, the hidden space is constructed through the double dictionary learning technology, so that the multi-modal data are aligned in the hidden space. Aiming at the problem of weaker cross-modal semantic association, discriminant attributes are added into the double-dictionary model, so that the distance in the subclasses is more compact and the distance between the subclasses is more sparse. The method is more in line with the fine-grained multi-mode scene setting of the power grid. The method solves the technical problems of weak intra-modal feature discrimination and weak inter-modal semantic relevance in the prior art.
The functional description of the cross-modal retrieval device based on discriminant hidden space learning provided by the embodiment of the invention is detailed by referring to the cross-modal retrieval method description based on discriminant hidden space learning in the embodiment.
The embodiment of the present invention further provides a storage medium, as shown in fig. 4, on which a computer program 601 is stored, which when executed by a processor, implements the steps of the cross-modal retrieval method based on discriminative hidden space learning in the above embodiment. The storage medium also stores audio and video stream data, characteristic frame data, interactive request signaling, encrypted data, preset data size and the like. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.
It will be appreciated by those skilled in the art that implementing all or part of the above-described embodiment method may be implemented by a computer program to instruct related hardware, where the program may be stored in a computer readable storage medium, and the program may include the above-described embodiment method when executed. Wherein the storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.
The embodiment of the present invention further provides an electronic device, as shown in fig. 5, where the electronic device may include a processor 51 and a memory 52, where the processor 51 and the memory 52 may be connected by a bus or other means, and in fig. 5, the connection is exemplified by a bus.
The processor 51 may be a central processing unit (Central Processing Unit, CPU). The processor 51 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations thereof.
The memory 52 serves as a non-transitory computer readable storage medium that may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as corresponding program instructions/modules in embodiments of the present invention. The processor 51 executes various functional applications of the processor and data processing by running non-transitory software programs, instructions and modules stored in the memory 52, i.e., implements the cross-modal retrieval method based on discriminant hidden space learning in the above-described method embodiment.
The memory 52 may include a memory program area that may store an operating device, an application program required for at least one function, and a memory data area; the storage data area may store data created by the processor 51, etc. In addition, memory 52 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 52 may optionally include memory located remotely from processor 51, which may be connected to processor 51 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more modules are stored in the memory 52, which when executed by the processor 51, perform a cross-modal retrieval method based on discriminative implicit space learning as in the embodiment shown in fig. 1-2.
The specific details of the electronic device may be understood correspondingly with reference to the corresponding related descriptions and effects in the embodiments shown in fig. 1 to 2, which are not repeated here.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims (10)

1. A cross-modal retrieval method based on discriminant hidden space learning is characterized by comprising the following steps:
extracting first features of the first modal data and second features of the second modal data, and constructing a first training set and a second training set;
training the dual-dictionary model with the discriminant attribute added by adopting the first training set and the second training set to obtain a hidden space feature model;
based on the hidden space feature model, respectively projecting the feature of the mode data to be searched and the feature of the mode data in the search database into a hidden space to obtain a hidden space feature representation of the mode data to be searched and a hidden space feature representation of the mode data in the search database;
and carrying out similarity calculation on the hidden space feature representation of the modal data to be searched and the hidden space feature representation of the modal data in the search database to obtain a search result of the modal data to be searched.
2. The cross-modal retrieval method based on discriminant hidden space learning of claim 1, wherein extracting first features of first modal data and second features of second modal data to construct a first training set and a second training set comprises:
acquiring first modality data and second modality data, wherein the first modality data is image data, and the second modality data is text data;
extracting first features of the first modal data by using a first feature extractor, and constructing a first training set based on the first features;
and extracting second features of the second modal data by adopting a second feature extractor, and constructing a second training set based on the second features.
3. The cross-modal retrieval method based on discriminant hidden space learning of claim 1, wherein said discriminant attributes are attributes representing whether the first modal data and the second modal data belong to the same sub-category.
4. The cross-modal retrieval method based on discriminant hidden space learning of claim 1, wherein the objective function and constraint conditions of the dual dictionary model with added discriminant attributes are expressed as:
Figure QLYQS_1
wherein V represents a first training set, T represents a second training set, D 1 And D 2 Respectively representing dictionaries in a dual dictionary model, Z representing a hidden space feature representation projected by a first feature in a first training set and a second feature in a second training set, H representing class labels of the first feature in the first training set and the second feature in the second training set, U representing weights of classifiers in the hidden space,
Figure QLYQS_2
and->
Figure QLYQS_3
Is a super parameter; />
Figure QLYQS_4
Representation->
Figure QLYQS_5
I column>
Figure QLYQS_6
Representation->
Figure QLYQS_7
Is selected from the group consisting of the (i) column,u i the ith column of U.
5. The cross-modal retrieval method based on discriminant hidden space learning of claim 1, wherein based on the hidden space feature model, projecting the features of the modal data to be retrieved and the features of the modal data in the retrieval database into the hidden space respectively, before obtaining the hidden space feature representation of the modal data to be retrieved and the hidden space feature representation of the modal data in the retrieval database, further comprises:
and respectively carrying out feature extraction on the mode data to be searched and the mode data in the search database to obtain the mode data features to be searched and the mode data features in the search database, wherein the mode data to be searched and the mode data in the search database are different in mode.
6. The cross-modal retrieval method based on discriminant hidden space learning of claim 4, wherein the hidden space feature representation of the modal data to be retrieved is calculated using the following formula:
Figure QLYQS_8
in the method, in the process of the invention,
Figure QLYQS_9
representing the characteristics of the data to be retrieved->
Figure QLYQS_10
Representing the corresponding features of the projection of the data features to be retrieved into the hidden space,
Figure QLYQS_11
representation->
Figure QLYQS_12
Gamma represents the super-parameters;
the hidden space feature representation of the model data in the retrieval database is calculated by adopting the following formula:
Figure QLYQS_13
in the method, in the process of the invention,
Figure QLYQS_14
representing the characteristics of the model data in the search database, +.>
Figure QLYQS_15
Representing the corresponding feature of the projection of the model data feature in the search database into the hidden space, +.>
Figure QLYQS_16
Representation->
Figure QLYQS_17
Is a solution to the optimization of (3).
7. The cross-modal retrieval method based on discriminant hidden space learning of claim 1, wherein performing similarity calculation on the hidden space feature representation of the modal data to be retrieved and the hidden space feature representation of the modal data in the retrieval database to obtain a retrieval result of the modal data to be retrieved, comprises:
calculating the cosine similarity of the hidden space feature representation of the modal data to be searched and the hidden space feature representation of each modal data in the search database to obtain a plurality of similarity results;
and selecting the mode data in the retrieval database corresponding to the maximum value in the similarity results as the retrieval result of the mode data to be retrieved.
8. Cross-modal retrieval device based on discriminant hidden space learning is characterized by comprising:
the feature extraction module is used for extracting first features of the first modal data and second features of the second modal data and constructing a first training set and a second training set;
the training module is used for training the double dictionary model with the discriminant attribute added by adopting the first training set and the second training set to obtain a hidden space feature model;
the projection module is used for respectively projecting the mode data features to be searched and the mode data features in the search database into the hidden space based on the hidden space feature model to obtain hidden space feature representation of the mode data to be searched and hidden space feature representation of the mode data in the search database;
and the retrieval module is used for carrying out similarity calculation on the hidden space feature representation of the modal data to be retrieved and the hidden space feature representation of the modal data in the retrieval database to obtain a retrieval result of the modal data to be retrieved.
9. A computer-readable storage medium storing computer instructions for causing the computer to perform the cross-modal retrieval method based on discriminative hidden space learning as claimed in any one of claims 1 to 7.
10. An electronic device, comprising: a memory and a processor, the memory and the processor being communicatively connected to each other, the memory storing computer instructions, the processor executing the computer instructions to perform the cross-modal retrieval method based on discriminative implicit space learning as claimed in any of claims 1 to 7.
CN202310594729.5A 2023-05-25 2023-05-25 Cross-modal retrieval method, device and medium based on discriminant hidden space learning Active CN116304135B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310594729.5A CN116304135B (en) 2023-05-25 2023-05-25 Cross-modal retrieval method, device and medium based on discriminant hidden space learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310594729.5A CN116304135B (en) 2023-05-25 2023-05-25 Cross-modal retrieval method, device and medium based on discriminant hidden space learning

Publications (2)

Publication Number Publication Date
CN116304135A true CN116304135A (en) 2023-06-23
CN116304135B CN116304135B (en) 2023-08-08

Family

ID=86832717

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310594729.5A Active CN116304135B (en) 2023-05-25 2023-05-25 Cross-modal retrieval method, device and medium based on discriminant hidden space learning

Country Status (1)

Country Link
CN (1) CN116304135B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491788A (en) * 2017-08-21 2017-12-19 天津大学 A kind of zero sample classification method based on dictionary learning
CN109299341A (en) * 2018-10-29 2019-02-01 山东师范大学 One kind confrontation cross-module state search method dictionary-based learning and system
CN114691986A (en) * 2022-03-21 2022-07-01 合肥工业大学 Cross-modal retrieval method based on subspace adaptive spacing and storage medium
US20230154159A1 (en) * 2021-11-08 2023-05-18 Samsung Electronics Co., Ltd. Method and apparatus for real-world cross-modal retrieval problems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491788A (en) * 2017-08-21 2017-12-19 天津大学 A kind of zero sample classification method based on dictionary learning
CN109299341A (en) * 2018-10-29 2019-02-01 山东师范大学 One kind confrontation cross-module state search method dictionary-based learning and system
US20230154159A1 (en) * 2021-11-08 2023-05-18 Samsung Electronics Co., Ltd. Method and apparatus for real-world cross-modal retrieval problems
CN114691986A (en) * 2022-03-21 2022-07-01 合肥工业大学 Cross-modal retrieval method based on subspace adaptive spacing and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BOKUN WANG 等: "Adversarial Cross-Model Retrieval", 《MOUNTAIN VIEW》, pages 154 - 162 *

Also Published As

Publication number Publication date
CN116304135B (en) 2023-08-08

Similar Documents

Publication Publication Date Title
US10691899B2 (en) Captioning a region of an image
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
US11373390B2 (en) Generating scene graphs from digital images using external knowledge and image reconstruction
CN107066464B (en) Semantic natural language vector space
GB2547068B (en) Semantic natural language vector space
KR101754473B1 (en) Method and system for automatically summarizing documents to images and providing the image-based contents
US10713298B2 (en) Video retrieval methods and apparatuses
WO2019052403A1 (en) Training method for image-text matching model, bidirectional search method, and related apparatus
WO2021139191A1 (en) Method for data labeling and apparatus for data labeling
Wu et al. Learning of multimodal representations with random walks on the click graph
US10482146B2 (en) Systems and methods for automatic customization of content filtering
CN112204575A (en) Multi-modal image classifier using text and visual embedding
Sahbi et al. Context-based support vector machines for interconnected image annotation
CN110968725B (en) Image content description information generation method, electronic device and storage medium
WO2023179429A1 (en) Video data processing method and apparatus, electronic device, and storage medium
CN113886571A (en) Entity identification method, entity identification device, electronic equipment and computer readable storage medium
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
Moumtzidou et al. ITI-CERTH participation to TRECVID 2012.
CN116150698B (en) Automatic DRG grouping method and system based on semantic information fusion
CN113297410A (en) Image retrieval method and device, computer equipment and storage medium
US11250299B2 (en) Learning representations of generalized cross-modal entailment tasks
WO2024093578A1 (en) Voice recognition method and apparatus, and electronic device, storage medium and computer program product
CN115204318B (en) Event automatic hierarchical classification method and electronic equipment
CN116304135B (en) Cross-modal retrieval method, device and medium based on discriminant hidden space learning
WO2023173552A1 (en) Establishment method for target detection model, application method for target detection model, and device, apparatus and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231102

Address after: 102209 18 Riverside Avenue, Changping District science and Technology City, Beijing

Patentee after: State Grid Smart Grid Research Institute Co.,Ltd.

Patentee after: STATE GRID CORPORATION OF CHINA

Address before: 102209 18 Riverside Avenue, Changping District science and Technology City, Beijing

Patentee before: State Grid Smart Grid Research Institute Co.,Ltd.