CN112861944B - Image retrieval method and device based on mixed modal input - Google Patents

Image retrieval method and device based on mixed modal input Download PDF

Info

Publication number
CN112861944B
CN112861944B CN202110118166.3A CN202110118166A CN112861944B CN 112861944 B CN112861944 B CN 112861944B CN 202110118166 A CN202110118166 A CN 202110118166A CN 112861944 B CN112861944 B CN 112861944B
Authority
CN
China
Prior art keywords
image
mixed
modal
retrieval
retrieved
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110118166.3A
Other languages
Chinese (zh)
Other versions
CN112861944A (en
Inventor
高成英
陈俊霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202110118166.3A priority Critical patent/CN112861944B/en
Publication of CN112861944A publication Critical patent/CN112861944A/en
Application granted granted Critical
Publication of CN112861944B publication Critical patent/CN112861944B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Abstract

The invention discloses an image retrieval method and device based on mixed modal input, wherein the method comprises the steps of firstly inputting a mixed modal image to be retrieved into a trained mixed modal retrieval model, firstly extracting original feature vectors of image elements of various modes in the mixed modal image to be retrieved by the mixed modal retrieval model, and then converting the original feature vectors into a mixed modal relational graph based on a modal information attention sub-model; then, projecting the mixed mode relation graph to a comparable feature space through a graph convolution network submodel to obtain a comparable feature vector; finally, calculating similarity scores of the mixed modal image to be retrieved and each image to be compared according to the comparable feature vectors; and finally, obtaining a corresponding target image and obtaining a retrieval result. By implementing the invention, a user can conveniently construct the retrieval condition, and the accuracy of image retrieval is improved.

Description

Image retrieval method and device based on mixed modal input
Technical Field
The invention relates to the technical field of image recognition, in particular to an image retrieval method and device based on mixed modal input.
Background
For a scene, there are many different ways of describing: such as natural images (photographs), hand-drawn sketches (drawing out the item with black lines on a white background), and text with spatial information (putting the words describing the item on the canvas in the corresponding positions), etc. We refer to the different ways of describing the scene as different modalities. With the continuous development of the internet and the continuous increase of the data scale, the multi-modal retrieval task of retrieving the corresponding natural images according to the data of different modalities has huge application requirements and commercial values.
The data of different modes are adopted as retrieval conditions, and the method has different characteristics:
1. the natural image is used as a retrieval condition, and because the natural image provides rich visual information such as color, posture, texture, article type and the like, an accurate retrieval result can be obtained.
2. The hand-drawn sketch is used as a retrieval condition, and the retrieval result is not accurate enough to adopt a natural image due to the fact that the sketch lacks texture and color information. However, the sketch can be drawn only by one mobile phone of the user, and is easier to obtain than a natural image, and the convenience of obtaining is better.
3. The spatial text is used as the search condition, and only the type information of the article is contained, and the visual information of the article is not contained, so that the result that the postures of the article are more diversified can be obtained.
However, the existing multi-modal search task requires that the input belongs to the same modality: for example, a whole natural image and a whole sketch are input as search conditions, and image search cannot be realized according to mixed-modality scene input (i.e., an image with multiple modalities is input as a search condition, for example, three image elements of a hand-drawing, a text and a natural image are simultaneously provided in one image, as shown in fig. 1). This greatly limits the expression of the search condition by the user: first, natural images of parts of the item may be difficult to obtain, and users may be limited by their cultural level to draw a suitable sketch. Second, the user may have specific requirements for certain items in the scene (e.g., desire a front-facing doll at the bottom left of the scene) and only require the correct type of items in the scene.
Disclosure of Invention
The embodiment of the invention provides an image retrieval method and device based on mixed modal input, which can be used for performing image retrieval based on mixed modal images, are convenient for a user to construct retrieval conditions and can more accurately capture retrieval requirements of the user.
An embodiment of the present invention provides an image retrieval method based on mixed modality input, including: acquiring a mixed mode image to be retrieved; the mixed modality image to be retrieved comprises image elements of at least two different modalities;
inputting the mixed modal image to be retrieved into a preset mixed modal retrieval model, so that the mixed modal retrieval model selects a target image corresponding to the mixed modal image to be retrieved from preset images to be compared according to the mixed modal image to be retrieved, and a retrieval result is obtained;
selecting a target image corresponding to the mixed modality image to be retrieved from preset images to be compared according to the mixed modality image to be retrieved, wherein the obtaining of the retrieval result specifically comprises:
extracting an original feature vector of each image element in the mixed modal image to be retrieved;
inputting the original feature vector of each image element into a modal information attention sub-model, so that the model information attention sub-model generates a mixed modal relationship graph according to the original feature vector of each image element;
projecting the mixed mode relation graph to a feature space to generate a comparable feature vector of each image element through a preset graph convolution network sub-model;
and calculating the similarity scores of the mixed modal image to be retrieved and the images to be compared according to the comparable feature vectors of the image elements, and then selecting a target image corresponding to the mixed modal image to be retrieved from the images to be compared according to the similarity scores to obtain a retrieval result.
Further, the mixed modality image to be retrieved includes at least image elements of any two modalities: a natural image modality, a hand-drawn sketch modality, and a descriptive text modality.
Further, the model information attention sub-model generates a mixed mode relation graph according to the original feature vectors of the image elements, and specifically includes:
calculating a supply vector and a demand vector of each original feature vector according to the supply projection matrix and the demand projection matrix of each image element;
calculating a feature correlation coefficient between every two image elements according to the supply vector and the demand vector of each image element;
and constructing an adjacency matrix according to all the characteristic correlation coefficients, and taking the adjacency matrix as the mixed mode relation graph.
Further, the method for generating comparable feature vectors of image elements includes:
step 1: projecting the input characteristic vector of the graph convolution layer of the current level to generate a graph convolution characteristic vector of the current level for processing by a graph convolution network; wherein, the input feature vector of the graph volume layer of the first level is the original feature vector of the image element;
step 2: fusing the adjacency matrix and the graph convolution feature vector of the current level to generate a fusion feature of the current level;
step 3, combining the fusion features of the current level, the fusion features of all levels before the current level and the original feature vectors of the image elements to generate an input feature vector of a first next level;
step 4, repeating the step 1 to the step 3 until the fusion feature of the last level is generated;
and 5: and combining the fusion features of all levels and the feature vectors of the image elements to generate the comparable feature vector.
Further, when the mixed modal retrieval model is constructed, a mixed modal image and each image to be matched are used as input, a target image corresponding to the mixed modal image is used as output, the mixed modal retrieval model is constructed through a neural network, when the mixed modal retrieval model is trained, a loss function is constructed according to the similarity scores of the mixed modal image and each image to be matched, and then network parameters of the mixed modal retrieval model are updated according to the loss function.
Further, the constructing a loss function according to the similarity scores of the mixed modality image and each image to be matched, and then updating the network parameters of the mixed modality retrieval model according to the loss function specifically includes:
dividing the images to be matched into matched images and unmatched images according to matching results;
calculating a modal alignment loss function from the comparable feature vectors of the mixed modal image and the feature vectors of the matched images;
calculating a triple connection loss function according to the similarity scores of the mixed modal image and the matched image and the similarity scores of the mixed modal image and the unmatched image;
calculating a regression loss function according to comparable feature vectors of the mixed mode images, feature vectors of the matched images, original feature vectors of the matched images, feature vectors of the unmatched images and original feature vectors of the unmatched images;
and generating a total loss function according to the modal alignment loss function, the triple-link loss function and the regression loss function, and taking the total loss function as a loss function of the mixed modal retrieval model.
On the basis of the above method item embodiment, the present invention correspondingly provides an apparatus item embodiment:
another embodiment of the present invention provides an image retrieval apparatus based on a hybrid modality input, comprising an image acquisition module and an image retrieval module; the image retrieval module includes: an original characteristic vector extraction sub-module, a mixed modal relationship diagram generation sub-module, a comparable characteristic vector extraction sub-module and a retrieval result selection sub-module;
the image acquisition module is used for acquiring a mixed modal image to be retrieved; the mixed modal image to be retrieved comprises image elements of at least two different modalities;
the image retrieval module is used for inputting the mixed modal image to be retrieved into a preset mixed modal retrieval model so that the mixed modal retrieval model selects a target image corresponding to the mixed modal image to be retrieved from preset images to be compared according to the mixed modal image to be retrieved to obtain a retrieval result;
the original feature vector extraction submodule is used for extracting an original feature vector of each image element in the mixed modal image to be retrieved;
the mixed mode relation graph generating submodule is used for inputting the original characteristic vector of each image element into a mode information attention submodel so that the model information attention submodel generates a mixed mode relation graph according to the original characteristic vector of each image element;
the comparable characteristic vector extraction submodule is used for projecting the mixed modal relationship diagram to a characteristic space to generate a comparable characteristic vector of each image element through a preset diagram convolution network submodel;
the retrieval result selection submodule is configured to calculate a similarity score between the mixed-mode image to be retrieved and each image to be compared according to the comparable feature vector of each image element, and then select a target image corresponding to the mixed-mode image to be retrieved from each image to be compared according to the similarity score to obtain a retrieval result.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides an image retrieval method and device based on mixed modal input. Specifically, the mixed modal retrieval model firstly extracts original feature vectors of image elements of various modalities in a mixed modal image to be retrieved, and then converts the original feature vectors into a mixed modal relational graph based on a modal information attention sub-model; then, projecting the mixed mode relation graph to a comparable feature space through a graph convolution network submodel to obtain a comparable feature vector; finally, calculating similarity scores of the mixed modal image to be retrieved and each image to be compared according to the comparable feature vectors; and finally, obtaining a corresponding target image and obtaining a retrieval result. By implementing the invention, the image retrieval can be carried out on a mixed modality object with image elements in different modalities, and the matching can ensure that a user can more conveniently construct retrieval conditions when carrying out the image retrieval, more accurately capture the retrieval requirements of the user and improve the accuracy of the image retrieval.
Drawings
FIG. 1 is a schematic illustration of a mixed modality image.
Fig. 2 is a flowchart illustrating an image retrieval method based on mixed modality input according to an embodiment of the present invention.
FIG. 3 is a schematic diagram of a comparable feature vector according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of a retrieval result of an image retrieval method based on mixed mode input according to an embodiment of the present invention.
Fig. 5 is a schematic structural diagram of an image retrieval apparatus based on mixed modality input according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
First, the image elements of different modalities are explained: in the present invention, the modality refers to an expression for describing scene contents in an image, and as shown in fig. 1, in the image shown in fig. 1, an image element 1 (an image captured by an image capturing device, such as a camera) of a natural image modality, an image element 2 of a hand-drawn sketch modality, and an image element 3 describing a text modality are included. Image elements refer to individual items in an image.
The scheme of the present application is explained below:
as shown in fig. 2, an embodiment of the present invention provides an image retrieval method based on mixed modality input, including:
s101, acquiring a mixed mode image to be retrieved; the mixed modal image to be retrieved comprises image elements of at least two different modalities;
step S102, inputting the mixed modal image to be retrieved into a preset mixed modal retrieval model, so that the mixed modal retrieval model selects a target image corresponding to the mixed modal image to be retrieved from preset images to be compared according to the mixed modal image to be retrieved, and a retrieval result is obtained;
selecting a target image corresponding to the mixed modality image to be retrieved from preset images to be compared according to the mixed modality image to be retrieved, wherein the step of obtaining a retrieval result specifically comprises the following steps: extracting an original feature vector of each image element in the mixed modal image to be retrieved; inputting the original feature vector of each image element into a modal information attention sub-model, so that the model information attention sub-model generates a mixed modal relationship graph according to the original feature vector of each image element; projecting the mixed mode relation graph to a feature space to generate a comparable feature vector of each image element through a preset graph convolution network sub-model; and calculating the similarity scores of the mixed modal image to be retrieved and the images to be compared according to the comparable feature vectors of the image elements, and then selecting a target image corresponding to the mixed modal image to be retrieved from the images to be compared according to the similarity scores to obtain a retrieval result.
For step S101, a mixed modality image to be retrieved is first obtained, where the mixed modality image at least includes image elements of any two modalities: a natural image modality, a hand-drawn sketch modality, and a descriptive text modality. For example, the picture shown in fig. 1 is selected as the above-mentioned mixed modality image to be retrieved, and the image includes image elements of a natural image modality (airplane 1 in the figure), image elements of a sketch modality (airplane 2 in the figure) and 5 image elements describing a text modality ("cluud" 3 in the figure).
For step S102, the following description is made of a processing flow involved in the mixed modality retrieval model of the mixed modality image to be retrieved;
the first step is as follows: extracting an original feature vector: in the invention, the image elements of different modal types are subjected to data preprocessing by adopting the existing neural network model to extract the original characteristic feature vector of each image element; the original feature vector n of each image element can be represented as n ═ v, c, x, y ], where v is a visual information vector of 2208 dimensions, generated using the existing densnet-161 network for image elements of natural image modalities and image elements of hand-drawn sketch modalities, and using the full 0 vector as the visual information vector for image elements describing text modalities. c is a 300-dimensional class vector and is generated by adopting the existing GloVE network. (x, y) are the position coordinates of the image element in the image.
The second step is that: generating a mixed mode relational graph: in a preferred embodiment, the model information attention sub-model generates a mixed mode relationship diagram according to the original feature vector of each image element, and specifically includes: calculating a supply vector and a demand vector of each original feature vector according to the supply projection matrix and the demand projection matrix of each image element; calculating a feature correlation coefficient between every two image elements according to the supply vector and the demand vector of each image element; and constructing an adjacency matrix according to all the characteristic correlation coefficients, and taking the adjacency matrix as the mixed mode relation graph.
The specific step comprises the following small steps:
s21 in the modal information attention submodel, first, the projection matrix S is given according to the preset m And a demand projection matrix D m Calculating a supply vector s and a demand vector d corresponding to the original feature vector of each image element; the calculation formula is as follows:
s=S m *n;d=D m n; n is an original feature vector; it should be noted that in the present invention, a corresponding pair of supply projection matrices S is provided for each type of modality image element m And a demand projection matrix D m And training different projection matrixes to perform projection of the characteristics by adopting a neural network. The image shown in fig. 1 has three image elements of different modality types, and thus there are three pairs of image elements trained by different neural networks at this timeThe supply projection matrix and the demand projection matrix.
S22 demand vector d according to image element i i With supply vector s of picture element j i Calculating a characteristic correlation coefficient e ij And then constructing an adjacency matrix A according to the characteristic correlation coefficient, and using the adjacency matrix as a mixed mode relation graph. The calculation formula is as follows:
Figure BDA0002921082050000091
Figure BDA0002921082050000092
Figure BDA0002921082050000093
in the above equation, | N | is the number of image elements in the feature set inputted by N. exp () is an exponential function with e as base, K768 is a vector d i And s i Of (c) is calculated.
The third step: generating comparable feature vectors; in a preferred embodiment, the method for generating comparable feature vectors of image elements comprises:
step 1: projecting the input characteristic vector of the graph convolution layer of the current level to generate a graph convolution characteristic vector of the current level for processing by a graph convolution network; wherein, the input feature vector of the graph volume layer of the first level is the original feature vector of the image element; step 2: fusing the adjacency matrix and the graph convolution feature vector of the current level to generate a fusion feature of the current level; combining the fusion features of the current level, the fusion features of all levels before the current level and the original feature vectors of the image elements to generate an input feature vector of a first next level; step 4, repeating the step 1 to the step 3 until the fusion characteristics of the last level are generated; and 5: and combining the fused features of all the levels and the feature vectors of the image elements to generate the comparable feature vector.
The graph convolution network submodel is constructed by the graph convolution network of the dense projection shown in fig. 3, and the mixed mode relation graph is projected into a comparable feature space to obtain a comparable feature vector. Specifically, this step includes the following small steps:
s31 inputting feature vectors x of different modes l Using projection layers Proj l Projecting to obtain the characteristic p for graph convolution network processing l Wherein l is the hierarchical index of the convolutional layer; specifically, the input feature vector of the first layer is the original feature vector n of each image element, and the specific configuration of the input feature vector after the input of the rest of the layers is shown in step 3-3. The formula is expressed as follows:
p l =Proj l (x l-1 )
x 1 =n;
s32 wrapping the layer Graph with the Graph l For the adjacent matrix A and the projected eigenvector matrix
Figure BDA0002921082050000101
Carrying out feature fusion to obtain a feature matrix after feature fusion
Figure BDA0002921082050000102
Wherein
Figure BDA0002921082050000103
Is the fusion feature of the ith image element. The formula is expressed as follows:
G l =Graph l (AP l );
s33, combining the original feature vector n and the fusion feature of the graph convolution layer from the 1 st layer to the l layer as the input x of the l +1 st layer l+1 The formula is as follows:
x l+1 =[n,g 1 ,g 2 ,...,g l ];
s34, repeating the steps S31 to S33 until the output g of the graph winding layer of the last layer l The calculation is completed. Combining the original features n with the fusion features of all graph convolution layers to obtain a comparable feature vector, wherein the formula is as follows:
h=Linear([n,g 1 ,g 2 ,...,g L ]);
where Linear is a Linear layer in a deep neural network.
The fourth step: calculating the similarity; specifically, the method can be divided into the following small steps:
s41 comparable feature vectors { h } for all background image elements of the same class i Integrating to obtain integrated background feature vectors
Figure BDA0002921082050000104
The formula is defined as:
Figure BDA0002921082050000105
c i =c;
where c is the type of article, k is the number of image elements belonging to type c, c i As comparable feature vectors h for image elements i The species to which it belongs. In the present invention, all image elements are divided into foreground image elements and background image elements in advance, for example, the airplane in the hand-drawing sketch mode and the airplane in the natural image mode in fig. 1 are both foreground image elements, and "cloud" describing the text mode is used as a background element, in the present invention, the foreground image and the background image of each image element in the hybrid mode image to be retrieved are identified, then category identification is performed, and the item category corresponding to each image element is identified, in this step, mainly in order to fuse the background elements of the same kind, for example, 5 "cloud" shown in fig. 1 all belong to a cloud and belong to the same kind, it should be noted that the same kind here refers to the same category of the items corresponding to the image elements, and does not refer to the same category of the modes. For example, the sketched aircraft of fig. 1 and the natural image modality aircraft may be considered to be of the same kind. In the present invention there is another prior art backgroundThe segmentation method and the article identification method are used for identifying the article type of each image element in the mixed modal image to be retrieved and distinguishing whether each image element belongs to a background element. In the present invention, the background feature vector is integrated to match the number of image elements of the mixed-mode image W used in training with the matched image I + The number of the image elements of the object is kept consistent, and the training effect is improved. When constructing mixed modal input and drawing a sketch of a background, a user usually likes to use a plurality of articles to represent the background, for example, drawing a plurality of clouds to represent the sky and drawing a plurality of grasses to replace a grassland.
S42, respectively calculating a comparable characteristic vector set for the mixed modal input M to be retrieved and a preset natural image input l to be compared (in the invention, each image to be compared is a natural image)
Figure BDA0002921082050000111
And
Figure BDA0002921082050000112
then, the cosine similarity is adopted to compare every two characteristics of different modes, and the similarity score is calculated
Figure BDA0002921082050000113
Figure BDA0002921082050000114
Figure BDA0002921082050000115
Wherein N is M And N I Respectively representing the number of image elements in the mixed mode image to be retrieved and the image to be compared, Cos () is a cosine similarity calculation function,
Figure BDA0002921082050000121
as comparable feature vectors for image elements
Figure BDA0002921082050000122
And F is a foreground set, and B is a background set. When the similarity of the graph is calculated, different weights are adopted for foreground and background image elements, and the calculation mode can obtain a better retrieval effect.
Calculating the similarity scores between the mixed modal image to be retrieved and each image to be compared according to the above example, and finally outputting the image with the highest similarity as a target image to obtain a retrieval result; the target image obtained according to fig. 1 is, for example, the image shown in fig. 4.
In a preferred embodiment, when the mixed modal retrieval model is constructed, a mixed modal image and each image to be matched are used as input, a target image corresponding to the mixed modal image is used as output, the mixed modal retrieval model is constructed through a neural network, when the mixed modal retrieval model is trained, a loss function is constructed according to similarity scores of the mixed modal image and each image to be matched, and then network parameters of the mixed modal retrieval model are updated according to the loss function.
The constructing a loss function according to the similarity scores of the mixed modal image and the images to be matched, and then updating the network parameters of the mixed modal retrieval model according to the loss function specifically includes:
dividing the images to be matched into matched images and unmatched images according to matching results;
calculating a modal alignment loss function from the comparable feature vectors of the mixed modal image and the feature vectors of the matched images;
calculating a triple connection loss function according to the similarity scores of the mixed modal image and the matched image and the similarity scores of the mixed modal image and the unmatched image;
calculating a regression loss function according to the comparable feature vector of the mixed mode image, the original feature vector of the matched image, the feature vector of the unmatched image and the original feature vector of the unmatched image;
and generating a total loss function according to the modal alignment loss function, the triple-link loss function and the regression loss function, and taking the total loss function as a loss function of the mixed modal retrieval model.
The following mainly explains the updating of the loss function and the network parameters of the model, and the specific data processing principle in the model training process is consistent with the data processing principle in the model application process, which is not repeated herein;
updating network parameters according to the similarity degree score loss function, which mainly comprises the following small steps:
s51 comparable feature vector set from mixed modality image W
Figure BDA0002921082050000131
Matched image I + Of comparable feature vector sets
Figure BDA0002921082050000132
Computing a modal alignment loss function L ca :
Figure BDA0002921082050000133
Wherein E P ,E q Respectively sets of characteristic vectors
Figure BDA0002921082050000134
And
Figure BDA0002921082050000135
y () is a feature projection function, H k To regenerate hilbert space, k is the gaussian kernel function.
S52 matching the image I according to the mixed mode image W + And unmatched images I - Calculating the triple loss L tri (triplet loss):
Figure BDA0002921082050000136
Wherein the content of the first and second substances,
Figure BDA0002921082050000137
for mixed modal image W and unmatched image I - The similarity score of (a) is calculated,
Figure BDA0002921082050000138
for mixing the modal image W with the matched image I + The similarity score of (a).
S53 comparable feature vector set from mixed modality image W
Figure BDA0002921082050000139
Matched image I + Set of comparable feature vectors
Figure BDA00029210820500001310
Matched image I + Of the original feature vector set, unmatched image I - Set of comparable feature vectors
Figure BDA00029210820500001311
And calculating a regression loss function L by using the set of original characteristic vectors of the unmatched images cyc
Figure BDA00029210820500001312
r=Regressor(h);
Wherein Regressor () is a regression matrix, E2]Is a desired function. Here, the set of comparable feature vectors for the mixed modality image W is regressed to the matched image I + The reason for the set of raw feature vectors of (1) is that the present invention aims to correctly retrieve a matched natural image I from a mixed-mode image W + It should be ensured that the comparable feature vectors contain information that can be regressed onto the original feature vectors of the matched images.
S54 aligning the modes to a loss function L ca IIIContinuous loss L tri And a regression loss function L cyc And adding to obtain the loss function of the mixed mode retrieval model, and updating the network parameters by minimizing the loss function of the mixed mode retrieval model. Loss function L ═ L of mixed mode search model tri +0.1L ca +0.05L cyc
On the basis of the above method item embodiments, the present invention correspondingly provides apparatus item embodiments.
As shown in fig. 5, another embodiment of the present invention provides an image retrieval apparatus based on mixed modality input, including an image acquisition module and an image retrieval module; the image retrieval module includes: an original characteristic vector extraction sub-module, a mixed modal relationship diagram generation sub-module, a comparable characteristic vector extraction sub-module and a retrieval result selection sub-module;
the image acquisition module is used for acquiring a mixed modal image to be retrieved; the mixed modal image to be retrieved comprises image elements of at least two different modalities;
the image retrieval module is used for inputting the mixed modal image to be retrieved into a preset mixed modal retrieval model so that the mixed modal retrieval model selects a target image corresponding to the mixed modal image to be retrieved from preset images to be compared according to the mixed modal image to be retrieved to obtain a retrieval result;
the original feature vector extraction submodule is used for extracting an original feature vector of each image element in the mixed modal image to be retrieved;
the mixed mode relation graph generating submodule is used for inputting the original characteristic vector of each image element into a mode information attention submodel so that the model information attention submodel generates a mixed mode relation graph according to the original characteristic vector of each image element;
the comparable characteristic vector extraction submodule is used for projecting the mixed modal relationship diagram to a characteristic space to generate a comparable characteristic vector of each image element through a preset diagram convolution network submodel;
the retrieval result selection submodule is configured to calculate a similarity score between the mixed-mode image to be retrieved and each image to be compared according to the comparable feature vector of each image element, and then select a target image corresponding to the mixed-mode image to be retrieved from each image to be compared according to the similarity score to obtain a retrieval result.
In a preferred embodiment, the system further includes a model construction module, where the model construction module is configured to construct the mixed modality retrieval model through a neural network by taking the mixed modality image and each image to be matched as input and taking a target image corresponding to the mixed modality image as output, and when the mixed modality retrieval model is trained, construct a loss function according to a similarity score between the mixed modality image and each image to be matched, and then update a network parameter of the mixed modality retrieval model according to the loss function.
It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (6)

1. An image retrieval method based on mixed modality input, comprising:
acquiring a mixed mode image to be retrieved; the mixed modal image to be retrieved comprises image elements of at least two different modalities;
inputting the mixed modal image to be retrieved into a preset mixed modal retrieval model, so that the mixed modal retrieval model selects a target image corresponding to the mixed modal image to be retrieved from preset images to be compared according to the mixed modal image to be retrieved, and a retrieval result is obtained;
selecting a target image corresponding to the mixed modality image to be retrieved from preset images to be compared according to the mixed modality image to be retrieved, wherein the step of obtaining a retrieval result specifically comprises the following steps:
extracting an original feature vector of each image element in the mixed modal image to be retrieved;
inputting the original feature vector of each image element into a modal information attention sub-model, so that the modal information attention sub-model generates a mixed modal relationship graph according to the original feature vector of each image element; the modality information attention sub-model generates a mixed modality relationship diagram according to the original feature vectors of the image elements, and specifically includes: calculating a supply vector and a demand vector of each original feature vector according to the supply projection matrix and the demand projection matrix of each image element; calculating a feature correlation coefficient between every two image elements according to the supply vector and the demand vector of each image element; constructing an adjacency matrix according to all the characteristic correlation coefficients, and taking the adjacency matrix as the mixed mode relation graph;
projecting the mixed mode relation graph to a feature space to generate a comparable feature vector of each image element through a preset graph convolution network sub-model;
and calculating the similarity scores of the mixed modal image to be retrieved and the images to be compared according to the comparable feature vectors of the image elements, and then selecting a target image corresponding to the mixed modal image to be retrieved from the images to be compared according to the similarity scores to obtain a retrieval result.
2. The method for image retrieval based on mixed modality input according to claim 1, wherein the mixed modality image to be retrieved includes at least image elements of any two of the following modalities: a natural image modality, a hand-drawn sketch modality, and a descriptive text modality.
3. The method for image retrieval based on mixed modality input according to claim 1, wherein the method for generating comparable feature vectors for image elements comprises:
step 1: projecting the input characteristic vector of the graph convolution layer of the current level to generate a graph convolution characteristic vector of the current level for processing by a graph convolution network; wherein, the input feature vector of the graph volume layer of the first level is the original feature vector of the image element;
step 2: fusing the adjacency matrix and the graph convolution feature vector of the current level to generate a fusion feature of the current level;
combining the fusion features of the current level, the fusion features of all levels before the current level and the original feature vectors of the image elements to generate an input feature vector of a first next level;
step 4, repeating the step 1 to the step 3 until the fusion characteristics of the last level are generated;
and 5: and combining the fused features of all the levels and the feature vectors of the image elements to generate the comparable feature vector.
4. The image retrieval method based on mixed modality input according to claim 3, wherein when constructing the mixed modality retrieval model, the mixed modality retrieval model is constructed through a neural network by taking a mixed modality image and each image to be matched as input and a target image corresponding to the mixed modality image as output, and when training the mixed modality retrieval model, a loss function is constructed according to the similarity score between the mixed modality image and each image to be matched, and then network parameters of the mixed modality retrieval model are updated according to the loss function.
5. The image retrieval method based on mixed modality input according to claim 4, wherein the constructing a loss function according to the similarity scores of the mixed modality image and each image to be matched, and then updating the network parameters of the mixed modality retrieval model according to the loss function specifically comprises:
dividing the images to be matched into matched images and unmatched images according to matching results;
calculating a modal alignment loss function from the comparable feature vectors of the mixed modal image and the feature vectors of the matched images;
calculating a triple connection loss function according to the similarity scores of the mixed modal image and the matched image and the similarity scores of the mixed modal image and the unmatched image;
calculating a regression loss function according to comparable feature vectors of the mixed mode images, feature vectors of the matched images, original feature vectors of the matched images, feature vectors of the unmatched images and original feature vectors of the unmatched images;
and generating a total loss function according to the modal alignment loss function, the triple-link loss function and the regression loss function, and taking the total loss function as a loss function of the mixed modal retrieval model.
6. An image retrieval device based on mixed modal input is characterized by comprising an image acquisition module and an image retrieval module; the image retrieval module includes: an original characteristic vector extraction sub-module, a mixed modal relationship diagram generation sub-module, a comparable characteristic vector extraction sub-module and a retrieval result selection sub-module;
the image acquisition module is used for acquiring a mixed modal image to be retrieved; the mixed modal image to be retrieved comprises image elements of at least two different modalities;
the image retrieval module is used for inputting the mixed modal image to be retrieved into a preset mixed modal retrieval model so that the mixed modal retrieval model selects a target image corresponding to the mixed modal image to be retrieved from preset images to be compared according to the mixed modal image to be retrieved to obtain a retrieval result;
the original feature vector extraction submodule is used for extracting an original feature vector of each image element in the mixed modal image to be retrieved;
the mixed mode relation graph generating submodule is used for inputting the original characteristic vector of each image element into a mode information attention sub-model so that the mode information attention sub-model generates a mixed mode relation graph according to the original characteristic vector of each image element; the modality information attention sub-model generates a mixed modality relationship diagram according to the original feature vectors of the image elements, and specifically includes: calculating a supply vector and a demand vector of each original feature vector according to the supply projection matrix and the demand projection matrix of each image element; calculating a feature correlation coefficient between every two image elements according to the supply vector and the demand vector of each image element; constructing an adjacency matrix according to all the characteristic correlation coefficients, and taking the adjacency matrix as the mixed mode relation graph;
the comparable characteristic vector extraction submodule is used for projecting the mixed modal relationship diagram to a characteristic space to generate a comparable characteristic vector of each image element through a preset diagram convolution network submodel;
the retrieval result selection submodule is configured to calculate a similarity score between the mixed-mode image to be retrieved and each image to be compared according to the comparable feature vector of each image element, and then select a target image corresponding to the mixed-mode image to be retrieved from each image to be compared according to the similarity score to obtain a retrieval result.
CN202110118166.3A 2021-01-28 2021-01-28 Image retrieval method and device based on mixed modal input Active CN112861944B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110118166.3A CN112861944B (en) 2021-01-28 2021-01-28 Image retrieval method and device based on mixed modal input

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110118166.3A CN112861944B (en) 2021-01-28 2021-01-28 Image retrieval method and device based on mixed modal input

Publications (2)

Publication Number Publication Date
CN112861944A CN112861944A (en) 2021-05-28
CN112861944B true CN112861944B (en) 2022-09-23

Family

ID=75987649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110118166.3A Active CN112861944B (en) 2021-01-28 2021-01-28 Image retrieval method and device based on mixed modal input

Country Status (1)

Country Link
CN (1) CN112861944B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6728752B1 (en) * 1999-01-26 2004-04-27 Xerox Corporation System and method for information browsing using multi-modal features
CN102262670A (en) * 2011-07-29 2011-11-30 中山大学 Cross-media information retrieval system and method based on mobile visual equipment
CN102521368A (en) * 2011-12-16 2012-06-27 武汉科技大学 Similarity matrix iteration based cross-media semantic digesting and optimizing method
CN103473327A (en) * 2013-09-13 2013-12-25 广东图图搜网络科技有限公司 Image retrieval method and image retrieval system
CN105701173A (en) * 2016-01-05 2016-06-22 中国电影科学技术研究所 Multi-mode image retrieving method based appearance design patent
CN105930440A (en) * 2016-04-19 2016-09-07 中山大学 Large-scale quick retrieval method of pedestrian image on the basis of cross-horizon information and quantization error encoding
CN106095893A (en) * 2016-06-06 2016-11-09 北京大学深圳研究生院 A kind of cross-media retrieval method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11244205B2 (en) * 2019-03-29 2022-02-08 Microsoft Technology Licensing, Llc Generating multi modal image representation for an image

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6728752B1 (en) * 1999-01-26 2004-04-27 Xerox Corporation System and method for information browsing using multi-modal features
CN102262670A (en) * 2011-07-29 2011-11-30 中山大学 Cross-media information retrieval system and method based on mobile visual equipment
CN102521368A (en) * 2011-12-16 2012-06-27 武汉科技大学 Similarity matrix iteration based cross-media semantic digesting and optimizing method
CN103473327A (en) * 2013-09-13 2013-12-25 广东图图搜网络科技有限公司 Image retrieval method and image retrieval system
CN105701173A (en) * 2016-01-05 2016-06-22 中国电影科学技术研究所 Multi-mode image retrieving method based appearance design patent
CN105930440A (en) * 2016-04-19 2016-09-07 中山大学 Large-scale quick retrieval method of pedestrian image on the basis of cross-horizon information and quantization error encoding
CN106095893A (en) * 2016-06-06 2016-11-09 北京大学深圳研究生院 A kind of cross-media retrieval method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度神经网络的语音-图像跨模态检索研究;郭毛;《中国优秀博硕士学位论文全文数据库(硕士)》;20200515(第5期);I136-12 *

Also Published As

Publication number Publication date
CN112861944A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
CN111354079B (en) Three-dimensional face reconstruction network training and virtual face image generation method and device
CN110717977B (en) Method, device, computer equipment and storage medium for processing game character face
CN108230240B (en) Method for obtaining position and posture in image city range based on deep learning
CN106096542B (en) Image video scene recognition method based on distance prediction information
WO2021143264A1 (en) Image processing method and apparatus, server and storage medium
EP3876110A1 (en) Method, device and apparatus for recognizing, categorizing and searching for garment, and storage medium
CN105117399B (en) Image searching method and device
CN104616247B (en) A kind of method for map splicing of being taken photo by plane based on super-pixel SIFT
CN112085835B (en) Three-dimensional cartoon face generation method and device, electronic equipment and storage medium
CN108537887A (en) Sketch based on 3D printing and model library 3-D view matching process
CN111127309A (en) Portrait style transfer model training method, portrait style transfer method and device
US11308313B2 (en) Hybrid deep learning method for recognizing facial expressions
CN111108508A (en) Facial emotion recognition method, intelligent device and computer-readable storage medium
CN110874575A (en) Face image processing method and related equipment
CN113361387A (en) Face image fusion method and device, storage medium and electronic equipment
CN112200844A (en) Method, device, electronic equipment and medium for generating image
CN109857895B (en) Stereo vision retrieval method and system based on multi-loop view convolutional neural network
CN114743162A (en) Cross-modal pedestrian re-identification method based on generation of countermeasure network
CN113408590B (en) Scene recognition method, training method, device, electronic equipment and program product
CN111754622B (en) Face three-dimensional image generation method and related equipment
CN117115917A (en) Teacher behavior recognition method, device and medium based on multi-modal feature fusion
CN112861944B (en) Image retrieval method and device based on mixed modal input
CN115239857B (en) Image generation method and electronic device
CN113362455B (en) Video conference background virtualization processing method and device
CN114707055A (en) Photographing posture recommendation method integrating image content and feature extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant