CN112861944B

CN112861944B - Image retrieval method and device based on mixed modal input

Info

Publication number: CN112861944B
Application number: CN202110118166.3A
Authority: CN
Inventors: 高成英; 陈俊霖
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2022-09-23
Anticipated expiration: 2041-01-28
Also published as: CN112861944A

Abstract

The invention discloses an image retrieval method and device based on mixed modal input, wherein the method comprises the steps of firstly inputting a mixed modal image to be retrieved into a trained mixed modal retrieval model, firstly extracting original feature vectors of image elements of various modes in the mixed modal image to be retrieved by the mixed modal retrieval model, and then converting the original feature vectors into a mixed modal relational graph based on a modal information attention sub-model; then, projecting the mixed mode relation graph to a comparable feature space through a graph convolution network submodel to obtain a comparable feature vector; finally, calculating similarity scores of the mixed modal image to be retrieved and each image to be compared according to the comparable feature vectors; and finally, obtaining a corresponding target image and obtaining a retrieval result. By implementing the invention, a user can conveniently construct the retrieval condition, and the accuracy of image retrieval is improved.

Description

Image retrieval method and device based on mixed modal input

Technical Field

The invention relates to the technical field of image recognition, in particular to an image retrieval method and device based on mixed modal input.

Background

For a scene, there are many different ways of describing: such as natural images (photographs), hand-drawn sketches (drawing out the item with black lines on a white background), and text with spatial information (putting the words describing the item on the canvas in the corresponding positions), etc. We refer to the different ways of describing the scene as different modalities. With the continuous development of the internet and the continuous increase of the data scale, the multi-modal retrieval task of retrieving the corresponding natural images according to the data of different modalities has huge application requirements and commercial values.

The data of different modes are adopted as retrieval conditions, and the method has different characteristics:

1. the natural image is used as a retrieval condition, and because the natural image provides rich visual information such as color, posture, texture, article type and the like, an accurate retrieval result can be obtained.

2. The hand-drawn sketch is used as a retrieval condition, and the retrieval result is not accurate enough to adopt a natural image due to the fact that the sketch lacks texture and color information. However, the sketch can be drawn only by one mobile phone of the user, and is easier to obtain than a natural image, and the convenience of obtaining is better.

3. The spatial text is used as the search condition, and only the type information of the article is contained, and the visual information of the article is not contained, so that the result that the postures of the article are more diversified can be obtained.

However, the existing multi-modal search task requires that the input belongs to the same modality: for example, a whole natural image and a whole sketch are input as search conditions, and image search cannot be realized according to mixed-modality scene input (i.e., an image with multiple modalities is input as a search condition, for example, three image elements of a hand-drawing, a text and a natural image are simultaneously provided in one image, as shown in fig. 1). This greatly limits the expression of the search condition by the user: first, natural images of parts of the item may be difficult to obtain, and users may be limited by their cultural level to draw a suitable sketch. Second, the user may have specific requirements for certain items in the scene (e.g., desire a front-facing doll at the bottom left of the scene) and only require the correct type of items in the scene.

Disclosure of Invention

The embodiment of the invention provides an image retrieval method and device based on mixed modal input, which can be used for performing image retrieval based on mixed modal images, are convenient for a user to construct retrieval conditions and can more accurately capture retrieval requirements of the user.

An embodiment of the present invention provides an image retrieval method based on mixed modality input, including: acquiring a mixed mode image to be retrieved; the mixed modality image to be retrieved comprises image elements of at least two different modalities;

inputting the mixed modal image to be retrieved into a preset mixed modal retrieval model, so that the mixed modal retrieval model selects a target image corresponding to the mixed modal image to be retrieved from preset images to be compared according to the mixed modal image to be retrieved, and a retrieval result is obtained;

selecting a target image corresponding to the mixed modality image to be retrieved from preset images to be compared according to the mixed modality image to be retrieved, wherein the obtaining of the retrieval result specifically comprises:

extracting an original feature vector of each image element in the mixed modal image to be retrieved;

inputting the original feature vector of each image element into a modal information attention sub-model, so that the model information attention sub-model generates a mixed modal relationship graph according to the original feature vector of each image element;

projecting the mixed mode relation graph to a feature space to generate a comparable feature vector of each image element through a preset graph convolution network sub-model;

and calculating the similarity scores of the mixed modal image to be retrieved and the images to be compared according to the comparable feature vectors of the image elements, and then selecting a target image corresponding to the mixed modal image to be retrieved from the images to be compared according to the similarity scores to obtain a retrieval result.

Further, the mixed modality image to be retrieved includes at least image elements of any two modalities: a natural image modality, a hand-drawn sketch modality, and a descriptive text modality.

Further, the model information attention sub-model generates a mixed mode relation graph according to the original feature vectors of the image elements, and specifically includes:

calculating a supply vector and a demand vector of each original feature vector according to the supply projection matrix and the demand projection matrix of each image element;

calculating a feature correlation coefficient between every two image elements according to the supply vector and the demand vector of each image element;

and constructing an adjacency matrix according to all the characteristic correlation coefficients, and taking the adjacency matrix as the mixed mode relation graph.

Further, the method for generating comparable feature vectors of image elements includes:

step 1: projecting the input characteristic vector of the graph convolution layer of the current level to generate a graph convolution characteristic vector of the current level for processing by a graph convolution network; wherein, the input feature vector of the graph volume layer of the first level is the original feature vector of the image element;

step 2: fusing the adjacency matrix and the graph convolution feature vector of the current level to generate a fusion feature of the current level;

step 3, combining the fusion features of the current level, the fusion features of all levels before the current level and the original feature vectors of the image elements to generate an input feature vector of a first next level;

step 4, repeating the step 1 to the step 3 until the fusion feature of the last level is generated;

and 5: and combining the fusion features of all levels and the feature vectors of the image elements to generate the comparable feature vector.

Further, when the mixed modal retrieval model is constructed, a mixed modal image and each image to be matched are used as input, a target image corresponding to the mixed modal image is used as output, the mixed modal retrieval model is constructed through a neural network, when the mixed modal retrieval model is trained, a loss function is constructed according to the similarity scores of the mixed modal image and each image to be matched, and then network parameters of the mixed modal retrieval model are updated according to the loss function.

Further, the constructing a loss function according to the similarity scores of the mixed modality image and each image to be matched, and then updating the network parameters of the mixed modality retrieval model according to the loss function specifically includes:

dividing the images to be matched into matched images and unmatched images according to matching results;

calculating a modal alignment loss function from the comparable feature vectors of the mixed modal image and the feature vectors of the matched images;

calculating a triple connection loss function according to the similarity scores of the mixed modal image and the matched image and the similarity scores of the mixed modal image and the unmatched image;

calculating a regression loss function according to comparable feature vectors of the mixed mode images, feature vectors of the matched images, original feature vectors of the matched images, feature vectors of the unmatched images and original feature vectors of the unmatched images;

and generating a total loss function according to the modal alignment loss function, the triple-link loss function and the regression loss function, and taking the total loss function as a loss function of the mixed modal retrieval model.

On the basis of the above method item embodiment, the present invention correspondingly provides an apparatus item embodiment:

another embodiment of the present invention provides an image retrieval apparatus based on a hybrid modality input, comprising an image acquisition module and an image retrieval module; the image retrieval module includes: an original characteristic vector extraction sub-module, a mixed modal relationship diagram generation sub-module, a comparable characteristic vector extraction sub-module and a retrieval result selection sub-module;

the image acquisition module is used for acquiring a mixed modal image to be retrieved; the mixed modal image to be retrieved comprises image elements of at least two different modalities;

the image retrieval module is used for inputting the mixed modal image to be retrieved into a preset mixed modal retrieval model so that the mixed modal retrieval model selects a target image corresponding to the mixed modal image to be retrieved from preset images to be compared according to the mixed modal image to be retrieved to obtain a retrieval result;

the original feature vector extraction submodule is used for extracting an original feature vector of each image element in the mixed modal image to be retrieved;

the mixed mode relation graph generating submodule is used for inputting the original characteristic vector of each image element into a mode information attention submodel so that the model information attention submodel generates a mixed mode relation graph according to the original characteristic vector of each image element;

the comparable characteristic vector extraction submodule is used for projecting the mixed modal relationship diagram to a characteristic space to generate a comparable characteristic vector of each image element through a preset diagram convolution network submodel;

the retrieval result selection submodule is configured to calculate a similarity score between the mixed-mode image to be retrieved and each image to be compared according to the comparable feature vector of each image element, and then select a target image corresponding to the mixed-mode image to be retrieved from each image to be compared according to the similarity score to obtain a retrieval result.

The embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides an image retrieval method and device based on mixed modal input. Specifically, the mixed modal retrieval model firstly extracts original feature vectors of image elements of various modalities in a mixed modal image to be retrieved, and then converts the original feature vectors into a mixed modal relational graph based on a modal information attention sub-model; then, projecting the mixed mode relation graph to a comparable feature space through a graph convolution network submodel to obtain a comparable feature vector; finally, calculating similarity scores of the mixed modal image to be retrieved and each image to be compared according to the comparable feature vectors; and finally, obtaining a corresponding target image and obtaining a retrieval result. By implementing the invention, the image retrieval can be carried out on a mixed modality object with image elements in different modalities, and the matching can ensure that a user can more conveniently construct retrieval conditions when carrying out the image retrieval, more accurately capture the retrieval requirements of the user and improve the accuracy of the image retrieval.

Drawings

FIG. 1 is a schematic illustration of a mixed modality image.

Fig. 2 is a flowchart illustrating an image retrieval method based on mixed modality input according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of a comparable feature vector according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a retrieval result of an image retrieval method based on mixed mode input according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of an image retrieval apparatus based on mixed modality input according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

First, the image elements of different modalities are explained: in the present invention, the modality refers to an expression for describing scene contents in an image, and as shown in fig. 1, in the image shown in fig. 1, an image element 1 (an image captured by an image capturing device, such as a camera) of a natural image modality, an image element 2 of a hand-drawn sketch modality, and an image element 3 describing a text modality are included. Image elements refer to individual items in an image.

The scheme of the present application is explained below:

as shown in fig. 2, an embodiment of the present invention provides an image retrieval method based on mixed modality input, including:

s101, acquiring a mixed mode image to be retrieved; the mixed modal image to be retrieved comprises image elements of at least two different modalities;

step S102, inputting the mixed modal image to be retrieved into a preset mixed modal retrieval model, so that the mixed modal retrieval model selects a target image corresponding to the mixed modal image to be retrieved from preset images to be compared according to the mixed modal image to be retrieved, and a retrieval result is obtained;

selecting a target image corresponding to the mixed modality image to be retrieved from preset images to be compared according to the mixed modality image to be retrieved, wherein the step of obtaining a retrieval result specifically comprises the following steps: extracting an original feature vector of each image element in the mixed modal image to be retrieved; inputting the original feature vector of each image element into a modal information attention sub-model, so that the model information attention sub-model generates a mixed modal relationship graph according to the original feature vector of each image element; projecting the mixed mode relation graph to a feature space to generate a comparable feature vector of each image element through a preset graph convolution network sub-model; and calculating the similarity scores of the mixed modal image to be retrieved and the images to be compared according to the comparable feature vectors of the image elements, and then selecting a target image corresponding to the mixed modal image to be retrieved from the images to be compared according to the similarity scores to obtain a retrieval result.

For step S101, a mixed modality image to be retrieved is first obtained, where the mixed modality image at least includes image elements of any two modalities: a natural image modality, a hand-drawn sketch modality, and a descriptive text modality. For example, the picture shown in fig. 1 is selected as the above-mentioned mixed modality image to be retrieved, and the image includes image elements of a natural image modality (airplane 1 in the figure), image elements of a sketch modality (airplane 2 in the figure) and 5 image elements describing a text modality ("cluud" 3 in the figure).

For step S102, the following description is made of a processing flow involved in the mixed modality retrieval model of the mixed modality image to be retrieved;

the first step is as follows: extracting an original feature vector: in the invention, the image elements of different modal types are subjected to data preprocessing by adopting the existing neural network model to extract the original characteristic feature vector of each image element; the original feature vector n of each image element can be represented as n ═ v, c, x, y ], where v is a visual information vector of 2208 dimensions, generated using the existing densnet-161 network for image elements of natural image modalities and image elements of hand-drawn sketch modalities, and using the full 0 vector as the visual information vector for image elements describing text modalities. c is a 300-dimensional class vector and is generated by adopting the existing GloVE network. (x, y) are the position coordinates of the image element in the image.

The second step is that: generating a mixed mode relational graph: in a preferred embodiment, the model information attention sub-model generates a mixed mode relationship diagram according to the original feature vector of each image element, and specifically includes: calculating a supply vector and a demand vector of each original feature vector according to the supply projection matrix and the demand projection matrix of each image element; calculating a feature correlation coefficient between every two image elements according to the supply vector and the demand vector of each image element; and constructing an adjacency matrix according to all the characteristic correlation coefficients, and taking the adjacency matrix as the mixed mode relation graph.

The specific step comprises the following small steps:

s21 in the modal information attention submodel, first, the projection matrix S is given according to the preset _m And a demand projection matrix D _m Calculating a supply vector s and a demand vector d corresponding to the original feature vector of each image element; the calculation formula is as follows:

s＝S _m *n；d＝D _m n; n is an original feature vector; it should be noted that in the present invention, a corresponding pair of supply projection matrices S is provided for each type of modality image element _m And a demand projection matrix D _m And training different projection matrixes to perform projection of the characteristics by adopting a neural network. The image shown in fig. 1 has three image elements of different modality types, and thus there are three pairs of image elements trained by different neural networks at this timeThe supply projection matrix and the demand projection matrix.

S22 demand vector d according to image element i _i With supply vector s of picture element j _i Calculating a characteristic correlation coefficient e _ij And then constructing an adjacency matrix A according to the characteristic correlation coefficient, and using the adjacency matrix as a mixed mode relation graph. The calculation formula is as follows:

in the above equation, | N | is the number of image elements in the feature set inputted by N. exp () is an exponential function with e as base, K768 is a vector d _i And s _i Of (c) is calculated.

The third step: generating comparable feature vectors; in a preferred embodiment, the method for generating comparable feature vectors of image elements comprises:

step 1: projecting the input characteristic vector of the graph convolution layer of the current level to generate a graph convolution characteristic vector of the current level for processing by a graph convolution network; wherein, the input feature vector of the graph volume layer of the first level is the original feature vector of the image element; step 2: fusing the adjacency matrix and the graph convolution feature vector of the current level to generate a fusion feature of the current level; combining the fusion features of the current level, the fusion features of all levels before the current level and the original feature vectors of the image elements to generate an input feature vector of a first next level; step 4, repeating the step 1 to the step 3 until the fusion characteristics of the last level are generated; and 5: and combining the fused features of all the levels and the feature vectors of the image elements to generate the comparable feature vector.

The graph convolution network submodel is constructed by the graph convolution network of the dense projection shown in fig. 3, and the mixed mode relation graph is projected into a comparable feature space to obtain a comparable feature vector. Specifically, this step includes the following small steps:

s31 inputting feature vectors x of different modes ^l Using projection layers Proj ^l Projecting to obtain the characteristic p for graph convolution network processing ^l Wherein l is the hierarchical index of the convolutional layer; specifically, the input feature vector of the first layer is the original feature vector n of each image element, and the specific configuration of the input feature vector after the input of the rest of the layers is shown in step 3-3. The formula is expressed as follows:

p ^l ＝Proj ^l (x ^l-1 )

x ¹ ＝n；

s32 wrapping the layer Graph with the Graph ^l For the adjacent matrix A and the projected eigenvector matrix

Carrying out feature fusion to obtain a feature matrix after feature fusion

Wherein

Is the fusion feature of the ith image element. The formula is expressed as follows:

G ^l ＝Graph ^l (AP ^l )；

s33, combining the original feature vector n and the fusion feature of the graph convolution layer from the 1 st layer to the l layer as the input x of the l +1 st layer ^l+1 The formula is as follows:

x ^l+1 ＝[n,g ¹ ,g ² ,...,g ^l ]；

s34, repeating the steps S31 to S33 until the output g of the graph winding layer of the last layer ^l The calculation is completed. Combining the original features n with the fusion features of all graph convolution layers to obtain a comparable feature vector, wherein the formula is as follows:

h＝Linear([n,g ¹ ,g ² ,...,g ^L ])；

where Linear is a Linear layer in a deep neural network.

The fourth step: calculating the similarity; specifically, the method can be divided into the following small steps:

s41 comparable feature vectors { h } for all background image elements of the same class _i Integrating to obtain integrated background feature vectors

The formula is defined as:

c _i ＝c；

where c is the type of article, k is the number of image elements belonging to type c, c _i As comparable feature vectors h for image elements _i The species to which it belongs. In the present invention, all image elements are divided into foreground image elements and background image elements in advance, for example, the airplane in the hand-drawing sketch mode and the airplane in the natural image mode in fig. 1 are both foreground image elements, and "cloud" describing the text mode is used as a background element, in the present invention, the foreground image and the background image of each image element in the hybrid mode image to be retrieved are identified, then category identification is performed, and the item category corresponding to each image element is identified, in this step, mainly in order to fuse the background elements of the same kind, for example, 5 "cloud" shown in fig. 1 all belong to a cloud and belong to the same kind, it should be noted that the same kind here refers to the same category of the items corresponding to the image elements, and does not refer to the same category of the modes. For example, the sketched aircraft of fig. 1 and the natural image modality aircraft may be considered to be of the same kind. In the present invention there is another prior art backgroundThe segmentation method and the article identification method are used for identifying the article type of each image element in the mixed modal image to be retrieved and distinguishing whether each image element belongs to a background element. In the present invention, the background feature vector is integrated to match the number of image elements of the mixed-mode image W used in training with the matched image I ⁺ The number of the image elements of the object is kept consistent, and the training effect is improved. When constructing mixed modal input and drawing a sketch of a background, a user usually likes to use a plurality of articles to represent the background, for example, drawing a plurality of clouds to represent the sky and drawing a plurality of grasses to replace a grassland.

S42, respectively calculating a comparable characteristic vector set for the mixed modal input M to be retrieved and a preset natural image input l to be compared (in the invention, each image to be compared is a natural image)

And

then, the cosine similarity is adopted to compare every two characteristics of different modes, and the similarity score is calculated

Wherein N is _M And N _I Respectively representing the number of image elements in the mixed mode image to be retrieved and the image to be compared, Cos () is a cosine similarity calculation function,

as comparable feature vectors for image elements

And F is a foreground set, and B is a background set. When the similarity of the graph is calculated, different weights are adopted for foreground and background image elements, and the calculation mode can obtain a better retrieval effect.

Calculating the similarity scores between the mixed modal image to be retrieved and each image to be compared according to the above example, and finally outputting the image with the highest similarity as a target image to obtain a retrieval result; the target image obtained according to fig. 1 is, for example, the image shown in fig. 4.

In a preferred embodiment, when the mixed modal retrieval model is constructed, a mixed modal image and each image to be matched are used as input, a target image corresponding to the mixed modal image is used as output, the mixed modal retrieval model is constructed through a neural network, when the mixed modal retrieval model is trained, a loss function is constructed according to similarity scores of the mixed modal image and each image to be matched, and then network parameters of the mixed modal retrieval model are updated according to the loss function.

The constructing a loss function according to the similarity scores of the mixed modal image and the images to be matched, and then updating the network parameters of the mixed modal retrieval model according to the loss function specifically includes:

calculating a regression loss function according to the comparable feature vector of the mixed mode image, the original feature vector of the matched image, the feature vector of the unmatched image and the original feature vector of the unmatched image;

The following mainly explains the updating of the loss function and the network parameters of the model, and the specific data processing principle in the model training process is consistent with the data processing principle in the model application process, which is not repeated herein;

updating network parameters according to the similarity degree score loss function, which mainly comprises the following small steps:

s51 comparable feature vector set from mixed modality image W

Matched image I ⁺ Of comparable feature vector sets

Computing a modal alignment loss function L _ca :

Wherein E _P ，E _q Respectively sets of characteristic vectors

And

y () is a feature projection function, H _k To regenerate hilbert space, k is the gaussian kernel function.

S52 matching the image I according to the mixed mode image W ⁺ And unmatched images I ^- Calculating the triple loss L _tri (triplet loss)：

Wherein the content of the first and second substances,

for mixed modal image W and unmatched image I ^- The similarity score of (a) is calculated,

for mixing the modal image W with the matched image I ⁺ The similarity score of (a).

S53 comparable feature vector set from mixed modality image W

Matched image I ⁺ Set of comparable feature vectors

Matched image I ⁺ Of the original feature vector set, unmatched image I ^- Set of comparable feature vectors

And calculating a regression loss function L by using the set of original characteristic vectors of the unmatched images _cyc ：

r＝Regressor(h)；

Wherein Regressor () is a regression matrix, E2]Is a desired function. Here, the set of comparable feature vectors for the mixed modality image W is regressed to the matched image I ⁺ The reason for the set of raw feature vectors of (1) is that the present invention aims to correctly retrieve a matched natural image I from a mixed-mode image W ⁺ It should be ensured that the comparable feature vectors contain information that can be regressed onto the original feature vectors of the matched images.

S54 aligning the modes to a loss function L _ca IIIContinuous loss L _tri And a regression loss function L _cyc And adding to obtain the loss function of the mixed mode retrieval model, and updating the network parameters by minimizing the loss function of the mixed mode retrieval model. Loss function L ═ L of mixed mode search model _tri +0.1L _ca +0.05L _cyc 。

On the basis of the above method item embodiments, the present invention correspondingly provides apparatus item embodiments.

As shown in fig. 5, another embodiment of the present invention provides an image retrieval apparatus based on mixed modality input, including an image acquisition module and an image retrieval module; the image retrieval module includes: an original characteristic vector extraction sub-module, a mixed modal relationship diagram generation sub-module, a comparable characteristic vector extraction sub-module and a retrieval result selection sub-module;

In a preferred embodiment, the system further includes a model construction module, where the model construction module is configured to construct the mixed modality retrieval model through a neural network by taking the mixed modality image and each image to be matched as input and taking a target image corresponding to the mixed modality image as output, and when the mixed modality retrieval model is trained, construct a loss function according to a similarity score between the mixed modality image and each image to be matched, and then update a network parameter of the mixed modality retrieval model according to the loss function.

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. An image retrieval method based on mixed modality input, comprising:

acquiring a mixed mode image to be retrieved; the mixed modal image to be retrieved comprises image elements of at least two different modalities;

selecting a target image corresponding to the mixed modality image to be retrieved from preset images to be compared according to the mixed modality image to be retrieved, wherein the step of obtaining a retrieval result specifically comprises the following steps:

inputting the original feature vector of each image element into a modal information attention sub-model, so that the modal information attention sub-model generates a mixed modal relationship graph according to the original feature vector of each image element; the modality information attention sub-model generates a mixed modality relationship diagram according to the original feature vectors of the image elements, and specifically includes: calculating a supply vector and a demand vector of each original feature vector according to the supply projection matrix and the demand projection matrix of each image element; calculating a feature correlation coefficient between every two image elements according to the supply vector and the demand vector of each image element; constructing an adjacency matrix according to all the characteristic correlation coefficients, and taking the adjacency matrix as the mixed mode relation graph;

2. The method for image retrieval based on mixed modality input according to claim 1, wherein the mixed modality image to be retrieved includes at least image elements of any two of the following modalities: a natural image modality, a hand-drawn sketch modality, and a descriptive text modality.

3. The method for image retrieval based on mixed modality input according to claim 1, wherein the method for generating comparable feature vectors for image elements comprises:

combining the fusion features of the current level, the fusion features of all levels before the current level and the original feature vectors of the image elements to generate an input feature vector of a first next level;

step 4, repeating the step 1 to the step 3 until the fusion characteristics of the last level are generated;

and 5: and combining the fused features of all the levels and the feature vectors of the image elements to generate the comparable feature vector.

4. The image retrieval method based on mixed modality input according to claim 3, wherein when constructing the mixed modality retrieval model, the mixed modality retrieval model is constructed through a neural network by taking a mixed modality image and each image to be matched as input and a target image corresponding to the mixed modality image as output, and when training the mixed modality retrieval model, a loss function is constructed according to the similarity score between the mixed modality image and each image to be matched, and then network parameters of the mixed modality retrieval model are updated according to the loss function.

5. The image retrieval method based on mixed modality input according to claim 4, wherein the constructing a loss function according to the similarity scores of the mixed modality image and each image to be matched, and then updating the network parameters of the mixed modality retrieval model according to the loss function specifically comprises:

6. An image retrieval device based on mixed modal input is characterized by comprising an image acquisition module and an image retrieval module; the image retrieval module includes: an original characteristic vector extraction sub-module, a mixed modal relationship diagram generation sub-module, a comparable characteristic vector extraction sub-module and a retrieval result selection sub-module;

the mixed mode relation graph generating submodule is used for inputting the original characteristic vector of each image element into a mode information attention sub-model so that the mode information attention sub-model generates a mixed mode relation graph according to the original characteristic vector of each image element; the modality information attention sub-model generates a mixed modality relationship diagram according to the original feature vectors of the image elements, and specifically includes: calculating a supply vector and a demand vector of each original feature vector according to the supply projection matrix and the demand projection matrix of each image element; calculating a feature correlation coefficient between every two image elements according to the supply vector and the demand vector of each image element; constructing an adjacency matrix according to all the characteristic correlation coefficients, and taking the adjacency matrix as the mixed mode relation graph;