CN112925935A

CN112925935A - Image menu retrieval method based on intra-modality and inter-modality mixed fusion

Info

Publication number: CN112925935A
Application number: CN202110397679.2A
Authority: CN
Inventors: 徐行; 李娇; 沈复民; 邵杰; 申恒涛
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2021-06-08
Anticipated expiration: 2041-04-13
Also published as: CN112925935B

Abstract

The invention relates to the field of cross-modal retrieval, in particular to an image menu retrieval method based on intra-modal and inter-modal mixed fusion, which comprises the following steps: step 1, preparing image data and menu data; step 2, constructing an integral network; step 3, training the whole network in the step 2, and setting a loss function; step 4, performing cross-modal retrieval on the food and the menu by using the trained integral network; the problem of cross-modal retrieval effect poor is solved.

Description

Image menu retrieval method based on intra-modality and inter-modality mixed fusion

Technical Field

The invention relates to the field of cross-modal retrieval, in particular to an image menu retrieval method based on intra-modal and inter-modal mixed fusion.

Background

With the increasing of multimedia data such as texts, images, videos and the like in the internet, retrieval spanning different modes becomes a new trend of information retrieval; the cross-media retrieval refers to taking data of any media type as input query to retrieve data related to semantics in all media types; cross-media teletext retrieval is a retrieval task that is broadly directed between images and text; in the invention, the application scene of cross-media image-text retrieval is embodied as the mutual retrieval of food images and menu texts: for any food image, retrieving a menu description text most related to the description of the content of the food image, or for any menu description text, retrieving a food image most related to the description of the menu description text; typically, an image and a corresponding recipe will be provided in the data set, wherein each recipe contains food material and cooking steps; the food materials include various food materials, some of which (e.g., beef, egg) are directly visible in the dish and some of which (e.g., salt, honey) are not directly visible in the dish; most food materials are sensitive to high temperatures, cutting and other cooking operations, and their original appearance is easily altered; the cooking step contains complex logic information about cooking; meanwhile, food objects in the image typically exhibit a stacked, staggered pose; therefore, for the cross-modal retrieval of images and recipes, the difficulty is how to obtain more effective modal characteristics by having the information of the image or the recipe with emphasis to comprehensively measure the similarity of the images and the recipes.

Most of the existing methods respectively encode food materials and cooking steps in a menu by adopting two independent cyclic neural networks, and extract global image features by using a convolutional neural network; retrieving a loss function for pulling the matched pairs closer and pushing the dissimilar pairs away; although the prior methods achieve some achievements in cross-modal image menu retrieval, they still have the following disadvantages:

1) the existing methods are less concerned about potential interactions between food materials and cooking steps in the recipe, key information in food materials and cooking steps often occur simultaneously, and their respective characteristics do not well represent the relationship between them.

2) In the existing method, global image features are usually used for representation, and a fine-grained food image area is ignored, so that food materials occupying few pixel points are easily ignored in feature extraction, and the size, shape, primary and secondary relations of the food materials are difficult to capture by the global image features, so that the retrieval performance is unsatisfactory.

3) Most methods project images and recipes to a shared subspace to learn their feature distributions, measuring the distance between the two, which lacks the interaction and fusion of the two modality data, making subspace learning inefficient.

Disclosure of Invention

Based on the problems, the invention provides an image menu retrieval method based on intra-mode and inter-mode mixed fusion, and solves the problem of poor cross-mode retrieval effect.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

the image menu retrieval method based on mixed fusion in and among the modes comprises the following steps:

step 1, preparing image data and menu data, wherein the image data comprises food images, and the menu data comprises food materials and cooking steps;

step 2, constructing an integral network;

the step 2 specifically comprises the following steps:

step 21, extracting an image characteristic sequence, a food material characteristic sequence and a cooking step characteristic sequence;

step 22, performing feature fusion on the food material feature sequence and the cooking step feature sequence by using a cross attention mechanism in a menu mode to obtain food material fusion features and cooking step fusion features;

step 23, performing modal fusion on the food material fusion characteristics and the image characteristic sequence under a gating mechanism by using a multi-head cross attention mechanism to obtain secondary food material fusion characteristics and image fusion characteristics;

step 24, obtaining local similarity of the image fusion characteristics and the secondary fusion characteristics of the food materials through a multilayer perceptron;

step 25, fusing the image feature sequence and the cooking step, and calculating cosine similarity to obtain global similarity;

step 26, linearly combining the local similarity and the global similarity to jointly measure the similarity between the image data and the menu data;

step 3, training the whole network in the step 2, and setting a loss function;

and 4, performing cross-modal retrieval on the food and the menu by using the trained integral network.

Further, the step 21 specifically includes the following steps:

step 211, extracting the features of the food image by using a convolutional neural network ResNet50 to extract an image feature sequence;

step 212, performing characteristic representation on the food material by using a word2vec model, and extracting a food material characteristic sequence by using a single-layer bidirectional gate control circulation unit GRU;

and step 213, performing characteristic representation on the cooking steps by using a sensor 2vector model, and extracting a cooking step characteristic sequence.

Further, the step 22 specifically includes the following steps:

step 221, calculating an affinity matrix of the food material and the cooking step;

step 222, calculating the food material characteristic sequence and the affinity matrix to obtain weighted food material characteristics weighted by the characteristic sequence of the cooking step;

step 223, calculating the cooking step characteristic sequence and the affinity matrix to obtain weighted cooking step characteristics weighted by the food material characteristic sequence;

step 224, performing matrix splicing on the food material characteristic sequence and the characteristics of the weighted cooking step to obtain food material fusion characteristics;

and step 225, performing matrix splicing on the cooking step characteristic sequence and the weighted food material characteristics to obtain the cooking step fusion characteristics.

Further, the step 23 specifically includes the following steps:

step 231, calculating an image information vector weighted by the food material fusion characteristics for the image characteristic sequence, dividing the food material fusion characteristics and the image characteristic sequence into h vectors in the same-dimension subspace by using a multi-head cross attention mechanism, and performing matrix splicing on the image information sub-vectors respectively obtained by the h vectors to obtain a final weighted image information vector;

step 232, calculating the food material information vectors weighted by the image feature sequences for the food material fusion features, dividing the image feature sequences and the food material fusion features into h vectors in the same-dimension subspace by the multi-head cross attention mechanism, and performing matrix splicing on the food material information sub-vectors respectively obtained by the h vectors to obtain final weighted food material information vectors;

step 233, performing dot product calculation on the image characteristic sequence and the weighted food material information vector to obtain correlation, and further expressing the correlation as a gating matrix of fusion degree;

step 234, performing point multiplication on the image characteristic sequence and the weighted food material information vector element-by-element summation and the gating matrix, and performing residual connection with the image characteristic sequence to finally obtain an image fusion characteristic;

step 235, performing point multiplication calculation on the food material fusion characteristics and the weighted image information vector to obtain correlation, and further expressing the correlation as a gating matrix of the fusion degree;

and 236, performing point multiplication on the element-by-element summation of the food material fusion characteristics and the weighted image characteristics and the gating matrix, and performing residual connection on the food material fusion characteristics to finally obtain the secondary food material fusion characteristics.

Further, the step 24 specifically includes the following steps:

241, performing matrix splicing on the image fusion characteristics obtained in the step 23 and the secondary food material fusion characteristics to obtain a 2048-dimensional splicing vector;

and 242, inputting the splicing vector into the multilayer perceptrons of the two layers, and obtaining a value from 0 to 1 after activating a function sigmoid, wherein the value is represented as local similarity.

Further, in the step 25, the image feature sequence and the cooking step fusion feature are respectively subjected to average pooling operation to obtain features with the same dimensionality, and the cosine similarity calculated by the image feature sequence and the cooking step fusion feature is expressed as global similarity.

Further, in the step 26, the local similarity and the global similarity are linearly combined, wherein the ratio of the local similarity to the global similarity is between 0 and 1, and the sum of the ratios is 1, which is expressed as the matching degree of the image data and the recipe data.

Further, in the step 3, a contextual ranking loss is adopted as a loss function to train the whole network in the step 2.

Compared with the prior art, the invention has the beneficial effects that: through menu intra-modal fusion, the interaction between food materials and cooking steps is absorbed, the information expression of two independent embedding characteristics is further enriched, meanwhile, the images and the menu are fused between the modalities, the potential relation between a fine-grained image area and the food materials is explored, and therefore the final image menu similarity is formed from the local aspect and the global aspect together, and a better cross-modal retrieval effect is obtained.

Drawings

FIG. 1 is a flow chart of the present embodiment;

FIG. 2 is a table showing the results of the experiment in this example.

Detailed Description

The invention will be further described with reference to the accompanying drawings. Embodiments of the present invention include, but are not limited to, the following examples.

Example 1

The image menu retrieval method based on intra-modality and inter-modality mixed fusion as shown in fig. 1 includes the following steps:

step 1, preparing image data and menu data;

step 2, constructing an integral network;

step 3, training the whole network in the step 2, and setting a loss function;

In the method, the image data prepared in the step 1 comprises food images, and the menu data comprises food materials and cooking steps;

in this embodiment, step 2 specifically includes the following steps:

in the above method, step 21 specifically includes the following steps:

and step 211, performing feature extraction on the food image by using a convolutional neural network ResNet50, and taking the output of the last layer of residual error block of the convolutional neural network, wherein the output comprises 7 × 7-49 columns 2048-dimensional convolutional output, and the output result is expressed as an image feature sequence

Wherein s denotes the number of image areas, E^uA sequence of features of the image is represented,

a food image feature representing an image area of the ith row;

step 212, performing characteristic representation on the food material by using the word2vec model, extracting a food material characteristic sequence by using a single-layer bidirectional gating circulation unit GRU, and representing the food material characteristic sequence as

Wherein m represents the length of the food material sequence, R^gRepresenting a characteristic sequence of food material r_i ^gA food material characteristic representing an ith food material sequence;

step 213, using the sensor 2vector model to perform characteristic representation on the cooking steps, and extracting the characteristic sequence of the cooking steps to represent the characteristic sequence as

Wherein n represents the length of the sequence of cooking steps, R^sCharacteristic sequence of cooking steps, r_i ^sA cooking step characteristic representing an ith sequence of cooking steps.

In the above method, step 22 specifically includes the following steps:

step 221, calculating an affinity matrix of the food material and the cooking step, wherein the formula of the affinity matrix A is as follows:

A＝(R^gW^g)(R^sW^s)^T，

wherein, W^g、W^sIs a weight parameter to be learned, T represents a matrix transposition operation;

step 222, calculating the food material characteristic sequence and the affinity matrix to obtain weighted food material characteristics weighted by the characteristic sequence of the cooking step

The method comprises the following specific steps:

wherein A is^gRepresenting a weight matrix after an affinity matrix A is normalized along the dimension of the food materials, a softmax () function represents that the weight value is mapped between 0 and 1, A^TA transpose of the affinity matrix a is represented,

square root representing characteristic dimension of food material characteristic or cooking step;

step 223, calculating the characteristic sequence of the cooking steps and the affinity matrix to obtain the weighted cooking step characteristics weighted by the characteristic sequence of the food materials

The method comprises the following specific steps:

wherein A is^sRepresenting a weight matrix of the affinity matrix A after the normalization along the dimension of the cooking step;

step 224, food material characteristic sequence R^gAnd weighted cooking step characteristics

Matrix splicing is carried out to obtain food material fusion characteristic E^gThe method specifically comprises the following steps:

wherein [ |. ] [ ] | ]]Representing a matrix splicing operation;

step 225, cooking step signature sequence R^sAnd weighting the characteristics of the food materials

Performing matrix splicing to obtain a cooking step fusion characteristic E^sIs concretely provided with

In the above method, step 23 specifically includes the following steps:

the input of the multi-head cross attention mechanism is represented as three groups of vectors, namely Q (query), K (Key) and V (value), the attention value is calculated by Q and K, and the attention process is realized as follows:

wherein Attn () represents the attention function, softmax () function maps the value between 0 and 1, K^TA transposed matrix representing K is provided,

square root representing the Q or K dimension;

the multi-head cross attention divides the features into h heads, each head performs the attention process, the outputs of the multiple heads are spliced to obtain the output of the multi-head cross attention mechanism, and specifically, the image characteristic sequence E obtained in step 211 is used^uAs Q, the food material fusion characteristics E obtained in step 224^gObtaining weighted food material characteristics weighted by the image characteristic sequence as K and V

The following were used:

wherein, W_i ^g，W_i ^k，W_i ^uAnd W^gIs a parameter matrix to be learned, Multihead^gW^gRepresenting the dimension size when mapping the output dimension of the multi-headed cross attention mechanism back to the input,

an output representing the h-th head attention;

blending food material with characteristics E^gAs Q, the image characteristic sequence E obtained in step 211^uObtaining a weighted image feature sequence weighted by the food material fusion features as K and V

The following were used:

wherein, Multihead^uW^uRepresenting the dimension size when mapping the output dimension of the multi-headed cross attention mechanism back to the input,

an output representing the h-th head attention;

in addition, for the image characteristic sequence E^uThe ith line of features in

And weighted food material characteristics weighted by the image characteristic sequence

The ith line of features in

Calculating the dot product and obtaining the fusion degree through an activation function sigmoid, 0 tableNo fusion was shown, 1 represents complete fusion, as follows:

wherein the content of the first and second substances,

an output result representing the degree of fusion of the ith row of features;

for the image feature sequence E^u，

Indicating the degree of fusion of all regions, and similarly, fusing the feature E for the food material^g，

Representing the fusion degree of all food materials;

in addition, a sequence of image features E is taken into account^uAnd weighted food material characteristics weighted by the image characteristic sequence

The fusion operation of (1) adopts element-by-element summation, and adds residual connection to obtain final image fusion characteristics in order to avoid losing original image area information which cannot be well captured by multi-head cross attention

The following were used:

wherein, l indicates a product element by element,

representing element-by-element summation, and similarly obtaining the final secondary fusion characteristic of the food material

The following were used:

in the above method, step 24 specifically includes the following steps:

fusing features to images

Secondary blending feature with food material

Performing average pooling operation to reduce dimension, inputting a two-layer multilayer sensor, and outputting after activating a function sigmoid to obtain a local similarity Slocal as follows:

in the above method, step 25 specifically includes the following steps:

for image characteristic sequence E^uMerging features E with cooking steps^sCalculating the cosine similarity to obtain a global similarity Sglobal as follows:

Sglobal(I，R)＝cosine(pool(E^u),pool(E^s))。

in the above method, step 26 specifically includes the following steps:

the local similarity Slocal and the global similarity Sglobal are linearly combined to obtain the final similarity S between the image data and the menu data, and the similarity S is as follows:

S(I，R)＝ω₁Slocal(I，R)+ω₂Sglobal(I，R)，

wherein, ω is₁，ω₂Is a weight parameter, and ω₁+ω₂＝1。

In the above method, step 3 specifically includes the following steps:

assume similarity S (I, R) is a positive sample pair (I) of image data and menu data_p，R_p) Assigning a high value as a negative sample pair (I)_p，R_n) Assigning a low value, i.e. (I)_p，R_p)＞(I_p，R_n) And n ≠ p, the menu which is most matched with the query image can be found by ranking the similarity scores between the query image and all the menus, and vice versa;

in addition, during training, the contrast ranking loss is used as a loss function, and for the sampled image data and the sampled menu data positive sample pair (I)_p，R_p) Finding the negative sample pair (I) in a small batch that is most difficult to distinguish_p，R_p)，(I_p，R_n) And separate them from the positive sample pairs with a predefined function interval delta as follows:

Loss(I，R)＝[Δ+S(I_p，R_n)-S(I_p，R_p)]₊+[Δ+S(I_n，R_p)-S(I_p，R_p)]₊，

wherein, [ x ]]₊＝max(0，x)。

After the integral model is constructed and trained, step 4 can be performed to perform cross-modal retrieval on food and recipes by using the trained integral network, specifically as follows:

step 41, extracting a characteristic vector of data of a given mode;

step 42, inputting the extracted feature vectors into the trained integral network;

step 43, the trained overall network calculates the related data variables of the given modality and the other modality to obtain the similarity (linear combination of local similarity and global similarity) of the given modality data and a plurality of candidate data in the other modality;

and step 44, sequencing the similarity results, wherein the original modal data corresponding to the variable with the maximum similarity is the retrieval result.

The overall model of the application is evaluated by using the most common retrieval evaluation indexes top-k and MedR, wherein top-k refers to the proportion of the image sequence or the menu sequence of the target positive samples in the first k results in the candidate scores returned by the model, the higher the top is, the better the top is, in the embodiment, k is 1, 5 and 10, respectively, and MedR represents the ranking median position of all the candidate sample target positive samples, and the lower the top is, the better the top is.

Testing the present invention on a large scale image Recipe retrieval dataset Recipe1M, Recipe1M dataset collected over 100 million cooking recipes and 80 million food images from 24 popular cooking websites, with 238,999 image samples and corresponding Recipe text as training sets, 51,119 image samples and corresponding Recipe text as validation sets, and 51,303 image samples and corresponding Recipe text as test sets.

In the testing stage, 1,000 pairs (1k) and 10,000 pairs (10k) are respectively sampled and repeated for 10 times to report an average result, as shown in fig. 2, it can be seen that the highest retrieval accuracy is obtained in the image-Recipe retrieval scene, and on a Recipe1M data set, compared with the prior art, on a 1k test set and a 10k test set, the top-k index is obviously improved, and the effectiveness of image-Recipe cross-mode retrieval is improved.

The above is an embodiment of the present invention. The specific parameters in the above embodiments and examples are only for the purpose of clearly illustrating the invention verification process of the inventor and are not intended to limit the scope of the invention, which is defined by the claims, and all equivalent structural changes made by using the contents of the specification and the drawings of the present invention should be covered by the scope of the present invention.

Claims

1. The image menu retrieval method based on mixed fusion in and among the modes is characterized by comprising the following steps:

step 2, constructing an integral network;

the step 2 specifically comprises the following steps:

step 26, linearly combining the local similarity and the global similarity to jointly measure the matching degree of the image data and the menu data;

step 3, training the whole network in the step 2, and setting a loss function;

2. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: the step 21 specifically includes the following steps:

3. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: the step 22 specifically includes the following steps:

4. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: the step 23 specifically includes the following steps:

5. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: the step 24 specifically includes the following steps:

6. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: in the step 25, average pooling operation is respectively performed on the image feature sequence and the cooking step fusion feature to obtain features with the same dimensionality, and the cosine similarity calculated by the image feature sequence and the cooking step fusion feature is expressed as global similarity.

7. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: in the step 26, the local similarity and the global similarity are linearly combined, wherein the ratio of the local similarity to the global similarity is between 0 and 1, and the sum of the ratios is 1, which is expressed as the matching degree of the image data and the menu data.

8. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: in the step 3, the comprehensive ranking loss is adopted as a loss function to train the whole network in the step 2.