CN112925935A - Image menu retrieval method based on intra-modality and inter-modality mixed fusion - Google Patents

Image menu retrieval method based on intra-modality and inter-modality mixed fusion Download PDF

Info

Publication number
CN112925935A
CN112925935A CN202110397679.2A CN202110397679A CN112925935A CN 112925935 A CN112925935 A CN 112925935A CN 202110397679 A CN202110397679 A CN 202110397679A CN 112925935 A CN112925935 A CN 112925935A
Authority
CN
China
Prior art keywords
image
food material
fusion
modality
weighted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110397679.2A
Other languages
Chinese (zh)
Other versions
CN112925935B (en
Inventor
徐行
李娇
沈复民
邵杰
申恒涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202110397679.2A priority Critical patent/CN112925935B/en
Publication of CN112925935A publication Critical patent/CN112925935A/en
Application granted granted Critical
Publication of CN112925935B publication Critical patent/CN112925935B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to the field of cross-modal retrieval, in particular to an image menu retrieval method based on intra-modal and inter-modal mixed fusion, which comprises the following steps: step 1, preparing image data and menu data; step 2, constructing an integral network; step 3, training the whole network in the step 2, and setting a loss function; step 4, performing cross-modal retrieval on the food and the menu by using the trained integral network; the problem of cross-modal retrieval effect poor is solved.

Description

Image menu retrieval method based on intra-modality and inter-modality mixed fusion
Technical Field
The invention relates to the field of cross-modal retrieval, in particular to an image menu retrieval method based on intra-modal and inter-modal mixed fusion.
Background
With the increasing of multimedia data such as texts, images, videos and the like in the internet, retrieval spanning different modes becomes a new trend of information retrieval; the cross-media retrieval refers to taking data of any media type as input query to retrieve data related to semantics in all media types; cross-media teletext retrieval is a retrieval task that is broadly directed between images and text; in the invention, the application scene of cross-media image-text retrieval is embodied as the mutual retrieval of food images and menu texts: for any food image, retrieving a menu description text most related to the description of the content of the food image, or for any menu description text, retrieving a food image most related to the description of the menu description text; typically, an image and a corresponding recipe will be provided in the data set, wherein each recipe contains food material and cooking steps; the food materials include various food materials, some of which (e.g., beef, egg) are directly visible in the dish and some of which (e.g., salt, honey) are not directly visible in the dish; most food materials are sensitive to high temperatures, cutting and other cooking operations, and their original appearance is easily altered; the cooking step contains complex logic information about cooking; meanwhile, food objects in the image typically exhibit a stacked, staggered pose; therefore, for the cross-modal retrieval of images and recipes, the difficulty is how to obtain more effective modal characteristics by having the information of the image or the recipe with emphasis to comprehensively measure the similarity of the images and the recipes.
Most of the existing methods respectively encode food materials and cooking steps in a menu by adopting two independent cyclic neural networks, and extract global image features by using a convolutional neural network; retrieving a loss function for pulling the matched pairs closer and pushing the dissimilar pairs away; although the prior methods achieve some achievements in cross-modal image menu retrieval, they still have the following disadvantages:
1) the existing methods are less concerned about potential interactions between food materials and cooking steps in the recipe, key information in food materials and cooking steps often occur simultaneously, and their respective characteristics do not well represent the relationship between them.
2) In the existing method, global image features are usually used for representation, and a fine-grained food image area is ignored, so that food materials occupying few pixel points are easily ignored in feature extraction, and the size, shape, primary and secondary relations of the food materials are difficult to capture by the global image features, so that the retrieval performance is unsatisfactory.
3) Most methods project images and recipes to a shared subspace to learn their feature distributions, measuring the distance between the two, which lacks the interaction and fusion of the two modality data, making subspace learning inefficient.
Disclosure of Invention
Based on the problems, the invention provides an image menu retrieval method based on intra-mode and inter-mode mixed fusion, and solves the problem of poor cross-mode retrieval effect.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
the image menu retrieval method based on mixed fusion in and among the modes comprises the following steps:
step 1, preparing image data and menu data, wherein the image data comprises food images, and the menu data comprises food materials and cooking steps;
step 2, constructing an integral network;
the step 2 specifically comprises the following steps:
step 21, extracting an image characteristic sequence, a food material characteristic sequence and a cooking step characteristic sequence;
step 22, performing feature fusion on the food material feature sequence and the cooking step feature sequence by using a cross attention mechanism in a menu mode to obtain food material fusion features and cooking step fusion features;
step 23, performing modal fusion on the food material fusion characteristics and the image characteristic sequence under a gating mechanism by using a multi-head cross attention mechanism to obtain secondary food material fusion characteristics and image fusion characteristics;
step 24, obtaining local similarity of the image fusion characteristics and the secondary fusion characteristics of the food materials through a multilayer perceptron;
step 25, fusing the image feature sequence and the cooking step, and calculating cosine similarity to obtain global similarity;
step 26, linearly combining the local similarity and the global similarity to jointly measure the similarity between the image data and the menu data;
step 3, training the whole network in the step 2, and setting a loss function;
and 4, performing cross-modal retrieval on the food and the menu by using the trained integral network.
Further, the step 21 specifically includes the following steps:
step 211, extracting the features of the food image by using a convolutional neural network ResNet50 to extract an image feature sequence;
step 212, performing characteristic representation on the food material by using a word2vec model, and extracting a food material characteristic sequence by using a single-layer bidirectional gate control circulation unit GRU;
and step 213, performing characteristic representation on the cooking steps by using a sensor 2vector model, and extracting a cooking step characteristic sequence.
Further, the step 22 specifically includes the following steps:
step 221, calculating an affinity matrix of the food material and the cooking step;
step 222, calculating the food material characteristic sequence and the affinity matrix to obtain weighted food material characteristics weighted by the characteristic sequence of the cooking step;
step 223, calculating the cooking step characteristic sequence and the affinity matrix to obtain weighted cooking step characteristics weighted by the food material characteristic sequence;
step 224, performing matrix splicing on the food material characteristic sequence and the characteristics of the weighted cooking step to obtain food material fusion characteristics;
and step 225, performing matrix splicing on the cooking step characteristic sequence and the weighted food material characteristics to obtain the cooking step fusion characteristics.
Further, the step 23 specifically includes the following steps:
step 231, calculating an image information vector weighted by the food material fusion characteristics for the image characteristic sequence, dividing the food material fusion characteristics and the image characteristic sequence into h vectors in the same-dimension subspace by using a multi-head cross attention mechanism, and performing matrix splicing on the image information sub-vectors respectively obtained by the h vectors to obtain a final weighted image information vector;
step 232, calculating the food material information vectors weighted by the image feature sequences for the food material fusion features, dividing the image feature sequences and the food material fusion features into h vectors in the same-dimension subspace by the multi-head cross attention mechanism, and performing matrix splicing on the food material information sub-vectors respectively obtained by the h vectors to obtain final weighted food material information vectors;
step 233, performing dot product calculation on the image characteristic sequence and the weighted food material information vector to obtain correlation, and further expressing the correlation as a gating matrix of fusion degree;
step 234, performing point multiplication on the image characteristic sequence and the weighted food material information vector element-by-element summation and the gating matrix, and performing residual connection with the image characteristic sequence to finally obtain an image fusion characteristic;
step 235, performing point multiplication calculation on the food material fusion characteristics and the weighted image information vector to obtain correlation, and further expressing the correlation as a gating matrix of the fusion degree;
and 236, performing point multiplication on the element-by-element summation of the food material fusion characteristics and the weighted image characteristics and the gating matrix, and performing residual connection on the food material fusion characteristics to finally obtain the secondary food material fusion characteristics.
Further, the step 24 specifically includes the following steps:
241, performing matrix splicing on the image fusion characteristics obtained in the step 23 and the secondary food material fusion characteristics to obtain a 2048-dimensional splicing vector;
and 242, inputting the splicing vector into the multilayer perceptrons of the two layers, and obtaining a value from 0 to 1 after activating a function sigmoid, wherein the value is represented as local similarity.
Further, in the step 25, the image feature sequence and the cooking step fusion feature are respectively subjected to average pooling operation to obtain features with the same dimensionality, and the cosine similarity calculated by the image feature sequence and the cooking step fusion feature is expressed as global similarity.
Further, in the step 26, the local similarity and the global similarity are linearly combined, wherein the ratio of the local similarity to the global similarity is between 0 and 1, and the sum of the ratios is 1, which is expressed as the matching degree of the image data and the recipe data.
Further, in the step 3, a contextual ranking loss is adopted as a loss function to train the whole network in the step 2.
Compared with the prior art, the invention has the beneficial effects that: through menu intra-modal fusion, the interaction between food materials and cooking steps is absorbed, the information expression of two independent embedding characteristics is further enriched, meanwhile, the images and the menu are fused between the modalities, the potential relation between a fine-grained image area and the food materials is explored, and therefore the final image menu similarity is formed from the local aspect and the global aspect together, and a better cross-modal retrieval effect is obtained.
Drawings
FIG. 1 is a flow chart of the present embodiment;
FIG. 2 is a table showing the results of the experiment in this example.
Detailed Description
The invention will be further described with reference to the accompanying drawings. Embodiments of the present invention include, but are not limited to, the following examples.
Example 1
The image menu retrieval method based on intra-modality and inter-modality mixed fusion as shown in fig. 1 includes the following steps:
step 1, preparing image data and menu data;
step 2, constructing an integral network;
step 3, training the whole network in the step 2, and setting a loss function;
and 4, performing cross-modal retrieval on the food and the menu by using the trained integral network.
In the method, the image data prepared in the step 1 comprises food images, and the menu data comprises food materials and cooking steps;
in this embodiment, step 2 specifically includes the following steps:
step 21, extracting an image characteristic sequence, a food material characteristic sequence and a cooking step characteristic sequence;
step 22, performing feature fusion on the food material feature sequence and the cooking step feature sequence by using a cross attention mechanism in a menu mode to obtain food material fusion features and cooking step fusion features;
step 23, performing modal fusion on the food material fusion characteristics and the image characteristic sequence under a gating mechanism by using a multi-head cross attention mechanism to obtain secondary food material fusion characteristics and image fusion characteristics;
step 24, obtaining local similarity of the image fusion characteristics and the secondary fusion characteristics of the food materials through a multilayer perceptron;
step 25, fusing the image feature sequence and the cooking step, and calculating cosine similarity to obtain global similarity;
step 26, linearly combining the local similarity and the global similarity to jointly measure the similarity between the image data and the menu data;
in the above method, step 21 specifically includes the following steps:
and step 211, performing feature extraction on the food image by using a convolutional neural network ResNet50, and taking the output of the last layer of residual error block of the convolutional neural network, wherein the output comprises 7 × 7-49 columns 2048-dimensional convolutional output, and the output result is expressed as an image feature sequence
Figure BDA0003018736350000041
Wherein s denotes the number of image areas, EuA sequence of features of the image is represented,
Figure BDA0003018736350000042
a food image feature representing an image area of the ith row;
step 212, performing characteristic representation on the food material by using the word2vec model, extracting a food material characteristic sequence by using a single-layer bidirectional gating circulation unit GRU, and representing the food material characteristic sequence as
Figure BDA0003018736350000043
Wherein m represents the length of the food material sequence, RgRepresenting a characteristic sequence of food material ri gA food material characteristic representing an ith food material sequence;
step 213, using the sensor 2vector model to perform characteristic representation on the cooking steps, and extracting the characteristic sequence of the cooking steps to represent the characteristic sequence as
Figure BDA0003018736350000051
Wherein n represents the length of the sequence of cooking steps, RsCharacteristic sequence of cooking steps, ri sA cooking step characteristic representing an ith sequence of cooking steps.
In the above method, step 22 specifically includes the following steps:
step 221, calculating an affinity matrix of the food material and the cooking step, wherein the formula of the affinity matrix A is as follows:
A=(RgWg)(RsWs)T
wherein, Wg、WsIs a weight parameter to be learned, T represents a matrix transposition operation;
step 222, calculating the food material characteristic sequence and the affinity matrix to obtain weighted food material characteristics weighted by the characteristic sequence of the cooking step
Figure BDA0003018736350000052
The method comprises the following specific steps:
Figure BDA0003018736350000053
Figure BDA0003018736350000054
wherein A isgRepresenting a weight matrix after an affinity matrix A is normalized along the dimension of the food materials, a softmax () function represents that the weight value is mapped between 0 and 1, ATA transpose of the affinity matrix a is represented,
Figure BDA0003018736350000055
square root representing characteristic dimension of food material characteristic or cooking step;
step 223, calculating the characteristic sequence of the cooking steps and the affinity matrix to obtain the weighted cooking step characteristics weighted by the characteristic sequence of the food materials
Figure BDA0003018736350000056
The method comprises the following specific steps:
Figure BDA0003018736350000057
Figure BDA0003018736350000058
wherein A issRepresenting a weight matrix of the affinity matrix A after the normalization along the dimension of the cooking step;
step 224, food material characteristic sequence RgAnd weighted cooking step characteristics
Figure BDA0003018736350000059
Matrix splicing is carried out to obtain food material fusion characteristic EgThe method specifically comprises the following steps:
Figure BDA00030187363500000510
wherein [ |. ] [ ] | ]]Representing a matrix splicing operation;
step 225, cooking step signature sequence RsAnd weighting the characteristics of the food materials
Figure BDA00030187363500000511
Performing matrix splicing to obtain a cooking step fusion characteristic EsIs concretely provided with
Figure BDA00030187363500000512
In the above method, step 23 specifically includes the following steps:
the input of the multi-head cross attention mechanism is represented as three groups of vectors, namely Q (query), K (Key) and V (value), the attention value is calculated by Q and K, and the attention process is realized as follows:
Figure BDA0003018736350000061
wherein Attn () represents the attention function, softmax () function maps the value between 0 and 1, KTA transposed matrix representing K is provided,
Figure BDA0003018736350000062
square root representing the Q or K dimension;
the multi-head cross attention divides the features into h heads, each head performs the attention process, the outputs of the multiple heads are spliced to obtain the output of the multi-head cross attention mechanism, and specifically, the image characteristic sequence E obtained in step 211 is useduAs Q, the food material fusion characteristics E obtained in step 224gObtaining weighted food material characteristics weighted by the image characteristic sequence as K and V
Figure BDA0003018736350000063
The following were used:
Figure BDA0003018736350000064
Figure BDA0003018736350000065
Figure BDA0003018736350000066
wherein, Wi g,Wi k,Wi uAnd WgIs a parameter matrix to be learned, MultiheadgWgRepresenting the dimension size when mapping the output dimension of the multi-headed cross attention mechanism back to the input,
Figure BDA0003018736350000067
an output representing the h-th head attention;
blending food material with characteristics EgAs Q, the image characteristic sequence E obtained in step 211uObtaining a weighted image feature sequence weighted by the food material fusion features as K and V
Figure BDA0003018736350000068
The following were used:
Figure BDA0003018736350000069
Figure BDA00030187363500000610
Figure BDA00030187363500000611
wherein, MultiheaduWuRepresenting the dimension size when mapping the output dimension of the multi-headed cross attention mechanism back to the input,
Figure BDA00030187363500000612
an output representing the h-th head attention;
in addition, for the image characteristic sequence EuThe ith line of features in
Figure BDA00030187363500000613
And weighted food material characteristics weighted by the image characteristic sequence
Figure BDA00030187363500000614
The ith line of features in
Figure BDA00030187363500000615
Calculating the dot product and obtaining the fusion degree through an activation function sigmoid, 0 tableNo fusion was shown, 1 represents complete fusion, as follows:
Figure BDA00030187363500000616
wherein the content of the first and second substances,
Figure BDA00030187363500000617
an output result representing the degree of fusion of the ith row of features;
for the image feature sequence Eu
Figure BDA00030187363500000618
Indicating the degree of fusion of all regions, and similarly, fusing the feature E for the food materialg
Figure BDA00030187363500000619
Representing the fusion degree of all food materials;
in addition, a sequence of image features E is taken into accountuAnd weighted food material characteristics weighted by the image characteristic sequence
Figure BDA00030187363500000620
The fusion operation of (1) adopts element-by-element summation, and adds residual connection to obtain final image fusion characteristics in order to avoid losing original image area information which cannot be well captured by multi-head cross attention
Figure BDA0003018736350000071
The following were used:
Figure BDA0003018736350000072
wherein, l indicates a product element by element,
Figure BDA0003018736350000073
representing element-by-element summation, and similarly obtaining the final secondary fusion characteristic of the food material
Figure BDA0003018736350000074
The following were used:
Figure BDA0003018736350000075
in the above method, step 24 specifically includes the following steps:
fusing features to images
Figure BDA0003018736350000076
Secondary blending feature with food material
Figure BDA0003018736350000077
Performing average pooling operation to reduce dimension, inputting a two-layer multilayer sensor, and outputting after activating a function sigmoid to obtain a local similarity Slocal as follows:
Figure BDA0003018736350000078
in the above method, step 25 specifically includes the following steps:
for image characteristic sequence EuMerging features E with cooking stepssCalculating the cosine similarity to obtain a global similarity Sglobal as follows:
Sglobal(I,R)=cosine(pool(Eu),pool(Es))。
in the above method, step 26 specifically includes the following steps:
the local similarity Slocal and the global similarity Sglobal are linearly combined to obtain the final similarity S between the image data and the menu data, and the similarity S is as follows:
S(I,R)=ω1Slocal(I,R)+ω2Sglobal(I,R),
wherein, ω is1,ω2Is a weight parameter, and ω12=1。
In the above method, step 3 specifically includes the following steps:
assume similarity S (I, R) is a positive sample pair (I) of image data and menu datap,Rp) Assigning a high value as a negative sample pair (I)p,Rn) Assigning a low value, i.e. (I)p,Rp)>(Ip,Rn) And n ≠ p, the menu which is most matched with the query image can be found by ranking the similarity scores between the query image and all the menus, and vice versa;
in addition, during training, the contrast ranking loss is used as a loss function, and for the sampled image data and the sampled menu data positive sample pair (I)p,Rp) Finding the negative sample pair (I) in a small batch that is most difficult to distinguishp,Rp),(Ip,Rn) And separate them from the positive sample pairs with a predefined function interval delta as follows:
Loss(I,R)=[Δ+S(Ip,Rn)-S(Ip,Rp)]++[Δ+S(In,Rp)-S(Ip,Rp)]+
wherein, [ x ]]+=max(0,x)。
After the integral model is constructed and trained, step 4 can be performed to perform cross-modal retrieval on food and recipes by using the trained integral network, specifically as follows:
step 41, extracting a characteristic vector of data of a given mode;
step 42, inputting the extracted feature vectors into the trained integral network;
step 43, the trained overall network calculates the related data variables of the given modality and the other modality to obtain the similarity (linear combination of local similarity and global similarity) of the given modality data and a plurality of candidate data in the other modality;
and step 44, sequencing the similarity results, wherein the original modal data corresponding to the variable with the maximum similarity is the retrieval result.
The overall model of the application is evaluated by using the most common retrieval evaluation indexes top-k and MedR, wherein top-k refers to the proportion of the image sequence or the menu sequence of the target positive samples in the first k results in the candidate scores returned by the model, the higher the top is, the better the top is, in the embodiment, k is 1, 5 and 10, respectively, and MedR represents the ranking median position of all the candidate sample target positive samples, and the lower the top is, the better the top is.
Testing the present invention on a large scale image Recipe retrieval dataset Recipe1M, Recipe1M dataset collected over 100 million cooking recipes and 80 million food images from 24 popular cooking websites, with 238,999 image samples and corresponding Recipe text as training sets, 51,119 image samples and corresponding Recipe text as validation sets, and 51,303 image samples and corresponding Recipe text as test sets.
In the testing stage, 1,000 pairs (1k) and 10,000 pairs (10k) are respectively sampled and repeated for 10 times to report an average result, as shown in fig. 2, it can be seen that the highest retrieval accuracy is obtained in the image-Recipe retrieval scene, and on a Recipe1M data set, compared with the prior art, on a 1k test set and a 10k test set, the top-k index is obviously improved, and the effectiveness of image-Recipe cross-mode retrieval is improved.
The above is an embodiment of the present invention. The specific parameters in the above embodiments and examples are only for the purpose of clearly illustrating the invention verification process of the inventor and are not intended to limit the scope of the invention, which is defined by the claims, and all equivalent structural changes made by using the contents of the specification and the drawings of the present invention should be covered by the scope of the present invention.

Claims (8)

1. The image menu retrieval method based on mixed fusion in and among the modes is characterized by comprising the following steps:
step 1, preparing image data and menu data, wherein the image data comprises food images, and the menu data comprises food materials and cooking steps;
step 2, constructing an integral network;
the step 2 specifically comprises the following steps:
step 21, extracting an image characteristic sequence, a food material characteristic sequence and a cooking step characteristic sequence;
step 22, performing feature fusion on the food material feature sequence and the cooking step feature sequence by using a cross attention mechanism in a menu mode to obtain food material fusion features and cooking step fusion features;
step 23, performing modal fusion on the food material fusion characteristics and the image characteristic sequence under a gating mechanism by using a multi-head cross attention mechanism to obtain secondary food material fusion characteristics and image fusion characteristics;
step 24, obtaining local similarity of the image fusion characteristics and the secondary fusion characteristics of the food materials through a multilayer perceptron;
step 25, fusing the image feature sequence and the cooking step, and calculating cosine similarity to obtain global similarity;
step 26, linearly combining the local similarity and the global similarity to jointly measure the matching degree of the image data and the menu data;
step 3, training the whole network in the step 2, and setting a loss function;
and 4, performing cross-modal retrieval on the food and the menu by using the trained integral network.
2. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: the step 21 specifically includes the following steps:
step 211, extracting the features of the food image by using a convolutional neural network ResNet50 to extract an image feature sequence;
step 212, performing characteristic representation on the food material by using a word2vec model, and extracting a food material characteristic sequence by using a single-layer bidirectional gate control circulation unit GRU;
and step 213, performing characteristic representation on the cooking steps by using a sensor 2vector model, and extracting a cooking step characteristic sequence.
3. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: the step 22 specifically includes the following steps:
step 221, calculating an affinity matrix of the food material and the cooking step;
step 222, calculating the food material characteristic sequence and the affinity matrix to obtain weighted food material characteristics weighted by the characteristic sequence of the cooking step;
step 223, calculating the cooking step characteristic sequence and the affinity matrix to obtain weighted cooking step characteristics weighted by the food material characteristic sequence;
step 224, performing matrix splicing on the food material characteristic sequence and the characteristics of the weighted cooking step to obtain food material fusion characteristics;
and step 225, performing matrix splicing on the cooking step characteristic sequence and the weighted food material characteristics to obtain the cooking step fusion characteristics.
4. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: the step 23 specifically includes the following steps:
step 231, calculating an image information vector weighted by the food material fusion characteristics for the image characteristic sequence, dividing the food material fusion characteristics and the image characteristic sequence into h vectors in the same-dimension subspace by using a multi-head cross attention mechanism, and performing matrix splicing on the image information sub-vectors respectively obtained by the h vectors to obtain a final weighted image information vector;
step 232, calculating the food material information vectors weighted by the image feature sequences for the food material fusion features, dividing the image feature sequences and the food material fusion features into h vectors in the same-dimension subspace by the multi-head cross attention mechanism, and performing matrix splicing on the food material information sub-vectors respectively obtained by the h vectors to obtain final weighted food material information vectors;
step 233, performing dot product calculation on the image characteristic sequence and the weighted food material information vector to obtain correlation, and further expressing the correlation as a gating matrix of fusion degree;
step 234, performing point multiplication on the image characteristic sequence and the weighted food material information vector element-by-element summation and the gating matrix, and performing residual connection with the image characteristic sequence to finally obtain an image fusion characteristic;
step 235, performing point multiplication calculation on the food material fusion characteristics and the weighted image information vector to obtain correlation, and further expressing the correlation as a gating matrix of the fusion degree;
and 236, performing point multiplication on the element-by-element summation of the food material fusion characteristics and the weighted image characteristics and the gating matrix, and performing residual connection on the food material fusion characteristics to finally obtain the secondary food material fusion characteristics.
5. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: the step 24 specifically includes the following steps:
241, performing matrix splicing on the image fusion characteristics obtained in the step 23 and the secondary food material fusion characteristics to obtain a 2048-dimensional splicing vector;
and 242, inputting the splicing vector into the multilayer perceptrons of the two layers, and obtaining a value from 0 to 1 after activating a function sigmoid, wherein the value is represented as local similarity.
6. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: in the step 25, average pooling operation is respectively performed on the image feature sequence and the cooking step fusion feature to obtain features with the same dimensionality, and the cosine similarity calculated by the image feature sequence and the cooking step fusion feature is expressed as global similarity.
7. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: in the step 26, the local similarity and the global similarity are linearly combined, wherein the ratio of the local similarity to the global similarity is between 0 and 1, and the sum of the ratios is 1, which is expressed as the matching degree of the image data and the menu data.
8. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: in the step 3, the comprehensive ranking loss is adopted as a loss function to train the whole network in the step 2.
CN202110397679.2A 2021-04-13 2021-04-13 Image menu retrieval method based on intra-modality and inter-modality mixed fusion Active CN112925935B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110397679.2A CN112925935B (en) 2021-04-13 2021-04-13 Image menu retrieval method based on intra-modality and inter-modality mixed fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110397679.2A CN112925935B (en) 2021-04-13 2021-04-13 Image menu retrieval method based on intra-modality and inter-modality mixed fusion

Publications (2)

Publication Number Publication Date
CN112925935A true CN112925935A (en) 2021-06-08
CN112925935B CN112925935B (en) 2022-05-06

Family

ID=76174378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110397679.2A Active CN112925935B (en) 2021-04-13 2021-04-13 Image menu retrieval method based on intra-modality and inter-modality mixed fusion

Country Status (1)

Country Link
CN (1) CN112925935B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024098763A1 (en) * 2022-11-08 2024-05-16 苏州元脑智能科技有限公司 Text operation diagram mutual-retrieval method and apparatus, text operation diagram mutual-retrieval model training method and apparatus, and device and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109189968A (en) * 2018-08-31 2019-01-11 深圳大学 A kind of cross-module state search method and system
CN110059157A (en) * 2019-03-18 2019-07-26 华南师范大学 A kind of picture and text cross-module state search method, system, device and storage medium
CN110147457A (en) * 2019-02-28 2019-08-20 腾讯科技(深圳)有限公司 Picture and text matching process, device, storage medium and equipment
CN111026894A (en) * 2019-12-12 2020-04-17 清华大学 Cross-modal image text retrieval method based on credibility self-adaptive matching network
CN111598214A (en) * 2020-04-02 2020-08-28 浙江工业大学 Cross-modal retrieval method based on graph convolution neural network
CN111985369A (en) * 2020-08-07 2020-11-24 西北工业大学 Course field multi-modal document classification method based on cross-modal attention convolution neural network
CN112164011A (en) * 2020-10-12 2021-01-01 桂林电子科技大学 Motion image deblurring method based on self-adaptive residual error and recursive cross attention
CN112241468A (en) * 2020-07-23 2021-01-19 哈尔滨工业大学(深圳) Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109189968A (en) * 2018-08-31 2019-01-11 深圳大学 A kind of cross-module state search method and system
CN110147457A (en) * 2019-02-28 2019-08-20 腾讯科技(深圳)有限公司 Picture and text matching process, device, storage medium and equipment
CN110059157A (en) * 2019-03-18 2019-07-26 华南师范大学 A kind of picture and text cross-module state search method, system, device and storage medium
CN111026894A (en) * 2019-12-12 2020-04-17 清华大学 Cross-modal image text retrieval method based on credibility self-adaptive matching network
CN111598214A (en) * 2020-04-02 2020-08-28 浙江工业大学 Cross-modal retrieval method based on graph convolution neural network
CN112241468A (en) * 2020-07-23 2021-01-19 哈尔滨工业大学(深圳) Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium
CN111985369A (en) * 2020-08-07 2020-11-24 西北工业大学 Course field multi-modal document classification method based on cross-modal attention convolution neural network
CN112164011A (en) * 2020-10-12 2021-01-01 桂林电子科技大学 Motion image deblurring method based on self-adaptive residual error and recursive cross attention

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
CHEN H 等: "Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval", 《PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *
FU H 等: "Mcen: Bridging cross-modal gap between cooking recipes and dish images with latent variable model", 《PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *
LI J 等: "Hybrid Fusion with Intra-and Cross-Modality Attention for Image-Recipe Retrieval", 《PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL》 *
储晶晶: "面向菜谱领域的跨模态检索方法研究", 《湖南大学》 *
林阳 等: "融合自注意力机制的跨模态食谱检索方法", 《计算机科学与探索》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024098763A1 (en) * 2022-11-08 2024-05-16 苏州元脑智能科技有限公司 Text operation diagram mutual-retrieval method and apparatus, text operation diagram mutual-retrieval model training method and apparatus, and device and medium

Also Published As

Publication number Publication date
CN112925935B (en) 2022-05-06

Similar Documents

Publication Publication Date Title
CN110516160B (en) Knowledge graph-based user modeling method and sequence recommendation method
CN107766873A (en) The sample classification method of multi-tag zero based on sequence study
CN112417306B (en) Method for optimizing performance of recommendation algorithm based on knowledge graph
CN113657450B (en) Attention mechanism-based land battlefield image-text cross-modal retrieval method and system
CN113297370B (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN107545276A (en) The various visual angles learning method of joint low-rank representation and sparse regression
CN107590505B (en) Learning method combining low-rank representation and sparse regression
CN112784782B (en) Three-dimensional object identification method based on multi-view double-attention network
CN114782694A (en) Unsupervised anomaly detection method, system, device and storage medium
CN114693397A (en) Multi-view multi-modal commodity recommendation method based on attention neural network
CN114970517A (en) Visual question and answer oriented method based on multi-modal interaction context perception
CN114241191A (en) Cross-modal self-attention-based non-candidate-box expression understanding method
CN112925935B (en) Image menu retrieval method based on intra-modality and inter-modality mixed fusion
CN110569761B (en) Method for retrieving remote sensing image by hand-drawn sketch based on counterstudy
CN112182275A (en) Trademark approximate retrieval system and method based on multi-dimensional feature fusion
CN115222954A (en) Weak perception target detection method and related equipment
Liu et al. Audiovisual cross-modal material surface retrieval
CN116189800B (en) Pattern recognition method, device, equipment and storage medium based on gas detection
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene
CN116955650A (en) Information retrieval optimization method and system based on small sample knowledge graph completion
CN115758159B (en) Zero sample text position detection method based on mixed contrast learning and generation type data enhancement
Zhang et al. Multiscale visual-attribute co-attention for zero-shot image recognition
CN114882409A (en) Intelligent violent behavior detection method and device based on multi-mode feature fusion
Sassi et al. Neural approach for context scene image classification based on geometric, texture and color information
CN115098646A (en) Multilevel relation analysis and mining method for image-text data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant