CN112925935A - Image menu retrieval method based on intra-modality and inter-modality mixed fusion - Google Patents
Image menu retrieval method based on intra-modality and inter-modality mixed fusion Download PDFInfo
- Publication number
- CN112925935A CN112925935A CN202110397679.2A CN202110397679A CN112925935A CN 112925935 A CN112925935 A CN 112925935A CN 202110397679 A CN202110397679 A CN 202110397679A CN 112925935 A CN112925935 A CN 112925935A
- Authority
- CN
- China
- Prior art keywords
- image
- food material
- fusion
- modality
- weighted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/532—Query formulation, e.g. graphical querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Library & Information Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Multimedia (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention relates to the field of cross-modal retrieval, in particular to an image menu retrieval method based on intra-modal and inter-modal mixed fusion, which comprises the following steps: step 1, preparing image data and menu data; step 2, constructing an integral network; step 3, training the whole network in the step 2, and setting a loss function; step 4, performing cross-modal retrieval on the food and the menu by using the trained integral network; the problem of cross-modal retrieval effect poor is solved.
Description
Technical Field
The invention relates to the field of cross-modal retrieval, in particular to an image menu retrieval method based on intra-modal and inter-modal mixed fusion.
Background
With the increasing of multimedia data such as texts, images, videos and the like in the internet, retrieval spanning different modes becomes a new trend of information retrieval; the cross-media retrieval refers to taking data of any media type as input query to retrieve data related to semantics in all media types; cross-media teletext retrieval is a retrieval task that is broadly directed between images and text; in the invention, the application scene of cross-media image-text retrieval is embodied as the mutual retrieval of food images and menu texts: for any food image, retrieving a menu description text most related to the description of the content of the food image, or for any menu description text, retrieving a food image most related to the description of the menu description text; typically, an image and a corresponding recipe will be provided in the data set, wherein each recipe contains food material and cooking steps; the food materials include various food materials, some of which (e.g., beef, egg) are directly visible in the dish and some of which (e.g., salt, honey) are not directly visible in the dish; most food materials are sensitive to high temperatures, cutting and other cooking operations, and their original appearance is easily altered; the cooking step contains complex logic information about cooking; meanwhile, food objects in the image typically exhibit a stacked, staggered pose; therefore, for the cross-modal retrieval of images and recipes, the difficulty is how to obtain more effective modal characteristics by having the information of the image or the recipe with emphasis to comprehensively measure the similarity of the images and the recipes.
Most of the existing methods respectively encode food materials and cooking steps in a menu by adopting two independent cyclic neural networks, and extract global image features by using a convolutional neural network; retrieving a loss function for pulling the matched pairs closer and pushing the dissimilar pairs away; although the prior methods achieve some achievements in cross-modal image menu retrieval, they still have the following disadvantages:
1) the existing methods are less concerned about potential interactions between food materials and cooking steps in the recipe, key information in food materials and cooking steps often occur simultaneously, and their respective characteristics do not well represent the relationship between them.
2) In the existing method, global image features are usually used for representation, and a fine-grained food image area is ignored, so that food materials occupying few pixel points are easily ignored in feature extraction, and the size, shape, primary and secondary relations of the food materials are difficult to capture by the global image features, so that the retrieval performance is unsatisfactory.
3) Most methods project images and recipes to a shared subspace to learn their feature distributions, measuring the distance between the two, which lacks the interaction and fusion of the two modality data, making subspace learning inefficient.
Disclosure of Invention
Based on the problems, the invention provides an image menu retrieval method based on intra-mode and inter-mode mixed fusion, and solves the problem of poor cross-mode retrieval effect.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
the image menu retrieval method based on mixed fusion in and among the modes comprises the following steps:
the step 2 specifically comprises the following steps:
step 21, extracting an image characteristic sequence, a food material characteristic sequence and a cooking step characteristic sequence;
step 22, performing feature fusion on the food material feature sequence and the cooking step feature sequence by using a cross attention mechanism in a menu mode to obtain food material fusion features and cooking step fusion features;
step 23, performing modal fusion on the food material fusion characteristics and the image characteristic sequence under a gating mechanism by using a multi-head cross attention mechanism to obtain secondary food material fusion characteristics and image fusion characteristics;
step 24, obtaining local similarity of the image fusion characteristics and the secondary fusion characteristics of the food materials through a multilayer perceptron;
step 25, fusing the image feature sequence and the cooking step, and calculating cosine similarity to obtain global similarity;
step 26, linearly combining the local similarity and the global similarity to jointly measure the similarity between the image data and the menu data;
step 3, training the whole network in the step 2, and setting a loss function;
and 4, performing cross-modal retrieval on the food and the menu by using the trained integral network.
Further, the step 21 specifically includes the following steps:
step 211, extracting the features of the food image by using a convolutional neural network ResNet50 to extract an image feature sequence;
step 212, performing characteristic representation on the food material by using a word2vec model, and extracting a food material characteristic sequence by using a single-layer bidirectional gate control circulation unit GRU;
and step 213, performing characteristic representation on the cooking steps by using a sensor 2vector model, and extracting a cooking step characteristic sequence.
Further, the step 22 specifically includes the following steps:
step 221, calculating an affinity matrix of the food material and the cooking step;
step 222, calculating the food material characteristic sequence and the affinity matrix to obtain weighted food material characteristics weighted by the characteristic sequence of the cooking step;
step 223, calculating the cooking step characteristic sequence and the affinity matrix to obtain weighted cooking step characteristics weighted by the food material characteristic sequence;
step 224, performing matrix splicing on the food material characteristic sequence and the characteristics of the weighted cooking step to obtain food material fusion characteristics;
and step 225, performing matrix splicing on the cooking step characteristic sequence and the weighted food material characteristics to obtain the cooking step fusion characteristics.
Further, the step 23 specifically includes the following steps:
step 231, calculating an image information vector weighted by the food material fusion characteristics for the image characteristic sequence, dividing the food material fusion characteristics and the image characteristic sequence into h vectors in the same-dimension subspace by using a multi-head cross attention mechanism, and performing matrix splicing on the image information sub-vectors respectively obtained by the h vectors to obtain a final weighted image information vector;
step 232, calculating the food material information vectors weighted by the image feature sequences for the food material fusion features, dividing the image feature sequences and the food material fusion features into h vectors in the same-dimension subspace by the multi-head cross attention mechanism, and performing matrix splicing on the food material information sub-vectors respectively obtained by the h vectors to obtain final weighted food material information vectors;
step 233, performing dot product calculation on the image characteristic sequence and the weighted food material information vector to obtain correlation, and further expressing the correlation as a gating matrix of fusion degree;
step 234, performing point multiplication on the image characteristic sequence and the weighted food material information vector element-by-element summation and the gating matrix, and performing residual connection with the image characteristic sequence to finally obtain an image fusion characteristic;
step 235, performing point multiplication calculation on the food material fusion characteristics and the weighted image information vector to obtain correlation, and further expressing the correlation as a gating matrix of the fusion degree;
and 236, performing point multiplication on the element-by-element summation of the food material fusion characteristics and the weighted image characteristics and the gating matrix, and performing residual connection on the food material fusion characteristics to finally obtain the secondary food material fusion characteristics.
Further, the step 24 specifically includes the following steps:
241, performing matrix splicing on the image fusion characteristics obtained in the step 23 and the secondary food material fusion characteristics to obtain a 2048-dimensional splicing vector;
and 242, inputting the splicing vector into the multilayer perceptrons of the two layers, and obtaining a value from 0 to 1 after activating a function sigmoid, wherein the value is represented as local similarity.
Further, in the step 25, the image feature sequence and the cooking step fusion feature are respectively subjected to average pooling operation to obtain features with the same dimensionality, and the cosine similarity calculated by the image feature sequence and the cooking step fusion feature is expressed as global similarity.
Further, in the step 26, the local similarity and the global similarity are linearly combined, wherein the ratio of the local similarity to the global similarity is between 0 and 1, and the sum of the ratios is 1, which is expressed as the matching degree of the image data and the recipe data.
Further, in the step 3, a contextual ranking loss is adopted as a loss function to train the whole network in the step 2.
Compared with the prior art, the invention has the beneficial effects that: through menu intra-modal fusion, the interaction between food materials and cooking steps is absorbed, the information expression of two independent embedding characteristics is further enriched, meanwhile, the images and the menu are fused between the modalities, the potential relation between a fine-grained image area and the food materials is explored, and therefore the final image menu similarity is formed from the local aspect and the global aspect together, and a better cross-modal retrieval effect is obtained.
Drawings
FIG. 1 is a flow chart of the present embodiment;
FIG. 2 is a table showing the results of the experiment in this example.
Detailed Description
The invention will be further described with reference to the accompanying drawings. Embodiments of the present invention include, but are not limited to, the following examples.
Example 1
The image menu retrieval method based on intra-modality and inter-modality mixed fusion as shown in fig. 1 includes the following steps:
step 3, training the whole network in the step 2, and setting a loss function;
and 4, performing cross-modal retrieval on the food and the menu by using the trained integral network.
In the method, the image data prepared in the step 1 comprises food images, and the menu data comprises food materials and cooking steps;
in this embodiment, step 2 specifically includes the following steps:
step 21, extracting an image characteristic sequence, a food material characteristic sequence and a cooking step characteristic sequence;
step 22, performing feature fusion on the food material feature sequence and the cooking step feature sequence by using a cross attention mechanism in a menu mode to obtain food material fusion features and cooking step fusion features;
step 23, performing modal fusion on the food material fusion characteristics and the image characteristic sequence under a gating mechanism by using a multi-head cross attention mechanism to obtain secondary food material fusion characteristics and image fusion characteristics;
step 24, obtaining local similarity of the image fusion characteristics and the secondary fusion characteristics of the food materials through a multilayer perceptron;
step 25, fusing the image feature sequence and the cooking step, and calculating cosine similarity to obtain global similarity;
step 26, linearly combining the local similarity and the global similarity to jointly measure the similarity between the image data and the menu data;
in the above method, step 21 specifically includes the following steps:
and step 211, performing feature extraction on the food image by using a convolutional neural network ResNet50, and taking the output of the last layer of residual error block of the convolutional neural network, wherein the output comprises 7 × 7-49 columns 2048-dimensional convolutional output, and the output result is expressed as an image feature sequenceWherein s denotes the number of image areas, EuA sequence of features of the image is represented,a food image feature representing an image area of the ith row;
step 212, performing characteristic representation on the food material by using the word2vec model, extracting a food material characteristic sequence by using a single-layer bidirectional gating circulation unit GRU, and representing the food material characteristic sequence asWherein m represents the length of the food material sequence, RgRepresenting a characteristic sequence of food material ri gA food material characteristic representing an ith food material sequence;
step 213, using the sensor 2vector model to perform characteristic representation on the cooking steps, and extracting the characteristic sequence of the cooking steps to represent the characteristic sequence asWherein n represents the length of the sequence of cooking steps, RsCharacteristic sequence of cooking steps, ri sA cooking step characteristic representing an ith sequence of cooking steps.
In the above method, step 22 specifically includes the following steps:
step 221, calculating an affinity matrix of the food material and the cooking step, wherein the formula of the affinity matrix A is as follows:
A=(RgWg)(RsWs)T,
wherein, Wg、WsIs a weight parameter to be learned, T represents a matrix transposition operation;
step 222, calculating the food material characteristic sequence and the affinity matrix to obtain weighted food material characteristics weighted by the characteristic sequence of the cooking stepThe method comprises the following specific steps:
wherein A isgRepresenting a weight matrix after an affinity matrix A is normalized along the dimension of the food materials, a softmax () function represents that the weight value is mapped between 0 and 1, ATA transpose of the affinity matrix a is represented,square root representing characteristic dimension of food material characteristic or cooking step;
step 223, calculating the characteristic sequence of the cooking steps and the affinity matrix to obtain the weighted cooking step characteristics weighted by the characteristic sequence of the food materialsThe method comprises the following specific steps:
wherein A issRepresenting a weight matrix of the affinity matrix A after the normalization along the dimension of the cooking step;
step 224, food material characteristic sequence RgAnd weighted cooking step characteristicsMatrix splicing is carried out to obtain food material fusion characteristic EgThe method specifically comprises the following steps:wherein [ |. ] [ ] | ]]Representing a matrix splicing operation;
step 225, cooking step signature sequence RsAnd weighting the characteristics of the food materialsPerforming matrix splicing to obtain a cooking step fusion characteristic EsIs concretely provided with
In the above method, step 23 specifically includes the following steps:
the input of the multi-head cross attention mechanism is represented as three groups of vectors, namely Q (query), K (Key) and V (value), the attention value is calculated by Q and K, and the attention process is realized as follows:
wherein Attn () represents the attention function, softmax () function maps the value between 0 and 1, KTA transposed matrix representing K is provided,square root representing the Q or K dimension;
the multi-head cross attention divides the features into h heads, each head performs the attention process, the outputs of the multiple heads are spliced to obtain the output of the multi-head cross attention mechanism, and specifically, the image characteristic sequence E obtained in step 211 is useduAs Q, the food material fusion characteristics E obtained in step 224gObtaining weighted food material characteristics weighted by the image characteristic sequence as K and VThe following were used:
wherein, Wi g,Wi k,Wi uAnd WgIs a parameter matrix to be learned, MultiheadgWgRepresenting the dimension size when mapping the output dimension of the multi-headed cross attention mechanism back to the input,an output representing the h-th head attention;
blending food material with characteristics EgAs Q, the image characteristic sequence E obtained in step 211uObtaining a weighted image feature sequence weighted by the food material fusion features as K and VThe following were used:
wherein, MultiheaduWuRepresenting the dimension size when mapping the output dimension of the multi-headed cross attention mechanism back to the input,an output representing the h-th head attention;
in addition, for the image characteristic sequence EuThe ith line of features inAnd weighted food material characteristics weighted by the image characteristic sequenceThe ith line of features inCalculating the dot product and obtaining the fusion degree through an activation function sigmoid, 0 tableNo fusion was shown, 1 represents complete fusion, as follows:
wherein the content of the first and second substances,an output result representing the degree of fusion of the ith row of features;
for the image feature sequence Eu,Indicating the degree of fusion of all regions, and similarly, fusing the feature E for the food materialg,Representing the fusion degree of all food materials;
in addition, a sequence of image features E is taken into accountuAnd weighted food material characteristics weighted by the image characteristic sequenceThe fusion operation of (1) adopts element-by-element summation, and adds residual connection to obtain final image fusion characteristics in order to avoid losing original image area information which cannot be well captured by multi-head cross attentionThe following were used:
wherein, l indicates a product element by element,representing element-by-element summation, and similarly obtaining the final secondary fusion characteristic of the food materialThe following were used:
in the above method, step 24 specifically includes the following steps:
fusing features to imagesSecondary blending feature with food materialPerforming average pooling operation to reduce dimension, inputting a two-layer multilayer sensor, and outputting after activating a function sigmoid to obtain a local similarity Slocal as follows:
in the above method, step 25 specifically includes the following steps:
for image characteristic sequence EuMerging features E with cooking stepssCalculating the cosine similarity to obtain a global similarity Sglobal as follows:
Sglobal(I,R)=cosine(pool(Eu),pool(Es))。
in the above method, step 26 specifically includes the following steps:
the local similarity Slocal and the global similarity Sglobal are linearly combined to obtain the final similarity S between the image data and the menu data, and the similarity S is as follows:
S(I,R)=ω1Slocal(I,R)+ω2Sglobal(I,R),
wherein, ω is1,ω2Is a weight parameter, and ω1+ω2=1。
In the above method, step 3 specifically includes the following steps:
assume similarity S (I, R) is a positive sample pair (I) of image data and menu datap,Rp) Assigning a high value as a negative sample pair (I)p,Rn) Assigning a low value, i.e. (I)p,Rp)>(Ip,Rn) And n ≠ p, the menu which is most matched with the query image can be found by ranking the similarity scores between the query image and all the menus, and vice versa;
in addition, during training, the contrast ranking loss is used as a loss function, and for the sampled image data and the sampled menu data positive sample pair (I)p,Rp) Finding the negative sample pair (I) in a small batch that is most difficult to distinguishp,Rp),(Ip,Rn) And separate them from the positive sample pairs with a predefined function interval delta as follows:
Loss(I,R)=[Δ+S(Ip,Rn)-S(Ip,Rp)]++[Δ+S(In,Rp)-S(Ip,Rp)]+,
wherein, [ x ]]+=max(0,x)。
After the integral model is constructed and trained, step 4 can be performed to perform cross-modal retrieval on food and recipes by using the trained integral network, specifically as follows:
step 41, extracting a characteristic vector of data of a given mode;
step 42, inputting the extracted feature vectors into the trained integral network;
step 43, the trained overall network calculates the related data variables of the given modality and the other modality to obtain the similarity (linear combination of local similarity and global similarity) of the given modality data and a plurality of candidate data in the other modality;
and step 44, sequencing the similarity results, wherein the original modal data corresponding to the variable with the maximum similarity is the retrieval result.
The overall model of the application is evaluated by using the most common retrieval evaluation indexes top-k and MedR, wherein top-k refers to the proportion of the image sequence or the menu sequence of the target positive samples in the first k results in the candidate scores returned by the model, the higher the top is, the better the top is, in the embodiment, k is 1, 5 and 10, respectively, and MedR represents the ranking median position of all the candidate sample target positive samples, and the lower the top is, the better the top is.
Testing the present invention on a large scale image Recipe retrieval dataset Recipe1M, Recipe1M dataset collected over 100 million cooking recipes and 80 million food images from 24 popular cooking websites, with 238,999 image samples and corresponding Recipe text as training sets, 51,119 image samples and corresponding Recipe text as validation sets, and 51,303 image samples and corresponding Recipe text as test sets.
In the testing stage, 1,000 pairs (1k) and 10,000 pairs (10k) are respectively sampled and repeated for 10 times to report an average result, as shown in fig. 2, it can be seen that the highest retrieval accuracy is obtained in the image-Recipe retrieval scene, and on a Recipe1M data set, compared with the prior art, on a 1k test set and a 10k test set, the top-k index is obviously improved, and the effectiveness of image-Recipe cross-mode retrieval is improved.
The above is an embodiment of the present invention. The specific parameters in the above embodiments and examples are only for the purpose of clearly illustrating the invention verification process of the inventor and are not intended to limit the scope of the invention, which is defined by the claims, and all equivalent structural changes made by using the contents of the specification and the drawings of the present invention should be covered by the scope of the present invention.
Claims (8)
1. The image menu retrieval method based on mixed fusion in and among the modes is characterized by comprising the following steps:
step 1, preparing image data and menu data, wherein the image data comprises food images, and the menu data comprises food materials and cooking steps;
step 2, constructing an integral network;
the step 2 specifically comprises the following steps:
step 21, extracting an image characteristic sequence, a food material characteristic sequence and a cooking step characteristic sequence;
step 22, performing feature fusion on the food material feature sequence and the cooking step feature sequence by using a cross attention mechanism in a menu mode to obtain food material fusion features and cooking step fusion features;
step 23, performing modal fusion on the food material fusion characteristics and the image characteristic sequence under a gating mechanism by using a multi-head cross attention mechanism to obtain secondary food material fusion characteristics and image fusion characteristics;
step 24, obtaining local similarity of the image fusion characteristics and the secondary fusion characteristics of the food materials through a multilayer perceptron;
step 25, fusing the image feature sequence and the cooking step, and calculating cosine similarity to obtain global similarity;
step 26, linearly combining the local similarity and the global similarity to jointly measure the matching degree of the image data and the menu data;
step 3, training the whole network in the step 2, and setting a loss function;
and 4, performing cross-modal retrieval on the food and the menu by using the trained integral network.
2. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: the step 21 specifically includes the following steps:
step 211, extracting the features of the food image by using a convolutional neural network ResNet50 to extract an image feature sequence;
step 212, performing characteristic representation on the food material by using a word2vec model, and extracting a food material characteristic sequence by using a single-layer bidirectional gate control circulation unit GRU;
and step 213, performing characteristic representation on the cooking steps by using a sensor 2vector model, and extracting a cooking step characteristic sequence.
3. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: the step 22 specifically includes the following steps:
step 221, calculating an affinity matrix of the food material and the cooking step;
step 222, calculating the food material characteristic sequence and the affinity matrix to obtain weighted food material characteristics weighted by the characteristic sequence of the cooking step;
step 223, calculating the cooking step characteristic sequence and the affinity matrix to obtain weighted cooking step characteristics weighted by the food material characteristic sequence;
step 224, performing matrix splicing on the food material characteristic sequence and the characteristics of the weighted cooking step to obtain food material fusion characteristics;
and step 225, performing matrix splicing on the cooking step characteristic sequence and the weighted food material characteristics to obtain the cooking step fusion characteristics.
4. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: the step 23 specifically includes the following steps:
step 231, calculating an image information vector weighted by the food material fusion characteristics for the image characteristic sequence, dividing the food material fusion characteristics and the image characteristic sequence into h vectors in the same-dimension subspace by using a multi-head cross attention mechanism, and performing matrix splicing on the image information sub-vectors respectively obtained by the h vectors to obtain a final weighted image information vector;
step 232, calculating the food material information vectors weighted by the image feature sequences for the food material fusion features, dividing the image feature sequences and the food material fusion features into h vectors in the same-dimension subspace by the multi-head cross attention mechanism, and performing matrix splicing on the food material information sub-vectors respectively obtained by the h vectors to obtain final weighted food material information vectors;
step 233, performing dot product calculation on the image characteristic sequence and the weighted food material information vector to obtain correlation, and further expressing the correlation as a gating matrix of fusion degree;
step 234, performing point multiplication on the image characteristic sequence and the weighted food material information vector element-by-element summation and the gating matrix, and performing residual connection with the image characteristic sequence to finally obtain an image fusion characteristic;
step 235, performing point multiplication calculation on the food material fusion characteristics and the weighted image information vector to obtain correlation, and further expressing the correlation as a gating matrix of the fusion degree;
and 236, performing point multiplication on the element-by-element summation of the food material fusion characteristics and the weighted image characteristics and the gating matrix, and performing residual connection on the food material fusion characteristics to finally obtain the secondary food material fusion characteristics.
5. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: the step 24 specifically includes the following steps:
241, performing matrix splicing on the image fusion characteristics obtained in the step 23 and the secondary food material fusion characteristics to obtain a 2048-dimensional splicing vector;
and 242, inputting the splicing vector into the multilayer perceptrons of the two layers, and obtaining a value from 0 to 1 after activating a function sigmoid, wherein the value is represented as local similarity.
6. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: in the step 25, average pooling operation is respectively performed on the image feature sequence and the cooking step fusion feature to obtain features with the same dimensionality, and the cosine similarity calculated by the image feature sequence and the cooking step fusion feature is expressed as global similarity.
7. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: in the step 26, the local similarity and the global similarity are linearly combined, wherein the ratio of the local similarity to the global similarity is between 0 and 1, and the sum of the ratios is 1, which is expressed as the matching degree of the image data and the menu data.
8. The method for image menu search based on intra-modality and inter-modality hybrid fusion according to claim 1, wherein: in the step 3, the comprehensive ranking loss is adopted as a loss function to train the whole network in the step 2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110397679.2A CN112925935B (en) | 2021-04-13 | 2021-04-13 | Image menu retrieval method based on intra-modality and inter-modality mixed fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110397679.2A CN112925935B (en) | 2021-04-13 | 2021-04-13 | Image menu retrieval method based on intra-modality and inter-modality mixed fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112925935A true CN112925935A (en) | 2021-06-08 |
CN112925935B CN112925935B (en) | 2022-05-06 |
Family
ID=76174378
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110397679.2A Active CN112925935B (en) | 2021-04-13 | 2021-04-13 | Image menu retrieval method based on intra-modality and inter-modality mixed fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112925935B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024098763A1 (en) * | 2022-11-08 | 2024-05-16 | 苏州元脑智能科技有限公司 | Text operation diagram mutual-retrieval method and apparatus, text operation diagram mutual-retrieval model training method and apparatus, and device and medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109189968A (en) * | 2018-08-31 | 2019-01-11 | 深圳大学 | A kind of cross-module state search method and system |
CN110059157A (en) * | 2019-03-18 | 2019-07-26 | 华南师范大学 | A kind of picture and text cross-module state search method, system, device and storage medium |
CN110147457A (en) * | 2019-02-28 | 2019-08-20 | 腾讯科技(深圳)有限公司 | Picture and text matching process, device, storage medium and equipment |
CN111026894A (en) * | 2019-12-12 | 2020-04-17 | 清华大学 | Cross-modal image text retrieval method based on credibility self-adaptive matching network |
CN111598214A (en) * | 2020-04-02 | 2020-08-28 | 浙江工业大学 | Cross-modal retrieval method based on graph convolution neural network |
CN111985369A (en) * | 2020-08-07 | 2020-11-24 | 西北工业大学 | Course field multi-modal document classification method based on cross-modal attention convolution neural network |
CN112164011A (en) * | 2020-10-12 | 2021-01-01 | 桂林电子科技大学 | Motion image deblurring method based on self-adaptive residual error and recursive cross attention |
CN112241468A (en) * | 2020-07-23 | 2021-01-19 | 哈尔滨工业大学(深圳) | Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium |
-
2021
- 2021-04-13 CN CN202110397679.2A patent/CN112925935B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109189968A (en) * | 2018-08-31 | 2019-01-11 | 深圳大学 | A kind of cross-module state search method and system |
CN110147457A (en) * | 2019-02-28 | 2019-08-20 | 腾讯科技(深圳)有限公司 | Picture and text matching process, device, storage medium and equipment |
CN110059157A (en) * | 2019-03-18 | 2019-07-26 | 华南师范大学 | A kind of picture and text cross-module state search method, system, device and storage medium |
CN111026894A (en) * | 2019-12-12 | 2020-04-17 | 清华大学 | Cross-modal image text retrieval method based on credibility self-adaptive matching network |
CN111598214A (en) * | 2020-04-02 | 2020-08-28 | 浙江工业大学 | Cross-modal retrieval method based on graph convolution neural network |
CN112241468A (en) * | 2020-07-23 | 2021-01-19 | 哈尔滨工业大学(深圳) | Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium |
CN111985369A (en) * | 2020-08-07 | 2020-11-24 | 西北工业大学 | Course field multi-modal document classification method based on cross-modal attention convolution neural network |
CN112164011A (en) * | 2020-10-12 | 2021-01-01 | 桂林电子科技大学 | Motion image deblurring method based on self-adaptive residual error and recursive cross attention |
Non-Patent Citations (5)
Title |
---|
CHEN H 等: "Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval", 《PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 * |
FU H 等: "Mcen: Bridging cross-modal gap between cooking recipes and dish images with latent variable model", 《PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 * |
LI J 等: "Hybrid Fusion with Intra-and Cross-Modality Attention for Image-Recipe Retrieval", 《PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL》 * |
储晶晶: "面向菜谱领域的跨模态检索方法研究", 《湖南大学》 * |
林阳 等: "融合自注意力机制的跨模态食谱检索方法", 《计算机科学与探索》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024098763A1 (en) * | 2022-11-08 | 2024-05-16 | 苏州元脑智能科技有限公司 | Text operation diagram mutual-retrieval method and apparatus, text operation diagram mutual-retrieval model training method and apparatus, and device and medium |
Also Published As
Publication number | Publication date |
---|---|
CN112925935B (en) | 2022-05-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110516160B (en) | Knowledge graph-based user modeling method and sequence recommendation method | |
CN107766873A (en) | The sample classification method of multi-tag zero based on sequence study | |
CN112417306B (en) | Method for optimizing performance of recommendation algorithm based on knowledge graph | |
CN113657450B (en) | Attention mechanism-based land battlefield image-text cross-modal retrieval method and system | |
CN113297370B (en) | End-to-end multi-modal question-answering method and system based on multi-interaction attention | |
CN107545276A (en) | The various visual angles learning method of joint low-rank representation and sparse regression | |
CN107590505B (en) | Learning method combining low-rank representation and sparse regression | |
CN112784782B (en) | Three-dimensional object identification method based on multi-view double-attention network | |
CN114782694A (en) | Unsupervised anomaly detection method, system, device and storage medium | |
CN114693397A (en) | Multi-view multi-modal commodity recommendation method based on attention neural network | |
CN114970517A (en) | Visual question and answer oriented method based on multi-modal interaction context perception | |
CN114241191A (en) | Cross-modal self-attention-based non-candidate-box expression understanding method | |
CN112925935B (en) | Image menu retrieval method based on intra-modality and inter-modality mixed fusion | |
CN110569761B (en) | Method for retrieving remote sensing image by hand-drawn sketch based on counterstudy | |
CN112182275A (en) | Trademark approximate retrieval system and method based on multi-dimensional feature fusion | |
CN115222954A (en) | Weak perception target detection method and related equipment | |
Liu et al. | Audiovisual cross-modal material surface retrieval | |
CN116189800B (en) | Pattern recognition method, device, equipment and storage medium based on gas detection | |
CN116933051A (en) | Multi-mode emotion recognition method and system for modal missing scene | |
CN116955650A (en) | Information retrieval optimization method and system based on small sample knowledge graph completion | |
CN115758159B (en) | Zero sample text position detection method based on mixed contrast learning and generation type data enhancement | |
Zhang et al. | Multiscale visual-attribute co-attention for zero-shot image recognition | |
CN114882409A (en) | Intelligent violent behavior detection method and device based on multi-mode feature fusion | |
Sassi et al. | Neural approach for context scene image classification based on geometric, texture and color information | |
CN115098646A (en) | Multilevel relation analysis and mining method for image-text data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |