CN111324752B - Image and text retrieval method based on graphic neural network structure modeling - Google Patents
Image and text retrieval method based on graphic neural network structure modeling Download PDFInfo
- Publication number
- CN111324752B CN111324752B CN202010104275.5A CN202010104275A CN111324752B CN 111324752 B CN111324752 B CN 111324752B CN 202010104275 A CN202010104275 A CN 202010104275A CN 111324752 B CN111324752 B CN 111324752B
- Authority
- CN
- China
- Prior art keywords
- text
- picture
- elements
- visual
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 13
- 230000000007 visual effect Effects 0.000 claims abstract description 53
- 238000004364 calculation method Methods 0.000 claims abstract description 8
- 238000012360 testing method Methods 0.000 claims description 7
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 2
- 238000011176 pooling Methods 0.000 claims description 2
- 230000006870 function Effects 0.000 description 5
- 241000282326 Felis catus Species 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/43—Querying
- G06F16/432—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/43—Querying
- G06F16/432—Query formulation
- G06F16/434—Query formulation using image data, e.g. images, photos, pictures taken by a user
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/43—Querying
- G06F16/438—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/48—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/483—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an image and text retrieval method based on graphic neural network structural modeling, which applies an attention mechanism to express fine granularity vision and text elements extracted from pictures and texts, so that the similarity of the pictures and the texts can be better calculated; the method has the advantages that the graph structure is adaptively constructed by utilizing visual and text elements, and the characteristics are updated by using a graph convolution method, so that the relationships between the intra-mode and inter-mode of the visual and text elements can be better considered; and a constraint mechanism is introduced between different picture and text pairs in the alignment process of visual and text elements, so that fine-granularity text elements can be favorably corresponding to corresponding picture areas, the reliability of picture and text level similarity calculation is further improved, and the accuracy of picture and text retrieval is further improved.
Description
Technical Field
The invention relates to the technical field of multimedia retrieval, in particular to an image and text retrieval method based on graphic neural network structure modeling.
Background
As massive amounts of multimedia data flow into the internet, multimedia retrieval techniques across a variety of different modalities of data (visual, text, speech, etc.) play an increasingly important role.
Conventional image retrieval techniques often use labels to retrieve pictures. This process tends to be unidirectional and only uses discrete tag data. The bidirectional retrieval of the images and the texts contains richer semantics and accords with the habit of using natural language by human beings. However, there is a large difference between the data of the two different modalities of vision and text. In order to achieve cross-modal retrieval of images and text, it is desirable to well integrate computer vision with natural language understanding.
Recently, a cross-modal retrieval method for images and texts based on deep learning mainly maps the images and texts to a unified embedded space, compares global similarity between visual data and language data, and finally outputs a retrieval result. However, these methods rarely consider the alignment between fine-grained visual elements, text elements. This limits the overall similarity calculation of the image and text, affecting the final retrieval accuracy.
Disclosure of Invention
The invention aims to provide an image and text retrieval method based on graphic neural network structure modeling, which can obtain higher image and text retrieval accuracy.
The invention aims at realizing the following technical scheme:
an image and text retrieval method based on graph neural network structure modeling comprises the following steps:
training phase: extracting visual elements and initial text elements of a single picture and text pair, and introducing an attention mechanism to re-represent each text element; taking visual elements of single picture and text pairs and re-represented text elements as nodes, adaptively constructing a graph structure, and updating each node by utilizing a graph convolution method; calculating the autocorrelation of each initial text element by combining the updated text elements, and making alignment constraints of visual elements and text elements in different pictures and texts; meanwhile, the autocorrelation of the initial text elements is converged to measure the similarity of the whole text and the whole picture, so that a retrieval ordering result is generated according to the similarity; constructing a total loss function by using the loss of the element alignment process and the loss function of the retrieval ordering;
testing: for an input picture to be detected, extracting a corresponding visual element, combining text data in a database, and calculating the autocorrelation of an initial text element of each text data in the same way as a training stage, so as to calculate the similarity of each text data and the picture to be detected; extracting corresponding initial text elements for an input text to be detected, combining picture data in a database, and calculating the autocorrelation of the initial text elements of the text to be detected in the same way as a training stage, so as to calculate the similarity of the text to be detected and the picture; and sorting according to the similarity to obtain a retrieval result.
According to the technical scheme provided by the invention, pictures and texts are expressed as fine-grained visual and text elements under the attention mechanism, and all potential visual and text element alignment can be found; the graph structure is adaptively constructed, and the relations between the inside of the same modal data of visual and text elements and different modal data are better considered; in different picture and text pairs, constraints are added to the text elements, so that the text elements can be better aligned to the corresponding visual elements. Accurate and comprehensive vision and fine granularity alignment of text elements enable the method to better measure similarity of picture and text levels and obtain higher picture and text retrieval accuracy.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of an image and text retrieval method based on a graph neural network structure modeling according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
The embodiment of the invention provides an image and text retrieval method based on graphic neural network structural modeling, as shown in fig. 1, which is a flow of the whole method, and the main processes of training and testing stages are the same, specifically:
training phase: extracting visual elements and initial text elements of a single picture and text pair, and introducing an attention mechanism to re-represent each text element; taking visual elements of single picture and text pairs and re-represented text elements as nodes, adaptively constructing a graph structure, and updating each node by utilizing a graph convolution method; calculating the autocorrelation of each initial text element by combining the updated text elements, and making alignment constraints of visual elements and text elements in different pictures and texts; meanwhile, the autocorrelation of the initial text elements is converged to measure the similarity of the whole text and the whole picture, so that a retrieval ordering result is generated according to the similarity; constructing a total loss function by using the loss of the element alignment process and the loss function of the retrieval ordering;
testing: for an input picture to be detected, extracting a corresponding visual element, combining text data in a database, and calculating the autocorrelation of an initial text element of each text data in the same way as a training stage, so as to calculate the similarity of each text data and the picture to be detected; extracting corresponding initial text elements for an input text to be detected, combining picture data in a database, and calculating the autocorrelation of the initial text elements of the text to be detected in the same way as a training stage, so as to calculate the similarity of the text to be detected and the picture; and sorting according to the similarity to obtain a corresponding retrieval result.
The method provided by the embodiment of the invention is a fine-grained image and text retrieval method, and the method performs fine-grained visual and text element representation on the images and texts; in the single picture and text pair, fine granularity vision and text element relation extraction and alignment are performed; between different pairs of pictures and texts, a constraint mechanism of visual and text element alignment is performed. Therefore, the alignment relation of fine granularity among all the areas in the picture and all the words in the text can be fully considered, the similarity of a given corresponding picture and text pair can be well calculated, and a retrieval result is given. The method can be applied to a database of internet multimedia application and can be used for feeding back a picture/text retrieval request of a user. In implementation, the method can be installed on a background server of a company in a software mode, performs similarity calculation on a large amount of pictures and text data, and returns the most similar result for searching the pictures or the text.
The training phase and the testing phase are described in detail below.
1. Training stage.
1. The visual features and the text features of the single picture and text pair are mapped to the same space, so that the visual elements and the initial text elements are obtained, and a attention mechanism is introduced to re-represent each text element.
This step is mainly to achieve fine-grained representation and alignment of visual, text elements (picture areas/objects, text words/words) by means of an attention mechanism.
For a given picture I, visual features f= { F of multiple regions of the picture I are extracted using fast R-CNN (a general target detection algorithm based on convolutional neural network CNN) 1 ,f 2 ,...,f n Next, feature F is mapped to the embedding space with full connection layer, denoted v= { V 1 ,v 2 ,...,v n -a }; where n is the number of regions (objects) in the picture that have explicit semantic information. As will be appreciated by those skilled in the art, a region of explicit semantic information means that the semantic information of the corresponding region is known and explicit, e.g., explicit semantic information may be cat, house, etc.; that is, the area where the semantic information is clear may be an area where one cat is circled, an area where one house is circled, or the like.
For the text T, each word in the sentence is represented by an embedded vector, and then mapped to an embedded space by bi-GRU (bi-directional gated cyclic unit network, a general natural language processing network based on cyclic neural network RNN), denoted as e= { E 1 ,e 2 ,...,e k And, where k is the number of words.
Thereafter, for the initial text element e j Expressed as:where j=1,.. ij Is text element e j With visual element v i The attention coefficient between them is defined by v= { V 1 ,v 2 ,...,v n} and E={e1 ,e 2 ,...,e k Similarity matrix calculation.
2. The visual elements and the re-represented text elements of the single picture and text pair are used as input, a graph structure is adaptively constructed, the elements are used as nodes, cosine similarity among the nodes is used as the continuous edge of the nodes, M edges with the maximum pre-similarity are reserved, and then each node is updated by using a graph convolution method.
In the embodiment of the invention, the visual elements of the single picture and text pair and the re-represented text element { v } 1 ...v n ,a 1 ...a k As input, adaptively constructing a graph structure (graph structure modeling), taking into account the relationships between the internal and inter-modal visual and textual element modalities; wherein { v 1 ...v n The visual element of the picture, { a }, is 1 ...a k The text element that is represented again is noted, and the subscript is the element number.
Each characteristic element { v } 1 ...v n ,a 1 ...a k As nodes of the graph, cosine similarity b between the nodes ij (i, j represents the serial number of two connected nodes) as the connecting edge of the node, for any node t p (i.e. t p =v i Or t p =a j ) Only the M edges with the maximum cosine similarity, namely M (t p )=topm(b pq ) The specific value of M can be set according to the actual situation.
Method for updating each node { v }, by using graph convolution 1 ...v n ,a 1 ...a k Characteristics of }, namely:the parameter beta is the update strength and can be adjusted according to the actual situation. The characteristic element updated thereby is denoted +.> Or->t p And t q The element types of the node are not limited, namely, the node can be the same type of element (both visual element or text element) or different types of element (namely, one is visual element and the other is text element), so that the characteristics of each node simultaneously consider the relation between the modes of the visual element and the text element.
3. Extracting updated text elements, and calculating the autocorrelation of each initial text element to be used as the constraint of the alignment process of vision and the initial text elements between different pictures and text pairs; meanwhile, the self-correlation of the initial text element is utilized to measure the similarity between the related text element and the whole picture, and a retrieval ordering result is generated according to the similarity.
In the last step, updated visual and text elements are obtained The updated visual elements and text elements are respectively marked with the subscript as the characteristic serial number; extracting->To calculate each initial text element e j Is the autocorrelation of (a):
because of the aboveIs based on the element v= { V of the whole picture 1 ,v 2 ,...,v n Finding based on a picture that does not match text +.>Calculated autocorrelation of the original text element +.>Autocorrelation of an initial text element calculated on the basis of a matching picture I>Small, i.e.)>
Based on the unequal relations, the patent considers graph structure modeling between different pictures and text pairs to enable the text element e to be j Better alignment with the visual elements in the true matching picture I.
In the embodiment of the invention, the triples are used for reflecting the unequal relations, and the sentences t in each training minimum batch (mini-batch) have the following losses:
where θ is the triplet loss margin, [ x ]] + =max{0,x},v pos and vneg Representing picture I and mismatch, respectively, matching sentence tIs a visual element of a picture of (a).
In the embodiment of the invention, the text element e is utilized j Is related to (a) by (b)The similarity between the related text elements and the whole picture T is measured, and an average pooling method is adopted in calculation, and is expressed as follows:
likewise, the search ordering objective is represented using triplet loss:
wherein (I, T) represents the matching picture and text pair, and />Representing the most difficult non-matching samples to picture I, text T, respectively, i.e +.> and />The parameter gamma is the triplet loss margin.
Finally, the total loss function of the whole model in the training stage is as follows: l=l IT +ηΣ t l tR The super-parameter eta is used for regulating two losses l IT and ηΣt l tR The factor of the weight.
In fig. 1, the pictures actually matching the input text and the pictures not matching the input text are denoted by "+", and "-", respectively. vi+, vi-are visual elements of the corresponding picture, ai+, ai-are text elements calculated using the attention mechanism in combination with visual features of the corresponding picture, ri+, ri-are autocorrelation of the text elements mentioned herein.
2. And (3) a testing stage.
In the test process, a user inputs a picture to be searched or a text to be searched, the invention calculates the similarity between the query content (picture or text) and all the contents (text or picture) to be searched in the database according to the similarity, and a final search sequencing result is generated according to the similarity.
If the text to be retrieved is input, the text element E= { E is extracted 1 ,e 2 ,...,e k Then follow the visual element v= { V for each picture in the database alone 1 ,v 2 ,...,v n Computing the corresponding a 1 ...a k Separately constructing the graph structure, and updating to obtainFinally calculate the correlation->Thereby calculating the similarity, and sorting according to the similarity to obtain a search result (namely, a series of text data).
If the input is the picture to be retrieved, the visual element V= { V of the picture is extracted 1 ,v 2 ,...,v n Then separately follow the text elements E= { E of each text of the database 1 ,e 2 ,...,e k Calculation of a 1 ...a k Separately constructing the graph structure, and updating to obtainFinally calculate the correlation->Thereby calculating the similarity, and further sequencing according to the size of the similarity to obtain a retrieval result (namely, a series of picture data).
According to the scheme provided by the embodiment of the invention, the attention mechanism is applied to express fine granularity vision and text elements extracted from the pictures and the texts, so that the similarity of the pictures and the texts can be better calculated; the method has the advantages that the graph structure is adaptively constructed by utilizing visual and text elements, and the characteristics are updated by using a graph convolution method, so that the relationships between the intra-mode and inter-mode of the visual and text elements can be better considered; and a constraint mechanism is introduced between different picture and text pairs in the alignment process of visual and text elements, so that fine-granularity text elements can be favorably corresponding to corresponding picture areas, the reliability of picture and text level similarity calculation is further improved, and the accuracy of picture and text retrieval is further improved.
From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.
Claims (4)
1. An image and text retrieval method based on graph neural network structure modeling is characterized by comprising the following steps:
training phase: extracting visual elements and initial text elements of a single picture and text pair, and introducing an attention mechanism to re-represent each text element; taking visual elements of single picture and text pairs and re-represented text elements as nodes, adaptively constructing a graph structure, and updating each node by utilizing a graph convolution method; calculating the autocorrelation of each initial text element by combining the updated text elements, and making alignment constraints of visual elements and text elements in different pictures and texts; meanwhile, the autocorrelation of the initial text elements is converged to measure the similarity of the whole text and the whole picture, so that a retrieval ordering result is generated according to the similarity; constructing a total loss function by using the loss of the element alignment process and the loss function of the retrieval ordering;
testing: for an input picture to be detected, extracting a corresponding visual element, combining text data in a database, and calculating the autocorrelation of an initial text element of each text data in the same way as a training stage, so as to calculate the similarity of each text data and the picture to be detected; extracting corresponding initial text elements for an input text to be detected, combining picture data in a database, and calculating the autocorrelation of the initial text elements of the text to be detected in the same way as a training stage, so as to calculate the similarity of the text to be detected and the picture; sorting according to the similarity to obtain a retrieval result;
extracting visual elements and initial text elements of a single picture and text pair includes:
for a given picture I, extracting visual features f= { F for multiple regions of picture I using fast R-CNN 1 ,f 2 ,...,f n Next, feature F is mapped to the embedding space with full connection layer, denoted v= { V 1 ,v 2 ,...,v n -a }; wherein n is the number of regions or targets with definite semantic information in the picture;
for the text T, each word in the sentence is first represented by an embedded vector, and then mapped to an embedded space by bi-GRU, denoted as e= { E 1 ,e 2 ,...,e k -where k is the number of words;
the attention-directed mechanism re-represents each text element including:
for the initial text element e j The attention-directing mechanism is re-expressed as:where j=1,.. ij Is the initial text element e j With visual element v i The attention coefficient between them is defined by v= { V 1 ,v 2 ,...,v n} and E={e1 ,e 2 ,...,e k Calculating a similarity matrix;
visual element and re-represented text element { v ] of single picture and text pair 1 ...v n ,a 1 ...a k As input, adaptively building a graph structure; wherein { v 1 ...v n The visual element of the picture, { a }, is 1 ...a k The text element is re-represented, and the subscript is an element number;
each element { v } 1 ...v n ,a 1 ...a k Using cosine similarity between nodes as connecting edge of nodes, and for any node t p Only the M edges with the maximum cosine similarity, namely M (t p )=topm(b pq ),t p =v i Or t p =a j ,j=1,...,k,i=1,...,n;
2. The method for image and text retrieval based on structural modeling of a neural network of claim 1, wherein,
the updated text element is noted asThe subscript is the sequence number of the text element, and each initial text element e is calculated j Is the autocorrelation of (a):
wherein ,is based on the feature V= { V of the whole picture 1 ,v 2 ,...,v n Finding based on a picture that does not match text +.>Calculated autocorrelation of the original text element +.>Autocorrelation of an initial text element calculated on the basis of a matching picture I>Small, i.e.)>
Using triples to reflect the above-mentioned inequality relationships, there is a loss of sentences t in each training minimum batch as follows:
3. The image and text retrieval method based on neural network structure modeling of claim 2, wherein the autocorrelation of the initial text element is utilizedThe similarity between the related text elements and the whole picture T is measured, and an average pooling method is adopted in calculation, and is expressed as follows:
wherein I represents text.
4. A method of image and text retrieval based on modeling of a neural network structure as claimed in claim 3, wherein the retrieval ordering objective is represented using triplet loss:
wherein (I, T) represents the matching picture and text pair, and />Respectively representing unmatched samples which are the most difficult to match with the picture I and the text T, wherein the parameter gamma is a triplet loss allowance;
the total loss function for the training phase is: l=l IT +η∑ t l tR η is a weight factor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010104275.5A CN111324752B (en) | 2020-02-20 | 2020-02-20 | Image and text retrieval method based on graphic neural network structure modeling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010104275.5A CN111324752B (en) | 2020-02-20 | 2020-02-20 | Image and text retrieval method based on graphic neural network structure modeling |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111324752A CN111324752A (en) | 2020-06-23 |
CN111324752B true CN111324752B (en) | 2023-06-16 |
Family
ID=71165326
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010104275.5A Active CN111324752B (en) | 2020-02-20 | 2020-02-20 | Image and text retrieval method based on graphic neural network structure modeling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111324752B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11475059B2 (en) * | 2019-08-16 | 2022-10-18 | The Toronto-Dominion Bank | Automated image retrieval with graph neural network |
CN111914113B (en) * | 2020-08-07 | 2024-06-28 | 大连理工大学 | Image retrieval method and related device |
CN112417097B (en) * | 2020-11-19 | 2022-09-16 | 中国电子科技集团公司电子科学研究院 | Multi-modal data feature extraction and association method for public opinion analysis |
CN113377990B (en) * | 2021-06-09 | 2022-06-14 | 电子科技大学 | Video/picture-text cross-modal matching training method based on meta-self learning |
CN113469197B (en) * | 2021-06-29 | 2024-03-22 | 北京达佳互联信息技术有限公司 | Image-text matching method, device, equipment and storage medium |
CN114067215B (en) * | 2022-01-17 | 2022-04-15 | 东华理工大学南昌校区 | Remote sensing image retrieval method based on node attention machine mapping neural network |
CN114443904B (en) * | 2022-01-20 | 2024-02-02 | 腾讯科技(深圳)有限公司 | Video query method, device, computer equipment and computer readable storage medium |
CN115730878B (en) * | 2022-12-15 | 2024-01-12 | 广东省电子口岸管理有限公司 | Cargo import and export checking management method based on data identification |
CN116226434B (en) * | 2023-05-04 | 2023-07-21 | 浪潮电子信息产业股份有限公司 | Multi-element heterogeneous model training and application method, equipment and readable storage medium |
CN116361502B (en) * | 2023-05-31 | 2023-08-01 | 深圳兔展智能科技有限公司 | Image retrieval method, device, computer equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013004093A (en) * | 2011-06-16 | 2013-01-07 | Fujitsu Ltd | Search method and system by multi-instance learning |
CN107330100A (en) * | 2017-07-06 | 2017-11-07 | 北京大学深圳研究生院 | Combine the two-way search method of image text of embedded space based on multi views |
WO2017210949A1 (en) * | 2016-06-06 | 2017-12-14 | 北京大学深圳研究生院 | Cross-media retrieval method |
CN108132968A (en) * | 2017-12-01 | 2018-06-08 | 西安交通大学 | Network text is associated with the Weakly supervised learning method of Semantic unit with image |
CN109255047A (en) * | 2018-07-18 | 2019-01-22 | 西安电子科技大学 | Based on the complementary semantic mutual search method of image-text being aligned and symmetrically retrieve |
CN109992686A (en) * | 2019-02-24 | 2019-07-09 | 复旦大学 | Based on multi-angle from the image-text retrieval system and method for attention mechanism |
CN110555121A (en) * | 2019-08-27 | 2019-12-10 | 清华大学 | Image hash generation method and device based on graph neural network |
CN110717498A (en) * | 2019-09-16 | 2020-01-21 | 腾讯科技(深圳)有限公司 | Image description generation method and device and electronic equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104572651B (en) * | 2013-10-11 | 2017-09-29 | 华为技术有限公司 | Picture sort method and device |
-
2020
- 2020-02-20 CN CN202010104275.5A patent/CN111324752B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013004093A (en) * | 2011-06-16 | 2013-01-07 | Fujitsu Ltd | Search method and system by multi-instance learning |
WO2017210949A1 (en) * | 2016-06-06 | 2017-12-14 | 北京大学深圳研究生院 | Cross-media retrieval method |
CN107330100A (en) * | 2017-07-06 | 2017-11-07 | 北京大学深圳研究生院 | Combine the two-way search method of image text of embedded space based on multi views |
CN108132968A (en) * | 2017-12-01 | 2018-06-08 | 西安交通大学 | Network text is associated with the Weakly supervised learning method of Semantic unit with image |
CN109255047A (en) * | 2018-07-18 | 2019-01-22 | 西安电子科技大学 | Based on the complementary semantic mutual search method of image-text being aligned and symmetrically retrieve |
CN109992686A (en) * | 2019-02-24 | 2019-07-09 | 复旦大学 | Based on multi-angle from the image-text retrieval system and method for attention mechanism |
CN110555121A (en) * | 2019-08-27 | 2019-12-10 | 清华大学 | Image hash generation method and device based on graph neural network |
CN110717498A (en) * | 2019-09-16 | 2020-01-21 | 腾讯科技(深圳)有限公司 | Image description generation method and device and electronic equipment |
Non-Patent Citations (2)
Title |
---|
刘畅 ; 周向东 ; 施伯乐 ; .图像语义相似性网络的文本描述方法.计算机应用与软件.2018,(01),全文. * |
綦金玮 ; 彭宇新 ; 袁玉鑫 ; .面向跨媒体检索的层级循环注意力网络模型.中国图象图形学报.2018,(11),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN111324752A (en) | 2020-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111324752B (en) | Image and text retrieval method based on graphic neural network structure modeling | |
KR101778679B1 (en) | Method and system for classifying data consisting of multiple attribues represented by sequences of text words or symbols using deep learning | |
US8027977B2 (en) | Recommending content using discriminatively trained document similarity | |
US9846836B2 (en) | Modeling interestingness with deep neural networks | |
US8787683B1 (en) | Image classification | |
US9009134B2 (en) | Named entity recognition in query | |
US8538898B2 (en) | Interactive framework for name disambiguation | |
CN111190997B (en) | Question-answering system implementation method using neural network and machine learning ordering algorithm | |
CN112819023B (en) | Sample set acquisition method, device, computer equipment and storage medium | |
CN111858940B (en) | Multi-head attention-based legal case similarity calculation method and system | |
CN111539197A (en) | Text matching method and device, computer system and readable storage medium | |
WO2023134082A1 (en) | Training method and apparatus for image caption statement generation module, and electronic device | |
US20120158716A1 (en) | Image object retrieval based on aggregation of visual annotations | |
CN111753167A (en) | Search processing method, search processing device, computer equipment and medium | |
US20100138414A1 (en) | Methods and systems for associative search | |
CN113360646A (en) | Text generation method and equipment based on dynamic weight and storage medium | |
CN112347758A (en) | Text abstract generation method and device, terminal equipment and storage medium | |
CN112805715A (en) | Identifying entity attribute relationships | |
CN113761890A (en) | BERT context sensing-based multi-level semantic information retrieval method | |
CN111666376A (en) | Answer generation method and device based on paragraph boundary scan prediction and word shift distance cluster matching | |
CN110569355B (en) | Viewpoint target extraction and target emotion classification combined method and system based on word blocks | |
CN116975271A (en) | Text relevance determining method, device, computer equipment and storage medium | |
Vaissnave et al. | Modeling of automated glowworm swarm optimization based deep learning model for legal text summarization | |
CN113516094A (en) | System and method for matching document with review experts | |
CN113535928A (en) | Service discovery method and system of long-term and short-term memory network based on attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |