CN111324752B - Image and text retrieval method based on graphic neural network structure modeling - Google Patents

Image and text retrieval method based on graphic neural network structure modeling Download PDF

Info

Publication number
CN111324752B
CN111324752B CN202010104275.5A CN202010104275A CN111324752B CN 111324752 B CN111324752 B CN 111324752B CN 202010104275 A CN202010104275 A CN 202010104275A CN 111324752 B CN111324752 B CN 111324752B
Authority
CN
China
Prior art keywords
text
picture
elements
visual
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010104275.5A
Other languages
Chinese (zh)
Other versions
CN111324752A (en
Inventor
张勇东
张天柱
魏曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202010104275.5A priority Critical patent/CN111324752B/en
Publication of CN111324752A publication Critical patent/CN111324752A/en
Application granted granted Critical
Publication of CN111324752B publication Critical patent/CN111324752B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • G06F16/434Query formulation using image data, e.g. images, photos, pictures taken by a user
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/438Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an image and text retrieval method based on graphic neural network structural modeling, which applies an attention mechanism to express fine granularity vision and text elements extracted from pictures and texts, so that the similarity of the pictures and the texts can be better calculated; the method has the advantages that the graph structure is adaptively constructed by utilizing visual and text elements, and the characteristics are updated by using a graph convolution method, so that the relationships between the intra-mode and inter-mode of the visual and text elements can be better considered; and a constraint mechanism is introduced between different picture and text pairs in the alignment process of visual and text elements, so that fine-granularity text elements can be favorably corresponding to corresponding picture areas, the reliability of picture and text level similarity calculation is further improved, and the accuracy of picture and text retrieval is further improved.

Description

Image and text retrieval method based on graphic neural network structure modeling
Technical Field
The invention relates to the technical field of multimedia retrieval, in particular to an image and text retrieval method based on graphic neural network structure modeling.
Background
As massive amounts of multimedia data flow into the internet, multimedia retrieval techniques across a variety of different modalities of data (visual, text, speech, etc.) play an increasingly important role.
Conventional image retrieval techniques often use labels to retrieve pictures. This process tends to be unidirectional and only uses discrete tag data. The bidirectional retrieval of the images and the texts contains richer semantics and accords with the habit of using natural language by human beings. However, there is a large difference between the data of the two different modalities of vision and text. In order to achieve cross-modal retrieval of images and text, it is desirable to well integrate computer vision with natural language understanding.
Recently, a cross-modal retrieval method for images and texts based on deep learning mainly maps the images and texts to a unified embedded space, compares global similarity between visual data and language data, and finally outputs a retrieval result. However, these methods rarely consider the alignment between fine-grained visual elements, text elements. This limits the overall similarity calculation of the image and text, affecting the final retrieval accuracy.
Disclosure of Invention
The invention aims to provide an image and text retrieval method based on graphic neural network structure modeling, which can obtain higher image and text retrieval accuracy.
The invention aims at realizing the following technical scheme:
an image and text retrieval method based on graph neural network structure modeling comprises the following steps:
training phase: extracting visual elements and initial text elements of a single picture and text pair, and introducing an attention mechanism to re-represent each text element; taking visual elements of single picture and text pairs and re-represented text elements as nodes, adaptively constructing a graph structure, and updating each node by utilizing a graph convolution method; calculating the autocorrelation of each initial text element by combining the updated text elements, and making alignment constraints of visual elements and text elements in different pictures and texts; meanwhile, the autocorrelation of the initial text elements is converged to measure the similarity of the whole text and the whole picture, so that a retrieval ordering result is generated according to the similarity; constructing a total loss function by using the loss of the element alignment process and the loss function of the retrieval ordering;
testing: for an input picture to be detected, extracting a corresponding visual element, combining text data in a database, and calculating the autocorrelation of an initial text element of each text data in the same way as a training stage, so as to calculate the similarity of each text data and the picture to be detected; extracting corresponding initial text elements for an input text to be detected, combining picture data in a database, and calculating the autocorrelation of the initial text elements of the text to be detected in the same way as a training stage, so as to calculate the similarity of the text to be detected and the picture; and sorting according to the similarity to obtain a retrieval result.
According to the technical scheme provided by the invention, pictures and texts are expressed as fine-grained visual and text elements under the attention mechanism, and all potential visual and text element alignment can be found; the graph structure is adaptively constructed, and the relations between the inside of the same modal data of visual and text elements and different modal data are better considered; in different picture and text pairs, constraints are added to the text elements, so that the text elements can be better aligned to the corresponding visual elements. Accurate and comprehensive vision and fine granularity alignment of text elements enable the method to better measure similarity of picture and text levels and obtain higher picture and text retrieval accuracy.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of an image and text retrieval method based on a graph neural network structure modeling according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
The embodiment of the invention provides an image and text retrieval method based on graphic neural network structural modeling, as shown in fig. 1, which is a flow of the whole method, and the main processes of training and testing stages are the same, specifically:
training phase: extracting visual elements and initial text elements of a single picture and text pair, and introducing an attention mechanism to re-represent each text element; taking visual elements of single picture and text pairs and re-represented text elements as nodes, adaptively constructing a graph structure, and updating each node by utilizing a graph convolution method; calculating the autocorrelation of each initial text element by combining the updated text elements, and making alignment constraints of visual elements and text elements in different pictures and texts; meanwhile, the autocorrelation of the initial text elements is converged to measure the similarity of the whole text and the whole picture, so that a retrieval ordering result is generated according to the similarity; constructing a total loss function by using the loss of the element alignment process and the loss function of the retrieval ordering;
testing: for an input picture to be detected, extracting a corresponding visual element, combining text data in a database, and calculating the autocorrelation of an initial text element of each text data in the same way as a training stage, so as to calculate the similarity of each text data and the picture to be detected; extracting corresponding initial text elements for an input text to be detected, combining picture data in a database, and calculating the autocorrelation of the initial text elements of the text to be detected in the same way as a training stage, so as to calculate the similarity of the text to be detected and the picture; and sorting according to the similarity to obtain a corresponding retrieval result.
The method provided by the embodiment of the invention is a fine-grained image and text retrieval method, and the method performs fine-grained visual and text element representation on the images and texts; in the single picture and text pair, fine granularity vision and text element relation extraction and alignment are performed; between different pairs of pictures and texts, a constraint mechanism of visual and text element alignment is performed. Therefore, the alignment relation of fine granularity among all the areas in the picture and all the words in the text can be fully considered, the similarity of a given corresponding picture and text pair can be well calculated, and a retrieval result is given. The method can be applied to a database of internet multimedia application and can be used for feeding back a picture/text retrieval request of a user. In implementation, the method can be installed on a background server of a company in a software mode, performs similarity calculation on a large amount of pictures and text data, and returns the most similar result for searching the pictures or the text.
The training phase and the testing phase are described in detail below.
1. Training stage.
1. The visual features and the text features of the single picture and text pair are mapped to the same space, so that the visual elements and the initial text elements are obtained, and a attention mechanism is introduced to re-represent each text element.
This step is mainly to achieve fine-grained representation and alignment of visual, text elements (picture areas/objects, text words/words) by means of an attention mechanism.
For a given picture I, visual features f= { F of multiple regions of the picture I are extracted using fast R-CNN (a general target detection algorithm based on convolutional neural network CNN) 1 ,f 2 ,...,f n Next, feature F is mapped to the embedding space with full connection layer, denoted v= { V 1 ,v 2 ,...,v n -a }; where n is the number of regions (objects) in the picture that have explicit semantic information. As will be appreciated by those skilled in the art, a region of explicit semantic information means that the semantic information of the corresponding region is known and explicit, e.g., explicit semantic information may be cat, house, etc.; that is, the area where the semantic information is clear may be an area where one cat is circled, an area where one house is circled, or the like.
For the text T, each word in the sentence is represented by an embedded vector, and then mapped to an embedded space by bi-GRU (bi-directional gated cyclic unit network, a general natural language processing network based on cyclic neural network RNN), denoted as e= { E 1 ,e 2 ,...,e k And, where k is the number of words.
Thereafter, for the initial text element e j Expressed as:
Figure BDA0002387965650000041
where j=1,.. ij Is text element e j With visual element v i The attention coefficient between them is defined by v= { V 1 ,v 2 ,...,v n} and E={e1 ,e 2 ,...,e k Similarity matrix calculation.
2. The visual elements and the re-represented text elements of the single picture and text pair are used as input, a graph structure is adaptively constructed, the elements are used as nodes, cosine similarity among the nodes is used as the continuous edge of the nodes, M edges with the maximum pre-similarity are reserved, and then each node is updated by using a graph convolution method.
In the embodiment of the invention, the visual elements of the single picture and text pair and the re-represented text element { v } 1 ...v n ,a 1 ...a k As input, adaptively constructing a graph structure (graph structure modeling), taking into account the relationships between the internal and inter-modal visual and textual element modalities; wherein { v 1 ...v n The visual element of the picture, { a }, is 1 ...a k The text element that is represented again is noted, and the subscript is the element number.
Each characteristic element { v } 1 ...v n ,a 1 ...a k As nodes of the graph, cosine similarity b between the nodes ij (i, j represents the serial number of two connected nodes) as the connecting edge of the node, for any node t p (i.e. t p =v i Or t p =a j ) Only the M edges with the maximum cosine similarity, namely M (t p )=topm(b pq ) The specific value of M can be set according to the actual situation.
Method for updating each node { v }, by using graph convolution 1 ...v n ,a 1 ...a k Characteristics of }, namely:
Figure BDA0002387965650000042
the parameter beta is the update strength and can be adjusted according to the actual situation. The characteristic element updated thereby is denoted +.>
Figure BDA0002387965650000043
Figure BDA0002387965650000044
Or->
Figure BDA0002387965650000045
t p And t q The element types of the node are not limited, namely, the node can be the same type of element (both visual element or text element) or different types of element (namely, one is visual element and the other is text element), so that the characteristics of each node simultaneously consider the relation between the modes of the visual element and the text element.
3. Extracting updated text elements, and calculating the autocorrelation of each initial text element to be used as the constraint of the alignment process of vision and the initial text elements between different pictures and text pairs; meanwhile, the self-correlation of the initial text element is utilized to measure the similarity between the related text element and the whole picture, and a retrieval ordering result is generated according to the similarity.
In the last step, updated visual and text elements are obtained
Figure BDA0002387965650000051
Figure BDA0002387965650000052
The updated visual elements and text elements are respectively marked with the subscript as the characteristic serial number; extracting->
Figure BDA0002387965650000053
To calculate each initial text element e j Is the autocorrelation of (a):
Figure BDA0002387965650000054
because of the above
Figure BDA0002387965650000055
Is based on the element v= { V of the whole picture 1 ,v 2 ,...,v n Finding based on a picture that does not match text +.>
Figure BDA0002387965650000056
Calculated autocorrelation of the original text element +.>
Figure BDA0002387965650000057
Autocorrelation of an initial text element calculated on the basis of a matching picture I>
Figure BDA0002387965650000058
Small, i.e.)>
Figure BDA0002387965650000059
Based on the unequal relations, the patent considers graph structure modeling between different pictures and text pairs to enable the text element e to be j Better alignment with the visual elements in the true matching picture I.
In the embodiment of the invention, the triples are used for reflecting the unequal relations, and the sentences t in each training minimum batch (mini-batch) have the following losses:
Figure BDA00023879656500000510
where θ is the triplet loss margin, [ x ]] + =max{0,x},v pos and vneg Representing picture I and mismatch, respectively, matching sentence t
Figure BDA00023879656500000511
Is a visual element of a picture of (a).
In the embodiment of the invention, the text element e is utilized j Is related to (a) by (b)
Figure BDA00023879656500000512
The similarity between the related text elements and the whole picture T is measured, and an average pooling method is adopted in calculation, and is expressed as follows:
Figure BDA00023879656500000513
likewise, the search ordering objective is represented using triplet loss:
Figure BDA00023879656500000514
wherein (I, T) represents the matching picture and text pair,
Figure BDA00023879656500000515
and />
Figure BDA00023879656500000516
Representing the most difficult non-matching samples to picture I, text T, respectively, i.e +.>
Figure BDA00023879656500000517
and />
Figure BDA00023879656500000518
The parameter gamma is the triplet loss margin.
Finally, the total loss function of the whole model in the training stage is as follows: l=l IT +ηΣ t l tR The super-parameter eta is used for regulating two losses l IT and ηΣt l tR The factor of the weight.
In fig. 1, the pictures actually matching the input text and the pictures not matching the input text are denoted by "+", and "-", respectively. vi+, vi-are visual elements of the corresponding picture, ai+, ai-are text elements calculated using the attention mechanism in combination with visual features of the corresponding picture, ri+, ri-are autocorrelation of the text elements mentioned herein.
2. And (3) a testing stage.
In the test process, a user inputs a picture to be searched or a text to be searched, the invention calculates the similarity between the query content (picture or text) and all the contents (text or picture) to be searched in the database according to the similarity, and a final search sequencing result is generated according to the similarity.
If the text to be retrieved is input, the text element E= { E is extracted 1 ,e 2 ,...,e k Then follow the visual element v= { V for each picture in the database alone 1 ,v 2 ,...,v n Computing the corresponding a 1 ...a k Separately constructing the graph structure, and updating to obtain
Figure BDA0002387965650000061
Finally calculate the correlation->
Figure BDA0002387965650000062
Thereby calculating the similarity, and sorting according to the similarity to obtain a search result (namely, a series of text data).
If the input is the picture to be retrieved, the visual element V= { V of the picture is extracted 1 ,v 2 ,...,v n Then separately follow the text elements E= { E of each text of the database 1 ,e 2 ,...,e k Calculation of a 1 ...a k Separately constructing the graph structure, and updating to obtain
Figure BDA0002387965650000063
Finally calculate the correlation->
Figure BDA0002387965650000064
Thereby calculating the similarity, and further sequencing according to the size of the similarity to obtain a retrieval result (namely, a series of picture data).
According to the scheme provided by the embodiment of the invention, the attention mechanism is applied to express fine granularity vision and text elements extracted from the pictures and the texts, so that the similarity of the pictures and the texts can be better calculated; the method has the advantages that the graph structure is adaptively constructed by utilizing visual and text elements, and the characteristics are updated by using a graph convolution method, so that the relationships between the intra-mode and inter-mode of the visual and text elements can be better considered; and a constraint mechanism is introduced between different picture and text pairs in the alignment process of visual and text elements, so that fine-granularity text elements can be favorably corresponding to corresponding picture areas, the reliability of picture and text level similarity calculation is further improved, and the accuracy of picture and text retrieval is further improved.
From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (4)

1. An image and text retrieval method based on graph neural network structure modeling is characterized by comprising the following steps:
training phase: extracting visual elements and initial text elements of a single picture and text pair, and introducing an attention mechanism to re-represent each text element; taking visual elements of single picture and text pairs and re-represented text elements as nodes, adaptively constructing a graph structure, and updating each node by utilizing a graph convolution method; calculating the autocorrelation of each initial text element by combining the updated text elements, and making alignment constraints of visual elements and text elements in different pictures and texts; meanwhile, the autocorrelation of the initial text elements is converged to measure the similarity of the whole text and the whole picture, so that a retrieval ordering result is generated according to the similarity; constructing a total loss function by using the loss of the element alignment process and the loss function of the retrieval ordering;
testing: for an input picture to be detected, extracting a corresponding visual element, combining text data in a database, and calculating the autocorrelation of an initial text element of each text data in the same way as a training stage, so as to calculate the similarity of each text data and the picture to be detected; extracting corresponding initial text elements for an input text to be detected, combining picture data in a database, and calculating the autocorrelation of the initial text elements of the text to be detected in the same way as a training stage, so as to calculate the similarity of the text to be detected and the picture; sorting according to the similarity to obtain a retrieval result;
extracting visual elements and initial text elements of a single picture and text pair includes:
for a given picture I, extracting visual features f= { F for multiple regions of picture I using fast R-CNN 1 ,f 2 ,...,f n Next, feature F is mapped to the embedding space with full connection layer, denoted v= { V 1 ,v 2 ,...,v n -a }; wherein n is the number of regions or targets with definite semantic information in the picture;
for the text T, each word in the sentence is first represented by an embedded vector, and then mapped to an embedded space by bi-GRU, denoted as e= { E 1 ,e 2 ,...,e k -where k is the number of words;
the attention-directed mechanism re-represents each text element including:
for the initial text element e j The attention-directing mechanism is re-expressed as:
Figure FDA0004063857910000011
where j=1,.. ij Is the initial text element e j With visual element v i The attention coefficient between them is defined by v= { V 1 ,v 2 ,...,v n} and E={e1 ,e 2 ,...,e k Calculating a similarity matrix;
visual element and re-represented text element { v ] of single picture and text pair 1 ...v n ,a 1 ...a k As input, adaptively building a graph structure; wherein { v 1 ...v n The visual element of the picture, { a }, is 1 ...a k The text element is re-represented, and the subscript is an element number;
each element { v } 1 ...v n ,a 1 ...a k Using cosine similarity between nodes as connecting edge of nodes, and for any node t p Only the M edges with the maximum cosine similarity, namely M (t p )=topm(b pq ),t p =v i Or t p =a j ,j=1,...,k,i=1,...,n;
And updating the characteristics of each node by using a graph convolution method, namely:
Figure FDA0004063857910000021
the parameter beta is the update strength, and the updated characteristic is marked as +.>
Figure FDA0004063857910000022
2. The method for image and text retrieval based on structural modeling of a neural network of claim 1, wherein,
the updated text element is noted as
Figure FDA0004063857910000023
The subscript is the sequence number of the text element, and each initial text element e is calculated j Is the autocorrelation of (a):
Figure FDA0004063857910000024
wherein ,
Figure FDA0004063857910000025
is based on the feature V= { V of the whole picture 1 ,v 2 ,...,v n Finding based on a picture that does not match text +.>
Figure FDA0004063857910000026
Calculated autocorrelation of the original text element +.>
Figure FDA0004063857910000027
Autocorrelation of an initial text element calculated on the basis of a matching picture I>
Figure FDA0004063857910000028
Small, i.e.)>
Figure FDA0004063857910000029
Using triples to reflect the above-mentioned inequality relationships, there is a loss of sentences t in each training minimum batch as follows:
Figure FDA00040638579100000210
where θ is the triplet loss margin, v pos and vneg Representing picture I and mismatch, respectively, matching sentence t
Figure FDA00040638579100000211
Is a visual element of a picture of (a).
3. The image and text retrieval method based on neural network structure modeling of claim 2, wherein the autocorrelation of the initial text element is utilized
Figure FDA00040638579100000212
The similarity between the related text elements and the whole picture T is measured, and an average pooling method is adopted in calculation, and is expressed as follows:
Figure FDA00040638579100000213
wherein I represents text.
4. A method of image and text retrieval based on modeling of a neural network structure as claimed in claim 3, wherein the retrieval ordering objective is represented using triplet loss:
Figure FDA00040638579100000214
wherein (I, T) represents the matching picture and text pair,
Figure FDA0004063857910000031
and />
Figure FDA0004063857910000032
Respectively representing unmatched samples which are the most difficult to match with the picture I and the text T, wherein the parameter gamma is a triplet loss allowance;
the total loss function for the training phase is: l=l IT +η∑ t l tR η is a weight factor.
CN202010104275.5A 2020-02-20 2020-02-20 Image and text retrieval method based on graphic neural network structure modeling Active CN111324752B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010104275.5A CN111324752B (en) 2020-02-20 2020-02-20 Image and text retrieval method based on graphic neural network structure modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010104275.5A CN111324752B (en) 2020-02-20 2020-02-20 Image and text retrieval method based on graphic neural network structure modeling

Publications (2)

Publication Number Publication Date
CN111324752A CN111324752A (en) 2020-06-23
CN111324752B true CN111324752B (en) 2023-06-16

Family

ID=71165326

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010104275.5A Active CN111324752B (en) 2020-02-20 2020-02-20 Image and text retrieval method based on graphic neural network structure modeling

Country Status (1)

Country Link
CN (1) CN111324752B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11475059B2 (en) * 2019-08-16 2022-10-18 The Toronto-Dominion Bank Automated image retrieval with graph neural network
CN111914113B (en) * 2020-08-07 2024-06-28 大连理工大学 Image retrieval method and related device
CN112417097B (en) * 2020-11-19 2022-09-16 中国电子科技集团公司电子科学研究院 Multi-modal data feature extraction and association method for public opinion analysis
CN113377990B (en) * 2021-06-09 2022-06-14 电子科技大学 Video/picture-text cross-modal matching training method based on meta-self learning
CN113469197B (en) * 2021-06-29 2024-03-22 北京达佳互联信息技术有限公司 Image-text matching method, device, equipment and storage medium
CN114067215B (en) * 2022-01-17 2022-04-15 东华理工大学南昌校区 Remote sensing image retrieval method based on node attention machine mapping neural network
CN114443904B (en) * 2022-01-20 2024-02-02 腾讯科技(深圳)有限公司 Video query method, device, computer equipment and computer readable storage medium
CN115730878B (en) * 2022-12-15 2024-01-12 广东省电子口岸管理有限公司 Cargo import and export checking management method based on data identification
CN116226434B (en) * 2023-05-04 2023-07-21 浪潮电子信息产业股份有限公司 Multi-element heterogeneous model training and application method, equipment and readable storage medium
CN116361502B (en) * 2023-05-31 2023-08-01 深圳兔展智能科技有限公司 Image retrieval method, device, computer equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013004093A (en) * 2011-06-16 2013-01-07 Fujitsu Ltd Search method and system by multi-instance learning
CN107330100A (en) * 2017-07-06 2017-11-07 北京大学深圳研究生院 Combine the two-way search method of image text of embedded space based on multi views
WO2017210949A1 (en) * 2016-06-06 2017-12-14 北京大学深圳研究生院 Cross-media retrieval method
CN108132968A (en) * 2017-12-01 2018-06-08 西安交通大学 Network text is associated with the Weakly supervised learning method of Semantic unit with image
CN109255047A (en) * 2018-07-18 2019-01-22 西安电子科技大学 Based on the complementary semantic mutual search method of image-text being aligned and symmetrically retrieve
CN109992686A (en) * 2019-02-24 2019-07-09 复旦大学 Based on multi-angle from the image-text retrieval system and method for attention mechanism
CN110555121A (en) * 2019-08-27 2019-12-10 清华大学 Image hash generation method and device based on graph neural network
CN110717498A (en) * 2019-09-16 2020-01-21 腾讯科技(深圳)有限公司 Image description generation method and device and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572651B (en) * 2013-10-11 2017-09-29 华为技术有限公司 Picture sort method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013004093A (en) * 2011-06-16 2013-01-07 Fujitsu Ltd Search method and system by multi-instance learning
WO2017210949A1 (en) * 2016-06-06 2017-12-14 北京大学深圳研究生院 Cross-media retrieval method
CN107330100A (en) * 2017-07-06 2017-11-07 北京大学深圳研究生院 Combine the two-way search method of image text of embedded space based on multi views
CN108132968A (en) * 2017-12-01 2018-06-08 西安交通大学 Network text is associated with the Weakly supervised learning method of Semantic unit with image
CN109255047A (en) * 2018-07-18 2019-01-22 西安电子科技大学 Based on the complementary semantic mutual search method of image-text being aligned and symmetrically retrieve
CN109992686A (en) * 2019-02-24 2019-07-09 复旦大学 Based on multi-angle from the image-text retrieval system and method for attention mechanism
CN110555121A (en) * 2019-08-27 2019-12-10 清华大学 Image hash generation method and device based on graph neural network
CN110717498A (en) * 2019-09-16 2020-01-21 腾讯科技(深圳)有限公司 Image description generation method and device and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘畅 ; 周向东 ; 施伯乐 ; .图像语义相似性网络的文本描述方法.计算机应用与软件.2018,(01),全文. *
綦金玮 ; 彭宇新 ; 袁玉鑫 ; .面向跨媒体检索的层级循环注意力网络模型.中国图象图形学报.2018,(11),全文. *

Also Published As

Publication number Publication date
CN111324752A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
CN111324752B (en) Image and text retrieval method based on graphic neural network structure modeling
KR101778679B1 (en) Method and system for classifying data consisting of multiple attribues represented by sequences of text words or symbols using deep learning
US8027977B2 (en) Recommending content using discriminatively trained document similarity
US9846836B2 (en) Modeling interestingness with deep neural networks
US8787683B1 (en) Image classification
US9009134B2 (en) Named entity recognition in query
US8538898B2 (en) Interactive framework for name disambiguation
CN111190997B (en) Question-answering system implementation method using neural network and machine learning ordering algorithm
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN111858940B (en) Multi-head attention-based legal case similarity calculation method and system
CN111539197A (en) Text matching method and device, computer system and readable storage medium
WO2023134082A1 (en) Training method and apparatus for image caption statement generation module, and electronic device
US20120158716A1 (en) Image object retrieval based on aggregation of visual annotations
CN111753167A (en) Search processing method, search processing device, computer equipment and medium
US20100138414A1 (en) Methods and systems for associative search
CN113360646A (en) Text generation method and equipment based on dynamic weight and storage medium
CN112347758A (en) Text abstract generation method and device, terminal equipment and storage medium
CN112805715A (en) Identifying entity attribute relationships
CN113761890A (en) BERT context sensing-based multi-level semantic information retrieval method
CN111666376A (en) Answer generation method and device based on paragraph boundary scan prediction and word shift distance cluster matching
CN110569355B (en) Viewpoint target extraction and target emotion classification combined method and system based on word blocks
CN116975271A (en) Text relevance determining method, device, computer equipment and storage medium
Vaissnave et al. Modeling of automated glowworm swarm optimization based deep learning model for legal text summarization
CN113516094A (en) System and method for matching document with review experts
CN113535928A (en) Service discovery method and system of long-term and short-term memory network based on attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant