CN112256904A - Image retrieval method based on visual description sentences - Google Patents

Image retrieval method based on visual description sentences Download PDF

Info

Publication number
CN112256904A
CN112256904A CN202010998165.8A CN202010998165A CN112256904A CN 112256904 A CN112256904 A CN 112256904A CN 202010998165 A CN202010998165 A CN 202010998165A CN 112256904 A CN112256904 A CN 112256904A
Authority
CN
China
Prior art keywords
image
visual
map
statement
reward
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010998165.8A
Other languages
Chinese (zh)
Inventor
聂为之
李杰思
刘安安
徐宁
张勇东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202010998165.8A priority Critical patent/CN112256904A/en
Publication of CN112256904A publication Critical patent/CN112256904A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses an image retrieval method based on visual description sentences, which comprises the following steps: based on a graph convolution deep learning network, constructing an information transfer mode of nodes and edges in the representation of a visual knowledge graph, and realizing the aggregation and the update of the characteristics of each semantic unit; coding each semantic unit feature aggregated and updated in the map by adopting a multi-level length short-time memory network in combination with an attention mechanism for generating an image description statement; under the framework of reinforcement learning, by utilizing CIDER scores and map similarity, a reward and punishment function based on an image description statement is designed and used for feedback regulation and optimization of an image-map process, an image-statement process and an image-map-statement process, a visual description statement with finer granularity of an image is obtained and retrieved, and a target retrieval image corresponding to an inquiry image is output. The invention improves the feasibility of utilizing the character-based image retrieval on a large-scale data set.

Description

Image retrieval method based on visual description sentences
Technical Field
The invention relates to the field of image retrieval, in particular to an image retrieval method based on visual description sentences.
Background
In recent years, image retrieval has been a research focus in the field of computer vision, of which content-based image retrieval methods are the most popular[1][2]. However, content-based image retrieval methods have mainly focused on retrieval using low-level visual features of images, such as color, shape, and texture[3][4][5]The method can not capture the high-level semantic information of the image, and does not accord with the method that people usually judge the image similarity according to the semantic information of the image. Thus, most content-based image retrieval systems do not fully reflect or match query intent, and there is a semantic gap between low-level features and upper-level understanding. To reduce the semantic gap, processing content-based image retrieval by matching visual elements of the image in the form of natural language has attracted the attention of researchers[6][7]Wherein the visual descriptive statement of the image plays a key role. The visual description sentence of the image expands the image representation from a small group of category labels or keywords to a detailed sentence, contains richer high-level semantic features of the image, and can improve the retrieval precision through longer and more targeted query. However, it is a major challenge to efficiently and accurately generate a plurality of visual descriptive sentences of an image, and not only to deeply understand and model the image content, but also to process the natural description language (including the elements of words, phrases, sentences, etc.) to match the corresponding image content. Recently, many researchers have applied visual knowledge-graph theory to visual descriptive statement generation of images[8][9]However, most of the research work only stays in the feed-forward type depth modeling stage, i.e., a unidirectional modeling process of first encoding "images" into "maps" and then decoding "maps" into "sentences". Because of the subjectivity of image description, a visual knowledge graph can often resolve various description sentences, so that the requirement of resolving the diversity of the visual resolution cannot be met only by a feedforward depth modeling algorithm[10]
Disclosure of Invention
The invention provides an image retrieval method based on visual description sentences, which is different from the prior method that the image retrieval is carried out only by depending on the visual characteristics of the lower layer of the image, the invention adopts the visual description sentences which generate the image to carry out the image retrieval, can effectively reduce the semantic gap in the image retrieval, better support the refinement and complicated retrieval of the image, automatically generate the corresponding image description, avoid the defects in manual labeling, improve the feasibility of utilizing the image retrieval based on characters on a large-scale data set, and is described in detail as follows:
an image retrieval method based on visual descriptive sentences, the method comprising the steps of:
based on a graph convolution deep learning network, constructing an information transfer mode of nodes and edges in the representation of a visual knowledge graph, and realizing the aggregation and the update of the characteristics of each semantic unit;
coding each semantic unit feature aggregated and updated in the map by adopting a multi-level length short-time memory network in combination with an attention mechanism for generating an image description statement;
under the framework of reinforcement learning, by utilizing CIDER scores and map similarity, a reward and punishment function based on an image description statement is designed and used for feedback regulation and optimization of an image-map process, an image-statement process and an image-map-statement process, a visual description statement with finer granularity of an image is obtained and retrieved, and a target retrieval image corresponding to an inquiry image is output.
The reward and punishment function is specifically as follows:
Figure BDA0002693336820000021
wherein, ω iscAnd ωsA trainable fusion weight;
Figure BDA0002693336820000022
a prediction description sentence representing an image,
Figure BDA0002693336820000023
a CIDER score that describes a sentence for prediction; si={s1,s2,s3Are 3 similarity scores characterizing the similarity of the spectra,
Figure BDA0002693336820000024
Figure BDA0002693336820000025
is a reward and penalty function of 3, wherein
Figure BDA0002693336820000026
For the optimization of the process of image-map-sentence,
Figure BDA0002693336820000027
for individual optimization of the "image-map" process,
Figure BDA0002693336820000028
for individual optimization of the "map-sentence" process.
Further, the method further comprises:
taking a reward and punishment function as a reward mechanism, updating network parameters by using a strategy gradient in reinforcement learning, and obtaining a loss function LRL(θ) the gradient calculation with respect to the parameter θ is as follows:
Figure BDA0002693336820000029
wherein the content of the first and second substances,
Figure BDA00026933368200000210
in order to reward the objective function for the desire,
Figure BDA00026933368200000211
as a function of the reward and punishment defined above,
Figure BDA00026933368200000212
as a function of the score.
Wherein the method further comprises: and detecting the visual relation between the visual entity and the vision from the input image, and constructing a visual knowledge map representation corresponding to the image.
The technical scheme provided by the invention has the beneficial effects that:
1. the invention utilizes the visual descriptive sentences of the images to carry out retrieval, uses the visual knowledge map to better represent the image semantics, introduces a feedback regulation mechanism, uses reinforcement learning to improve the accuracy and diversity of the descriptive sentences, effectively reduces the semantic gap in the image retrieval and improves the efficiency and precision of the image retrieval;
2. the method can automatically generate the corresponding image description without manual marking, avoids the defects generated during manual marking, and improves the feasibility of utilizing the character-based image retrieval on a large-scale data set.
Drawings
FIG. 1 is a flow chart of a method for image retrieval based on visual descriptive statements;
FIG. 2 is a schematic diagram of a visual knowledge map of an image;
FIG. 3 is a diagram of a two-layer long-short-term memory (LSTM) network code.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
In order to solve the problems existing in the background technology, the existing cognitive neuroscience shows that the human visual system is formed by connecting a large number of feedforward and feedback, the image retrieval based on the content is usually represented by using low-level visual features, the query intention cannot be completely reflected or matched, and semantic gaps exist in the retrieval. Therefore, in the description model based on the visual map, the feedback regulation mechanism is introduced, the feedback type depth modeling optimization of the process from the sentence to the map, the process from the map to the image and the whole process is jointly explored, the complex image retrieval is carried out based on the visual description sentence of the generated image, the semantic gap in the image retrieval can be effectively reduced, the accurate and diversified visual description sentence generation of the image is realized, the method is further used for the image retrieval, the retrieval efficiency and accuracy are improved, and the refinement and the complicated retrieval of the image are better supported.
Example 1
An image retrieval method based on visual descriptive sentences, referring to fig. 1, the method comprises the following steps:
step 101: detecting a visual entity and a visual relation from an input image, and constructing a visual knowledge map representation corresponding to the image;
step 102: based on a graph convolution deep learning network, constructing an information transfer mode of nodes and edges in the representation of a visual knowledge graph, and realizing the aggregation and the update of the characteristics of each semantic unit;
step 103: coding each semantic unit feature aggregated and updated in the map by adopting a multi-level length short-time memory network in combination with an attention mechanism for generating an image description statement;
step 104: in a reinforcement learning framework, 3 punishment functions based on an image description statement are designed by utilizing CIDER fraction and map similarity and are respectively used for feedback regulation and optimization of an image-map process, an image-statement process and an image-map-statement overall process to obtain a visual description statement with finer granularity of an image;
step 105: and retrieving based on the fine-grained visual description sentences to obtain a target retrieval image corresponding to the query image.
Example 2
The scheme of example 1 is further described below with reference to specific calculation formulas and examples, which are described in detail below:
201: the knowledge graph is represented as tuple G ═ (N, E), where N and E are the set of nodes and edges, respectively;
wherein, N contains three types of nodes: entity node o, attribute node a, and relationship node r. To obtain the node representation, the entity is first detected and classified using the Faster R-CNN (fast area convolutional neural network) target detector. The fast R-CNN is composed of a convolutional layer, an RPN (regional proposal network) layer, a RoI Pooling (region of interest Pooling) layer and a classification and regression layer, the characteristics of an input image are extracted through a plurality of convolutional layers and sent into the RPN network, a plurality of candidate regional frames are generated through training, the regional frames and the image characteristics extracted by the convolutional layers are integrated, and after the RoI Pooling, the regional frame classification and position regression are carried out in the classification and regression layer to obtain the final accurate position of a target class and a detection frame.
The invention selects at least 10 and at most 100 entities for each image by using pre-trained Faster R-CNN, extracts the characteristics of the entities by using RoI Pooling, calculates the attribute classification result of the visual entity by using a full-link layer and a Softmax function, and calculates the attribute classification result of the visual entity by using a MOTIFS model[11]Calculating the relation classification result between visual entities to obtain the characteristic representation u of the entities, attributes and relation nodeso、uaAnd ur. Further, the information of each node is known, using oiDenotes the ith entity, rijRepresents an entity oiAnd ojRelation between ai,lRepresents an entity oiThe definition of the edge in E is as follows: if the object oiPossession Attribute ai,lThen a slave o is establishediTo ai,lIf present, directed edge of<oi-rij-oj>The relationship triple of (3), then the slave o is establishediTo rijAnd rijTo ojTwo directed edges.
202: the proposed multi-modal graph convolution network specifically uses a four-function spatial graph convolution function fr、fs、faAnd foAnd calculating the complex semantic association among the nodes in the graph, wherein the four functions are all two-layer fc-ReLU structures with independent parameters, and carrying out context coding on the node characteristics of the graph to generate a semantic context-aware aggregation characteristic representation V.
Wherein, the aggregation feature representation V contains three types of embedding (embedded vector): relation node rijRelationship of (3)
Figure BDA0002693336820000041
Entity node oiGenus ofSex embedding
Figure BDA0002693336820000042
And entity embedding
Figure BDA0002693336820000043
In particular, relationship triplets in a known atlas<oi-rij-oj>Calculating the relationship embedding
Figure BDA0002693336820000044
Comprises the following steps:
Figure BDA0002693336820000045
wherein the content of the first and second substances,
Figure BDA0002693336820000046
and
Figure BDA0002693336820000047
are respectively entity nodes oiAnd ojIs used to represent the feature vector of (a),
Figure BDA0002693336820000048
is oiAnd ojRelation node r betweenijIs represented by the feature vector of (1).
Entity node o in known mapiAll the attributes of
Figure BDA0002693336820000049
Wherein
Figure BDA00026933368200000410
Represents oiNumber of all attributes with, compute attribute embedding
Figure BDA00026933368200000411
Comprises the following steps:
Figure BDA0002693336820000051
wherein the content of the first and second substances,
Figure BDA0002693336820000052
is oiFeature vector representation of the corresponding attribute node.
In the graph, because of the node oiIn a relational tuple, either as the head or tail entity "subject", o is integrated using different functionsiKnowledge of all relation tuples, computing entity embedding
Figure BDA0002693336820000053
Comprises the following steps:
Figure BDA0002693336820000054
wherein, the node oj∈sbj(oi) When it is object, oiIs a subject; node ok∈obj(oi) When it is, it is subject, oiIs object;
Figure BDA0002693336820000055
is at oiThe number of all relationship triplets present; sbj (o)i) As a set of head entity nodes, obj (o)i) Is a tail entity node set; f. ofsAnd foRespectively as characteristic convolution transfer functions of a head entity and a tail entity;
Figure BDA0002693336820000056
and
Figure BDA0002693336820000057
are respectively entity nodes oi、ojAnd okIs used to represent the feature vector of (a),
Figure BDA0002693336820000058
and
Figure BDA0002693336820000059
are respectively oiAnd ojAnd okAnd oiNode r of the relation betweenijAnd rkiIs represented by the feature vector of (1).
203: the following notation is defined to represent the operation of a long-short-time memory (LSTM) network over a single time step:
ht=LSTM(xt,ht-1) (4)
wherein x istIs the input vector of LSTM, htIs the output vector of the LSTM.
Specifically, first, at each time step t, the word ω is input by concatenationtAverage pooling characteristics
Figure BDA00026933368200000510
And previous outputs of the second layer LSTM
Figure BDA00026933368200000511
Maximum context information is collected as input to the first layer LSTM as follows:
Figure BDA00026933368200000512
wherein, WΣIs a word embedding matrix for a vocabulary of size Σ,
Figure BDA00026933368200000513
the method is a result of average pooling of the aggregation characteristic expression V output by the multi-modal graph convolution network in the previous step.
Further, at each time step t, the output according to the first layer LSTM
Figure BDA00026933368200000514
The normalized attention distribution is generated for the aggregate feature V as follows:
Figure BDA00026933368200000515
αt=softmax(at) (6)
wherein v ismIs embedding, W of three types in Vv、WhAnd ωaIs a learning parameter, αtIs the resulting normalized attention distribution.
Based on the attention distribution, carrying out attention weighted sum on all embedding in the V, and calculating to obtain new characteristics
Figure BDA0002693336820000061
Will be provided with
Figure BDA0002693336820000062
And
Figure BDA0002693336820000063
combined as input to the second layer LSTM, as follows:
Figure BDA0002693336820000064
further, using the symbol y1:TTo represent a word sequence (y)1,...,yT) At each time step t, using the output of the second layer LSTM
Figure BDA0002693336820000065
The conditional distribution of possible output words is given by:
Figure BDA0002693336820000066
wherein, WpAnd bpAre the learning weights and biases. Finally, the distribution of the complete output sequence is calculated as the product of the conditional distributions:
Figure BDA0002693336820000067
wherein T is the total number of time steps.
204: first, calculate the CIDER score CIDER of the model prediction sentence (c)i,Si) Wherein c isiAnd Si={si1,...simAre respectively images IiThe candidate description sentence and the reference description sentence of (4);
defining an n-gram tuple wkAppear in the reference sentence sijThe number of times in (1) is hk(sij) Appears in the candidate sentence ciThe number of times in (1) is hk(ci) Each n-gram tuple w is calculated bykTF-IDF weight of:
Figure BDA0002693336820000068
where Ω is the set of all n-grams, I is the set of all images in the dataset, and wlIs any n-gram tuple in omega, hl(sij) Is wlAppear in the reference sentence sijNumber of times of (1), IpIs an arbitrary image in the data set, hk(spq) Is wkAppears in IpCorresponding reference sentence spqThe number of times (1).
For n-grams tuples of length n, the candidate sentence c is usediAnd a reference sentence SiAverage cosine similarity between them to calculate its CIDERnAnd (3) fractional:
Figure BDA0002693336820000069
wherein m is SiNumber of sentences contained in, gn(ci) For all n-grams tuples of length n in the candidate sentence ciOf TF-IDF weight vector gn(sij) For all n-grams tuples of length n in the reference sentence sijTF-IDF weight vector in (1).
The scores of all n-grams of different lengths are weighted and summedThe total CIDER score is calculated as follows, where ωnIs a trade-off parameter:
Figure BDA0002693336820000071
further, for matching image-sentence pairs, the features of two different modalities of the image and the sentence are mapped to the same space by using a knowledge graph for comparison. Knowledge-graph characterization G of known images1For model prediction description sentences and real description sentences of the images, through natural language processing, a sentence parser of Stanford is used for constructing corresponding knowledge graph representation G2And G3. Performing a map G1、G2And G3Comparing the nodes between every two, and calculating the node similarity to represent the similarity between the maps according to the following formula:
Figure BDA0002693336820000072
wherein the content of the first and second substances,
Figure BDA0002693336820000073
for node comparison function, sigma is sigmoid activation function, s1、s2And s3Are each G1And G2、G1And G3、G2And G3Normalized similarity score of (a).
The similarity score is fused with the CIDER score as follows:
Figure BDA0002693336820000074
wherein, ω iscAnd ωsA trainable fusion weight;
Figure BDA0002693336820000075
a prediction description sentence representing an image,
Figure BDA0002693336820000076
a CIDER score that describes a sentence for prediction; si={s1,s2,s3Are the 3 similarity scores mentioned above,
Figure BDA0002693336820000077
is a reward and penalty function of 3, wherein
Figure BDA0002693336820000078
For the optimization of the process of image-map-sentence,
Figure BDA0002693336820000079
for individual optimization of the "image-map" process,
Figure BDA00026933368200000710
for individual optimization of the "map-sentence" process.
In the optimization process, the reward and punishment function is used as a reward mechanism, the strategy gradient in reinforcement learning is used for updating network parameters, and the loss function LRL(θ) the gradient calculation with respect to the parameter θ is as follows:
Figure BDA00026933368200000711
wherein the content of the first and second substances,
Figure BDA00026933368200000712
in order to reward the objective function for the desire,
Figure BDA00026933368200000713
as a function of the reward and punishment defined above,
Figure BDA00026933368200000714
as a function of the score.
205: with I ═ I1,i2,...imIndicates the data set of the current image retrieval, by
Figure BDA00026933368200000715
And
Figure BDA00026933368200000716
representing a query image q and any candidate image i in a datasetlE.g. I visual description statement, first using Word2vec will
Figure BDA00026933368200000717
And
Figure BDA00026933368200000718
encoding into a corresponding vector representation Vq={eq1,eq2,...eqnAnd
Figure BDA00026933368200000719
wherein eqnAnd
Figure BDA00026933368200000720
are respectively
Figure BDA00026933368200000721
And
Figure BDA00026933368200000722
the word imbedding in (1). Further, by using the coded vector representation, the semantic similarity between the images is measured by calculating the distance between the vectors, and the smaller the distance is, the more similar the images are. Thus, searching for an image becomes based on the distance metric of the visual descriptive statement embedding vector representation of the image, the equation being defined as follows:
Figure BDA0002693336820000081
wherein N and M are vector representations V respectivelyqAnd
Figure BDA0002693336820000082
number of elements in (e)qn∈Vq
Figure BDA0002693336820000083
sim(q,il) The similarity between the query image and any of the candidate images is scored.
After similarity calculation between the query image and all candidate images in the data set is performed by using the above formula, the candidate images are ranked according to the similarity scores, and the required number of target images with the highest similarity score relative to the query image are retrieved. Reference documents:
[1]Patel T,Gandhi S.A survey on context based similarity techniques for image retrieval[C]//2017International Conference on Innovative Mechanisms for Industry Applications(ICIMIA).IEEE,2017:219-223.
[2]Khawandi S,Abdallah F,Ismail A.A survey on Image Indexing and Retrieval based on Content Based Image[C]//2019International Conference on Machine Learning,Big Data,Cloud and Parallel Computing(COMITCon).IEEE,2019:222-225.
[3]Mahamuni C V,Wagh N B.Study of CBIR methods for retrieval of digital images based on colour and texture extraction[C]//2017International Conference on Computer Communication and Informatics(ICCCI).IEEE,2017:1-7.
[4]Zhou W,Li H,Tian Q.Recent advance in content-based image retrieval:A literature survey[J].arXiv preprint arXiv:1706.06064,2017.
[5]Narayan R,Reddy S C,Narayan L,et al.The Study of Approaches of Content Based Image Retrieval[J].2019.
[6]Wei X,Qi Y,Liu J,et al.Image retrieval by dense caption reasoning[C]//2017IEEE Visual Communications and Image Processing(VCIP).IEEE,2017:1-4.
[7]Hoxha G,Melgani F,Demir B.Retrieving Images with Generated Textual Descriptions[C]//IGARSS 2019-2019IEEE International Geoscience and Remote Sensing Symposium.IEEE,2019:5812-5815.
[8]Li X,Jiang S.Know more say less:Image captioning based on scene graphs[J].IEEE Transactions on Multimedia,2019,21(8):2117-2130.
[9]Yang X,Tang K,Zhang H,et al.Auto-encoding scene graphs for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:10685-10694.
[10] computational modeling and application research of a feedback mechanism in the Caochun and deep convolutional neural network [ D ]. university of Chinese science and technology, 2018.
[11]R.Zellers,M.Yatskar,S.Thomson,and Y.Choi.Neural motifs:Scene graph parsing with global context.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 5831–5840,2018.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (4)

1. An image retrieval method based on visual descriptive sentences, characterized in that the method comprises the following steps:
based on a graph convolution deep learning network, constructing an information transfer mode of nodes and edges in the representation of a visual knowledge graph, and realizing the aggregation and the update of the characteristics of each semantic unit;
coding each semantic unit feature aggregated and updated in the map by adopting a multi-level length short-time memory network in combination with an attention mechanism for generating an image description statement;
under the framework of reinforcement learning, by utilizing CIDER scores and map similarity, a reward and punishment function based on an image description statement is designed and used for feedback regulation and optimization of an image-map process, an image-statement process and an image-map-statement process, a visual description statement with finer granularity of an image is obtained and retrieved, and a target retrieval image corresponding to an inquiry image is output.
2. The image retrieval method based on the visual description statement of claim 1, wherein the reward and punishment function is specifically:
Figure FDA0002693336810000011
wherein, ω iscAnd ωsA trainable fusion weight;
Figure FDA0002693336810000012
a prediction description sentence representing an image,
Figure FDA0002693336810000013
a CIDER score that describes a sentence for prediction; si={s1,s2,s3Are 3 similarity scores characterizing the similarity of the spectra,
Figure FDA0002693336810000014
Figure FDA0002693336810000015
is a reward and penalty function of 3, wherein
Figure FDA0002693336810000016
For the optimization of the process of image-map-sentence,
Figure FDA0002693336810000017
for individual optimization of the "image-map" process,
Figure FDA0002693336810000018
for individual optimization of the "map-sentence" process.
3. An image retrieval method based on visual descriptive sentences according to claim 1 or 2, characterized in that the method further comprises:
taking a reward and punishment function as a reward mechanism, updating network parameters by using a strategy gradient in reinforcement learning, and obtaining a loss function LRL(θ) the gradient calculation with respect to the parameter θ is as follows:
Figure FDA0002693336810000019
wherein the content of the first and second substances,
Figure FDA00026933368100000110
in order to reward the objective function for the desire,
Figure FDA00026933368100000111
as a function of the reward and punishment defined above,
Figure FDA00026933368100000112
as a function of the score.
4. An image retrieval method based on visual descriptive sentences according to claim 1 or 2, characterized in that the method further comprises: and detecting the visual relation between the visual entity and the vision from the input image, and constructing a visual knowledge map representation corresponding to the image.
CN202010998165.8A 2020-09-21 2020-09-21 Image retrieval method based on visual description sentences Pending CN112256904A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010998165.8A CN112256904A (en) 2020-09-21 2020-09-21 Image retrieval method based on visual description sentences

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010998165.8A CN112256904A (en) 2020-09-21 2020-09-21 Image retrieval method based on visual description sentences

Publications (1)

Publication Number Publication Date
CN112256904A true CN112256904A (en) 2021-01-22

Family

ID=74231454

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010998165.8A Pending CN112256904A (en) 2020-09-21 2020-09-21 Image retrieval method based on visual description sentences

Country Status (1)

Country Link
CN (1) CN112256904A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989088A (en) * 2021-02-04 2021-06-18 西安交通大学 Visual relation example learning method based on reinforcement learning
CN114020779A (en) * 2021-10-22 2022-02-08 上海卓辰信息科技有限公司 Self-adaptive optimization retrieval performance database and data query method
CN114677580A (en) * 2022-05-27 2022-06-28 中国科学技术大学 Image description method based on self-adaptive enhanced self-attention network
CN117648444A (en) * 2024-01-30 2024-03-05 广东省华南技术转移中心有限公司 Patent clustering method and system based on graph convolution attribute aggregation
CN117648444B (en) * 2024-01-30 2024-04-30 广东省华南技术转移中心有限公司 Patent clustering method and system based on graph convolution attribute aggregation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108171283A (en) * 2017-12-31 2018-06-15 厦门大学 A kind of picture material automatic describing method based on structuring semantic embedding
CN110674850A (en) * 2019-09-03 2020-01-10 武汉大学 Image description generation method based on attention mechanism
CN110991515A (en) * 2019-11-28 2020-04-10 广西师范大学 Image description method fusing visual context
CN111259724A (en) * 2018-11-30 2020-06-09 塔塔顾问服务有限公司 Method and system for extracting relevant information from image and computer program product
CN111612103A (en) * 2020-06-23 2020-09-01 中国人民解放军国防科技大学 Image description generation method, system and medium combined with abstract semantic representation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108171283A (en) * 2017-12-31 2018-06-15 厦门大学 A kind of picture material automatic describing method based on structuring semantic embedding
CN111259724A (en) * 2018-11-30 2020-06-09 塔塔顾问服务有限公司 Method and system for extracting relevant information from image and computer program product
CN110674850A (en) * 2019-09-03 2020-01-10 武汉大学 Image description generation method based on attention mechanism
CN110991515A (en) * 2019-11-28 2020-04-10 广西师范大学 Image description method fusing visual context
CN111612103A (en) * 2020-06-23 2020-09-01 中国人民解放军国防科技大学 Image description generation method, system and medium combined with abstract semantic representation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙晓领: "《面向特定领域图像的语义知识抽取方法研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989088A (en) * 2021-02-04 2021-06-18 西安交通大学 Visual relation example learning method based on reinforcement learning
CN112989088B (en) * 2021-02-04 2023-03-21 西安交通大学 Visual relation example learning method based on reinforcement learning
CN114020779A (en) * 2021-10-22 2022-02-08 上海卓辰信息科技有限公司 Self-adaptive optimization retrieval performance database and data query method
CN114020779B (en) * 2021-10-22 2022-07-22 上海卓辰信息科技有限公司 Self-adaptive optimization retrieval performance database and data query method
CN114677580A (en) * 2022-05-27 2022-06-28 中国科学技术大学 Image description method based on self-adaptive enhanced self-attention network
CN114677580B (en) * 2022-05-27 2022-09-30 中国科学技术大学 Image description method based on self-adaptive enhanced self-attention network
CN117648444A (en) * 2024-01-30 2024-03-05 广东省华南技术转移中心有限公司 Patent clustering method and system based on graph convolution attribute aggregation
CN117648444B (en) * 2024-01-30 2024-04-30 广东省华南技术转移中心有限公司 Patent clustering method and system based on graph convolution attribute aggregation

Similar Documents

Publication Publication Date Title
Wang et al. Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval
CN110969020B (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN110598005B (en) Public safety event-oriented multi-source heterogeneous data knowledge graph construction method
CN108038122B (en) Trademark image retrieval method
CN104899253B (en) Towards the society image across modality images-label degree of correlation learning method
CN106845411B (en) Video description generation method based on deep learning and probability map model
Yin et al. Region search based on hybrid convolutional neural network in optical remote sensing images
CN112256904A (en) Image retrieval method based on visual description sentences
CN111881677A (en) Address matching algorithm based on deep learning model
CN111291556A (en) Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN109271539A (en) A kind of image automatic annotation method and device based on deep learning
CN114398491A (en) Semantic segmentation image entity relation reasoning method based on knowledge graph
CN111324765A (en) Fine-grained sketch image retrieval method based on depth cascade cross-modal correlation
CN108170823B (en) Hand-drawn interactive three-dimensional model retrieval method based on high-level semantic attribute understanding
Zhang et al. Hierarchical scene parsing by weakly supervised learning with image descriptions
CN114461890A (en) Hierarchical multi-modal intellectual property search engine method and system
CN114612767A (en) Scene graph-based image understanding and expressing method, system and storage medium
CN113868448A (en) Fine-grained scene level sketch-based image retrieval method and system
Barman et al. A graph-based approach for making consensus-based decisions in image search and person re-identification
Astolfi et al. Syntactic pattern recognition in computer vision: A systematic review
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
Li et al. Caption generation from road images for traffic scene modeling
CN112699685A (en) Named entity recognition method based on label-guided word fusion
CN114969343B (en) Weak supervision text classification method combined with relative position information
Tian et al. Scene graph generation by multi-level semantic tasks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210122