CN112989088A - Visual relation example learning method based on reinforcement learning - Google Patents

Visual relation example learning method based on reinforcement learning Download PDF

Info

Publication number
CN112989088A
CN112989088A CN202110152379.8A CN202110152379A CN112989088A CN 112989088 A CN112989088 A CN 112989088A CN 202110152379 A CN202110152379 A CN 202110152379A CN 112989088 A CN112989088 A CN 112989088A
Authority
CN
China
Prior art keywords
visual
agent
search
action
instance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110152379.8A
Other languages
Chinese (zh)
Other versions
CN112989088B (en
Inventor
杜友田
王航
王雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202110152379.8A priority Critical patent/CN112989088B/en
Publication of CN112989088A publication Critical patent/CN112989088A/en
Application granted granted Critical
Publication of CN112989088B publication Critical patent/CN112989088B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The visual relationship is usually represented as a triple < subject, predictor, object >, which contains two objects subject and object and the interaction predictor between them. The visual relation learning is a bridge between the low-level image perception task and the high-level image cognition task, and belongs to the intermediate-level image understanding task. Visual relationship instance learning is the problem of determining two object instances involved in each visual relationship given an image and corresponding set of visual relationships. The problem is modeled into a sequence decision process when two agents search on images by two example search boxes of subject and object, so that the visual relation example learning method based on deep reinforcement learning is provided. For a given test image and an associated visual relationship set, the example frames corresponding to the subject and the object in each visual relationship can be quickly and accurately found.

Description

Visual relation example learning method based on reinforcement learning
Technical Field
The invention belongs to the technical field of computer application, relates to deep learning, visual relation and reinforcement learning, and particularly relates to a visual relation example learning method based on reinforcement learning.
Background
A long-standing goal in the field of computer vision is to have an agent understand human natural language sufficiently to enable it to perform specific tasks in a visual environment. Currently, in computer vision tasks, understanding of image content can be divided into perception and cognition levels, and an object detection task belongs to the perception level and can learn the relationship between low-level visual appearance and high-level text semantics in an image. However, in order to more fully understand the content expressed by the image, the interactive relationship between the objects in the image, i.e. the learning of the visual relationship, must be further understood, which belongs to the understanding of the content of the image in the cognitive level.
There are a large number of images and associated texts on the internet, from which a set of visual relationships describing the image content can be extracted, and learning of these visual relationships is essential for a thorough understanding of the image content. In recent years, we have witnessed the widespread application of visual relationship learning in a series of image understanding tasks, including image description generation, image retrieval, image synthesis, scene graph generation, visual reasoning, and visual question and answer. A visual relationship typically consists of two objects, a subject and an object, and an interaction between them, typically expressed as a triple < subject, predictor, object >, e.g. < person-edge-bike >. Visual relationship learning requires not only identifying the class and bounding-box in which to locate objects in a given image, but also indicating the interaction between each pair of objects. That is, the visual relationship learning is a bridge connecting low-level image perception tasks (object detection, image classification, etc.) and high-level image recognition tasks (image description generation, visual question answering, etc.), and belongs to a moderate-level image understanding task. Visual relationship instance learning is the problem of determining two object instances involved in each visual relationship given an image and corresponding set of visual relationships.
Existing visual relationship learning models can be divided into two categories: (1) a combined model; (2) and (5) separating the model. The joint model treats a visual relationship triplet as a category and then learns the classifier. Example (b)For example, Plummer et al learn a CCA model based on features of different combinations of subjects, objects and joint regions between them, and then classify each visual relationship using a ranking SVM. However, since visual relationships usually exhibit long-tailed distributions, the joint model has the drawbacks of being large in scale and weak in generalization. In addition, when the number of object classes is N and the number of interaction relation classes is K, the learning complexity is O (N)2K) In that respect The separation model respectively trains classifiers for learning aiming at each component in the visual relation triple, so that the learning complexity is reduced to O (N + K). Lu et al predict the interaction between pairs of objects using visual features of the pairs of objects and linguistic prior knowledge. Zhang et al propose to consider predicate as a translation vector between subject and object, i.e., s + p ≈ o, and then map the visual features of the paired objects to a low-dimensional relationship space to construct a classification model VTransE of visual relationships. Based on spatial features and statistical dependencies between subjects, predicates, and objects, Dai et al utilizes a network of depth relationships to predict visual relationships between pairs of objects. In addition, the context information is captured by using a graph neural network, and Xu et al classify visual relations by establishing a message iterative transfer model.
However, the above methods do not solve the problem of example confusion in the visual relationship learning. As shown in fig. 1, given a set of images and associated visual relationships, how to correctly find and output an example box of two object subjects and objects in each visual relationship. Since there are often multiple instances of objects belonging to the same class in an image in the case where the object class has been specified, an instance confusion problem in visual relationship learning is caused.
Disclosure of Invention
In order to solve the problem of example confusion in visual relationship learning, the invention provides a visual relationship example learning method based on reinforcement learning, based on a deep reinforcement learning framework, the visual relationship example learning is modeled into a sequence decision problem that two objects, namely a subject and an object, involved in each visual relationship are searched in an image by two agents S-agent and O-agent, the state, the action and the reward in the problem are defined, and for a given test image and an associated visual relationship set, the model learned by the method can quickly and accurately find the example frames corresponding to the subject and the object in each visual relationship, so that the understanding capability of a cognitive level on image contents is greatly improved.
In order to achieve the purpose, the invention adopts the following technical means:
a visual relation example learning method based on reinforcement learning comprises the following steps:
step 1, inputting training set data, obtaining each image and a corresponding visual relation set, and connecting the following vectors in series to form a state vector: the method comprises the steps that visual features of the whole image, visual features of two object instance search boxes, historical action vectors of two agents, spatial relation features between the object instance search boxes and text features obtained by coding a current visual relation through a Skip-through language model, the two agents are an S-agent used for searching the object instance search box and an O-agent used for searching the object instance search box, and the historical action vectors are formed by connecting action vectors executed at the last 10 moments in series;
step 2, at each moment, the S-agent and the O-agent respectively execute the transformation action aiming at the subject and the object instance search boxes, so as to generate the search box at the next moment, then obtain the corresponding reward, and judge whether the search is terminated;
step 3, storing the state of the current moment, the action taken at the current moment, the obtained reward, the state of the next moment and a judgment mark for judging whether the search is terminated into an experience playback pool;
and 4, repeating the steps 1-3 until the experience playback pool reaches the minimum number capable of being sampled, randomly sampling a part of samples from the experience playback pool at the moment, respectively training the current Q networks and parameters of the S-agent and the O-agent, and respectively updating the parameters of the target Q networks of the S-agent and the O-agent by using the parameters of the current Q networks at regular intervals.
The visual relationship example learning method based on reinforcement learning can be used for finding each example in the appointed visual relationship in the visual environment, and can also apply the learned visual relationship and the corresponding example frame to image text related tasks such as image text mutual search, visual reasoning, visual question answering and the like.
The existing method needs to use the result of the object detection algorithm as a candidate example frame of two objects in each visual relationship, but the object detection algorithm is usually erroneous, and thus may affect the performance of the visual relationship learning to some extent. In the invention, the two agents start to search the example frames of the two objects in each visual relationship from the upper left corner of the image continuously, and the example frame with the labeling error is not used.
The existing method needs to evaluate a large number of candidate example frames in the process of generating the object example frame by using an object detection algorithm, thereby causing unnecessary calculation expense, the invention aims to learn the optimal strategy of searching the correct example frame, namely the shortest path, and the finally generated model can accurately find two object example frames involved in each visual relationship of the image by searching the minimum number of candidate example frames.
The method is based on deep reinforcement learning, and models an example positioning problem of an object related to each visual relation in a given image into a sequence decision problem when two agents search on the image, wherein S-agent of a search subject object and O-agent of the search subject start from the upper left corner of the image, and a specific action is executed according to the current state at each moment until the search is terminated to output an example frame corresponding to the two objects subject and object.
When an image and a visual relation set are given, the method can quickly and accurately find two object example frames in each visual relation by the shortest search path.
Drawings
FIG. 1 gives an image and a corresponding set of visual relationships.
FIG. 2 is a block diagram of example learning of visual relationships, with solid black and white boxes representing example search boxes for S-agent and O-agent, respectively, at the current time, and dashed boxes representing the next search box to jump to after performing the action at the current time.
FIG. 3 is a diagram of 9 predefined transformation actions, the implementation box representing the search box at the current time, the dashed box the search box at the next time, Left and Right representing horizontal Left and Right movement, respectively, Up and Down representing vertical Up and Down movement, Bigger and Smaller scaling operations, Taller and Fatter scaling operations, and terminal terminating the search operation.
FIG. 4 is an example box of visual relationship example learning, eventually learning two objects in each visual relationship, given a set of images and corresponding visual relationships.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the drawings and examples.
The invention relates to a visual relation example learning method based on reinforcement learning, which comprises the following steps:
step 1, inputting training set data, obtaining each image and a corresponding visual relation set, and connecting the following vectors in series to form a state vector: the method comprises the steps of obtaining visual characteristics of an entire image, visual characteristics of two object instance search boxes, historical action vectors of two agents, a spatial relation characteristic between the object instance search boxes and a text characteristic obtained by coding a current visual relation by using a Skip-through language model, wherein the two agents are an S-agent used for searching the object instance search box and an O-agent used for searching the object instance search box, and the historical action vectors are formed by connecting action vectors executed at the past 10 moments in series.
In the invention, S-agent and O-agent respectively start searching from the upper left corner of the current image.
Specifically, the image represents an interactive environment, two agents continuously interact with the environment to form an Episode, and each Episode processes a different image. Defining a state vector StIn the form:
Figure BDA0002932908820000051
in the formula, v (I)t) Is the image I being processed at the current moment ttThe visual feature vector of (a) is,
Figure BDA0002932908820000052
instance search boxes representing object at the present moment, i.e.
Figure BDA0002932908820000053
Is the coordinate of the upper left corner thereof,
Figure BDA0002932908820000054
is the width and length of the same,
Figure BDA0002932908820000055
representing example search boxes
Figure BDA00029329088200000519
The visual feature vector of (a) is,
Figure BDA0002932908820000056
representing a historical motion vector formed by concatenating the motion vectors of the agent S-agent at the last 10 moments,
Figure BDA0002932908820000057
instance search boxes representing object objects at the current time, i.e.
Figure BDA0002932908820000058
Is the coordinate of the upper left corner thereof,
Figure BDA0002932908820000059
is the width and length of the same,
Figure BDA00029329088200000510
representing example search boxes
Figure BDA00029329088200000511
The visual feature vector of (a) is,
Figure BDA00029329088200000512
represents a historical motion vector, w (e), formed by concatenating the motion vectors of the agent O-agent at the past 10 timest) Is a visual relation e generated by a Skip-through language model and related to the time ttThe semantic of (2) is embedded into the vector,
Figure BDA00029329088200000513
representing example search boxes
Figure BDA00029329088200000514
And
Figure BDA00029329088200000515
the spatial relationship feature vector between them, which is defined as follows:
Figure BDA00029329088200000516
wherein the content of the first and second substances,
Figure BDA00029329088200000517
and
Figure BDA00029329088200000518
respectively, the intersection and union between two example search boxes, and we use the GMM model to combine the 6-dimensional vector because the vector dimension is too small to adequately capture the slight differences between different spatial relationships
Figure BDA0002932908820000061
The vector discretized into 400 dimensions serves as the final spatial relationship feature vector between the two example search boxes.
And 2, at each moment, respectively executing transformation actions aiming at the subject and object instance search boxes by the S-agent and the O-agent, respectively selecting and executing a specific action according to the respective current state, so as to jump the instance search box at the current moment to the instance search box at the next moment and obtain corresponding rewards, and then jumping to the next state until the search is terminated.
FIG. 2 is a frame diagram of example learning of visual relationships, which is inputted as an image and a corresponding set of visual relationships, and two agents S-agent and O-agent are two different DQN networks respectively.
Specifically, the transformation action may be defined as a 9-dimensional vector, the element of the vector is 1 to represent that the action is performed, and 0 is not performed, and the defined 9-dimensional action vectors respectively correspond to the following 9 actions: horizontal Right movement (Right), horizontal Left movement (Left), vertical Up movement (Up), vertical Down movement (Down), zoom in (Bigger), zoom out (Smaller), change height ratio size (teller), change width ratio size (Fatter), and Terminate search (Terminate), as shown in fig. 3.
At each time, the S-agent performs a selected transformation action such that the width and height of the instance search box for the subject change as follows:
Figure BDA0002932908820000062
similarly, at each time, the O-agent performs a selected transformation action such that the width and height of the instance search box for an object change as follows:
Figure BDA0002932908820000063
wherein, alpha is ∈ [0, 1 ]]Is a change parameter, e.g., the agent S-agent performs a Right action so that the instance search box for the subject is moved from
Figure BDA0002932908820000064
Is transformed into
Figure BDA0002932908820000065
And
Figure BDA0002932908820000066
width and height variations of instance search box for subject, respectively;
Figure BDA0002932908820000067
And
Figure BDA0002932908820000068
the width and height variation of the instance search box with respect to object, respectively.
S-agent performs an action such that the instance search box for subject is moved from the current search box
Figure BDA0002932908820000071
Jump to next search box
Figure BDA0002932908820000072
Time of day, award obtained
Figure BDA0002932908820000073
Is defined as follows:
Figure BDA0002932908820000074
similarly, the O-agent performs an action such that the instance search box for object is from the current search box
Figure BDA0002932908820000075
Jump to next search box
Figure BDA0002932908820000076
Awarding of prizes
Figure BDA0002932908820000077
Is defined as:
Figure BDA0002932908820000078
wherein, gsGround-route, g, representing subjectoGround-route, sign (·) representing an object instance is a symbolic function, IOU (·) represents one of the two regionsCross-over ratio between them, i.e.
Figure BDA0002932908820000079
In particular, S-agent and O-agent obtain rewards after performing a termination search action
Figure BDA00029329088200000710
And
Figure BDA00029329088200000711
are respectively defined as:
Figure BDA00029329088200000712
Figure BDA00029329088200000713
where η is the reward for terminating the action and τ is the IoU threshold for terminating the action.
Step 3, storing the state transition of each moment, namely the state of the current moment, the action taken at the current moment, the obtained reward, the state of the next moment and a judgment mark for judging whether the search is terminated into an experience playback pool; in the invention, S-agent sets two Q networks respectively, namely the current Q network Q for action selectionsAnd target Q network Q 'for target value calculation'sO-agent also sets two Q networks respectively, i.e. the current Q network Q for action selectionoAnd target Q network Q 'for target value calculation'o
And 4, the experience playback pool is used for storing historical data, the steps 1-3 are repeated until the data volume loaded in the experience playback pool reaches the minimum sampling number, a part of samples are randomly sampled from the experience playback pool, and the current Q network Q of the S-agent is respectively trainedsAnd its parameters, O-agent's current Q network QoAnd its parameters, respectively utilizing current Q network Q at regular intervalssUpdate S-agent target Q network Q 'of'sUsing the current Q netLuo QoUpdate S-agent target Q network Q 'of'oThe parameter (c) of (c).
The result of the example learning of the visual relationship is shown in fig. 4, and a large number of learned visual relationships and two corresponding object examples thereof can be further applied to high-level image understanding tasks such as visual question answering, visual reasoning, image description generation and the like.
In one embodiment of the invention, the training set data is N samples
Figure BDA0002932908820000081
Wherein the content of the first and second substances,
Figure BDA0002932908820000082
representing the ith image IiContains miA visual relationship, i.e.
Figure BDA0002932908820000083
Representative image IiThe (j) th visual relationship of (c),
Figure BDA0002932908820000084
and
Figure BDA0002932908820000085
representing the category of the two objects that interact,
Figure BDA0002932908820000086
representing a category of interaction between two objects. The specific steps of this embodiment are as follows:
step 1):
initialization: experience playback pool D with capacity M and minimum samplable number Z, current Q network Q of two agents S-agent and O-agents,QoAnd its parameter thetas,θoTarget Q network Q 'of two Agents's,Q′oAnd parameters thereof
Figure BDA0002932908820000087
The number of iteration rounds T, IoU threshold τ,the method comprises the steps of terminating action reward eta, exploration rate epsilon, change parameter alpha, learning rate beta, attenuation factor gamma, sample number n of batch gradient decline and target Q network parameter updating frequency C.
Step 2):
when t is 1, from sample X1Starting, initializing state S1Actions for two agents to randomly select search boxes for subject and object instances with probabilities ε, respectively
Figure BDA0002932908820000088
And
Figure BDA0002932908820000089
otherwise, selecting the current optimal action according to the model:
Figure BDA00029329088200000810
Figure BDA00029329088200000811
as,aorepresent the optional action sets for S-agent and O-agent, respectively. Two agents each performing a selected action
Figure BDA00029329088200000812
And
Figure BDA00029329088200000813
thereby jumping to a new state St+1And separately awarded
Figure BDA00029329088200000814
And whether the state is _ end is terminatedtWill be provided with
Figure BDA00029329088200000815
And storing the data into an experience playback pool D.
Step 3):
performing steps 1) -2) until the experience playback poolThe number of samples stored in the memory reaches the minimum sampling number Z, and then n samples are sampled from the empirical playback pool D
Figure BDA0002932908820000091
j
1, 2.. n, calculating a target reward value for the s-agent to perform the action:
Figure BDA0002932908820000092
using a mean square error loss function
Figure BDA0002932908820000093
Updating parameter theta by inverse gradient propagation through neural networks. Calculating a target reward value for the o-agent to perform the action:
Figure BDA0002932908820000094
using a mean square error loss function
Figure BDA0002932908820000095
Updating parameter theta by inverse gradient propagation through neural networko. Updating parameters of the target Q network every C rounds:
Figure BDA0002932908820000096
step 4):
when the test data set is input, the two agents s-agent and o-agent output an example frame of subject and object corresponding to each visual relation in the image when the search is terminated.
The specific example of the invention learning into images based on given image and text relationships can be applied to many vision-related tasks, for example, a robot performing a navigation task in a visual environment needs to execute the following instructions "send this cup to a woman standing in front of a car and having an umbrella", then the robot needs to find each example of the following visual relationships in the visual environment: < woman-on-umbrella >, < woman-standing in front-car > and then the cup-sending operation can be completed. In addition, the learned visual relationship and the corresponding example frame can be applied to image text related tasks such as image text searching, visual reasoning, visual question answering and the like, for example, in image text searching, the input search text is < person-playing-football >, so that a plurality of pictures containing 'person playing football' can be quickly acquired, and the interactive action relationship of 'person' and 'football' is not found by searching two objects of 'person' and 'football' and then searching the obtained results, wherein the pictures contain two objects of 'person' and 'football', and the interactive action relationship of 'playing' is formed between the 'person' and the 'football'.

Claims (5)

1. A visual relation example learning method based on reinforcement learning is characterized by comprising the following steps:
step 1, inputting training set data, obtaining each image and a corresponding visual relation set, and connecting the following vectors in series to form a state vector: the method comprises the steps that visual features of the whole image, visual features of two object instance search boxes, historical action vectors of two agents, spatial relation features between the object instance search boxes and text features obtained by coding a current visual relation through a Skip-through language model, the two agents are an S-agent used for searching the object instance search box and an O-agent used for searching the object instance search box, and the historical action vectors are formed by connecting action vectors executed at the last 10 moments in series;
step 2, at each moment, the S-agent and the O-agent respectively execute the transformation action aiming at the subject and the object instance search boxes, so as to generate the search box at the next moment, then obtain the corresponding reward, and judge whether the search is terminated;
step 3, storing the state of the current moment, the action taken at the current moment, the obtained reward, the state of the next moment and a judgment mark for judging whether the search is terminated into an experience playback pool;
and 4, repeating the steps 1-3 until the experience playback pool reaches the minimum number capable of being sampled, randomly sampling a part of samples from the experience playback pool, respectively training the current Q networks and parameters of the S-agent and the O-agent, and respectively updating the parameters of the target Q networks of the S-agent and the O-agent by using the parameters of the current Q network at regular intervals.
2. The reinforcement learning-based visual relationship instance learning method according to claim 1, wherein in the step 1, a state vector S is definedtIn the form:
Figure FDA0002932908810000011
wherein, v (I)t) Is the image I being processed at the current moment ttThe visual feature vector of (a) is,
Figure FDA0002932908810000012
instance search boxes representing object at the present moment, i.e.
Figure FDA0002932908810000013
Is the coordinate of the upper left corner thereof,
Figure FDA0002932908810000021
is the width and length of the same,
Figure FDA0002932908810000022
representing example search boxes
Figure FDA0002932908810000023
The visual feature vector of (a) is,
Figure FDA0002932908810000024
representing a historical motion vector formed by concatenating the motion vectors of the agent S-agent at the last 10 moments,
Figure FDA0002932908810000025
instance search boxes representing object objects at the current time, i.e.
Figure FDA0002932908810000026
Is the coordinate of the upper left corner thereof,
Figure FDA0002932908810000027
is the width and length of the same,
Figure FDA0002932908810000028
representing example search boxes
Figure FDA0002932908810000029
The visual feature vector of (a) is,
Figure FDA00029329088100000210
represents a historical motion vector, w (e), formed by concatenating the motion vectors of the agent O-agent at the past 10 timest) Is a visual relation e generated by a Skip-through language model and related to the time ttThe semantic of (2) is embedded into the vector,
Figure FDA00029329088100000211
representing example search boxes
Figure FDA00029329088100000212
And
Figure FDA00029329088100000213
the spatial relationship feature vector between them, which is defined as follows:
Figure FDA00029329088100000214
wherein the content of the first and second substances,
Figure FDA00029329088100000215
and
Figure FDA00029329088100000216
respectively, the intersection and union between the two instance search boxes.
3. The reinforcement learning-based visual relationship instance learning method according to claim 2, wherein the GMM model is utilized to learn the visual relationship instance
Figure FDA00029329088100000217
The vector discretized into 400 dimensions serves as the final spatial relationship feature vector between the two example search boxes.
4. The method according to claim 3, wherein in step 2, the transformation action is defined as a 9-dimensional vector, the element of the vector is 1 to represent that the action is performed, and 0 is not performed, and the defined 9-dimensional action vectors respectively correspond to the following 9 actions: horizontal right and left movement, vertical up and down movement, zoom in and out, change height ratio size, change width ratio size, and terminate the search;
at each time, the S-agent performs a selected transformation action such that the width and height of the instance search box for the subject change as follows:
Figure FDA00029329088100000218
at each time, the O-agent performs a selected transformation action such that the width and height of the instance search box for an object change as follows:
Figure FDA0002932908810000031
wherein, alpha is ∈ [0, 1 ]]Is the parameter of the change in the direction of the change,
Figure FDA0002932908810000032
and
Figure FDA0002932908810000033
width and height variation of the instance search box for the subject, respectively;
Figure FDA0002932908810000034
and
Figure FDA0002932908810000035
the width and height variation of the instance search box with respect to object, respectively.
5. The reinforcement learning-based visual relationship example learning method according to claim 4, wherein in the step 2, the S-agent performs an action to make the example search box about the subject from the current search box
Figure FDA0002932908810000036
Jump to next search box
Figure FDA0002932908810000037
Time of day, award obtained
Figure FDA0002932908810000038
Is defined as follows:
Figure FDA0002932908810000039
the O-agent performs an action that causes the instance search box for the object to be moved from the current search box
Figure FDA00029329088100000310
Jump to next search box
Figure FDA00029329088100000311
Awarding of prizes
Figure FDA00029329088100000312
Is defined as:
Figure FDA00029329088100000313
wherein, gsGround-route, g, representing subjectoGround-route, sign () representing an object instance is a sign function, IOU () represents the intersection ratio between two regions, i.e.
Figure FDA00029329088100000314
Rewards obtained by S-agent and O-agent after executing termination search action
Figure FDA00029329088100000315
And
Figure FDA00029329088100000316
are respectively defined as:
Figure FDA00029329088100000317
Figure FDA00029329088100000318
where η is the reward for terminating the action and τ is the IoU threshold for terminating the action.
CN202110152379.8A 2021-02-04 2021-02-04 Visual relation example learning method based on reinforcement learning Active CN112989088B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110152379.8A CN112989088B (en) 2021-02-04 2021-02-04 Visual relation example learning method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110152379.8A CN112989088B (en) 2021-02-04 2021-02-04 Visual relation example learning method based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN112989088A true CN112989088A (en) 2021-06-18
CN112989088B CN112989088B (en) 2023-03-21

Family

ID=76346704

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110152379.8A Active CN112989088B (en) 2021-02-04 2021-02-04 Visual relation example learning method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN112989088B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554129A (en) * 2021-09-22 2021-10-26 航天宏康智能科技(北京)有限公司 Scene graph generation method and generation device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463881A (en) * 2017-07-07 2017-12-12 中山大学 A kind of character image searching method based on depth enhancing study
WO2019035771A1 (en) * 2017-08-17 2019-02-21 National University Of Singapore Video visual relation detection methods and systems
CN111783852A (en) * 2020-06-16 2020-10-16 北京工业大学 Self-adaptive image description generation method based on deep reinforcement learning
US20200334545A1 (en) * 2019-04-19 2020-10-22 Adobe Inc. Facilitating changes to online computing environment by assessing impacts of actions using a knowledge base representation
CN112256904A (en) * 2020-09-21 2021-01-22 天津大学 Image retrieval method based on visual description sentences

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463881A (en) * 2017-07-07 2017-12-12 中山大学 A kind of character image searching method based on depth enhancing study
WO2019035771A1 (en) * 2017-08-17 2019-02-21 National University Of Singapore Video visual relation detection methods and systems
US20200334545A1 (en) * 2019-04-19 2020-10-22 Adobe Inc. Facilitating changes to online computing environment by assessing impacts of actions using a knowledge base representation
CN111783852A (en) * 2020-06-16 2020-10-16 北京工业大学 Self-adaptive image description generation method based on deep reinforcement learning
CN112256904A (en) * 2020-09-21 2021-01-22 天津大学 Image retrieval method based on visual description sentences

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CAO, QIANWEN ET AL: "3-D Relation Network for visual relation recognition in videos", 《NEUROCOMPUTING》 *
丁文博等: "深度学习的视觉关系检测方法研究进展", 《科技创新导报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554129A (en) * 2021-09-22 2021-10-26 航天宏康智能科技(北京)有限公司 Scene graph generation method and generation device

Also Published As

Publication number Publication date
CN112989088B (en) 2023-03-21

Similar Documents

Publication Publication Date Title
CN110609891B (en) Visual dialog generation method based on context awareness graph neural network
CN109934261B (en) Knowledge-driven parameter propagation model and few-sample learning method thereof
CN109993102B (en) Similar face retrieval method, device and storage medium
US20160350653A1 (en) Dynamic Memory Network
Kaluri et al. An enhanced framework for sign gesture recognition using hidden Markov model and adaptive histogram technique.
US20200065560A1 (en) Signal retrieval apparatus, method, and program
CN110377707B (en) Cognitive diagnosis method based on depth item reaction theory
CN114495129B (en) Character detection model pre-training method and device
CN112527993A (en) Cross-media hierarchical deep video question-answer reasoning framework
US20200218932A1 (en) Method and system for classification of data
CN112668608A (en) Image identification method and device, electronic equipment and storage medium
CN115130591A (en) Cross supervision-based multi-mode data classification method and device
CN115270752A (en) Template sentence evaluation method based on multilevel comparison learning
CN112989088B (en) Visual relation example learning method based on reinforcement learning
CN113420552B (en) Biomedical multi-event extraction method based on reinforcement learning
CN113240033B (en) Visual relation detection method and device based on scene graph high-order semantic structure
CN111914949B (en) Zero sample learning model training method and device based on reinforcement learning
CN116452895B (en) Small sample image classification method, device and medium based on multi-mode symmetrical enhancement
CN114565804A (en) NLP model training and recognizing system
Wu et al. Question-driven multiple attention (dqma) model for visual question answer
CN113821610A (en) Information matching method, device, equipment and storage medium
CN115066690A (en) Search normalization-activation layer architecture
Voruganti Visual question answering with external knowledge
CN116129333B (en) Open set action recognition method based on semantic exploration
CN114936297B (en) Video question-answering method based on priori knowledge and object sensitivity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant