CN112989088A

CN112989088A - Visual relation example learning method based on reinforcement learning

Info

Publication number: CN112989088A
Application number: CN202110152379.8A
Authority: CN
Inventors: 杜友田; 王航; 王雪
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2021-06-18
Anticipated expiration: 2041-02-04
Also published as: CN112989088B

Abstract

The visual relationship is usually represented as a triple < subject, predictor, object >, which contains two objects subject and object and the interaction predictor between them. The visual relation learning is a bridge between the low-level image perception task and the high-level image cognition task, and belongs to the intermediate-level image understanding task. Visual relationship instance learning is the problem of determining two object instances involved in each visual relationship given an image and corresponding set of visual relationships. The problem is modeled into a sequence decision process when two agents search on images by two example search boxes of subject and object, so that the visual relation example learning method based on deep reinforcement learning is provided. For a given test image and an associated visual relationship set, the example frames corresponding to the subject and the object in each visual relationship can be quickly and accurately found.

Description

Visual relation example learning method based on reinforcement learning

Technical Field

The invention belongs to the technical field of computer application, relates to deep learning, visual relation and reinforcement learning, and particularly relates to a visual relation example learning method based on reinforcement learning.

Background

A long-standing goal in the field of computer vision is to have an agent understand human natural language sufficiently to enable it to perform specific tasks in a visual environment. Currently, in computer vision tasks, understanding of image content can be divided into perception and cognition levels, and an object detection task belongs to the perception level and can learn the relationship between low-level visual appearance and high-level text semantics in an image. However, in order to more fully understand the content expressed by the image, the interactive relationship between the objects in the image, i.e. the learning of the visual relationship, must be further understood, which belongs to the understanding of the content of the image in the cognitive level.

There are a large number of images and associated texts on the internet, from which a set of visual relationships describing the image content can be extracted, and learning of these visual relationships is essential for a thorough understanding of the image content. In recent years, we have witnessed the widespread application of visual relationship learning in a series of image understanding tasks, including image description generation, image retrieval, image synthesis, scene graph generation, visual reasoning, and visual question and answer. A visual relationship typically consists of two objects, a subject and an object, and an interaction between them, typically expressed as a triple < subject, predictor, object >, e.g. < person-edge-bike >. Visual relationship learning requires not only identifying the class and bounding-box in which to locate objects in a given image, but also indicating the interaction between each pair of objects. That is, the visual relationship learning is a bridge connecting low-level image perception tasks (object detection, image classification, etc.) and high-level image recognition tasks (image description generation, visual question answering, etc.), and belongs to a moderate-level image understanding task. Visual relationship instance learning is the problem of determining two object instances involved in each visual relationship given an image and corresponding set of visual relationships.

Existing visual relationship learning models can be divided into two categories: (1) a combined model; (2) and (5) separating the model. The joint model treats a visual relationship triplet as a category and then learns the classifier. Example (b)For example, Plummer et al learn a CCA model based on features of different combinations of subjects, objects and joint regions between them, and then classify each visual relationship using a ranking SVM. However, since visual relationships usually exhibit long-tailed distributions, the joint model has the drawbacks of being large in scale and weak in generalization. In addition, when the number of object classes is N and the number of interaction relation classes is K, the learning complexity is O (N)²K) In that respect The separation model respectively trains classifiers for learning aiming at each component in the visual relation triple, so that the learning complexity is reduced to O (N + K). Lu et al predict the interaction between pairs of objects using visual features of the pairs of objects and linguistic prior knowledge. Zhang et al propose to consider predicate as a translation vector between subject and object, i.e., s + p ≈ o, and then map the visual features of the paired objects to a low-dimensional relationship space to construct a classification model VTransE of visual relationships. Based on spatial features and statistical dependencies between subjects, predicates, and objects, Dai et al utilizes a network of depth relationships to predict visual relationships between pairs of objects. In addition, the context information is captured by using a graph neural network, and Xu et al classify visual relations by establishing a message iterative transfer model.

However, the above methods do not solve the problem of example confusion in the visual relationship learning. As shown in fig. 1, given a set of images and associated visual relationships, how to correctly find and output an example box of two object subjects and objects in each visual relationship. Since there are often multiple instances of objects belonging to the same class in an image in the case where the object class has been specified, an instance confusion problem in visual relationship learning is caused.

Disclosure of Invention

In order to solve the problem of example confusion in visual relationship learning, the invention provides a visual relationship example learning method based on reinforcement learning, based on a deep reinforcement learning framework, the visual relationship example learning is modeled into a sequence decision problem that two objects, namely a subject and an object, involved in each visual relationship are searched in an image by two agents S-agent and O-agent, the state, the action and the reward in the problem are defined, and for a given test image and an associated visual relationship set, the model learned by the method can quickly and accurately find the example frames corresponding to the subject and the object in each visual relationship, so that the understanding capability of a cognitive level on image contents is greatly improved.

In order to achieve the purpose, the invention adopts the following technical means:

a visual relation example learning method based on reinforcement learning comprises the following steps:

step 1, inputting training set data, obtaining each image and a corresponding visual relation set, and connecting the following vectors in series to form a state vector: the method comprises the steps that visual features of the whole image, visual features of two object instance search boxes, historical action vectors of two agents, spatial relation features between the object instance search boxes and text features obtained by coding a current visual relation through a Skip-through language model, the two agents are an S-agent used for searching the object instance search box and an O-agent used for searching the object instance search box, and the historical action vectors are formed by connecting action vectors executed at the last 10 moments in series;

step 2, at each moment, the S-agent and the O-agent respectively execute the transformation action aiming at the subject and the object instance search boxes, so as to generate the search box at the next moment, then obtain the corresponding reward, and judge whether the search is terminated;

step 3, storing the state of the current moment, the action taken at the current moment, the obtained reward, the state of the next moment and a judgment mark for judging whether the search is terminated into an experience playback pool;

and 4, repeating the steps 1-3 until the experience playback pool reaches the minimum number capable of being sampled, randomly sampling a part of samples from the experience playback pool at the moment, respectively training the current Q networks and parameters of the S-agent and the O-agent, and respectively updating the parameters of the target Q networks of the S-agent and the O-agent by using the parameters of the current Q networks at regular intervals.

The visual relationship example learning method based on reinforcement learning can be used for finding each example in the appointed visual relationship in the visual environment, and can also apply the learned visual relationship and the corresponding example frame to image text related tasks such as image text mutual search, visual reasoning, visual question answering and the like.

The existing method needs to use the result of the object detection algorithm as a candidate example frame of two objects in each visual relationship, but the object detection algorithm is usually erroneous, and thus may affect the performance of the visual relationship learning to some extent. In the invention, the two agents start to search the example frames of the two objects in each visual relationship from the upper left corner of the image continuously, and the example frame with the labeling error is not used.

The existing method needs to evaluate a large number of candidate example frames in the process of generating the object example frame by using an object detection algorithm, thereby causing unnecessary calculation expense, the invention aims to learn the optimal strategy of searching the correct example frame, namely the shortest path, and the finally generated model can accurately find two object example frames involved in each visual relationship of the image by searching the minimum number of candidate example frames.

The method is based on deep reinforcement learning, and models an example positioning problem of an object related to each visual relation in a given image into a sequence decision problem when two agents search on the image, wherein S-agent of a search subject object and O-agent of the search subject start from the upper left corner of the image, and a specific action is executed according to the current state at each moment until the search is terminated to output an example frame corresponding to the two objects subject and object.

When an image and a visual relation set are given, the method can quickly and accurately find two object example frames in each visual relation by the shortest search path.

Drawings

FIG. 1 gives an image and a corresponding set of visual relationships.

FIG. 2 is a block diagram of example learning of visual relationships, with solid black and white boxes representing example search boxes for S-agent and O-agent, respectively, at the current time, and dashed boxes representing the next search box to jump to after performing the action at the current time.

FIG. 3 is a diagram of 9 predefined transformation actions, the implementation box representing the search box at the current time, the dashed box the search box at the next time, Left and Right representing horizontal Left and Right movement, respectively, Up and Down representing vertical Up and Down movement, Bigger and Smaller scaling operations, Taller and Fatter scaling operations, and terminal terminating the search operation.

FIG. 4 is an example box of visual relationship example learning, eventually learning two objects in each visual relationship, given a set of images and corresponding visual relationships.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

The invention relates to a visual relation example learning method based on reinforcement learning, which comprises the following steps:

step 1, inputting training set data, obtaining each image and a corresponding visual relation set, and connecting the following vectors in series to form a state vector: the method comprises the steps of obtaining visual characteristics of an entire image, visual characteristics of two object instance search boxes, historical action vectors of two agents, a spatial relation characteristic between the object instance search boxes and a text characteristic obtained by coding a current visual relation by using a Skip-through language model, wherein the two agents are an S-agent used for searching the object instance search box and an O-agent used for searching the object instance search box, and the historical action vectors are formed by connecting action vectors executed at the past 10 moments in series.

In the invention, S-agent and O-agent respectively start searching from the upper left corner of the current image.

Specifically, the image represents an interactive environment, two agents continuously interact with the environment to form an Episode, and each Episode processes a different image. Defining a state vector S_tIn the form:

in the formula, v (I)_t) Is the image I being processed at the current moment t_tThe visual feature vector of (a) is,

instance search boxes representing object at the present moment, i.e.

Is the coordinate of the upper left corner thereof,

is the width and length of the same,

representing example search boxes

The visual feature vector of (a) is,

representing a historical motion vector formed by concatenating the motion vectors of the agent S-agent at the last 10 moments,

instance search boxes representing object objects at the current time, i.e.

Is the coordinate of the upper left corner thereof,

is the width and length of the same,

representing example search boxes

The visual feature vector of (a) is,

represents a historical motion vector, w (e), formed by concatenating the motion vectors of the agent O-agent at the past 10 times_t) Is a visual relation e generated by a Skip-through language model and related to the time t_tThe semantic of (2) is embedded into the vector,

representing example search boxes

And

the spatial relationship feature vector between them, which is defined as follows:

wherein the content of the first and second substances,

and

respectively, the intersection and union between two example search boxes, and we use the GMM model to combine the 6-dimensional vector because the vector dimension is too small to adequately capture the slight differences between different spatial relationships

The vector discretized into 400 dimensions serves as the final spatial relationship feature vector between the two example search boxes.

And 2, at each moment, respectively executing transformation actions aiming at the subject and object instance search boxes by the S-agent and the O-agent, respectively selecting and executing a specific action according to the respective current state, so as to jump the instance search box at the current moment to the instance search box at the next moment and obtain corresponding rewards, and then jumping to the next state until the search is terminated.

FIG. 2 is a frame diagram of example learning of visual relationships, which is inputted as an image and a corresponding set of visual relationships, and two agents S-agent and O-agent are two different DQN networks respectively.

Specifically, the transformation action may be defined as a 9-dimensional vector, the element of the vector is 1 to represent that the action is performed, and 0 is not performed, and the defined 9-dimensional action vectors respectively correspond to the following 9 actions: horizontal Right movement (Right), horizontal Left movement (Left), vertical Up movement (Up), vertical Down movement (Down), zoom in (Bigger), zoom out (Smaller), change height ratio size (teller), change width ratio size (Fatter), and Terminate search (Terminate), as shown in fig. 3.

At each time, the S-agent performs a selected transformation action such that the width and height of the instance search box for the subject change as follows:

similarly, at each time, the O-agent performs a selected transformation action such that the width and height of the instance search box for an object change as follows:

wherein, alpha is ∈ [0, 1 ]]Is a change parameter, e.g., the agent S-agent performs a Right action so that the instance search box for the subject is moved from

Is transformed into

And

width and height variations of instance search box for subject, respectively；

And

the width and height variation of the instance search box with respect to object, respectively.

S-agent performs an action such that the instance search box for subject is moved from the current search box

Jump to next search box

Time of day, award obtained

Is defined as follows:

similarly, the O-agent performs an action such that the instance search box for object is from the current search box

Jump to next search box

Awarding of prizes

Is defined as:

wherein, g_sGround-route, g, representing subject_oGround-route, sign (·) representing an object instance is a symbolic function, IOU (·) represents one of the two regionsCross-over ratio between them, i.e.

In particular, S-agent and O-agent obtain rewards after performing a termination search action

And

are respectively defined as:

where η is the reward for terminating the action and τ is the IoU threshold for terminating the action.

Step 3, storing the state transition of each moment, namely the state of the current moment, the action taken at the current moment, the obtained reward, the state of the next moment and a judgment mark for judging whether the search is terminated into an experience playback pool; in the invention, S-agent sets two Q networks respectively, namely the current Q network Q for action selection_sAnd target Q network Q 'for target value calculation'_sO-agent also sets two Q networks respectively, i.e. the current Q network Q for action selection_oAnd target Q network Q 'for target value calculation'_o。

And 4, the experience playback pool is used for storing historical data, the steps 1-3 are repeated until the data volume loaded in the experience playback pool reaches the minimum sampling number, a part of samples are randomly sampled from the experience playback pool, and the current Q network Q of the S-agent is respectively trained_sAnd its parameters, O-agent's current Q network Q_oAnd its parameters, respectively utilizing current Q network Q at regular intervals_sUpdate S-agent target Q network Q 'of'_sUsing the current Q netLuo Q_oUpdate S-agent target Q network Q 'of'_oThe parameter (c) of (c).

The result of the example learning of the visual relationship is shown in fig. 4, and a large number of learned visual relationships and two corresponding object examples thereof can be further applied to high-level image understanding tasks such as visual question answering, visual reasoning, image description generation and the like.

In one embodiment of the invention, the training set data is N samples

Wherein the content of the first and second substances,

representing the ith image I_iContains m_iA visual relationship, i.e.

Representative image I_iThe (j) th visual relationship of (c),

and

representing the category of the two objects that interact,

representing a category of interaction between two objects. The specific steps of this embodiment are as follows:

step 1):

initialization: experience playback pool D with capacity M and minimum samplable number Z, current Q network Q of two agents S-agent and O-agent_s，Q_oAnd its parameter theta_s，θ_oTarget Q network Q 'of two Agents'_s，Q′_oAnd parameters thereof

The number of iteration rounds T, IoU threshold τ,the method comprises the steps of terminating action reward eta, exploration rate epsilon, change parameter alpha, learning rate beta, attenuation factor gamma, sample number n of batch gradient decline and target Q network parameter updating frequency C.

Step 2):

when t is 1, from sample X₁Starting, initializing state S₁Actions for two agents to randomly select search boxes for subject and object instances with probabilities ε, respectively

And

otherwise, selecting the current optimal action according to the model:

a^s，a^orepresent the optional action sets for S-agent and O-agent, respectively. Two agents each performing a selected action

And

thereby jumping to a new state S_t+1And separately awarded

And whether the state is _ end is terminated_tWill be provided with

And storing the data into an experience playback pool D.

Step 3):

performing steps 1) -2) until the experience playback poolThe number of samples stored in the memory reaches the minimum sampling number Z, and then n samples are sampled from the empirical playback pool D

j 1, 2.. n, calculating a target reward value for the s-agent to perform the action:

using a mean square error loss function

Updating parameter theta by inverse gradient propagation through neural network_s. Calculating a target reward value for the o-agent to perform the action:

using a mean square error loss function

Updating parameter theta by inverse gradient propagation through neural network_o. Updating parameters of the target Q network every C rounds:

step 4):

when the test data set is input, the two agents s-agent and o-agent output an example frame of subject and object corresponding to each visual relation in the image when the search is terminated.

The specific example of the invention learning into images based on given image and text relationships can be applied to many vision-related tasks, for example, a robot performing a navigation task in a visual environment needs to execute the following instructions "send this cup to a woman standing in front of a car and having an umbrella", then the robot needs to find each example of the following visual relationships in the visual environment: < woman-on-umbrella >, < woman-standing in front-car > and then the cup-sending operation can be completed. In addition, the learned visual relationship and the corresponding example frame can be applied to image text related tasks such as image text searching, visual reasoning, visual question answering and the like, for example, in image text searching, the input search text is < person-playing-football >, so that a plurality of pictures containing 'person playing football' can be quickly acquired, and the interactive action relationship of 'person' and 'football' is not found by searching two objects of 'person' and 'football' and then searching the obtained results, wherein the pictures contain two objects of 'person' and 'football', and the interactive action relationship of 'playing' is formed between the 'person' and the 'football'.

Claims

1. A visual relation example learning method based on reinforcement learning is characterized by comprising the following steps:

and 4, repeating the steps 1-3 until the experience playback pool reaches the minimum number capable of being sampled, randomly sampling a part of samples from the experience playback pool, respectively training the current Q networks and parameters of the S-agent and the O-agent, and respectively updating the parameters of the target Q networks of the S-agent and the O-agent by using the parameters of the current Q network at regular intervals.

2. The reinforcement learning-based visual relationship instance learning method according to claim 1, wherein in the step 1, a state vector S is defined_tIn the form:

wherein, v (I)_t) Is the image I being processed at the current moment t_tThe visual feature vector of (a) is,

instance search boxes representing object at the present moment, i.e.

Is the coordinate of the upper left corner thereof,

is the width and length of the same,

representing example search boxes

The visual feature vector of (a) is,

instance search boxes representing object objects at the current time, i.e.

Is the coordinate of the upper left corner thereof,

is the width and length of the same,

representing example search boxes

The visual feature vector of (a) is,

representing example search boxes

And

wherein the content of the first and second substances,

and

respectively, the intersection and union between the two instance search boxes.

3. The reinforcement learning-based visual relationship instance learning method according to claim 2, wherein the GMM model is utilized to learn the visual relationship instance

4. The method according to claim 3, wherein in step 2, the transformation action is defined as a 9-dimensional vector, the element of the vector is 1 to represent that the action is performed, and 0 is not performed, and the defined 9-dimensional action vectors respectively correspond to the following 9 actions: horizontal right and left movement, vertical up and down movement, zoom in and out, change height ratio size, change width ratio size, and terminate the search;

at each time, the O-agent performs a selected transformation action such that the width and height of the instance search box for an object change as follows:

wherein, alpha is ∈ [0, 1 ]]Is the parameter of the change in the direction of the change,

and

width and height variation of the instance search box for the subject, respectively;

and

5. The reinforcement learning-based visual relationship example learning method according to claim 4, wherein in the step 2, the S-agent performs an action to make the example search box about the subject from the current search box

Jump to next search box

Time of day, award obtained

Is defined as follows:

the O-agent performs an action that causes the instance search box for the object to be moved from the current search box

Jump to next search box

Awarding of prizes

Is defined as:

wherein, g_sGround-route, g, representing subject_oGround-route, sign () representing an object instance is a sign function, IOU () represents the intersection ratio between two regions, i.e.

Rewards obtained by S-agent and O-agent after executing termination search action

And

are respectively defined as: