CN111985505A

CN111985505A - Interest visual relationship detection method and device based on interest propagation network

Info

Publication number: CN111985505A
Application number: CN202010848981.0A
Authority: CN
Inventors: 任桐炜; 武港山; 王浩楠; 于凡
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2020-11-24
Anticipated expiration: 2040-08-21
Also published as: CN111985505B

Abstract

An object is extracted from an input image, pairwise combination is carried out to obtain an object pair, corresponding object features and joint features are calculated, the visual features, semantic features and position features of the object and the object pair are generated, interest features of the object and the object pair are obtained through meridian transformation, accordingly, interest degree of the object pair is predicted, interest features of a relation predicate are obtained through meridian transformation of the visual features, the semantic features and the position features of the object pair relation predicate, and interest degree of the relation predicate among the objects is predicted; and finally, combining the interest degree of the object with the interest degree of the relational predicate to obtain a visual relation interest degree, wherein the visual relation with high interest degree is the finally detected interest visual relation. The invention can predict the relation interest degree more reasonably by taking the semantic importance as the standard in the process of detecting the visual relation, finds out the interest visual relation capable of accurately transmitting the main content of the image and has good universality and practicability.

Description

Interest visual relationship detection method and device based on interest propagation network

Technical Field

The invention belongs to the technical field of computer vision, relates to visual relation detection in images, and particularly relates to an interest visual relation detection method based on an interest propagation network.

Technical Field

As a bridge between visual and natural languages, visual relationship detection aims to describe objects in an image and interactions between objects in the form of relationship triples of < subject, relationship predicate, object >. Where subjects and objects are generally represented by the borders and categories of the object, relational predicates typically have verbs (e.g., "raise", "ride", "see"), orientation words (e.g., "beside", "in front", "above"), and verb phrases (e.g., "stand beside", "sit on", "walk through"). The visual relation detection can help a machine to understand and analyze the content of the image or the video, and can be widely applied to scenes such as image retrieval, video analysis and the like.

Conventional visual relationship detection methods are directed to detecting all visual relationships in an image. In fact, due to the explosive combination of subjects, relational predicates, and objects, conventional methods typically detect a very rich visual relationship, as shown in fig. 2. Although the image content can be described more fully, too much detail may mislead the understanding of the machine to the image main content, resulting in the loss of precision in the scene of image retrieval and the like, which is not favorable for the accurate analysis of the image or video by the machine.

Intuitively, not all detected visual relationships are semantically really 'interesting', that is, not all visual relationships express the main content of an image, and often only a small part of the relationships have important significance for conveying the main content of the image, and such relationships are interesting visual relationships. The goal of interest visual relationship detection is to detect the visual relationship that is really important for conveying the main content of the image, i.e., the "interesting" visual relationship.

At present, no research work attempts to detect interest visual relationships, and only some related works measure the visual significance of the relationships through an attention module, so that the significance weight of the relationships is determined, and the significant visual relationships are found out. However, such a method only takes into account the visual significance of the relationship, and does not take into account the semantic importance of the relationship, and the obtained interest relationship is not necessarily really "interesting".

Disclosure of Invention

The invention aims to solve the problems that: machine understanding deviation easily caused by too rich visual relationship in the image needs to be detected, and the interest visual relationship capable of accurately conveying the content of the image main body needs to be detected so as to help a machine to more accurately understand and analyze the image or the video.

The technical scheme of the invention is as follows: an interest visual relationship detection method based on an interest propagation network comprises the steps of establishing the interest propagation network, inputting images and outputting interest visual relationships in the images, wherein the interest propagation network comprises a panoramic object detection module, an object interest prediction module and a relationship predicate interest prediction module; firstly, extracting objects from an input image through a panoramic object detection module, combining every two objects into object pairs, calculating object characteristics of the objects and joint characteristics of the object pairs, generating visual characteristics, semantic characteristics and position characteristics of the objects and the object pairs in an object pair interest prediction module, and respectively obtaining interest characteristics of the objects and the object pairs so as to predict interest degrees of the object pairs; meanwhile, the relation predicate interest prediction module obtains interest characteristics of the relation predicates according to the visual characteristics, the semantic characteristics and the position characteristics of the object pair relation predicates, and predicts the relation predicate interest degrees among the objects by using semi-supervised learning; and finally, combining the interest degree of the object with the interest degree of the relational predicate to obtain a visual relation interest degree, wherein the visual relation with high interest degree is the finally detected interest visual relation.

Further, the invention comprises the following steps:

1) extracting frames and categories of all objects from the input image, calculating characteristics in the frames of the n objects as object characteristics, combining the n objects pairwise to form n (n-1) object pairs, calculating characteristics in the frames of a subject and an object in each object pair as joint characteristics;

2) for each object, pre-training a GloVe model to obtain a word embedding feature of the class name of the object, taking the object feature of the object as a visual feature, taking the word embedding feature of the class name as a semantic feature, taking the position of the object relative to the whole image as a position feature, and combining the three features to obtain an interest feature of the object; for each object pair, respectively calculating three characteristics of the subject and the object in the same way, calculating the three characteristics of the object pair, and combining to obtain the interest characteristics of the object pair; inputting the object and the interest characteristics of the object pair into a graph convolution neural network, and predicting the interest degree of the object pair;

3) for each object pair, calculating the visual characteristics, semantic characteristics and position characteristics of the relational predicates to obtain interest characteristics of the object pair relational predicates, and for each relational predicate, predicting the probability that the relational predicates are interesting under the condition that the object pairs are interesting by using semi-supervised learning, namely, obtaining relational predicate interestingness;

4) adding the loss predicted by the object type in the step 1), the loss predicted by the object and the object with the interest degree in the step 2) and the loss predicted by the relational predicate interest degree in the step 3) to obtain total loss, combining the object interest degree and the relational predicate interest degree obtained by minimizing the total loss to obtain visual relational interest degrees, and sequencing all the visual relations according to the interest degrees, wherein the visual relation with high interest degree is the finally detected interest visual relation.

The invention also provides an interest visual relationship detection device based on the interest propagation network, which is configured with a computer program, wherein the computer program correspondingly realizes the interest propagation network and realizes the interest visual relationship detection method.

The effective benefits of the invention are: the method and the device have the advantages that the problem of machine understanding deviation caused by excessively rich visual relations in the images is solved, the semantic features, the position features and the visual features of objects and object pairs are considered, the relation interest degree can be predicted more reasonably by taking the semantic importance as a standard in the process of detecting the visual relations, and the interest visual relations capable of accurately conveying the main contents of the images are found. The method has good universality and practicability.

Drawings

FIG. 1 is a flow chart of the architecture and interest visual relationship detection of the interest propagation network of the present invention.

Fig. 2 shows the effect of detecting an excessively rich visual relationship by the conventional method.

FIG. 3 is a diagram illustrating the results of the interest visual relationship detection method of the present invention.

Detailed Description

The interest visual relationship detection method based on the interest propagation network, which is related by the invention, provides a solution for the problem of machine understanding deviation caused by excessively rich visual relationship in an image, realizes that the interest characteristics of an object, an object pair and a relationship predicate are obtained by carrying out linear transformation on semantic characteristics, position characteristics and visual characteristics of an input image, a combined object and an object pair, reasonably predicts the interest degree of the visual relationship by taking semantic importance as a standard, and produces an interest visual relationship result capable of accurately conveying the main content of the image.

The practice of the present invention is described in detail below.

As shown in FIG. 1, the invention establishes an interest propagation network, inputs images and outputs interest visual relationships in the images, wherein the interest propagation network comprises a panoramic object detection module, an object interest prediction module and a relation predicate interest prediction module; the panoramic object detection module performs panoramic segmentation (panoramic segmentation) on the image, and the content in the image can be divided into categories of thins and stuff according to whether a fixed shape exists, wherein objects with fixed shapes such as people and cars belong to the categories of thins (namely, numerical names usually belong to things); objects with no fixed shape such as sky, grass, etc. belong to the stuff category (i.e., the non-countable term belongs to stuff). The method comprises the steps of obtaining an object in an image through panoramic segmentation, obtaining object features of the object and joint features of object pairs through an instance encoder, respectively inputting an object pair interest prediction module and a relation predicate interest prediction module, obtaining semantic features, visual features and position features of the object, the object pairs and the object pair relation predicates through a semantic encoder, a visual encoder and a position encoder, obtaining interest features of the object, the object pairs and the object pair relation predicates through meridian transformation of the three features, obtaining interest degrees of the object, the object pairs and the object pair relation predicates through supervised learning and semi-supervised learning prediction of the interest features, obtaining visual relation interest degrees through combination of the two interest degrees, and sequencing and outputting interest visual relations according to the visual relation interest degrees.

On the basis of the interest propagation network, firstly, objects are extracted from an input image through a panoramic object detection module, pairwise combination is carried out to form object pairs, object features of the objects and combined features of the object pairs are calculated, visual features, semantic features and position features of the objects and the object pairs are generated, then interest features of the objects and the object pairs are obtained through an object interest prediction module respectively, and accordingly interest degrees of the objects are predicted; meanwhile, the relation predicate interest prediction module obtains interest characteristics of the relation predicates according to the visual characteristics, the semantic characteristics and the position characteristics of the object pair relation predicates, and predicts the relation predicate interest degrees among the objects by using semi-supervised learning; and finally, combining the interest degree of the object with the interest degree of the relational predicate to obtain a visual relation interest degree, wherein the visual relation with high interest degree is the finally detected interest visual relation.

The following describes the implementation of the present invention in detail. The invention comprises the following steps:

1) for the input image, a panoramic object detection module of the interest propagation network is adopted to calculate object features and joint features:

1.1) extracting the frames and the categories of all objects in the graph;

1.2) calculating the characteristics in the n object frames in the step 1.1) as object characteristics;

1.3) combining the n objects in the step 1.1) pairwise to form n (n-1) object pairs, and calculating the characteristics in the subject and object union frame in each object pair as combined characteristics.

2) For the objects extracted in the step 1) and the object pairs formed by the objects, calculating the interest degree of the objects by adopting an object interest prediction module of an interest propagation network:

2.1) for each object extracted in the step 1), using the object characteristics as visual characteristics, pre-training by a GloVe model to obtain word embedding characteristics of the class name, using the word embedding characteristics of the class name as semantic characteristics, using the position of the object relative to the whole image as position characteristics, and combining the three characteristics to obtain the interest characteristics of the object. The method for calculating the position characteristics of the object comprises the following steps:

wherein, Loc_iIs a characteristic of the position of the object i,

it is shown that the operation of juxtaposition,

respectively, the coordinates of the left, upper, right, and lower boundaries of the object, and w, h, respectively, the width and height of the input image.

2.2) calculating three characteristics of the subject and the object respectively in a similar mode for each object pair formed in the step 1), and then calculating the three characteristics of the object pair to jointly obtain the interest characteristics of the object pair. The method for calculating the position characteristics of the object pair comprises the following steps:

Loc_pis a positional characteristic of the object to p, Loc_iIs a position characteristic of the object i, s_p、o_pAnd the U represents the juxtaposition operation of the object level.

The method for calculating the visual characteristics of the object pair comprises the following steps:

wherein F_pIs the view of an object on pThe characteristics of the sense of sight,

respectively representing the subject and object characteristics of the object pair,

representing the combined characteristics of the subject and object of the object pair.

2.3) inputting the two interest characteristics in the step 2.1) and the step 2.2) into a graph convolution neural network, and predicting the interest degree of the object.

3) Calculating a relational predicate interest degree by adopting a relational predicate interest prediction module of the interest propagation network for the object pair formed in the step 1):

3.1) calculating the visual characteristics, semantic characteristics and position characteristics of the object pair relational predicates for each object pair formed in the step 1), and jointly obtaining the interest characteristics of the relational predicates. The calculation method of the relational predicate position characteristics of the object pairs comprises the following steps:

wherein Loc'_pIs the relation predicate position characteristic of the object pair p, and w 'and h' are the width and height of the subject and object union frame in the object pair respectively. The calculation of the visual features is the same as the calculation of the visual features by an object.

3.2) for each relational predicate, the semi-supervised learning is used to predict the probability that the relational predicate is also interesting under the condition that the object pair is interesting, namely the relational predicate interestingness. The loss of semi-supervised learning is calculated as follows:

wherein L is_relaIs the loss of the relational predicate interest prediction Module, l_relaIs a function of the loss as a function of,

respectively representing the prediction results of marked data and unmarked data,

the real results of marked data and unmarked data are respectively shown, and beta is the loss weight of the unmarked data.

4) Minimizing the total loss of the interest propagation network, predicting the interest visual relationship:

4.1) adding the loss of the object type prediction in the step 1), the loss of the object and the object interest prediction in the step 2) and the loss of the relation predicate interest prediction in the step 3) to obtain the total loss of the interest propagation network, and combining the object interest degree and the relation predicate interest degree obtained by minimizing the total loss to obtain the visual relation interest degree. The total loss of the interest propagation network is calculated as follows:

L^pos＝-(1-p^pos)²log p^pos

L^neg＝-p^neg log(1-p^neg)

wherein L is^pos、L^negRepresenting the loss of positive and negative samples, p, respectively^pos、p^negRespectively representing the probability scores of positive and negative samples, L_totalIs the total loss of the interest propagation network, L_classIs a loss of prediction of the object class,

respectively representing positive and negative losses of the object interest prediction,

respectively representing positive and negative losses of the object to the prediction of interest,

respectively representing relational predicate interest predictionsPositive losses and negative losses.

4.2) sequencing all visual relations according to the interest degree, wherein the visual relation with high interest degree is the finally detected interest visual relation. The interestingness of the visual relationship is calculated as follows:

I_spo＝E_so·I_so·P_spo

wherein, I_spoIs the degree of interest of the visual relationship, I_so、P_spoRespectively representing the interest-degree of the object-pair and the relational predicate, E_soIs a binary parameter, E when the subject and object in the object pair are the same object_soGet 0, otherwise E_so1 is taken.

The method of the present invention can be implemented by a computer program, and therefore, an interest visual relationship detecting apparatus based on an interest propagation network is also provided, wherein the apparatus is configured with a computer program and when executed, implements the interest visual relationship detecting method of the present invention.

The method is implemented on the MSCOCO image data set, and compared with the result of the traditional visual relation detection method. Fig. 2 and 3 are comparative examples of the results of conventional visual relationship detection and the results of the present invention. Fig. 2(a) and 3(a) are input images, and objects related to the visual relationship detection result are marked. Fig. 2(b) is the result of conventional visual relationship detection, which includes up to 24 visual relationships, and most of the relationships are weakly associated with the main content of the input image. Fig. 3(b) is the result of the interest visual relationship detection of the present invention, which includes only 5 visual relationships, and all of which are strongly associated with the main content of the input image.

Claims

1. An interest visual relationship detection method based on an interest propagation network is characterized in that the interest visual relationship detection method is characterized in that the interest propagation network is established, images are input, and interest visual relationships in the images are output, and the interest propagation network comprises a panoramic object detection module, an object interest prediction module and a relationship predicate interest prediction module; firstly, extracting objects from an input image through a panoramic object detection module, combining every two objects into object pairs, calculating object characteristics of the objects and joint characteristics of the object pairs, generating visual characteristics, semantic characteristics and position characteristics of the objects and the object pairs in an object pair interest prediction module, and obtaining interest characteristics of the objects and the object pairs through linear transformation so as to predict interest degrees of the object pairs; meanwhile, the relation predicate interest prediction module obtains interest characteristics of the relation predicates through linear transformation of the visual characteristics, the semantic characteristics and the position characteristics of the object pair relation predicates, and predicts the relation predicate interest degrees among the objects by using semi-supervised learning; and finally, combining the interest degree of the object with the interest degree of the relational predicate to obtain a visual relation interest degree, wherein the visual relation with high interest degree is the finally detected interest visual relation.

2. The interest visual relationship detection method based on the interest propagation network as claimed in claim 1, characterized by comprising the following steps:

3. The interest visual relationship detection method based on the interest propagation network as claimed in claim 2, wherein in the step 2), the position characteristic of the object is calculated by:

wherein, Loc_iIs a characteristic of the position of the object i,

it is shown that the operation of juxtaposition,

respectively the coordinates of the left, upper, right and lower boundary of the object i, and w, h respectively the width and height of the input image.

4. The interest visual relationship detection method based on the interest propagation network as claimed in claim 2, wherein in step 2), the calculation method of the position features of the object pairs comprises:

wherein, Loc_pIs a positional characteristic of the object to p, Loc_iIs a position characteristic of the object i, s_p、o_pAnd the U represents the juxtaposition operation of the object level.

5. The interest visual relationship detection method based on the interest propagation network as claimed in claim 2, wherein the visual characteristics of the object pair are calculated by:

wherein F_pIs the visual characteristic of the object pair p,

6. The interest visual relationship detection method based on the interest propagation network as claimed in claim 2, wherein in step 3), the calculation method of the relational predicate position characteristics of the object pairs comprises:

wherein Loc'_pIs the object-to-p relational predicate location feature,

it is shown that the operation of juxtaposition,

respectively the coordinates, s, of the left, upper, right, and lower boundaries of the object i_p、o_pRespectively representing a main body and an object of the object pair, U represents the juxtaposition operation of the object level, and w 'and h' are the widths of a frame of a union set of the main body and the object in the object pairAnd a height.

7. The interest visual relationship detection method based on the interest propagation network as claimed in claim 2, wherein in the semi-supervised learning predicted relationship predicate interestingness of step 3), the calculation method of the prediction loss is as follows:

wherein L is_relaIs the loss of relational predicate interestingness prediction, l_relaIs a function of the loss as a function of,

8. The interest visual relationship detection method based on the interest propagation network as claimed in claim 2, wherein the total loss in step 4) is calculated by:

L^pos＝-(1-p^pos)²logp^pos

L^neg＝-p^neglog(1-p^neg)

wherein L is^pos、L^negRepresenting the loss of positive and negative samples, p, respectively^pos、p^negRespectively representing the probability scores of positive and negative samples, L_totalIs the total loss, L_classIs a loss of prediction of the object class,

respectively representing positive and negative losses of the object to the interestingness prediction,

representing positive and negative losses, respectively, of the relational predicate interestingness prediction.

9. An interest visual relationship detection device based on an interest propagation network, characterized in that the device is configured with a computer program, the computer program corresponds to the interest propagation network of claim 1, and when executed, the interest visual relationship detection method of claim 1 is realized.