CN111985505B

CN111985505B - Interest visual relation detection method and device based on interest propagation network

Info

Publication number: CN111985505B
Application number: CN202010848981.0A
Authority: CN
Inventors: 任桐炜; 武港山; 王浩楠; 于凡
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2024-02-13
Anticipated expiration: 2040-08-21
Also published as: CN111985505A

Abstract

An interest visual relation detection method and device based on an interest propagation network extracts objects from an input image, combines the objects into object pairs in pairs, calculates corresponding object characteristics and joint characteristics, generates visual characteristics, semantic characteristics and position characteristics of the objects and the object pairs, obtains the interest characteristics of the objects and the object pairs through linear transformation, predicts the interest degree of the objects, obtains the interest characteristics of the relationship predicates through linear transformation of the visual characteristics, the semantic characteristics and the position characteristics of the relationship predicates, and predicts the interest degree of the relationship predicates among the objects; finally, the object interest degree and the relation predicate interest degree are combined to obtain the visual relation interest degree, and the visual relation with high interest degree is the finally detected interest visual relation. The invention can more reasonably predict the interest degree of the relation by taking the semantic importance as the standard in the process of detecting the visual relation, find out the interest visual relation which can accurately convey the content of the image main body, and has good universality and practicability.

Description

Interest visual relation detection method and device based on interest propagation network

Technical Field

The invention belongs to the technical field of computer vision, relates to vision relation detection in images, and particularly relates to an interest vision relation detection method based on an interest propagation network.

Technical Field

As a bridge between vision and natural language, visual relationship detection aims at describing objects in images and interactions between objects in the form of a relationship triplet of < subject, relationship predicate, object >. Where subjects and objects are typically represented by borders and categories of objects, relational predicates are typically verbs (e.g., "lift", "ride", "see"), direction words (e.g., "beside", "in front", "above") and verb phrases (e.g., "stand aside", "sit on", "walk over"). The visual relation detection can help the machine to understand and analyze the content of the image or video, and can be widely applied to scenes such as image retrieval, video analysis and the like.

Conventional visual relationship detection methods are directed to detecting all visual relationships in an image. In fact, conventional approaches typically detect a very rich visual relationship due to the explosive combination of subject, relationship predicates and objects, as shown in FIG. 2. Although the image content can be more fully described, too many details can mislead the understanding of the machine on the image main body content, so that the accuracy in the scenes such as image retrieval is lost, and the accurate analysis of the image or video by the machine is not facilitated.

Intuitively, not all detected visual relationships are truly "interesting" in terms of semantics, i.e., not all visual relationships express the subject content of an image, often only a small portion of which has an important meaning for conveying the subject content of an image, such relationships being interesting visual relationships. The goal of interest visual relationship detection is to detect visual relationships that are truly important for conveying the content of the image subject, i.e., visual relationships that are "interesting".

At present, no research work has tried to detect interesting visual relations, and only some related works measure the visual significance of the relations through an attention module, so as to determine the significance weight of the relations and find out the significant visual relations. However, such an approach only takes into account the visual significance of the relationship, and does not take into account the semantic significance of the relationship, and the resulting interest relationship is not necessarily truly "interesting".

Disclosure of Invention

The invention aims to solve the problems that: machine understanding bias, which is easily caused by too rich visual relationships in images, requires the detection of interesting visual relationships that can accurately convey the content of the image subject to help the machine understand and analyze the images or videos more accurately.

The technical scheme of the invention is as follows: an interest visual relation detection method based on an interest propagation network establishes an interest propagation network, inputs an image and outputs an interest visual relation in the image, wherein the interest propagation network comprises a panoramic object detection module, an object interest prediction module and a relation predicate interest prediction module; firstly, extracting objects from an input image through a panoramic object detection module, combining the objects into object pairs in pairs, calculating the combined characteristics of object characteristics and the object pairs of the objects, generating visual characteristics, semantic characteristics and position characteristics of the objects and the object pairs in an object pair interest prediction module, and respectively obtaining the interest characteristics of the objects and the object pairs, thereby predicting the interest degree of the objects; meanwhile, the interest prediction module of the relation predicates obtains interest features of the relation predicates from visual features, semantic features and position features of the objects on the relation predicates, and uses semi-supervised learning to predict interest degree of the relation predicates among the objects; finally, the object interest degree and the relation predicate interest degree are combined to obtain the visual relation interest degree, and the visual relation with high interest degree is the finally detected interest visual relation.

Further, the invention comprises the following steps:

1) For an input image, extracting frames and categories of all objects, calculating characteristics in the frames of the n objects to serve as object characteristics, combining the n objects in pairs to form n (n-1) object pairs, and calculating characteristics in the frames of a main body and an object in each object pair to serve as joint characteristics;

2) For each object, pre-training by using a GloVe model to obtain word embedded features of category names, taking object features of the object as visual features, taking the word embedded features of the category names as semantic features, taking the position of the object relative to the whole image as position features, and combining the three features to obtain interesting features of the object; for each object pair, calculating three characteristics of a host and an object respectively in the same mode, calculating three characteristics of the object pair, and combining the three characteristics to obtain interesting characteristics of the object pair; inputting the object and the interest feature of the object pair into a graph convolution neural network, and predicting the object pair interest degree;

3) For each object pair, calculating visual features, semantic features and position features of the relation predicates to obtain interest features of the object pair relation predicates, and for each relation predicate, using semi-supervised learning to predict the probability that the relation predicate is interesting under the condition that the object pair is interesting, namely, the interest degree of the relation predicate;

4) Adding the loss of the object category prediction in the step 1), the loss of the object and the object for the interest degree prediction in the step 2) and the loss of the relation predicate interest degree prediction in the step 3) to obtain total loss, combining the object for the interest degree and the relation predicate interest degree obtained by minimizing the total loss to obtain visual relation interest degree, and sorting all visual relations according to the interest degree, wherein the visual relation with high interest degree is the finally detected interest visual relation.

The invention also provides an interest visual relation detection device based on the interest propagation network, which is configured with a computer program, wherein the computer program correspondingly realizes the interest propagation network and realizes the interest visual relation detection method.

The invention has the following effective benefits: the scheme for solving the problem of machine understanding deviation caused by too rich visual relationships in the images is provided, the semantic features, the position features and the visual features of objects and object pairs are considered, the relationship interestingness can be predicted reasonably by taking semantic importance as a standard in the process of detecting the visual relationships, and the interest visual relationships capable of accurately conveying the content of the image main body are found out. The method has good universality and practicability.

Drawings

FIG. 1 is a flow chart of the architecture and interest visual relationship detection of the interest propagation network of the present invention.

Fig. 2 shows an effect of too rich visual relationship detected by the conventional method.

FIG. 3 is a graphical representation of the results of the visual relationship of interest detection method of the present invention.

Detailed Description

The interest visual relation detection method based on the interest propagation network provides a solution to the problem of machine understanding deviation caused by too rich visual relation in an image, achieves that for an input image, the interest characteristics of an object, an object pair and a relation predicate are obtained through linear transformation of semantic characteristics, position characteristics and visual characteristics of the combined object and the object pair, interest degrees of the visual relation are reasonably predicted by taking semantic importance as a standard, and interest visual relation results capable of accurately conveying image main body contents are produced.

The practice of the invention is specifically described below.

As shown in FIG. 1, the invention establishes an interest propagation network, which inputs an image and outputs an interest visual relationship in the image, and comprises a panoramic object detection module, an object interest prediction module and a relationship predicate interest prediction module; the panoramic object detection module performs panoramic segmentation (panoptic segmentation) on the image, wherein the content in the image can be divided into a types of thags and a types of stuffs according to whether the fixed shapes exist, wherein objects with fixed shapes such as people, vehicles and the like belong to the types of thags (namely, the nouns generally belong to the thags); objects of the sky, grass, etc. that do not have a fixed shape belong to the category of stuff (i.e. the inexhaustible nouns belong to stuff). The method comprises the steps of extracting an object in an image through panoramic segmentation, obtaining the joint characteristics of the object characteristic and the object pair by an instance encoder, respectively inputting the joint characteristics into an object-to-interest prediction module and a relation predicate-interest prediction module, respectively obtaining semantic characteristics, visual characteristics and position characteristics of the object, the object pair and the relation predicate by a semantic encoder, a visual encoder and a position encoder, obtaining the interest characteristics of the object, the object pair and the relation predicate by linear transformation, respectively obtaining the object-to-interest degree and the relation predicate-interest degree by the interest characteristics through supervised learning and semi-supervised learning prediction, combining the two interest degrees to obtain the visual relation-interest degree, and sequencing and outputting the interest visual relation according to the visual relation-interest degree.

On the basis of the interest propagation network, firstly, extracting objects from an input image through a panoramic object detection module, combining the objects into object pairs in pairs, calculating the combined characteristics of the object characteristics of the objects and the object pairs, generating the visual characteristics, semantic characteristics and position characteristics of the objects and the object pairs, and respectively obtaining the interest characteristics of the objects and the object pairs through an object pair interest prediction module, thereby predicting the interest degree of the objects; meanwhile, the interest prediction module of the relation predicates obtains interest features of the relation predicates from visual features, semantic features and position features of the objects on the relation predicates, and uses semi-supervised learning to predict interest degree of the relation predicates among the objects; finally, the object interest degree and the relation predicate interest degree are combined to obtain the visual relation interest degree, and the visual relation with high interest degree is the finally detected interest visual relation.

The implementation of the present invention is described in detail below. The invention comprises the following steps:

1) For an input image, calculating object features and joint features by adopting a panoramic object detection module of an interest propagation network:

1.1 Extracting the frames and the categories of all objects in the graph;

1.2 Calculating the characteristics in the n object frames in the step 1.1) to be used as the object characteristics;

1.3 Combining the n objects in the step 1.1) to form n (n-1) object pairs, and calculating the characteristics in the frame of the union of the main body and the object in each object pair as the joint characteristics.

2) For the object and the object pair composed by the object extracted in the step 1), calculating the object pair interestingness by adopting an object pair interestingness prediction module of an interest propagation network:

2.1 For each object extracted in the step 1), taking the object characteristics as visual characteristics, pre-training by using a GloVE model to obtain word embedded characteristics of category names, taking the word embedded characteristics of the category names as semantic characteristics, taking the position of the object relative to the whole image as position characteristics, and combining the three characteristics to obtain the interest characteristics of the object. The calculation method of the position characteristics of the object comprises the following steps:

wherein Loc _i Is a feature of the position of the object i,representing juxtaposition operations,/->Coordinates of a left boundary, an upper boundary, a right boundary, and a lower boundary of the object, w and h are the width and the height of the input image, respectively.

2.2 For each object pair formed in the step 1), three characteristics of a subject and an object are calculated respectively in a similar manner, and then three characteristics of the object pair are calculated, so that the interesting characteristics of the object pair are obtained in a combined manner. The method for calculating the position characteristics of the object pairs comprises the following steps:

Loc _p is the position characteristic of the object to p, loc _i Is the position feature of object i, s _p 、o _p Respectively representing the subject and the object of the object pair, and U represents the juxtaposition operation of the object level.

The method for calculating the visual characteristics of the object pairs comprises the following steps:

wherein F is _p Is the visual characteristic of the object to p,representing object characteristics of the subject and object of the object pair respectively,representing the joint characteristics of the subject and object of the object pair.

2.3 Inputting the two interest features in the step 2.1) and the step 2.2) into a graph convolution neural network, and predicting the object interestingness.

3) For the object pairs formed in the step 1), calculating the interest degree of the relation predicates by adopting a relation predicate interest prediction module of the interest propagation network:

3.1 For each object pair composed in the step 1), calculating visual features, semantic features and position features of the object pair relation predicates, and combining to obtain interest features of the relation predicates. The method for calculating the position characteristics of the relation predicates of the object pairs comprises the following steps:

wherein Loc' _p Is the relation predicate position characteristic of the object pair p, and w 'and h' are the width and the height of the frame of the union of the object pair main body and the object respectively. The calculation of visual features is the same as the calculation of visual features by an object.

3.2 For each relationship predicate, semi-supervised learning is used to predict the probability that the relationship predicate is also interesting under the condition that the object is interesting, namely the relationship predicate interestingness. The loss of semi-supervised learning is calculated as follows:

wherein L is _rela Is the loss of the relation predicate interest prediction module, l _rela Is the loss function of the device,prediction results of marked data and unmarked data are respectively represented by +.>And respectively representing the real results of marked data and unmarked data, wherein beta is the loss weight of the unmarked data.

4) Minimizing the total loss of the interest propagation network, predicting the visual relationship of interest:

4.1 Adding the loss of the object category prediction in the step 1), the loss of the object and the object to the interest prediction in the step 2) and the loss of the relation predicate interest prediction in the step 3) to obtain the total loss of the interest propagation network, and combining the object to the interest degree and the relation predicate interest degree obtained by minimizing the total loss to obtain the visual relation interest degree. The total loss of the interest propagation network is calculated as follows:

L ^pos ＝-(1-p ^pos ) ² log p ^pos

L ^neg ＝-p ^neg log(1-p ^neg )

wherein L is ^pos 、L ^neg Representing the loss of positive and negative samples, p ^pos 、p ^neg Representing the probability scores of positive and negative samples, respectively, L _total Is the total loss of the interest propagation network, L _class Is a loss of prediction of the class of the object,positive and negative losses, respectively, representing predicted object interest,>representing the predicted positive and negative loss of interest of the object, respectively,>positive and negative losses of the relational predicate interest prediction are represented, respectively.

4.2 Ordering all visual relations according to the interestingness, wherein the visual relation with high interestingness is the finally detected interest visual relation. The interestingness of the visual relationship is calculated as follows:

I _spo ＝E _so ·I _so ·P _spo

wherein I is _spo Is the interestingness of visual relationship, I _so 、P _spo Respectively representing the interest degree of object pairs and relation predicates, E _so Is a binary parameter, E when the subject and the object in the object pair are the same object _so Get 0, otherwise E _so Taking 1.

The method of the present invention can be implemented by a computer program, and thus there is also provided an interest visual relationship detection apparatus based on an interest propagation network, the apparatus being configured with a computer program, which when executed implements the interest visual relationship detection method of the present invention.

The invention is implemented on an MSCOCO image dataset and compared with the result of the traditional visual relationship detection method. Fig. 2 and 3 are comparative examples of conventional visual relationship detection results and the results of the present invention. Wherein fig. 2 (a) and fig. 3 (a) are input images, and objects related to the visual relationship detection result are marked. Fig. 2 (b) is a result of a conventional visual relationship detection, including up to 24 visual relationships, and most relationships are weakly associated with the subject content of the input image. Fig. 3 (b) shows the result of interest visual relationship detection according to the present invention, wherein the visual relationship is only 5, and the correlation between the visual relationship and the main content of the input image is strong.

Claims

1. An interest visual relation detection method based on an interest propagation network is characterized in that an interest propagation network is established, an image is input, an interest visual relation in the image is output, and the interest propagation network comprises a panoramic object detection module, an object interest prediction module and a relation predicate interest prediction module; firstly, extracting objects from an input image through a panoramic object detection module, combining the objects into object pairs, calculating the combined characteristics of object characteristics and the object pairs of the objects, generating visual characteristics, semantic characteristics and position characteristics of the objects and the object pairs in an object pair interest prediction module, and obtaining interest characteristics of the objects and the object pairs through linear transformation, thereby predicting the interest degree of the objects; meanwhile, the interest prediction module of the relation predicates obtains interest features of the relation predicates through linear transformation of visual features, semantic features and position features of the relation predicates by the objects, and uses semi-supervised learning to predict interest degree of the relation predicates among the objects; finally, the object interest degree and the relation predicate interest degree are combined to obtain the visual relation interest degree, and the visual relation with high interest degree is the finally detected interest visual relation.

2. The interest visual relationship detection method based on the interest propagation network as claimed in claim 1, comprising the steps of:

3. The interest visual relationship detection method based on the interest propagation network as claimed in claim 2, wherein in the step 2), the calculation method of the position characteristics of the object is as follows:

wherein Loc _i Is a feature of the position of the object i,representing juxtaposition operations,/->Coordinates of a left boundary, an upper boundary, a right boundary, and a lower boundary of the object i are respectively, and w and h are respectively the width and the height of the input image.

4. The interest visual relationship detection method based on the interest propagation network as claimed in claim 2, wherein in the step 2), the method for calculating the position characteristics of the object pair is as follows:

wherein Loc _p Is the position characteristic of the object to p, loc _i Is the position feature of object i, s _p 、o _p Respectively representing the subject and the object of the object pair, and U represents the juxtaposition operation of the object level.

5. The interest visual relationship detection method based on the interest propagation network as claimed in claim 2, wherein the method for calculating the visual characteristics of the object pair is as follows:

wherein F is _p Is the visual characteristic of the object to p,object characteristics of the subject and the object of the object pair, respectively,/->Representing the joint characteristics of the subject and object of the object pair.

6. The interest visual relationship detection method based on the interest propagation network as claimed in claim 2, wherein in step 3), the method for calculating the relationship predicate position feature of the object pair is as follows:

wherein Loc' _p Is the position characteristic of the relation predicate of the object to p,representing juxtaposition operations,/->Coordinates of left boundary, upper boundary, right boundary, and lower boundary of object i, s _p 、o _p And the U represents the juxtaposition operation of the object level, and w 'and h' are the width and the height of the frame of the union of the object and the object in the object pair.

7. The interest visual relationship detection method based on the interest propagation network according to claim 2, wherein in the semi-supervised learning prediction relationship predicate interestingness in step 3), the calculation method of the prediction loss is as follows:

wherein L is _rela Is the loss of relation predicate interestingness prediction, l _rela Is the loss function of the device,prediction results of marked data and unmarked data are respectively represented by +.>And respectively representing the real results of marked data and unmarked data, wherein beta is the loss weight of the unmarked data.

8. The interest visual relationship detection method based on the interest propagation network as claimed in claim 2, wherein the calculation method of the total loss in step 4) is as follows:

L ^pos ＝-(1-p ^pos ) ² logp ^pos

L ^neg ＝-p ^neg log(1-p ^neg )

wherein L is ^pos 、L ^neg Representing the loss of positive and negative samples, p ^pos 、p ^neg Representing the probability scores of positive and negative samples, respectively, L _total Is total loss, L _class Is a loss of prediction of the class of the object,positive and negative losses representing object interestingness prediction, respectively,>representing the positive and negative losses of the object's prediction of interest,positive and negative losses of the relationship predicate interestingness prediction are represented, respectively.

9. An interest visual relationship detection apparatus based on an interest propagation network, characterized in that the apparatus is configured with a computer program which, when executed, implements the interest visual relationship detection method of claim 1, corresponding to the interest propagation network of claim 1.