CN108229491B

CN108229491B - Method, device and equipment for detecting object relation from picture

Info

Publication number: CN108229491B
Application number: CN201710113099.XA
Authority: CN
Inventors: 汤晓鸥; 戴勃; 林达华
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2017-02-28
Filing date: 2017-02-28
Publication date: 2021-04-13
Anticipated expiration: 2037-02-28
Also published as: CN108229491A

Abstract

A method, a device and equipment for detecting object relations from pictures are disclosed. According to one embodiment, the method for detecting the object relation from the picture comprises the following steps: detecting a plurality of objects in the picture to obtain picture region characteristics of each object; pairing the objects detected in the pictures to obtain candidate host guest pairs, wherein each candidate host guest pair comprises picture region characteristics of paired subject objects and picture region characteristics of paired object objects; obtaining picture characteristics of the relational predicate at least based on the picture area characteristics of the subject object and the picture area characteristics of the object in the candidate subject pair; and sequentially detecting the position relations of the plurality of objects through N sub-neural networks according to the characteristic information, and taking the detection result of the Nth sub-neural network as the final detection result of the object relation in the picture.

Description

Method, device and equipment for detecting object relation from picture

Technical Field

The present application relates to image recognition technology, and in particular, to a method, an apparatus, and a device for detecting object relationships from pictures.

Background

Identifying objects and object relationships in pictures is an important aspect in image recognition. The object relationship includes three parts, namely, (subject object, relational predicate, object), e.g., (cat, eat, fish). The extraction of the object relationship plays an important role in understanding the content and meaning of the picture.

In the conventional object relation detection, generally, (subject object, relation predicate, object) is treated as a whole, and different models are trained on different whole. The problems of over-variety and redundant detection are encountered in practical scenarios.

Disclosure of Invention

The embodiment of the application provides a technical scheme for detecting object relation from a picture.

According to a first aspect of the present application, a method for detecting an object relationship from a picture is provided, the method comprising: detecting a plurality of objects in the picture to obtain picture region characteristics of each object; pairing the objects detected in the pictures to obtain candidate host guest pairs, wherein each candidate host guest pair comprises picture region characteristics of paired subject objects and picture region characteristics of paired object objects; generating picture characteristics of the relational predicate at least based on the picture area characteristics of the subject object and the picture area characteristics of the object in the candidate subject pair; detecting the position relations of the plurality of objects sequentially through N sub-neural networks according to the characteristic information, and taking the detection result of the Nth sub-neural network as the final detection result of the object relation in the picture, wherein: the 1 st sub-neural network detects the position relation of the plurality of objects according to the characteristic information; the nth sub-neural network detects the position relations of the plurality of objects again according to the characteristic information and the detection result of the (n-1) th sub-neural network; n is an integer greater than 1, N is an integer greater than 1 and less than or equal to N; the characteristic information includes: the picture region characteristics of the subject object, the picture region characteristics of the object, and the picture characteristics of the relational predicate.

In an exemplary embodiment, the feature information of the picture is acquired via a first neural network.

In an exemplary embodiment, the first neural network and the N sub-neural networks are pre-trained via a training picture set including object relationship labeling data.

In an exemplary embodiment, the picture region characteristic comprises at least information representing an appearance characteristic of the object.

In an exemplary embodiment, the method further includes: detecting a plurality of objects in the picture to obtain the position layout characteristics of each object; generating picture characteristics of the relational predicate at least based on the picture area characteristics of the subject object and the picture area characteristics of the object in the candidate subject-object pair, wherein the generating step comprises the following steps: generating picture features of the relational predicate based on picture region features of the subject object and picture region features of the object in the candidate subject pair, and the position layout features of the subject object and the object.

In an exemplary embodiment, the positional layout features include information indicating a positional relationship of the subject object and the object in the picture.

In an exemplary embodiment, generating the picture characteristics of the relational predicate based on the picture region characteristics of the subject object and the object in the candidate subject-object pair and the position layout characteristics of the subject object and the object comprises: and integrating the picture area characteristics of the subject object and the object and the position layout characteristics of the subject object and the object to generate the picture characteristics of the relational predicate.

In exemplary embodiments, the integration includes direct combining or post-combining compression.

According to a second aspect of the present application, a method of detecting object relationships from a picture is presented, the method comprising: detecting a plurality of objects in the picture to obtain picture region characteristics and position layout characteristics of each object; pairing the objects detected in the pictures to obtain candidate host guest pairs, wherein each candidate host guest pair comprises picture region characteristics of paired subject objects and picture region characteristics of paired object objects; generating picture features of the relational predicate based on picture region features of the subject object and picture region features of the object in the candidate subject pair, and the position layout features of the subject object and the object; and detecting the positional relationship of the plurality of objects based on characteristic information including: the picture region characteristics of the subject object, the picture region characteristics of the object, and the picture characteristics of the relational predicate.

In an exemplary embodiment, detecting the positional relationship of the plurality of objects from the feature information includes: detecting the position relations of the plurality of objects sequentially through N sub-neural networks according to the characteristic information, and taking the detection result of the Nth sub-neural network as the final detection result of the object relation in the picture, wherein: the 1 st sub-neural network detects the position relation of the plurality of objects according to the characteristic information; the nth sub-neural network detects the position relations of the plurality of objects again according to the characteristic information and the detection result of the (n-1) th sub-neural network; n is an integer greater than 1, and N is an integer greater than 1 and less than or equal to N.

In an exemplary embodiment, generating the picture characteristics of the predicate based on the picture area characteristics of the subject object and the picture area characteristics of the object in the candidate subject-object pair and the location layout characteristics of the subject object and the object comprises: and integrating the picture area characteristics of the subject object and the object and the position layout characteristics of the subject object and the object to generate the picture characteristics of the relational predicate.

According to a third aspect of the present application, an apparatus for detecting an object relationship from a picture is provided, including: the object detection module is used for detecting a plurality of objects in the picture to obtain picture region characteristics of each object; the object pairing module is used for pairing the detected objects to obtain candidate guest pairs, and each candidate guest pair comprises paired picture region characteristics of the subject object and paired picture region characteristics of the object; the characteristic generation module is used for generating image characteristics of the relation predicates on the basis of the image area characteristics of the subject object and the image area characteristics of the object in the candidate subject pair; and the relation detection module is used for sequentially detecting the position relations of the plurality of objects through N sub-neural networks according to the characteristic information, and taking the detection result of the Nth sub-neural network as the final detection result of the object relation in the picture, wherein: the 1 st sub-neural network detects the position relation of the plurality of objects according to the characteristic information; the nth sub-neural network detects the position relations of the plurality of objects again according to the characteristic information and the detection result of the (n-1) th sub-neural network; n is an integer greater than 1, N is an integer greater than 1 and less than or equal to N; the characteristic information includes: the picture region characteristics of the subject object, the picture region characteristics of the object, and the picture characteristics of the relational predicate.

In an exemplary embodiment, the object detection module, the object pairing module and the feature generation module are implemented by a first neural network.

In an exemplary embodiment, the object detection module further obtains a position layout feature of each object by detecting a plurality of objects in the picture; the feature generation module generates the picture features of the relational predicate based on picture region features of the subject object and picture region features of the object in the candidate subject pair, and the position layout features of the subject object and the object.

In an exemplary embodiment, the feature generation module integrates the picture region features of the subject object and the object and the position layout features of the subject object and the object to generate the picture features of the relational predicate.

According to a fourth aspect of the present application, an apparatus for detecting an object relationship from a picture is provided, including: the object detection module is used for detecting a plurality of objects in the picture to obtain picture region characteristics and position layout characteristics of each object; the object pairing module is used for pairing the objects detected in the pictures to obtain candidate host guest pairs, and each candidate host guest pair comprises paired picture region characteristics of the subject object and paired picture region characteristics of the object; a feature generation module that generates a picture feature of the relational predicate based on picture region features of a subject object and picture region features of an object in the candidate subject pair, and the position layout features of the subject object and the object; and a relationship detection module that detects positional relationships of the plurality of objects according to feature information, the feature information including: the picture region characteristics of the subject object, the picture region characteristics of the object, and the picture characteristics of the relational predicate.

In an exemplary embodiment, the relationship detection module detects the position relationships of the plurality of objects sequentially through N sub-neural networks according to the feature information, and uses the detection result of the nth sub-neural network as the final detection result of the relationship of the objects in the picture, where: the 1 st sub-neural network detects the position relation of the plurality of objects according to the characteristic information; the nth sub-neural network detects the position relations of the plurality of objects again according to the characteristic information and the detection result of the (n-1) th sub-neural network; n is an integer greater than 1, and N is an integer greater than 1 and less than or equal to N.

According to a fifth aspect of the present application, there is provided an apparatus for detecting an object relationship from a picture, comprising: a processor; and a memory storing computer executable instructions, wherein, when the computer executable instructions are executed by the processor, the processor is operable to perform the method of detecting object relationships from pictures of the first and second aspects above.

According to a sixth aspect of the present application, a computer-readable medium is provided, in which computer-executable instructions are stored, and when a processor executes the computer-executable instructions stored in the computer-readable medium, the processor performs the method for detecting an object relationship from a picture according to the first and second aspects.

According to the embodiment of the application, the accuracy of detecting the object relation from the picture can be improved.

Drawings

Other features, objects, and advantages of the present application will become more apparent upon reading the following detailed description made with reference to the accompanying drawings in which:

fig. 1 shows a flow chart of a method for detecting object relations from pictures according to an exemplary embodiment of the present application;

fig. 2 shows, in connection with an example, a flow chart of a method for detecting object relations from pictures according to an exemplary embodiment of the present application;

FIG. 3 shows a block diagram of an apparatus for detecting object relationships from pictures according to an exemplary embodiment of the present application;

FIG. 4 shows a flow chart of a method for detecting object relationships from pictures according to another exemplary embodiment of the present application;

FIG. 5 shows a block diagram of an apparatus for detecting object relationships from pictures according to another exemplary embodiment of the present application; and

FIG. 6 shows a block diagram of a computer system for detecting object relationships from pictures according to an example embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the accompanying drawings and embodiments. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be noted that, for convenience of description, only the portions related to the present application are shown in the drawings. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 shows a flow chart of a method 10 for detecting object relations from pictures according to an exemplary embodiment of the present application.

As shown in fig. 1, first, in step S11, a plurality of objects in a picture are detected to obtain picture region characteristics of each object. In this step, the input is an original image, and the output is a series of object detection results. Specifically, the output detection result includes the picture region feature of each object. The picture region characteristics include at least information representing the appearance characteristics of the object. Optionally, the picture region characteristics may also include information indicative of one or more of size, color, detail, etc. characteristics of the object. The detection in step S11 may be implemented using a neural network object detection framework, for example, including but not limited to, fast-RCNN. Here, the inputted picture may be a still image or a video frame image in a video. For a static image, this step is equivalent to performing object relation detection on the static image. For the video frame image in the video, this step is equivalent to performing object relation detection on the video.

In an exemplary embodiment, the picture region characteristic comprises at least information representing an appearance characteristic of the object. The picture region characteristics of the subject object and the object may be determined according to the picture regions in which the subject object and the object are located. Specifically, for example, a region where the object is located is identified from the picture, and the outline feature of the object and one or more of the features of the size, color, detail, and the like of the object are further identified from the region. The picture region characteristics can be determined from the picture region where the subject object and object are located, for example, using the ResNet101 neural network. Alternatively, the picture region characteristics may be determined using any other suitable neural network.

In step S12, the detected objects are paired to obtain a candidate guest-host pair. Each candidate host-guest pair includes a pair of picture region features of the subject object and picture region features of the object. In this step, all the object pairs can be taken as candidate guest-host pairs by directly adopting a full-permutation method. Alternatively, the neural network model may be used to perform rough filtering on all object pairs, filter out a part of object pairs with smaller relevance, and reserve a part of object pairs with larger relevance as candidate guest-host pairs.

At step S13, the picture characteristics of the predicate are obtained based on at least the picture area characteristics of the subject object and the picture area characteristics of the object in the candidate subject pair. The picture characteristics of the relational predicate are obtained by integrating the picture region characteristics of the subject object and the picture region characteristics of the object, for example. Specifically, the picture characteristics of the relational predicate may be formed by directly combining the picture area characteristics of the subject object and the picture area characteristics of the object; alternatively, the picture region characteristics of the subject object and the picture region characteristics of the object may be combined and compressed to form the picture characteristics of the relational predicate.

In step S14, sequentially detecting the position relationships of the plurality of objects through N sub-neural networks according to the feature information, and taking the detection result of the nth sub-neural network as the final detection result of the relationship of the objects in the picture, wherein: the 1 st sub-neural network detects the position relation of the plurality of objects according to the characteristic information; the nth sub-neural network detects the position relations of the plurality of objects again according to the characteristic information and the detection result of the (n-1) th sub-neural network; n is an integer greater than 1, N is an integer greater than 1 and less than or equal to N; the characteristic information includes: the picture region characteristics of the subject object, the picture region characteristics of the object, and the picture characteristics of the relational predicate.

In the embodiment, the accuracy of relation detection is improved through multi-stage prediction of N sub-neural networks.

In an exemplary embodiment, the feature information of the picture may be acquired via a first neural network. The N sub-neural networks form a second neural network. The first neural network and the N sub-neural networks are obtained by pre-training a training picture set comprising object relation labeling data. The parameters of the first neural network and the second neural network comprising the N sub-neural networks may be derived by joint training. For example, the first neural network and the second neural network may be jointly trained using a gradient regression method with error back propagation.

In an alternative embodiment, a training picture set with labeled data of relationships between a plurality of objects in a picture (i.e., object relationships, where the object relationships are represented by a subject object, a relationship predicate, and an object) may be used to train a second neural network (in this embodiment, a neural network composed of N sub-neural networks connected in sequence) for detecting object relationships, and during the training process, network parameters of each sub-neural network in the second neural network are repeatedly adjusted so that a difference between the object relationship predicted by the last sub-neural network and the labeled data of the object relationship of the training image satisfies a predetermined condition (e.g., the difference is smaller than a certain threshold value).

In the picture detection application, the detection of the object relationship in the picture is performed on the basis of the trained second neural network, which is equivalent to the detection of the object relationship of the picture according to the correlation among the subject object, the relationship predicate and the object, wherein the correlation among the subject object, the relationship predicate and the object reflects the probability that the collocation of the three objects may appear in daily life, and if the correlation is high, the probability that the collocation of the three objects may appear in daily life is high, and otherwise, the probability is low.

The characteristic information of the picture can be obtained through the first neural network. The extracted characteristic information of the first neural network can be output to each sub-neural network in the second neural network, so that each sub-neural network can detect the object relationship respectively.

The first neural network and the second neural network may be pre-trained with object relationship labeling data of a picture set based on training. In an alternative implementation, the first neural network and the second neural network may be obtained by joint training based on a training picture set with object relationship labeling data, and an alternative joint training process is, for example:

inputting a training picture with object relation labeling data into a first neural network, and extracting features and outputting feature information by the first neural network; the first sub-neural network carries out object relation prediction on the picture according to the characteristic information extracted by the first neural network; predicting the object relationship of the picture again by the next sub-neural network according to the feature information extracted by the first neural network and the object relationship of the previous sub-neural network, and so on until obtaining the object relationship prediction result of the last sub-neural network on the picture; comparing the prediction result of the last sub-neural network with the object relation labeling data of the picture, reversely transmitting the error of the prediction result of the last sub-neural network and the object relation labeling data of the picture back to each sub-neural network and the first neural network, and adjusting the network parameters of each network so as to finish a training process; and executing the training process for multiple times until the error between the prediction result of the last sub-neural network and the object relation labeling data of the picture meets a preset condition (if the error is less than a certain preset condition), and finishing the joint training process.

In an exemplary embodiment, detecting the positional relationship of the plurality of objects via the N sub neural networks in sequence according to the feature information includes: updating the current detection result of the subject object according to the picture area characteristics of the subject object, the current detection result of the object and the current detection result of the relational predicate in each subneural network; updating the current detection result of the object according to the picture area characteristics of the object, the current detection result of the subject object and the current detection result of the relationship predicate; and updating the current detection result of the relational predicate according to the picture characteristics of the relational predicate, the current detection result of the subject object and the current detection result of the object.

In this embodiment, the detection of the subject object uses not only the picture region characteristics of the subject object extracted from the picture, but also the object preliminarily detected from the picture region characteristics of the object and the relational predicate preliminarily detected from the picture characteristics of the relational predicate (that is, the correlation among the subject object, the object, and the relational predicate). Similarly, the detection of the object uses not only the picture region characteristics of the object extracted from the picture, but also the subject object preliminarily detected from the picture region characteristics of the subject object and the relational predicate preliminarily detected from the picture characteristics of the relational predicate. Similarly, the detection of the relational predicate uses not only the picture characteristics of the obtained relational predicate but also a subject object preliminarily detected from the picture area characteristics of the subject object and an object preliminarily detected from the picture area characteristics of the object. Since the correlations among the subject object, the object, and the relational predicate are taken into consideration, the detection results of the subject object, the object, and the relational predicate are updated using the correlations among the subject object, the object, and the relational predicate, and the accuracy of the detection results is improved.

The N sub-neural networks in the second neural network may be implemented by any sub-neural network, and each sub-neural network has the same structure. For example, each of the sub-neural networks may be implemented by a fully-connected layer, or by any other suitable sub-neural network. Wherein, the number and/or the structure of the sub-neural networks can be adjusted appropriately according to the needs of the actual scene. For the sub-neural networks with the same structure, the more the number of the sub-neural networks is, the more accurate the obtained detection result is. In an exemplary embodiment, the number of sub-neural networks may be determined through learning of the neural networks. For example, in the learning process, the number of sub-neural networks employed when the detection result of the positive sample reaches the predetermined threshold may be set as the number of sub-neural networks actually employed. In an exemplary embodiment, the final result output by at least the last sub-neural network of the plurality of sub-neural networks may be normalized by, for example, a Softmax layer.

In the above embodiment, the number and structure of the sub-neural networks can be adjusted as required, so that the method has better learning ability and can realize more accurate detection compared with the technical scheme adopting the conditional random field.

According to an exemplary embodiment, the method for detecting object relationships from a picture further includes detecting a plurality of objects in the picture, obtaining a position layout feature of each object, wherein generating a picture feature of a relational predicate based on at least a picture area feature of a subject object and a picture area feature of an object in a candidate subject pair includes: generating picture features of the relational predicate based on picture region features of the subject object and picture region features of the object in the candidate subject pair, and the position layout features of the subject object and the object. The positional layout characteristics of the subject object and the object include at least information indicating a positional relationship of the subject object and the object in the picture. Further, the position layout feature may also include information indicating the position of the subject object in the picture and the position of the object in the picture. By adopting two mutually complementary information sources of picture area characteristics of the subject object and the object and position layout characteristics of the subject object and the object, balanced and effective prediction and detection can be realized for different types of relational predicates.

In the above embodiments, the positional layout features may be determined from the positional layout of candidate host-guest pairs formed by the host object and the object using, for example, a convolutional neural network. The location layout features may also be implemented by any other suitable neural network. The picture of the relational predicate may be determined by integrating (e.g., directly combining or compressing after combining) picture region characteristics of the subject object and the object and position layout characteristics of the subject object and the object. For example, the feature vector of the relational predicate can be obtained using all connected layers, and specifically, the picture region feature and the position layout feature are directly combined and determined. Optionally, the image region feature and the position layout feature are directly combined and then compressed to obtain the image feature of the predicate.

In the above embodiment, the picture region characteristic and the position layout characteristic of the subject object, the picture region characteristic and the position layout characteristic of the object, and the picture characteristic of the relational predicate are obtained via the first neural network. And sequentially carrying out relationship detection through N sub-neural networks of the second neural network on the basis of the characteristic information obtained through the first neural network. Fig. 2 shows, in connection with an example, a method for detecting object relations from pictures according to an exemplary embodiment of the present application.

As shown in fig. 2, for a given picture 101, object detection is first performed to identify and locate objects contained in the picture 101. Object detection can be implemented, for example, using the fast RCNN network. For example, the object-detected picture 102 shows a picture containing three objects: a. umbrella (umbrella); b. a chair (chair); table (table). Each object has a respective picture characteristic.

After identifying the objects contained in the picture, the identified objects are paired to form a candidate guest-host pair. Each candidate host-guest pair includes a possible corresponding host-object and object-object. In the example shown in fig. 2, the candidate guest-host pairs include, for example, but are not limited to, a candidate guest-host pair 103 (umbrella, chair), a candidate guest-host pair 104 (chair, table). In the case where n objects are identified by object detection, there may be n (n-1) pairs of candidate host-guest pairs at most. There may be some meaningless pairs in the n (n-1) pair of candidate guest-masters. Thus, in some embodiments, a preliminary screening of all n (n-1) pairs of candidate guest-masters can be performed using, for example, a low-cost neural network to filter out meaningless pairs, thereby reducing the amount of subsequent computations. This preliminary screening may take into account spatial relationships and object classes, since objects that are spatially too far apart typically have fewer associations, while object classes determine that it is unlikely that there will be an association between certain objects.

For each obtained candidate host-guest pair (if having preliminary screening, for each host-guest pair after screening), detecting a relational predicate of the subject object and the object by using a neural network based on the picture region feature vectors of the subject object and the object therein and the position layout feature vectors of the subject object and the object, respectively.

In the example shown in fig. 2, for example, for the candidate host guest pair 103 (umbrella, chair), the position layout feature vectors of the subject object (umbrella) and the object (chair) are obtained by the position layout 201 through the processing of the position layout module 203. The picture region feature vectors for the subject object (umbrella) and the object (chair) are obtained from the picture region map 202 containing the subject object (umbrella) and the object (chair) through processing by the picture feature module 204. The position layout feature vector and the picture region feature vector thus obtained are connected and passed through, for example, two fully-connected layers 205 to obtain feature vector detection of a relational predicate on a subject object (umbrella) and an object (chair).

In the embodiment shown in fig. 2, the picture 101 is input to a first neural Network, and after picture features of Relational predicates are generated via object detection and pairing, the generated feature information is provided to a second neural Network formed of N sub-neural networks, that is, a Deep dependent Network (DR-Net) 206. The fully-connected layer 205 and its previous processing all belong to the first neural network.

As described above, each object has a respective picture characteristic. The picture features of the object may indicate a category of the object that is visually reflected in the picture. The picture characteristics of the object can be obtained by convolving the picture part of the region of the object in the picture. The positional layout features between the objects may represent relative positions and relative sizes between the objects. The position layout features between objects may be used as a supplement to the picture features of the objects. In an exemplary embodiment, the positional layout features between objects may be characterized by a dual spatial mask. Specifically, the spatial positions and sizes of the subject object and the object in the picture are shown using binary masks (i.e., each pixel in the object region in the picture is represented by one of 0 and 1, and each pixel in the region other than the object region in the picture is represented by the other of 0 and 1), thereby resulting in two binary masks. Both can be passed through three convolutional layers, for example, after appropriate downsampling, to obtain the position layout feature vectors of the subject object and the object.

Further, the feature vector detection for the obtained relational predicate may be further processed by the neural network 206 to determine and output the most probable object relationship according to the correlation between the subject and the predicate in the object relationship. For example, in the example shown in fig. 2, deriving the results may include, for example: (umbrella, above, table), 0.85; and (chair, front of table), 0.90. The result indicates that the object relationships included in the exemplary picture include: the umbrella is arranged above the table, and the probability of the umbrella is 0.85; the chair is in front of the table with a probability of 0.90.

The processing of the neural network 206 in the present application is described below in conjunction with the example shown in fig. 2.

As described above, the feature vectors for objects (including subject objects and object objects) may be obtained after the objects are identified, for example, by a convolutional neural network. The feature vector detection of the relational predicate is obtained by the picture region feature vector and the position layout feature vector of the subject object and the object.

In the neural network 206, a plurality of sub-neural networks 206-0, 206-1, 206-2 … 206-k that perform serial processing are included. Each timeThe sub-neural network comprises three parts, namely a main part, a predicate part and an object part (which respectively correspond to the parts indicated by the subscripts o, r and s) for detecting the main part, the predicate part and the object part, and the detection result of the previous sub-neural network is updated to obtain more accurate detection. In the example shown in fig. 2, the neural network 206 is formed by connecting k +1 sub-neural networks in series. Each sub-neural network includes three parts for detecting probability vectors corresponding to subject objects, relational predicates, and object objects, respectively

And

(hereinafter, the three portions will be referred to as "three portions" for convenience, respectively

And

) Where i is an integer from 0 to k. The feature vector of the subject object s, the feature vector detection of the relational predicate r, and the feature vector of the object o are input to three parts of each sub-neural network, respectively

And

in (1).

In the first sub-neural network (i ═ 0), the subject object, the object, and the relational predicate are preliminarily detected based on the picture region characteristics of the subject object, the picture region characteristics of the object, and the picture characteristics of the relational predicate, respectively. Specifically, the subject object is preliminarily detected according to the picture region characteristics of the subject object, the object is preliminarily detected according to the picture region characteristics of the object, and the relational predicate is preliminarily detected according to the picture characteristics of the relational predicate.

In addition to the first sub-neural network (i ═ 0)In the neural network of the other child,

the input of (2) further comprising in the last sub-neural network

And

is then outputted from the output of (a),

the input of (2) further comprising in the last sub-neural network

And

is then outputted from the output of (a),

the input of (2) further comprising in the last sub-neural network

And

to output of (c).

The output of the neural network 206 is as follows:

q′_s＝W_ax_s+W_srq_r+W_soq_o

q′_r＝W_rx_r+W_rsq_s+W_roq_o

q′_o＝W_ax_o+W_osq_s+W_orq_r

wherein, q'_s、q′_rAnd q'_oRespectively for each sub-neural networkResult of output, x_s、x_rAnd x_oRespectively, the feature vector of the subject object s, the feature vector of the relational predicate r, and the feature vector of the object o, q_s、q_rAnd q is_oRespectively representing the results of the three partial outputs of the last sub-neural network, W_a、W_r、W_sr、W_so、W_rs、W_ro、W_osAnd W_orAre network parameters of the neural network 206. The network parameters may be obtained through learning of the neural network 206.

According to an exemplary embodiment, in at least the last sub-neural network, the output result may be further normalized via a Softmax layer.

Fig. 3 shows a block diagram of an apparatus 30 for detecting object relations from pictures according to an exemplary embodiment of the present application.

As shown in fig. 3, the apparatus 30 for detecting object relationship from pictures may include an object detection module 31, an object pairing module 32, a feature generation module 33, and a relationship detection module 34.

For the input picture, the object detection module 31 detects objects in the picture to obtain picture region features of each object. The picture region characteristics include at least information representing the appearance characteristics of the object. Optionally, the picture region characteristics may also include information indicative of one or more of size, color, detail, etc. characteristics of the object. The object detection module 31 may be implemented using a neural network object detection framework, including, but not limited to, for example, fast-RCNN.

The object pairing module 32 pairs the detected objects to obtain candidate guest-host pairs. Each candidate host-guest pair includes a pair of picture region features of the subject object and picture region features of the object. For example, the object pairing module 32 may directly adopt a full-permutation method to make all object pairs as candidate guest-host pairs. Alternatively, a neural network model may be used to perform coarse filtering on all object pairs, and a part of the object pairs with greater weight may be reserved as candidate guest-host pairs.

The feature generation module 33 may generate the picture features of the relational predicate based on the picture region feature vectors of the subject object and the object in the candidate subject-object pair. The picture characteristics of the relational predicate are obtained by integrating the picture region characteristics of the subject object and the picture region characteristics of the object, for example. Specifically, the picture characteristics of the relational predicate may be formed by directly combining the picture area characteristics of the subject object and the picture area characteristics of the object; alternatively, the picture region characteristics of the subject object and the picture region characteristics of the object may be combined and compressed to form the picture characteristics of the relational predicate.

The relationship detection module 34 sequentially detects the position relationship of the plurality of objects through the N sub-neural networks according to the feature information, and uses the detection result of the nth sub-neural network as the final detection result of the object relationship in the picture, where: the 1 st sub-neural network detects the position relation of the plurality of objects according to the characteristic information; the nth sub-neural network detects the position relations of the plurality of objects again according to the characteristic information and the detection result of the (n-1) th sub-neural network; n is an integer greater than 1, N is an integer greater than 1 and less than or equal to N; the characteristic information includes: the picture region characteristics of the subject object, the picture region characteristics of the object, and the picture characteristics of the relational predicate.

In an exemplary embodiment, the object detection module 31, the object pairing module 32 and the feature generation module 33 constitute a first neural network, and the relationship detection module 34 constitutes a second neural network and includes a plurality of sub-neural networks. The first neural network and the N sub-neural networks are obtained by pre-training a training picture set comprising object relation labeling data. The parameters of the first neural network and the second neural network comprising the N sub-neural networks may be derived by joint training. For example, the first neural network and the second neural network may be jointly trained using a gradient regression method with error back propagation.

The plurality of sub neural networks in the second neural network may be implemented by any sub neural network, and each sub neural network has the same structure. For example, each of the sub-neural networks may be implemented by a fully-connected layer, or by any other suitable sub-neural network. Wherein, the number and/or the structure of the sub-neural networks can be adjusted appropriately according to the needs of the actual scene. For the sub-neural networks with the same structure, the more the number of the sub-neural networks is, the more accurate the obtained detection result is. In an exemplary embodiment, the number of sub-neural networks may be determined through learning of the neural networks. For example, in the learning process, the number of sub-neural networks employed when the detection result of the positive sample reaches the predetermined threshold may be set as the number of sub-neural networks actually employed. In an exemplary embodiment, the final result output by at least the last sub-neural network of the plurality of sub-neural networks may be normalized by, for example, a Softmax layer.

According to an exemplary embodiment, the object detection module 31 further obtains the position layout characteristics of each object by detecting a plurality of objects in the picture. The feature generation module 33 generates the picture features of the relational predicate based on the picture region features of the subject object and the picture region features of the object in the candidate subject pair, and the position layout features of the subject object and the object. In this embodiment, in addition to the picture region characteristics of the subject object and the object in the candidate subject pair, the picture characteristics of the relational predicate may be obtained further based on the position layout characteristics of the subject object and the object. The positional layout characteristics of the subject object and the object include at least information indicating a positional relationship of the subject object and the object in the picture. Further, the position layout feature may also include information indicating the position of the subject object in the picture and the position of the object in the picture. By adopting two mutually complementary information sources of picture area characteristics of the subject object and the object and position layout characteristics of the subject object and the object, balanced and effective prediction and detection can be realized for different types of relational predicates.

In an exemplary embodiment, the relationship detection module 33 may include three sub-neural networks. The first sub-neural network may determine picture region characteristics of the subject object and the object from pictures of regions in which the subject object and the object are located. For example, the first sub-neural network may be implemented by a ResNet101 neural network. The second sub-neural network may determine positional layout characteristics of the subject objects and the object objects according to positional layout of candidate subject-object pairs formed by the subject objects and the object objects. For example, the second sub-neural network may be implemented by a convolutional network. The third sub-neural network may determine the picture characteristics of the predicate according to the picture region characteristics and the location layout characteristics respectively determined by the first sub-neural network and the second sub-neural network. For example, the third sub-neural network may be implemented by a two-layer fully-connected layer neural network. Specifically, the third sub-neural network may connect and compress two feature vectors, namely, the picture region feature vector and the position layout feature vector, respectively determined by the first sub-neural network and the second sub-neural network, to obtain an output vector of the final relational predicate.

Fig. 4 shows a flowchart of a method 40 for detecting object relations from pictures according to another exemplary embodiment of the present application. As shown in fig. 4, in step S41, a plurality of objects in the picture are detected, and picture region features and position layout features of each object are obtained. In step S42, the objects detected in the pictures are paired to obtain candidate subject pairs, each of which includes paired picture region features of the subject object and picture region features of the object. At step S43, a picture feature of the predicate is generated based on the picture region features of the subject object and the picture region features of the object in the candidate subject pair, and the positional layout features of the subject object and the object. At step S44, the positional relationships of the plurality of objects are detected based on characteristic information including: the picture region characteristics of the subject object, the picture region characteristics of the object, and the picture characteristics of the relational predicate.

In the method according to the above embodiment, the picture region characteristics may include at least information indicating appearance characteristics of the object, and the position layout characteristics may include information indicating a positional relationship of the subject object and the object in the picture.

According to the above embodiment, balanced and effective prediction and detection of different types of relational predicates can be achieved by using two mutually complementary information sources, i.e., the picture region characteristics of the subject object and the object, and the position layout characteristics of the subject object and the object.

In an exemplary embodiment, step S44 may include detecting the position relationships of the plurality of objects sequentially through N sub-neural networks according to the feature information, and taking the detection result of the nth sub-neural network as the final detection result of the relationship of the objects in the picture, where: the 1 st sub-neural network detects the position relation of the plurality of objects according to the characteristic information; the nth sub-neural network detects the position relations of the plurality of objects again according to the characteristic information and the detection result of the (n-1) th sub-neural network; n is an integer greater than 1, and N is an integer greater than 1 and less than or equal to N. Therefore, the accuracy of relation detection is improved through the multi-stage prediction of the N sub-neural networks.

In an exemplary embodiment, step S43 may include integrating the picture region characteristics of the subject and object and the positional layout characteristics of the subject and object to generate the picture characteristics of the relational predicate. For example, the integration includes direct combination or combined post-compression.

In an exemplary embodiment, the picture region feature and the location layout feature are obtained via a first neural network, and the relationship detection performed based on the feature information is implemented via a second neural network. The second network may comprise, for example, N sub-neural networks. The first neural network and the second neural network can be obtained by pre-training through a training picture set comprising object relation labeling data. The parameters of the first neural network and the second neural network comprising the N sub-neural networks may be derived by joint training. For example, the first neural network and the second neural network may be jointly trained using a gradient regression method with error back propagation.

It should be understood that features in embodiments of the method described in connection with fig. 1 may also be applied to embodiments of the method described in connection with fig. 4 without creating conflicts.

Fig. 5 shows a block diagram of an apparatus 50 for detecting object relations from pictures according to another exemplary embodiment of the present application.

As shown in fig. 5, the apparatus 50 for detecting object relationship from pictures may include an object detection module 51, an object pairing module 52, a feature generation module 53, and a relationship detection module 54.

The object detection module 51 detects a plurality of objects in the picture, and obtains picture region features and position layout features of each object. The object pairing module 52 pairs the objects detected in the pictures to obtain candidate host-guest pairs, each candidate host-guest pair including paired picture region features of the subject object and picture region features of the object. The feature generation module 53 generates the picture feature of the relational predicate based on the picture region feature of the subject object and the picture region feature of the object in the candidate subject pair, and the position layout features of the subject object and the object. The relationship detection module 54 detects the positional relationship of the plurality of objects based on the characteristic information including: the picture region characteristics of the subject object, the picture region characteristics of the object, and the picture characteristics of the relational predicate.

In an exemplary embodiment, the relationship detection module may detect the position relationships of the plurality of objects sequentially through N sub-neural networks according to the feature information, and use a detection result of the nth sub-neural network as a final detection result of the relationship of the objects in the picture, where: the 1 st sub-neural network detects the position relation of the plurality of objects according to the characteristic information; the nth sub-neural network detects the position relations of the plurality of objects again according to the characteristic information and the detection result of the (n-1) th sub-neural network; n is an integer greater than 1, and N is an integer greater than 1 and less than or equal to N. Therefore, the accuracy of relation detection is improved through the multi-stage prediction of the N sub-neural networks.

In an exemplary embodiment, the feature generation module may include integrating picture region features of the subject object and the object and position layout features of the subject object and the object to generate the picture features of the relational predicate. For example, the integration includes direct combination or combined post-compression.

In an exemplary embodiment, the object detection module 51, the object pairing module 52 and the feature generation module 53 form a first neural network implementation, and the relationship detection module 54 forms a second neural network implementation. The second network may comprise, for example, N sub-neural networks. The first neural network and the second neural network can be obtained by pre-training through a training picture set comprising object relation labeling data. The parameters of the first neural network and the second neural network comprising the N sub-neural networks may be derived by joint training. For example, the first neural network and the second neural network may be jointly trained using a gradient regression method with error back propagation.

The method and apparatus for detecting object relationships from pictures described with reference to fig. 1 to 5 may be implemented by a computer system. The computer system may include a memory storing executable instructions and a processor. The processor is in communication with the memory to execute the executable instructions to implement the method and apparatus for detecting object relationships from pictures described with reference to fig. 1-5. Alternatively or additionally, the method and apparatus for detecting object relationships from pictures described with reference to fig. 1-5 may be implemented by a non-transitory computer storage medium. The medium stores computer readable instructions that, when executed, cause a processor to perform the method and apparatus for detecting object relationships from pictures described with reference to fig. 1 to 5.

Referring now to FIG. 6, FIG. 6 shows a block diagram of a computer system 60 for detecting object relationships from pictures according to an example embodiment of the present application.

As shown in fig. 6, the computer system 60 may include a processing unit (e.g., a Central Processing Unit (CPU)601, a Graphics Processing Unit (GPU), etc.) that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the system 600 may also be stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output I/O interface 605 is also connected to bus 604.

The following are components that may be connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a cathode ray tube CRT, a liquid crystal display device LCD, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card (e.g., a LAN card and a modem, etc.). The communication section 609 can perform communication processing through a network such as the internet. The driver 610 may also be connected to the I/O interface 605 as desired. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like may be mounted on the drive 610 so that a computer program read out therefrom is installed into the storage section 608 as necessary.

In particular, according to embodiments of the present disclosure, the method and apparatus for detecting object relationships from pictures described above with reference to fig. 1 to 5 may be implemented as a computer software program. For example, embodiments of the disclosure may include a computer program product comprising a computer program tangibly embodied in a machine-readable medium. The computer program comprises means for performing the method and apparatus for detecting object relations from pictures described with reference to fig. 1 to 5. In such an embodiment, the computer program may be downloaded from a network and installed through the communication section 609, and/or may be installed from the removable medium 611.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules referred to in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor. The names of these units or modules should not be construed as limiting these units or modules.

The above description is only exemplary of the present application and illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features and the technical features having similar functions disclosed in the present application are mutually replaced to form the technical solution.

Claims

1. A method of detecting object relationships from a picture, the method comprising:

detecting a plurality of objects in the picture to obtain picture region characteristics of each object;

pairing the objects detected in the pictures and filtering out nonsense pairs to obtain candidate guest-host pairs, wherein each candidate guest-host pair comprises paired picture region characteristics of the subject object and picture region characteristics of the object;

generating picture characteristics of the relational predicate at least based on the picture area characteristics of the subject object and the picture area characteristics of the object in the candidate subject pair;

detecting the position relations of the plurality of objects sequentially through N sub-neural networks according to the characteristic information, and taking the detection result of the Nth sub-neural network as the final detection result of the object relation in the picture, wherein: the 1 st sub-neural network detects the position relation of the plurality of objects according to the characteristic information; the nth sub-neural network detects the position relations of the plurality of objects again according to the characteristic information and the detection result of the (n-1) th sub-neural network; n is an integer greater than 1, N is an integer greater than 1 and less than or equal to N; the characteristic information includes: the picture region characteristics of the subject object, the picture region characteristics of the object, and the picture characteristics of the relational predicate.

2. The method of claim 1, wherein the feature information of the picture is obtained via a first neural network other than the N sub-neural networks.

3. The method of claim 2, wherein the first neural network and the N sub-neural networks are pre-trained via a set of training pictures including object relationship labeling data.

4. The method of claim 1, wherein the picture region characteristics comprise at least information representative of topographical characteristics of the object.

5. The method of claim 1, further comprising:

detecting a plurality of objects in the picture to obtain the position layout characteristics of each object;

generating picture characteristics of the relational predicate at least based on the picture area characteristics of the subject object and the picture area characteristics of the object in the candidate subject-object pair, wherein the generating step comprises the following steps:

generating picture features of the relational predicate based on picture region features of the subject object and picture region features of the object in the candidate subject pair, and the position layout features of the subject object and the object.

6. The method of claim 5, wherein the positional layout features include information indicating a positional relationship of the subject object and the object in the picture.

7. The method of claim 5, wherein generating the picture characteristics of the relational predicate based on picture region characteristics of the subject object and the object in the candidate subject-object pair and location layout characteristics of the subject object and the object comprises:

and integrating the picture area characteristics of the subject object and the object and the position layout characteristics of the subject object and the object to generate the picture characteristics of the relational predicate.

8. The method of claim 7, wherein the integrating comprises direct combining or post-combining compression.

9. A method of detecting object relationships from a picture, the method comprising:

detecting a plurality of objects in the picture to obtain picture region characteristics and position layout characteristics of each object;

generating picture features of the relational predicate based on picture region features of the subject object and picture region features of the object in the candidate subject pair, and the position layout features of the subject object and the object; and

detecting the position relations of the plurality of objects according to characteristic information, wherein the characteristic information comprises: the picture region characteristics of the subject object, the picture region characteristics of the object, and the picture characteristics of the relational predicate.

10. The method of claim 9, wherein the positional layout features include information indicating a positional relationship of the subject object and the object in the picture.

11. The method according to claim 9, wherein detecting the positional relationship of the plurality of objects from the feature information includes:

detecting the position relations of the plurality of objects sequentially through N sub-neural networks according to the characteristic information, and taking the detection result of the Nth sub-neural network as the final detection result of the object relation in the picture, wherein: the 1 st sub-neural network detects the position relation of the plurality of objects according to the characteristic information; the nth sub-neural network detects the position relations of the plurality of objects again according to the characteristic information and the detection result of the (n-1) th sub-neural network; n is an integer greater than 1, and N is an integer greater than 1 and less than or equal to N.

12. The method of claim 9, wherein generating the picture characteristics of the predicate based on the picture area characteristics of the subject object and the picture area characteristics of the object in the candidate subject-object pair and the location layout characteristics of the subject object and the object comprises:

13. The method of claim 12, wherein the integrating comprises direct combining or post-combining compression.

14. The method of claim 11, wherein the feature information of the picture is obtained via a first neural network other than the N sub-neural networks.

15. The method of claim 14, wherein the first neural network and the N sub-neural networks are pre-trained via a set of training pictures including object relationship labeling data.

16. The method of claim 9, wherein the picture region characteristics comprise at least information representative of topographical characteristics of the object.

17. An apparatus for detecting object relationships from a picture, comprising:

the object detection module is used for detecting a plurality of objects in the picture to obtain picture region characteristics of each object;

the object pairing module is used for pairing the detected objects and filtering out meaningless pairs to obtain candidate guest-host pairs, wherein each candidate guest-host pair comprises paired picture region characteristics of the subject object and picture region characteristics of the object;

the characteristic generation module is used for generating image characteristics of the relation predicates on the basis of the image area characteristics of the subject object and the image area characteristics of the object in the candidate subject pair; and

the relation detection module is used for sequentially detecting the position relations of the plurality of objects through the N sub-neural networks according to the characteristic information, and taking the detection result of the Nth sub-neural network as the final detection result of the object relation in the picture, wherein: the 1 st sub-neural network detects the position relation of the plurality of objects according to the characteristic information; the nth sub-neural network detects the position relations of the plurality of objects again according to the characteristic information and the detection result of the (n-1) th sub-neural network; n is an integer greater than 1, N is an integer greater than 1 and less than or equal to N; the characteristic information includes: the picture region characteristics of the subject object, the picture region characteristics of the object, and the picture characteristics of the relational predicate.

18. The apparatus of claim 17, wherein the object detection module, the object pairing module, and the feature generation module are implemented by a first neural network other than the N sub-neural networks.

19. The apparatus of claim 18, wherein the first neural network and the N sub-neural networks are pre-trained via a set of training pictures including object relationship labeling data.

20. The apparatus of claim 17, wherein the picture region characteristic comprises at least information representative of an appearance characteristic of the object.

21. The apparatus according to claim 17, wherein the object detection module further obtains the position layout feature of each object by detecting a plurality of objects in the picture; the feature generation module generates the picture features of the relational predicate based on picture region features of the subject object and picture region features of the object in the candidate subject pair, and the position layout features of the subject object and the object.

22. The apparatus of claim 21, wherein the positional layout features include information indicating a positional relationship of the subject object and the object in the picture.

23. The apparatus of claim 21, wherein the feature generation module integrates picture region features of the subject and object objects and positional layout features of the subject and object objects to generate the picture features of the relational predicate.

24. The apparatus of claim 23, wherein the integration comprises direct combining or post-combining compression.

25. An apparatus for detecting object relationships from a picture, comprising:

the object detection module is used for detecting a plurality of objects in the picture to obtain picture region characteristics and position layout characteristics of each object;

the object pairing module is used for pairing the objects detected in the pictures and filtering out meaningless pairs to obtain candidate guest-host pairs, and each candidate guest-host pair comprises paired picture region characteristics of the subject object and picture region characteristics of the object;

a feature generation module that generates a picture feature of the relational predicate based on picture region features of a subject object and picture region features of an object in the candidate subject pair, and the position layout features of the subject object and the object; and

a relationship detection module that detects positional relationships of the plurality of objects according to feature information, the feature information including: the picture region characteristics of the subject object, the picture region characteristics of the object, and the picture characteristics of the relational predicate.

26. The apparatus of claim 25, wherein the positional layout features include information indicating a positional relationship of the subject object and the object in the picture.

27. The apparatus of claim 25, wherein the relationship detection module detects the position relationships of the plurality of objects sequentially through N sub-neural networks according to the feature information, and uses a detection result of the nth sub-neural network as a final detection result of the relationship of the objects in the picture, wherein: the 1 st sub-neural network detects the position relation of the plurality of objects according to the characteristic information; the nth sub-neural network detects the position relations of the plurality of objects again according to the characteristic information and the detection result of the (n-1) th sub-neural network; n is an integer greater than 1, and N is an integer greater than 1 and less than or equal to N.

28. The apparatus of claim 25, wherein the feature generation module integrates picture region features of the subject and object objects and positional layout features of the subject and object objects to generate the picture features of the relational predicate.

29. The apparatus of claim 28, wherein the integration comprises direct combining or post-combining compression.

30. The apparatus of claim 27, wherein the object detection module, the object pairing module, and the feature generation module are implemented by a first neural network other than the N sub-neural networks.

31. The apparatus of claim 30, wherein the first neural network and the N sub-neural networks are pre-trained via a set of training pictures including object relationship labeling data.

32. The apparatus of claim 25, wherein the picture region characteristic comprises at least information representative of an appearance characteristic of the object.

33. An electronic device, comprising: a processor; and a memory storing computer executable instructions, wherein the processor is operable to perform the method of detecting object relationships from pictures as claimed in any of claims 1-8 or claims 9-16 when the computer executable instructions are executed by the processor.