CN114067196A

CN114067196A - Method and device for generating image scene information

Info

Publication number: CN114067196A
Application number: CN202111323685.XA
Authority: CN
Inventors: 詹忆冰; 陈超
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-02-18

Abstract

The application discloses a method and a device for generating image scene information. One embodiment of the method comprises: detecting target objects in the acquired image to be processed to obtain a detection result and characteristic information of each target object; obtaining the context perception characteristics of each target object according to the detection result and the characteristic information of each target object; obtaining a relation characteristic representing the relation between target objects in each target object group according to the context perception characteristic of the target objects in the target object group in the image to be processed; and generating image scene information corresponding to the image to be processed according to the corresponding relation characteristic of each target object group. The method and the device fully model the target object and the relation between the target objects, and improve the accuracy of the image scene information determined through the relation characteristics.

Description

Method and device for generating image scene information

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for generating image scene information.

Background

The scene graph generation technology is an important technology for solving image information by a computing mechanism and is mainly applied to multimedia information analysis. Specifically, the scene graph generation technology parses input image data, acquires target objects in an image and analyzes relationships between the target objects, thereby abstracting the image into a directed graph, and such a graph structure describing scene information of the image is called a scene graph. The existing scene graph generation method is not enough to model the context information, and a promotion space exists.

Disclosure of Invention

The embodiment of the application provides a method and a device for generating image scene information.

In a first aspect, an embodiment of the present application provides a method for generating image scene information, including: detecting target objects in the acquired image to be processed to obtain a detection result and characteristic information of each target object; obtaining the context perception characteristics of each target object according to the detection result and the characteristic information of each target object; obtaining a relation characteristic representing the relation between target objects in each target object group according to the context perception characteristic of the target objects in the target object group in the image to be processed; and generating image scene information corresponding to the image to be processed according to the corresponding relation characteristic of each target object group.

In some embodiments, the detection result includes location information and classification information; and obtaining the context-aware feature of each target object according to the detection result and the feature information of each target object, including: for each target object, the following operations are performed: splicing the position information, the classification information and the characteristic information of the target object to obtain the object characteristics of the target object; carrying out linear transformation on the object characteristics to obtain transformed object characteristics; and carrying out context coding on the transformed object characteristics through the first attention network to obtain the context perception characteristics of the target object.

In some embodiments, obtaining a relationship characteristic representing a relationship between target objects in each target object group according to the context-aware characteristics of the target objects in the target object groups in the image to be processed includes: for each target object group, the following operations are performed: splicing the context perception characteristics of the target objects in the target object group to obtain the object group characteristics of the target object group; carrying out linear transformation on the object group characteristics to obtain transformed object group characteristics; and carrying out context coding on the transformed object group characteristics through a second attention network to obtain relationship characteristics representing the relationship between the target objects in the target object group.

In some embodiments, the splicing the context-aware features of the target objects in the target object group to obtain the object group features of the target object group includes: determining bounding box information of target objects in the target object group in the image to be processed; and splicing the context perception characteristics of the target objects in the target object group and the surrounding frame information to obtain the object group characteristics of the target object group.

In some embodiments, the generating image scene information corresponding to the image to be processed according to the relationship characteristic corresponding to each target object group includes: determining the relation between the target objects in each target object group according to the relation characteristics of each target object group through a relation classification network; and generating a scene graph representing scene information in the image to be processed according to the relation between the target objects in each target object group.

In a second aspect, an embodiment of the present application provides a method for generating image scene information, including: acquiring a training sample set, wherein training samples in the training sample set comprise sample images, object labels representing target objects in the sample images and relation labels representing relations among the target objects in a target object group in the sample images; detecting target objects in the sample image to obtain a detection result and characteristic information of each target object; obtaining the context perception characteristics of each target object according to the detection result and the characteristic information of each target object through a first attention network; obtaining a relation characteristic representing the relation between the target objects in each target object group according to the context perception characteristic of the target objects in each target object group through a second attention network; and taking the relation characteristics as the input of the relation classification network, taking the object labels and the relation labels corresponding to the input training samples as the expected output of the relation classification network, and training to obtain a scene information generation network comprising the first attention network, the second attention network and the relation classification network.

In some embodiments, the training to obtain a scene information generating network including a first attention network, a second attention network, and a relationship classification network by using the relationship features as the input of the relationship classification network and using the object labels and the relationship labels corresponding to the input training samples as the expected output of the relationship classification network includes: determining resistance deviation of each relation included in the predicted training sample according to distribution information of the training samples in the training sample set; taking the relation characteristics as the input of a relation classification network to obtain a classification result; correcting the classification result corresponding to each relation through the resistance deviation corresponding to each relation to obtain a corrected classification result; and training to obtain a scene information generation network based on the loss between the object label and the relation label corresponding to the input sample image and the corrected classification result.

In some embodiments, the determining the resistance deviation of each relationship included in the predicted training sample according to the distribution information of the training samples in the training sample set includes: and determining the resistance deviation of each relation included in the predicted training sample according to the distribution information of the training samples in the training sample set by combining the preset hyper-parameter for adjusting the resistance.

In some embodiments, the determining the resistance deviation of each relationship included in the predicted training sample according to the distribution information of the training samples in the training sample set includes: determining the resistance deviation corresponding to each relation according to the proportion of the training sample corresponding to each relation in the training sample set; or determining the resistance deviation corresponding to each relation according to the normalization result of the number of the target object groups corresponding to each relation; or for each relation corresponding to each target object group, determining the resistance deviation corresponding to each relation related to the target object group according to the proportion of the training samples related to the target object group belonging to the relation in all the training samples related to the target object group; or for each relation corresponding to each target object group, determining the resistance deviation corresponding to each relation related to the target object group according to the estimated quantity of the training samples related to the target object group belonging to the relation and the proportion of the estimated quantity of the training samples related to the target object group in the total estimated quantity of the training samples related to the target object group, wherein the estimated quantity represents the general distribution information of the training samples related to the target object group belonging to the relation.

In some embodiments, the estimated number is determined by: and determining the estimated number of the training samples related to the target object group under the relationship according to the number of the training samples related to the subject object in the target object group under the relationship and the number of the training samples related to the object in the target object group under the relationship.

In a third aspect, an embodiment of the present application provides an apparatus for generating image scene information, including: the first detection unit is configured to detect target objects in the acquired image to be processed, and obtain a detection result and characteristic information of each target object; the first feature processing unit is configured to obtain context perception features of each target object according to the detection result and feature information of each target object; the second feature processing unit is configured to obtain a relationship feature representing the relationship between the target objects in each target object group according to the context perception feature of the target objects in the target object group in the image to be processed; and the generating unit is configured to generate image scene information corresponding to the image to be processed according to the corresponding relation characteristic of each target object group.

In some embodiments, the detection result includes location information and classification information; and a first feature processing unit further configured to: for each target object, the following operations are performed: splicing the position information, the classification information and the characteristic information of the target object to obtain the object characteristics of the target object; carrying out linear transformation on the object characteristics to obtain transformed object characteristics; and carrying out context coding on the transformed object characteristics through the first attention network to obtain the context perception characteristics of the target object.

In some embodiments, the second feature processing unit is further configured to: for each target object group, the following operations are performed: splicing the context perception characteristics of the target objects in the target object group to obtain the object group characteristics of the target object group; carrying out linear transformation on the object group characteristics to obtain transformed object group characteristics; and carrying out context coding on the transformed object group characteristics through a second attention network to obtain relationship characteristics representing the relationship between the target objects in the target object group.

In some embodiments, the second feature processing unit is further configured to: determining bounding box information of target objects in the target object group in the image to be processed; and splicing the context perception characteristics of the target objects in the target object group and the surrounding frame information to obtain the object group characteristics of the target object group.

In some embodiments, the generating unit is further configured to: determining the relation between the target objects in each target object group according to the relation characteristics of each target object group through a relation classification network; and generating a scene graph representing scene information in the image to be processed according to the relation between the target objects in each target object group.

In a fourth aspect, an embodiment of the present application provides an apparatus for generating image scene information, including: an obtaining unit configured to obtain a training sample set, wherein training samples in the training sample set include a sample image, an object label characterizing a target object in the sample image, and a relationship label characterizing a relationship between target objects in a target object group in the sample image; the second detection unit is configured to detect the target objects in the sample image and obtain the detection result and the characteristic information of each target object; the first attention unit is configured to obtain context perception characteristics of each target object according to the detection result and the characteristic information of each target object through a first attention network; a second attention unit configured to obtain, through a second attention network, a relationship feature representing a relationship between the target objects in each of the target object groups according to the context-aware features of the target objects in each of the target object groups; and the training unit is configured to train to obtain a scene information generation network comprising the first attention network, the second attention network and the relation classification network by taking the relation characteristics as the input of the relation classification network and taking the object labels and the relation labels corresponding to the input training samples as the expected output of the relation classification network.

In some embodiments, the training unit is further configured to: determining resistance deviation of each relation included in the predicted training sample according to distribution information of the training samples in the training sample set; taking the relation characteristics as the input of a relation classification network to obtain a classification result; correcting the classification result corresponding to each relation through the resistance deviation corresponding to each relation to obtain a corrected classification result; and training to obtain a scene information generation network based on the loss between the object label and the relation label corresponding to the input sample image and the corrected classification result.

In some embodiments, the training unit is further configured to: and determining the resistance deviation of each relation included in the predicted training sample according to the distribution information of the training samples in the training sample set by combining the preset hyper-parameter for adjusting the resistance.

In some embodiments, the training unit is further configured to: determining the resistance deviation corresponding to each relation according to the proportion of the training sample corresponding to each relation in the training sample set; or determining the resistance deviation corresponding to each relation according to the normalization result of the number of the target object groups corresponding to each relation; or for each relation corresponding to each target object group, determining the resistance deviation corresponding to each relation related to the target object group according to the proportion of the training samples related to the target object group belonging to the relation in all the training samples related to the target object group; or for each relation corresponding to each target object group, determining the resistance deviation corresponding to each relation related to the target object group according to the estimated quantity of the training samples related to the target object group belonging to the relation and the proportion of the estimated quantity of the training samples related to the target object group in the total estimated quantity of the training samples related to the target object group, wherein the estimated quantity represents the general distribution information of the training samples related to the target object group belonging to the relation.

In a fifth aspect, the present application provides a computer-readable medium, on which a computer program is stored, where the program, when executed by a processor, implements the method as described in any implementation manner of the first aspect and the second aspect.

In a sixth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device, on which one or more programs are stored, which, when executed by one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first and second aspects.

According to the method and the device for generating the image scene information, the detection result and the characteristic information of each target object are obtained by detecting the target object in the acquired image to be processed; obtaining the context perception characteristics of each target object according to the detection result and the characteristic information of each target object; obtaining a relation characteristic representing the relation between target objects in each target object group according to the context perception characteristic of the target objects in the target object group in the image to be processed; according to the relation characteristics corresponding to each target object group, image scene information corresponding to the image to be processed is generated, so that a method for determining the context characteristics of the target objects in the image to be processed and then determining the relation characteristics representing the relation between the target objects in the target object group to generate scene graph information is provided, the relation between the target objects and the target objects is fully modeled, and the accuracy of the image scene information determined through the relation characteristics is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method of generating image scene information according to the present application;

fig. 3 is a schematic diagram of an application scenario of the method of generating image scene information according to the present embodiment;

FIG. 4 is a flow diagram for one embodiment of a method for generating image scene information according to the present application;

FIG. 5 is a schematic diagram of a structure of a scenario information generation model according to the present application;

FIG. 6 is a block diagram of one embodiment of an apparatus for generating image scene information according to the present application;

FIG. 7 is a block diagram of one embodiment of an apparatus for generating image scene information according to the present application;

FIG. 8 is a block diagram of a computer system suitable for use in implementing embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary architecture 100 to which the method and apparatus for generating image scene information, the method and apparatus for generating image scene information of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The communication connections between the

terminal devices

101, 102, 103 form a topological network, and the network 104 serves to provide a medium for communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may be hardware devices or software that support network connections for data interaction and data processing. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices supporting network connection, information acquisition, interaction, display, processing, and the like, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, for example, a background processing server that determines, for the to-be-processed images provided by the

terminal devices

101, 102, and 103, a context feature of a target object in the to-be-processed images, and then determines a relationship feature representing a relationship between target objects in the target object group, so as to generate image scene information. The server can also obtain a scene information generation network for generating image scene information based on training of the training sample set, wherein the scene information generation network comprises a first attention network for determining context characteristics of target objects in the image to be processed, a second attention network for determining relation characteristics representing relations between the target objects in the target object group, and a relation classification network for determining the relations according to the relation characteristics. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be further noted that the method for generating image scene information provided by the embodiment of the present application may be executed by a server, may also be executed by a terminal device, and may also be executed by the server and the terminal device in cooperation with each other. Accordingly, the apparatus for generating image scene information and the portions (for example, the units) included in the apparatus for generating image scene information may be all disposed in the server, may be all disposed in the terminal device, and may be disposed in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the method of generating image scene information, the electronic device on which the method of generating image scene information operates, does not need to perform data transmission with other electronic devices, the system architecture may include only the electronic device (e.g., server or terminal device) on which the method of generating image scene information operates.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method of generating image scene information is shown, comprising the steps of:

step 201, detecting the target object in the acquired image to be processed, and obtaining the detection result and the characteristic information of each target object.

In this embodiment, an execution subject (for example, a terminal device or a server in fig. 1) of the method for generating image scene information may obtain an image to be processed from a remote location or a local location based on a wired connection manner or a wireless connection manner, and detect a target object in the obtained image to be processed, so as to obtain a detection result and feature information of each target object.

The image to be processed may be an image including arbitrary contents in various scenes. For example, the image to be processed may be an image in a specific application scene of an intelligent robot, unmanned driving, visual obstacle assistance, and the like. The target object may be a person, an object, or the like included in the image to be processed.

In this embodiment, the execution main body may input the image to be processed into a pre-trained target detection network, a feature extraction network in the target detection network extracts feature information of a target object in the image to be processed, and a result output network determines a detection result of the target object according to the feature information.

The target detection network may be any network model having a function of detecting a target object. For example, the object detection Network may be CNN (Convolutional Neural Networks), RNN (Recurrent Neural Networks), fast RCNN (fast Convolutional Neural Networks).

Step 202, obtaining the context perception characteristics of each target object according to the detection result and the characteristic information of each target object.

In this embodiment, the execution subject may obtain the context-aware feature of each target object according to the detection result and the feature information of each target object. The context-aware feature represents a feature of context awareness (obtaining context information of the target object) corresponding to the target object.

As an example, the execution subject may perform context coding on each target object on the basis of the detection result and the feature information of each target object, so as to obtain the context-aware feature of each target object.

In some optional implementations of this embodiment, the detection result includes position information and classification information of the target object. The execution main body may execute the step 202 as follows:

for each target object, the following operations are performed:

firstly, the position information, the classification information and the feature information of the target object are spliced to obtain the object feature of the target object.

Then, the object features are linearly transformed to obtain transformed object features.

As an example, a set of all target objects detected in the image to be processed is denoted as V, and for any target object V ∈ V, the transformed object features are obtained through the following formula:

e_v＝W_o[pos(b_v)，g_v，ebd(c_v)]

wherein pos (b)_v) Position code representing the target object, g_vCharacteristic information indicating the target object, ebd (c)_v) Represents a word embedding vector generated from the classification result of the target object, [, ]]Indicating a splicing operation, W_oIs a linear transformation layer.

And finally, carrying out context coding on the transformed object characteristics through the first attention network to obtain the context perception characteristics of the target object.

Transformed object features { e } based on all target objects_vAnd V belongs to V }, performing context coding on the target object by using a stacked Transformer (attention network) to obtain a context-aware feature corresponding to each target object

Step 203, obtaining a relation characteristic representing the relation between the target objects in each target object group according to the context perception characteristics of the target objects in the target object group in the image to be processed.

In this embodiment, the execution subject may obtain a relationship feature representing a relationship between target objects in each target object group according to a context-aware feature of the target objects in the target object group in the image to be processed.

The image to be processed comprises at least one target object group, and each target object group comprises two target objects in the image to be processed. There is a relationship between the two target objects. For example, the relationship between two target objects in a target image group is "on.," has "(has) or describes a more precise and higher information content relationship, such as" parked., "carried" (carrying).

As an example, the execution subject may perform context coding on each target object group based on the context-aware features of two target objects in the target object group, so as to obtain the relationship feature of each target object group.

In some optional implementations of this embodiment, the executing main body may execute the step 203 by:

for each target object group, the following operations are performed:

first, the context-aware features of the target objects in the target object group are spliced to obtain the object group features of the target object group.

Then, the object group characteristics are linearly transformed to obtain transformed object group characteristics.

And finally, carrying out context coding on the transformed object group characteristics through a second attention network to obtain relationship characteristics representing the relationship between the target objects in the target object group.

In some optional implementations of this embodiment, the execution subject may obtain the object group feature by:

firstly, determining bounding box information of a target object in the target object group in an image to be processed; and then, splicing the context perception characteristics of the target objects in the target object group and the surrounding frame information to obtain the object group characteristics of the target object group. As an example, the bounding box information may be information of a smallest bounding box including the target objects in the target object group.

Specifically, for a target object group (s, o) composed of an arbitrary subject object s ∈ V and an object o ∈ V, the corresponding transformed object group characteristics are:

wherein, g_U(s，o)The information of the bounding box is represented,

represents a context-aware feature of the subject object,

representing a context-aware feature of an object, [, ]]Indicating a splicing operation, W_rA linear transformation layer is shown.

Transformed object group characteristics { e) based on all target object groups_s，oS belongs to V, o belongs to V, and another group of stacked transformers is used for carrying out context coding on the characteristics of the transformed object group to obtain the context-aware relational characteristics of each target object group

And 204, generating image scene information corresponding to the image to be processed according to the corresponding relation characteristics of each target object group.

In this embodiment, image scene information corresponding to the image to be processed is generated according to the corresponding relationship characteristic of each target object group.

As an example, the executing entity may map the relationship features to corresponding relationship categories through a mapping network, so as to obtain a relationship corresponding to each target object group; and counting the corresponding relation of each target object group to obtain image scene information corresponding to the image to be processed.

In some optional implementations of this embodiment, the executing main body may execute the step 204 by:

firstly, determining the relationship between target objects in each target object group according to the relationship characteristics of each target object group through a relationship classification network; and then, generating a scene graph representing scene information in the image to be processed according to the relation between the target objects in each target object group.

As an example, the execution subject may determine the corresponding relationship of each target object group by the following formula:

wherein,

a relationship classification network is represented that is,

representing the corresponding relation characteristic of the target object group, sigma representing the softmax function, p_s,oA predicted probability distribution representing the resulting relationship between the target objects s, o.

As an example, the execution subject may determine a relationship with the highest probability in the probability distribution as a relationship corresponding to the target object group.

In this embodiment, the execution body may construct the scene graph in a manner that nodes represent target objects in the image to be processed, and directed edges between the nodes represent relationships between the target objects.

It should be noted that the information processing process shown in the foregoing steps 201-204 can be executed by the scenario information generation model obtained in the subsequent embodiment 400.

With continued reference to fig. 3, fig. 3 is a schematic diagram 300 of an application scenario of the method of generating image scene information according to the present embodiment. In the application scenario of fig. 3, the server 301 obtains the to-be-processed image 303 from the terminal device 302. After the to-be-processed image 303 is acquired, first, the target objects in the acquired to-be-processed image are detected, and a detection result and feature information of each target object are obtained. And then, obtaining the context perception characteristics of each target object according to the detection result and the characteristic information of each target object. And then, obtaining a relation characteristic representing the relation between the target objects in each target object group according to the context perception characteristics of the target objects in the target object group in the image to be processed. And finally, generating image scene information corresponding to the image to be processed according to the corresponding relation characteristics of each target object group.

In the method provided by the above embodiment of the present application, the detection result and the feature information of each target object are obtained by detecting the target object in the acquired image to be processed; obtaining the context perception characteristics of each target object according to the detection result and the characteristic information of each target object; obtaining a relation characteristic representing the relation between target objects in each target object group according to the context perception characteristic of the target objects in the target object group in the image to be processed; according to the relation characteristics corresponding to each target object group, image scene information corresponding to the image to be processed is generated, so that a method for determining the context characteristics of the target objects in the image to be processed and then determining the relation characteristics representing the relation between the target objects in the target object group to generate scene graph information is provided, the relation between the target objects and the target objects is fully modeled, and the accuracy of the image scene information determined through the relation characteristics is improved.

And the context characteristics of the target objects in the image to be processed are determined first, and then the relationship characteristics representing the relationship between the target objects in the target object group are determined to generate the scene graph information, so that the interpretability and the robustness of the generation process of the image scene information are improved based on the operation steps.

With continuing reference to FIG. 4, a schematic flow chart 400 illustrating one embodiment of a method for generating image scene information in accordance with the present application is shown that includes the steps of:

step 401, a training sample set is obtained.

In this embodiment, an executing subject (for example, a server or a terminal device in fig. 1) of the method for generating image scene information obtains a training sample set from a remote location or from a local location based on a wired connection manner or a wireless connection manner. The training samples in the training sample set comprise sample images, object labels representing target objects in the sample images and relationship labels representing relationships among the target objects in the target object groups in the sample images.

Step 402, detecting the target objects in the sample image to obtain the detection result and the characteristic information of each target object.

In this embodiment, the execution subject may detect the target object in the sample image, and obtain a detection result and feature information of each target object.

As an example, the executing entity may input the sample image into a pre-trained target detection network, a feature extraction network in the target detection network extracts feature information of a target object in the image to be processed, and a result output network determines a detection result of the target object according to the feature information.

And step 403, obtaining the context awareness feature of each target object according to the detection result and the feature information of each target object through the first attention network.

In this embodiment, the execution subject may obtain the context awareness feature of each target object according to the detection result and the feature information of each target object through the first attention network.

As an example, the first attention network may be a multi-headed, stacked attention module.

Specifically, for each target object, the following operations are performed:

firstly, the position information, the classification information and the feature information of the target object are spliced to obtain the object feature of the target object. Then, the object features are linearly transformed to obtain transformed object features. And finally, carrying out context coding on the transformed object characteristics through the first attention network to obtain the context perception characteristics of the target object.

And step 404, obtaining a relation characteristic representing the relation between the target objects in each target object group according to the context perception characteristic of the target objects in each target object group through a second attention network.

In this embodiment, the execution subject may obtain, through the second attention network, a relationship feature representing a relationship between the target objects in each target object group according to the context awareness feature of the target objects in each target object group.

As an example, the second attention network may be a multi-headed, stacked attention module.

Specifically, for each target object group, the execution body performs the following operations: firstly, the context perception characteristics of the target objects in the target object group and the bounding box information of the target objects in the target object group in the image to be processed are spliced to obtain the object group characteristics of the target object group. Then, the object group characteristics are linearly transformed to obtain transformed object group characteristics. And finally, carrying out context coding on the transformed object group characteristics through a second attention network to obtain relationship characteristics representing the relationship between the target objects in the target object group.

Step 405, training to obtain a scene information generation network including the first attention network, the second attention network and the relationship classification network by taking the relationship features as input of the relationship classification network and taking the object labels and the relationship labels corresponding to the input training samples as expected output of the relationship classification network.

In this embodiment, the execution subject may use the relationship features as input of the relationship classification network, and use the object labels and the relationship labels corresponding to the input training samples as expected output of the relationship classification network, and train to obtain a scene information generation network including the first attention network, the second attention network, and the relationship classification network.

As an example, the execution subject may use the relationship feature as an input of a relationship classification network to obtain a classification result; and then updating the first attention network, the second attention network and the relation classification network based on the loss between the object label and the relation label corresponding to the input sample image and the corrected classification result until a preset end condition is reached, and training to obtain a scene information generation network.

The preset ending condition may be, for example, that the training time exceeds a time threshold, the training times exceeds a time threshold, and the loss tends to converge.

In some optional implementations of this embodiment, in order to make the whole training process more consistent and make the resulting model have high accuracy, the executing entity may also include a target detection network for determining the detection result and the feature information into the model updating process, so as to update the scene information generation network including the target detection network, the first attention network, the second attention network, and the relationship classification network according to the loss.

In some optional implementations of this embodiment, the executing main body may execute the step 405 by:

first, a resistance deviation of each relationship included in the predicted training samples is determined according to distribution information of the training samples in the training sample set.

Each training sample may include a plurality of target object groups, and a plurality of relationships may exist between the target objects in each target object group. For example, a person a and a horse B are included in the sample image, and various relationships between the person a and the horse B may be: the person is pulling the horse, which is in front of the person.

In this implementation, the resistance deviation is used to provide resistance for the prediction process of each relationship included in the training sample, so as to improve the recognition accuracy of the finally obtained scene information generation model for such relationship.

As an example, the execution subject may determine the resistance deviation of each relationship included in the prediction training sample based on the principle that the distribution information is negatively correlated with the resistance deviation.

Secondly, the relational characteristics are used as the input of a relational classification network to obtain a classification result.

Thirdly, correcting the classification result corresponding to each relation through the resistance deviation corresponding to each relation to obtain the corrected classification result.

As an example, for each relationship, the classification result of that relationship is subtracted by the resistance deviation corresponding to that relationship to obtain a corrected classification result.

And fourthly, training to obtain a scene information generation network based on the loss among the object labels and the relation labels corresponding to the input sample images and the corrected classification results.

Specifically, the execution main body may obtain gradient information according to the loss, and further update the scene information to generate the network according to the gradient information.

In the prior art, a labeled training sample set is mainly used for training a scene information generation network, but the labeled training sample set has the problem of unbalanced sample distribution, and the prominent expression is the problem of long-tail distribution of the relationship between target objects. Specifically, due to the difficulty difference of data collection and labeling tendency of the labeler, a small portion of common relationship or coarse-grained relationship description appears in a large amount in the data set, this portion of relationship is called head relationship, and they dominate the sample distribution of the whole training sample set, while the other relationship has relatively small data amount and is called tail relationship.

For example, in the labeling data of the widely used Visual gene database (VG), there are a large number of relationships "on …" (on) and "has" (has), and more accurate relationships with higher information content, such as "parked on …" (park on) and "carrying" (carring), relatively few relationships appear in the labeling data. The unbalanced data distribution causes a serious deflection problem of a model obtained by training based on the data, so that the quality of a scene graph generated by the model is reduced, and the requirement of practical application cannot be met.

In the implementation mode, resistance deviation is adopted, the problem that the training-obtained scene information generation model is deflected due to unbalanced data distribution of the training sample set is avoided, and the accuracy of the model is improved.

In some optional implementations of this embodiment, the executing body may execute the first step by:

and determining the resistance deviation of each relation included in the predicted training sample according to the distribution information of the training samples in the training sample set by combining the preset hyper-parameter for adjusting the resistance.

By presetting the super-parameters for adjusting the resistance, the resistance deviation of each relation can be adjusted more flexibly based on actual requirements, and the flexibility of the training process is improved.

In some optional implementations of the present embodiment, the execution body may determine the resistance deviation of each relationship by any one of the following:

the first method is as follows: and determining the resistance deviation corresponding to each relation according to the proportion of the training sample corresponding to each relation in the training sample set.

The second method comprises the following steps: and determining the resistance deviation corresponding to each relation according to the normalization result of the number of the target object groups corresponding to each relation.

The third method comprises the following steps: and for each relation corresponding to each target object group, determining the resistance deviation corresponding to each relation related to the target object group according to the proportion of the training samples related to the target object group belonging to the relation in all the training samples related to the target object group.

The method is as follows: and for each relation corresponding to each target object group, determining the resistance deviation corresponding to each relation related to the target object group according to the estimated number of the training samples related to the target object group belonging to the relation and the proportion of the estimated number of the training samples related to the target object group in the total estimated number of the training samples related to the target object group. Wherein the estimated quantities characterize general distribution information of training samples involved in the set of target objects belonging to the relationship.

Specifically, the training samples in the training sample set are counted, and the resistance deviation is generated according to the statistical data. The basic form of calculation of the resistance deviation is as follows:

wherein, C_rRepresenting a set of relationship classes, ω_iDisplay switchThe weights of i in the training sample set, and α and e represent preset hyper-parameters for adjusting the resistance deviation.

By adjusting alpha and epsilon, the depolarization effect of the resistance training can be conveniently adjusted. For example, α ═ 1 and ∈ ═ 0.001 may be taken by default. When alpha is reduced from 1 to 0 or epsilon is increased from 0.001 to 1, the strength of the resistance is gradually reduced, and the deflective effect of the resistance training is gradually weakened.

In order to describe the distribution of the training sample set from different angles, various ways are designed to describe the weight of each relationship in the training sample set:

in the first mode, the executing entity considers the influence of the proportion of each relation in the data on the model training, and the relation with the smaller proportion is harder to recognize. And initializing the weight of each relation by using the proportion of the number of samples of each relation category to the total training sample set to obtain the resistance deviation based on the number of samples.

Corresponding to the second mode, in addition to the difference of the number of samples of the relationship caused by the difference of the difficulty of obtaining the samples, the long-tail distribution is also reflected in that the description labels at the coarse granularity are much more than the descriptions at the finer granularity. In order to reflect the difference between labels with different granularities in long-tail distribution, resistance deviation of a target object group corresponding to each relation is designed. And if the relation I exists between the target objects in the target object group (s, o), the (s, o) is a target object group corresponding to the relation I. And normalizing the number of the target object groups of each relation to obtain the weight of each relation so as to obtain the resistance deviation based on the effective combination number.

Considering that the relationship prediction is not only related to the relationship class but also closely related to the target object group, the resistance deviation is further refined here, and different resistance deviations are given to different relationships of different target object groups, and the resistance deviation can be calculated by the following formula:

wherein, ω is_s,o,iIs a triplet(s, o, i) weights in the training sample set, where s, o, i correspond to the subject object's label, the object's label, and the relationship label, respectively. Based on the above formula, the resistance deviation calculation methods of the third and fourth methods are provided.

Corresponding to the third way described above, for each given target object group (s, o), the resistance deviation based on the number of samples of the target object group is obtained using, as the weight of the corresponding triplet group (s, o, i), the proportion of the number of samples of each relationship class i to which the target object group (s, o) relates to the total number of samples of the target object group.

Corresponding to the fourth mode, since many target object groups have only few samples in the training sample set, the distribution of the respective relationships of these target object groups cannot reflect the general distribution of the relationships. For each given target object group (s, o), the drag bias based on the target object group sample number is derived using the ratio of the estimated number of samples for each relationship class i to the total estimated number of the target object group as the weight for the corresponding triplet (s, o, i).

As an example, the estimated number may be an estimated number statistically derived based on a large amount of data to reflect a general distribution of relationships such that its weight more conforms to an actual scene.

In some optional temporal approaches of the present embodiment, the estimated number is determined by: and determining the estimated number of the training samples related to the target object group under the relationship according to the number of the training samples related to the subject object in the target object group under the relationship and the number of the training samples related to the object in the target object group under the relationship.

Specifically, the general distribution of relationships between target objects is estimated by the following predicate-predicate approach:

wherein, C_eRepresenting a set of target objects, n_s，o，iIs a triplet (s, o, i) in the training sampleNumber of samples in this set. n is_s，o′，iRepresenting the number of training samples, n, involved by the subject object s in the target object group under relation i_s，o′，iOnly subject objects are defined, and object objects are not defined; n is_s′，o，iThe number of training samples, n, to which the object o of the target object group relates under relation i_s′，o，iOnly object objects are defined and subject objects are not.

Further, the weight of each triplet is determined by the following formula to obtain the resistance bias based on the number-estimated distribution of the target object group:

in the implementation mode, various modes for determining the resistance deviation are provided, and in the specific training process, the mode can be specifically selected according to the actual situation, so that the flexibility and the accuracy of the training process are further improved.

With continued reference to FIG. 5, a particular context information generation model 500 is illustrated. The context information generation model 500 includes an object detection network 501, a first attention network 502, a second attention network 503, and a relationship classification network 504.

In the method provided by the above embodiment of the present application, a training sample set is obtained, where a training sample in the training sample set includes a sample image, an object label representing a target object in the sample image, and a relationship label representing a relationship between target objects in a target object group in the sample image; detecting target objects in the sample image to obtain a detection result and characteristic information of each target object; obtaining the context perception characteristics of each target object according to the detection result and the characteristic information of each target object through a first attention network; obtaining a relation characteristic representing the relation between the target objects in each target object group according to the context perception characteristic of the target objects in each target object group through a second attention network; the relation features are used as input of the relation classification network, the object labels and the relation labels corresponding to the input training samples are used as expected output of the relation classification network, the scene information generation network comprising the first attention network, the second attention network and the relation classification network is obtained through training, the training method of the scene information generation model is provided, the relation between the target object and the target object is fully modeled, and accuracy of the model is improved.

And the context characteristics of the target objects in the image are determined first, and then the relationship characteristics representing the relationship among the target objects in the target object group are determined to generate the scene graph information, so that the interpretability and the robustness of the model in the generation process of the image scene information are improved based on the operation steps.

With continuing reference to fig. 6, as an implementation of the method shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for generating image scene information, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.

As shown in fig. 6, the apparatus for generating image scene information includes: a first detection unit 601 configured to detect target objects in the acquired image to be processed, and obtain a detection result and feature information of each target object; a first feature processing unit 602 configured to obtain a context-aware feature of each target object according to the detection result and feature information of each target object; a second feature processing unit 603 configured to obtain a relationship feature representing a relationship between target objects in each target object group according to context-aware features of the target objects in the target object groups in the image to be processed; and the generating unit 604 is configured to generate image scene information corresponding to the image to be processed according to the corresponding relation characteristic of each target object group.

In some optional implementations of this embodiment, the detection result includes location information and classification information; and a first feature processing unit 602, further configured to: for each target object, the following operations are performed: splicing the position information, the classification information and the characteristic information of the target object to obtain the object characteristics of the target object; carrying out linear transformation on the object characteristics to obtain transformed object characteristics; and carrying out context coding on the transformed object characteristics through the first attention network to obtain the context perception characteristics of the target object.

In some optional implementations of this embodiment, the second feature processing unit 603 is further configured to: for each target object group, the following operations are performed: splicing the context perception characteristics of the target objects in the target object group to obtain the object group characteristics of the target object group; carrying out linear transformation on the object group characteristics to obtain transformed object group characteristics; and carrying out context coding on the transformed object group characteristics through a second attention network to obtain relationship characteristics representing the relationship between the target objects in the target object group.

In some optional implementations of this embodiment, the second feature processing unit 603 is further configured to: determining bounding box information of target objects in the target object group in the image to be processed; and splicing the context perception characteristics of the target objects in the target object group and the surrounding frame information to obtain the object group characteristics of the target object group.

In some optional implementations of this embodiment, the generating unit 604 is further configured to: determining the relation between the target objects in each target object group according to the relation characteristics of each target object group through a relation classification network; and generating a scene graph representing scene information in the image to be processed according to the relation between the target objects in each target object group.

In this embodiment, a first detection unit in the apparatus for generating image scene information detects a target object in an acquired image to be processed, and obtains a detection result and feature information of each target object; the first feature processing unit obtains context perception features of each target object according to the detection result and feature information of each target object; the second feature processing unit obtains a relationship feature representing the relationship between the target objects in each target object group according to the context perception feature of the target objects in the target object group in the image to be processed; the generating unit generates the image scene information corresponding to the image to be processed according to the relation characteristics corresponding to each target object group, so that a device for determining the context characteristics of the target objects in the image to be processed and further determining the relation characteristics representing the relation between the target objects in the target object group to generate the scene information is provided, the relation between the target objects and the target objects can be fully modeled, and the accuracy of the image scene information determined through the relation characteristics is improved.

With continuing reference to fig. 7, as an implementation of the method shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for generating image scene information, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 4, and the apparatus may be applied to various electronic devices.

As shown in fig. 7, the apparatus for generating image scene information includes: an obtaining unit 701 configured to obtain a training sample set, where training samples in the training sample set include a sample image, an object label representing a target object in the sample image, and a relationship label representing a relationship between target objects in a target object group in the sample image; a second detection unit 702 configured to detect target objects in the sample image, and obtain a detection result and feature information of each target object; a first attention unit 703 configured to obtain a context awareness feature of each target object according to the detection result and the feature information of each target object through a first attention network; a second attention unit 704 configured to obtain, through a second attention network, a relationship feature characterizing a relationship between the target objects in each target object group according to the context-aware features of the target objects in each target object group; the training unit 705 is configured to train a scene information generation network including the first attention network, the second attention network, and the relationship classification network, with the relationship features as inputs of the relationship classification network, and with the object labels and the relationship labels corresponding to the input training samples as expected outputs of the relationship classification network.

In some optional implementations of this embodiment, the training unit 705 is further configured to: determining resistance deviation of each relation included in the predicted training sample according to distribution information of the training samples in the training sample set; taking the relation characteristics as the input of a relation classification network to obtain a classification result; correcting the classification result corresponding to each relation through the resistance deviation corresponding to each relation to obtain a corrected classification result; and training to obtain a scene information generation network based on the loss between the object label and the relation label corresponding to the input sample image and the corrected classification result.

In some optional implementations of this embodiment, the training unit 705 is further configured to: and determining the resistance deviation of each relation included in the predicted training sample according to the distribution information of the training samples in the training sample set by combining the preset hyper-parameter for adjusting the resistance.

In some optional implementations of this embodiment, the training unit 705 is further configured to: determining the resistance deviation corresponding to each relation according to the proportion of the training sample corresponding to each relation in the training sample set; or determining the resistance deviation corresponding to each relation according to the normalization result of the number of the target object groups corresponding to each relation; or for each relation corresponding to each target object group, determining the resistance deviation corresponding to each relation related to the target object group according to the proportion of the training samples related to the target object group belonging to the relation in all the training samples related to the target object group; or for each relation corresponding to each target object group, determining the resistance deviation corresponding to each relation related to the target object group according to the estimated quantity of the training samples related to the target object group belonging to the relation and the proportion of the estimated quantity of the training samples related to the target object group in the total estimated quantity of the training samples related to the target object group, wherein the estimated quantity represents the general distribution information of the training samples related to the target object group belonging to the relation.

In some optional implementations of this embodiment, the estimated number is determined by: and determining the estimated number of the training samples related to the target object group under the relationship according to the number of the training samples related to the subject object in the target object group under the relationship and the number of the training samples related to the object in the target object group under the relationship.

In this embodiment, an obtaining unit in a device for generating image scene information obtains a training sample set, where a training sample in the training sample set includes a sample image, an object label representing a target object in the sample image, and a relationship label representing a relationship between target objects in a target object group in the sample image; the second detection unit detects the target objects in the sample image to obtain the detection result and the characteristic information of each target object; the first attention unit obtains the context perception characteristics of each target object according to the detection result and the characteristic information of each target object through a first attention network; the second attention unit obtains a relation characteristic representing the relation between the target objects in each target object group according to the context perception characteristic of the target objects in each target object group through a second attention network; the training unit takes the relation characteristics as the input of the relation classification network, takes the object labels and the relation labels corresponding to the input training samples as the expected output of the relation classification network, trains and obtains a scene information generation network comprising a first attention network, a second attention network and the relation classification network.

Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use in implementing devices of embodiments of the present application (e.g.,

devices

101, 102, 103, 105 shown in FIG. 1). The apparatus shown in fig. 8 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.

As shown in fig. 8, a computer system 800 includes a processor (e.g., CPU, central processing unit) 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data necessary for the operation of the system 800 are also stored. The processor 801, the ROM802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program, when executed by the processor 801, performs the above-described functions defined in the methods of the present application.

It should be noted that the computer readable medium of the present application can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the client computer, partly on the client computer, as a stand-alone software package, partly on the client computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the client computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor comprising a first detection unit, a first feature processing unit, a second feature processing unit, and a generation unit, further describable as: a processor includes an acquisition unit, a second detection unit, a first attention unit, a second attention unit, and a training unit. For example, the training unit may be further described as a unit that takes the relationship features as the input of the relationship classification network, takes the object labels and the relationship labels corresponding to the input training samples as the expected output of the relationship classification network, and trains the scene information generation network including the first attention network, the second attention network and the relationship classification network.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the computer device to: detecting target objects in the acquired image to be processed to obtain a detection result and characteristic information of each target object; obtaining the context perception characteristics of each target object according to the detection result and the characteristic information of each target object; obtaining a relation characteristic representing the relation between target objects in each target object group according to the context perception characteristic of the target objects in the target object group in the image to be processed; and generating image scene information corresponding to the image to be processed according to the corresponding relation characteristic of each target object group. The computer device is also caused to: acquiring a training sample set, wherein training samples in the training sample set comprise sample images, object labels representing target objects in the sample images and relation labels representing relations among the target objects in a target object group in the sample images; detecting target objects in the sample image to obtain a detection result and characteristic information of each target object; obtaining the context perception characteristics of each target object according to the detection result and the characteristic information of each target object through a first attention network; obtaining a relation characteristic representing the relation between the target objects in each target object group according to the context perception characteristic of the target objects in each target object group through a second attention network; and taking the relation characteristics as the input of the relation classification network, taking the object labels and the relation labels corresponding to the input training samples as the expected output of the relation classification network, and training to obtain a scene information generation network comprising the first attention network, the second attention network and the relation classification network.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method of generating image scene information, comprising:

detecting target objects in the acquired image to be processed to obtain a detection result and characteristic information of each target object;

obtaining the context perception characteristics of each target object according to the detection result and the characteristic information of each target object;

obtaining a relation characteristic representing the relation between target objects in each target object group according to the context perception characteristic of the target objects in the target object group in the image to be processed;

and generating image scene information corresponding to the image to be processed according to the corresponding relation characteristic of each target object group.

2. The method of claim 1, wherein the detection result includes location information and classification information; and

the obtaining of the context-aware feature of each target object according to the detection result and the feature information of each target object includes:

for each target object, the following operations are performed:

splicing the position information, the classification information and the characteristic information of the target object to obtain the object characteristics of the target object;

performing linear transformation on the object characteristics to obtain transformed object characteristics;

and carrying out context coding on the transformed object characteristics through a first attention network to obtain the context perception characteristics of the target object.

3. The method according to claim 1, wherein obtaining a relationship feature characterizing a relationship between target objects in each target object group according to context-aware features of target objects in the target object groups in the image to be processed comprises:

for each target object group, the following operations are performed:

splicing the context perception characteristics of the target objects in the target object group to obtain the object group characteristics of the target object group;

performing linear transformation on the object group characteristics to obtain transformed object group characteristics;

and carrying out context coding on the transformed object group characteristics through a second attention network to obtain relationship characteristics representing the relationship between the target objects in the target object group.

4. The method of claim 3, wherein the splicing the context-aware features of the target objects in the target object group to obtain the object group features of the target object group comprises:

determining bounding box information of the target objects in the target object group in the image to be processed;

and splicing the context perception characteristics of the target objects in the target object group and the surrounding frame information to obtain the object group characteristics of the target object group.

5. The method according to claim 1, wherein the generating image scene information corresponding to the image to be processed according to the relationship feature corresponding to each target object group includes:

determining the relation between the target objects in each target object group according to the relation characteristics of each target object group through a relation classification network;

and generating a scene graph representing scene information in the image to be processed according to the relation between the target objects in each target object group.

6. A method for generating image scene information, comprising:

acquiring a training sample set, wherein training samples in the training sample set comprise sample images, object labels representing target objects in the sample images and relation labels representing relations among the target objects in a target object group in the sample images;

detecting target objects in the sample image to obtain a detection result and characteristic information of each target object;

obtaining the context perception characteristics of each target object according to the detection result and the characteristic information of each target object through a first attention network;

obtaining a relation characteristic representing the relation between the target objects in each target object group according to the context perception characteristic of the target objects in each target object group through a second attention network;

and training to obtain a scene information generation network comprising the first attention network, the second attention network and the relation classification network by taking the relation characteristics as the input of the relation classification network and taking the object labels and the relation labels corresponding to the input training samples as the expected output of the relation classification network.

7. The method of claim 6, wherein the training a scene information generation network including the first attention network, the second attention network and the relationship classification network with the relationship features as inputs of the relationship classification network and object labels and relationship labels corresponding to the input training samples as expected outputs of the relationship classification network comprises:

determining resistance deviation of each relation included in the predicted training sample according to distribution information of the training samples in the training sample set;

taking the relation characteristics as the input of the relation classification network to obtain a classification result;

correcting the classification result corresponding to each relation through the resistance deviation corresponding to each relation to obtain a corrected classification result;

and training to obtain the scene information generation network based on the loss between the object label and the relation label corresponding to the input sample image and the corrected classification result.

8. The method of claim 7, wherein the determining a resistance bias for each relationship included in the predicted training samples from the distribution information of the training samples in the set of training samples comprises:

9. The method according to claim 7 or 8, wherein the determining a resistance deviation for each relation included in the predicted training samples from the distribution information of the training samples in the set of training samples comprises:

determining the resistance deviation corresponding to each relation according to the proportion of the training sample corresponding to each relation in the training sample set; or

Determining the resistance deviation corresponding to each relation according to the normalization result of the number of the target object groups corresponding to each relation; or

For each relation corresponding to each target object group, determining the resistance deviation corresponding to each relation related to the target object group according to the proportion of the training samples related to the target object group belonging to the relation in all the training samples related to the target object group; or

And for each relation corresponding to each target object group, determining the resistance deviation corresponding to each relation related to the target object group according to the estimated quantity of the training samples related to the target object group belonging to the relation and the proportion of the estimated quantity in the total estimated quantity of the training samples related to the target object group, wherein the estimated quantity represents the general distribution information of the training samples related to the target object group belonging to the relation.

10. The method of claim 9, wherein the estimated number is determined by:

and determining the estimated number of the training samples related to the target object group under the relationship according to the number of the training samples related to the subject object in the target object group under the relationship and the number of the training samples related to the object in the target object group under the relationship.

11. An apparatus for generating image scene information, comprising:

the first detection unit is configured to detect target objects in the acquired image to be processed, and obtain a detection result and characteristic information of each target object;

the first feature processing unit is configured to obtain context perception features of each target object according to the detection result and feature information of each target object;

a second feature processing unit configured to obtain a relationship feature representing a relationship between target objects in each target object group according to a context-aware feature of the target objects in the target object group in the image to be processed;

and the generating unit is configured to generate image scene information corresponding to the image to be processed according to the corresponding relation characteristic of each target object group.

12. An apparatus for generating image scene information, comprising:

an obtaining unit configured to obtain a training sample set, wherein training samples in the training sample set include a sample image, an object label characterizing a target object in the sample image, and a relationship label characterizing a relationship between target objects in a target object group in the sample image;

the second detection unit is configured to detect the target objects in the sample image and obtain the detection result and the characteristic information of each target object;

the first attention unit is configured to obtain context perception characteristics of each target object according to the detection result and the characteristic information of each target object through a first attention network;

a second attention unit configured to obtain, through a second attention network, a relationship feature representing a relationship between the target objects in each of the target object groups according to the context-aware features of the target objects in each of the target object groups;

and the training unit is configured to train to obtain a scene information generation network including the first attention network, the second attention network and the relation classification network by taking the relation features as input of the relation classification network and taking object labels and relation labels corresponding to the input training samples as expected output of the relation classification network.

13. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-10.

14. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-10.