CN116524513A

CN116524513A - Open vocabulary scene graph generation method, system, equipment and storage medium

Info

Publication number: CN116524513A
Application number: CN202310801730.0A
Authority: CN
Inventors: 徐童; 陈恩红; 冯长凯; 吴世伟; 许德容
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2023-08-01
Anticipated expiration: 2043-07-03
Also published as: CN116524513B

Abstract

The invention discloses a method, a system, a device and a storage medium for generating an open vocabulary scene graph, which are one-to-one schemes, wherein: extracting the representation of the input image by using a visual feature extraction method, and then acting on entity inquiry and relation inquiry to generate entity and relation visual features containing context information of different areas; generating entity and relationship text tokens based on the single prompt and based on the adaptive hierarchical prompt; an open vocabulary scene graph is generated based on the entity and relationship visual features, the entity and relationship text characterizations. According to the scheme, the visual representation and the text representation are aligned to perform entity recognition and relationship recognition, meanwhile, rich context information contained in the image is fully utilized to recognize the relationship of the long-distance entity, and the confusable categories are more easily distinguished by the visual representation by utilizing the hierarchical structure of the relationship categories, so that a good effect is achieved on the precision of generating the open vocabulary scene graph.

Description

Open vocabulary scene graph generation method, system, equipment and storage medium

Technical Field

The present invention relates to the field of computer vision and natural language processing, and in particular, to a method, a system, an apparatus, and a storage medium for generating an open vocabulary scene graph.

Background

The goal of scene graph generation is to detect visual relationships between objects in an image and generate triples. This ability to summarize scenes into structured visual semantics is the basis for many visual applications, such as image description, visual questions and answers, and visual reasoning. In order to make the generated scenegraph more practical, a more challenging open vocabulary scenegraph generation task is presented that trains using only a subset of entity and relationship categories and predicts using all the categories in the dataset.

Existing open vocabulary scene graph generation uses visual language models to identify categories that are not visible to the training set, but they do not address the following two challenges: 1) Identifying relationships of remote entity pairs is difficult because it requires context information from multiple regions; 2) Text inserts generated from relationship categories with similar spatial locations or semantics should be far enough from each other to achieve accurate differentiation through visual representation. Therefore, the prior art tends to identify the relationship between the subject and the object with a smaller distance and predict the relationship category visible in the training set, neglecting the identification of the remote relationship and the unknown relationship, and the generated scene graph is often imperfect in information and has error triples, so that a satisfactory effect is difficult to obtain.

Disclosure of Invention

The invention aims to provide a method, a system, equipment and a storage medium for generating an open vocabulary scene graph, which can generate the open vocabulary scene graph with higher precision.

The invention aims at realizing the following technical scheme:

a method for generating an open vocabulary scene graph comprises the following steps:

step 1, extracting a representation of an input image by using a visual feature extraction method, and then acting the extracted representation on entity inquiry and relation inquiry to generate entity visual features and relation visual features containing context information of different areas;

step 2, generating entity text tokens corresponding to all entity categories based on the single prompt, and generating relation text tokens corresponding to all relation categories based on the self-adaptive hierarchical prompt;

and 3, constructing a triplet visual feature by utilizing the relation visual feature of each pair of entity visual features and the corresponding position, and generating an open vocabulary scene graph based on the entity visual feature, the interaction result of the entity visual feature and the entity text representation and the interaction result of the triplet visual feature and the relation text representation.

An open vocabulary scene graph generation system comprising: generating a model by using the open vocabulary scene graph; the open vocabulary scene graph generation model comprises:

the feature generation module is used for extracting the representation of the input image by utilizing a visual feature extraction method, and then, the extracted representation is used for entity inquiry and relation inquiry to generate entity visual features and relation visual features containing different area context information;

the text representation generation module is used for generating entity text representations corresponding to each entity category based on the single prompt and generating relation text representations corresponding to each relation category based on the self-adaptive level prompt;

the scene graph construction module is used for constructing a triplet visual feature by utilizing the relation visual feature of each pair of entity visual features and the corresponding position, and generating an open vocabulary scene graph based on the entity visual feature, the interaction result of the entity visual feature and the entity text representation and the interaction result of the triplet visual feature and the relation text representation.

A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

A readable storage medium storing a computer program which, when executed by a processor, implements the method described above.

According to the technical scheme provided by the invention, the entity identification and the relation identification are carried out by utilizing the visual representation and text representation alignment mode, meanwhile, the relation of the long-distance entity is identified by fully utilizing the rich context information contained in the image, and the confusing category is more easily distinguished by the visual representation by utilizing the hierarchical structure of the relation category, so that a good effect is achieved on the precision of generating the open vocabulary scene graph.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of an open vocabulary scene graph generating method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an open vocabulary scene graph generation model provided by an embodiment of the present invention;

fig. 3 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The terms that may be used herein will first be described as follows:

the terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.

The term "consisting of … …" is meant to exclude any technical feature element not explicitly listed. If such term is used in a claim, the term will cause the claim to be closed, such that it does not include technical features other than those specifically listed, except for conventional impurities associated therewith. If the term is intended to appear in only a clause of a claim, it is intended to limit only the elements explicitly recited in that clause, and the elements recited in other clauses are not excluded from the overall claim.

The method, the system, the equipment and the storage medium for generating the open vocabulary scene graph are described in detail below. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer.

Example 1

The embodiment of the invention provides a method for generating an open vocabulary scene graph, which mainly comprises the following steps as shown in fig. 1:

and step 1, extracting the representation of the input image by using a visual feature extraction method, and then, applying the extracted representation to entity inquiry and relation inquiry to generate entity visual features and relation visual features containing context information of different areas.

In the embodiment of the invention, the representation of the input image is extracted through a convolution network, the extracted representation is serialized and is processed through a transducer encoder after being added with position codes, and the image representation after self-attention mechanism optimization is obtained; wherein, the Transformer represents a Transformer; and inputting the image representation optimized by the self-attention mechanism to a transducer decoder, respectively searching for an entity reference point and a relationship reference point by combining the entity query and the relationship query through the deformable attention mechanism, and then sampling around the searched entity reference point and relationship reference point to generate entity visual features and relationship visual features containing context information of different areas.

Preferably, a visual language model may be introduced, where the input is data obtained by preprocessing an image, and the visual embedding is output, and the massive knowledge contained in the visual language model is obtained by distilling the physical visual features and the relational visual features with the visual embedding output by the visual language model, which belongs to a scheme of the model training stage, and is described later.

And 2, generating entity text tokens corresponding to the entity categories based on the single prompt, and generating relation text tokens corresponding to the relation categories based on the adaptive level prompt.

In the embodiment of the invention, a single prompt is used as the prefix of each entity category and is transmitted to a text encoder to generate a corresponding entity text representation.

For a given entity classIts vector form is->Taking the single prompt as prefix, obtaining the entity category learnable single prompt expression vector +.>, wherein ,/>For L vectors of single hint, +.>Is->Vector of single hint->L is a set integer; generating corresponding entity text representation by means of a pre-trained text encoder>Expressed as:

；

wherein ,representing the text encoder.

In the embodiment of the invention, a hierarchical structure of relation categories is constructed in a bottom-up mode and is respectively marked as a bottom layer, a middle layer and a top layer; the method comprises the steps that a bottom layer is composed of nodes corresponding to known relation categories, virtual nodes of a plurality of middle layers are obtained through abstracting the relation categories corresponding to the bottom layer nodes, and virtual nodes of a top layer are obtained through abstracting the relation categories corresponding to the virtual nodes of the middle layers; nodes of the bottom layer and the middle layer self-adaptively select father nodes of the nodes to form a hierarchical prompt representation vector corresponding to each bottom layer node, and corresponding relation text representation is generated through a text encoder.

For relationship categoriesCorresponding bottom layer node, which adaptively selects middle layer node +.>As parent node, while middle layer node->Adaptively select top level node->Relationship category ∈as parent node>Hierarchical hint representation vector of corresponding bottom level node +.>The following are provided:

；

wherein the symbol j refers to any one of the top layer t, the middle layer m and the bottom layer b,、/>、/>hint vectors at top layer t, middle layer m, bottom layer b, respectively, each of which contains a vector representation of the relevant layer node +.>(it is a vector representation of the correlation class of nodes of the correlation layer, belonging to a non-learnable vector),>vectors for L single cues of the relevant layer, < ->Is the%>Vector of single hint->L is a set integer; />Is top level node->Is a learner-driven prompt vector of->Is->The%>A learnable hint vector->Is a middle layer nodeIs a learner-driven prompt vector of->Is->The%>The hint vector may be learned.

And then classifying the relationshipHierarchical hint representation vector of corresponding bottom level node +.>Inputting the text into a text encoder to generate corresponding relation text representation +.>。

In the embodiment of the invention, the physical visual characteristics are processed through the boundary frame head to obtain the boundary frame of the entity; the interaction result of the entity visual characteristics and the entity text characterization refers to entity category determined according to the similarity between the entity visual characteristics and each entity text characterization. The interactive result of the triple visual feature and the relation text representation refers to the relation category between two entities in the triple visual feature determined according to the similarity of the triple visual feature and each relation text representation. Finally, an open vocabulary scene graph is combined using bounding boxes of all entities, all entity categories, and relationship categories between all two entities.

Exemplary: assuming that 200 physical visual features and relational visual features are obtained through the step 1, respectively, the 2 nd physical visual featureVisual characteristics of the 5 th entity>Relation visual characteristics ∈2>Visual characteristics associated with the 5 th relationshipForm a triplet visual characteristic +.>The relationship category of the 2 nd entity and the 5 th entity can be calculated through the four visual features and the relationship text characterization.

In the scheme of the embodiment of the invention, the long-distance relation is more convenient to identify by the three-dimensional visual features consisting of the two entity visual features and the two relation visual features, because for the long-distance entity, if only one relation visual feature is used, the identification of the long-distance relation cannot be supported by the contained context information, and the identification of the long-distance relation can be optimized by utilizing the two relation visual features by fully utilizing the context near the subject and the context near the object.

In the scheme provided by the implementation of the invention, an open vocabulary scene graph generating module based on a visual language model is designed by combining the context information and the hierarchical prompt of different areas, and a large amount of experiments are carried out on a widely-used scene graph data set. The module achieves outstanding effect on objective evaluation indexes, and exceeds the previous optimal model. Specific: and (3) executing the step (1) by using a feature generation module, the step (2) by using a text characterization generation module, and the step (3) by using a scene graph construction module, wherein an open vocabulary scene graph generation model is formed by the feature generation module, the text characterization generation module and the scene graph construction module.

Training the open vocabulary scene graph generation model in advance before use; the training process comprises the following steps:

(1) Introducing a visual language model, cutting a training image into a plurality of entity images according to the marked entity boundary frames, and cutting by using the union of the entity boundary frames of each related entity pair as the boundary frame of the corresponding relation to obtain a triplet image; and taking the entity image and the triplet image as the input of the visual language model to obtain entity visual embedding and triplet visual embedding.

(2) The physical visual embedding and the triplet visual embedding are utilized, and the physical visual characteristics and the triplet visual characteristics of the corresponding training images output by the characteristic generating module passThe loss function calculates the distillation loss.

(3) And calculating a cross entropy loss function by using the interaction result of the entity visual features and the entity text representation and the interaction result of the triplet visual features and the relation text representation, which are obtained by the scene graph construction module.

(4) And calculating a target positioning loss function by using the open vocabulary scene graph generated by the scene graph construction module.

(5) And training the open vocabulary scene graph generation model by combining the distillation loss function, the cross entropy loss function and the target positioning loss function.

According to the scheme provided by the embodiment of the invention, the entity identification and the relation identification (the entity category and the relation category between the two entities are obtained through interaction of the visual characteristics and the text characteristics in the step 3) are carried out by utilizing the visual characteristics and the text characteristics, meanwhile, the relation of the long-distance entities is identified by fully utilizing rich context information contained in the image, and the confusing category is more easily distinguished by the visual characteristics by utilizing the hierarchical structure of the relation category, so that a good effect is obtained on the precision of generating the open vocabulary scene graph.

In order to more clearly demonstrate the technical scheme and the technical effects provided by the invention, the method provided by the embodiment of the invention is described in detail below by using specific embodiments.

1. And constructing an open vocabulary scene graph generation model.

As described above, the open vocabulary scene graph generation model in the embodiment of the present invention mainly includes: the feature generation module, the text characterization generation module and the scene graph construction module are shown in fig. 2.

Preferably, in order for the physical visual features and the relational visual features generated by the feature generation module to include strong zero sample recognition capability and a great deal of knowledge of the pre-trained visual language model, the subject model (i.e., the feature generation module) of the scene graph generation is distilled from the visual language model. Thus, a visual language model is also set up during the training phase.

2. And (5) model training.

1. Visual features are generated.

This portion is implemented by the feature generation module, which first extracts advanced and multi-scale characterizations using a convolutional network (e.g., resNet50 network) for an original image. The extracted tokens are then serialized and combined with sinusoidal position coding to have position information. The serialized representation with position coding is processed by a transducer encoder (e.g., a transducer encoder consisting of a 4-layer encoder) to yield a self-attention mechanism optimized image representation. The characterization is then passed through cross-attention, where the entity query and the relationship query are processed in a transducer decoder (e.g., a transducer decoder consisting of a 4-layer decoder) to obtain the entity visual features and the relationship visual features of the image.

Specifically, the image representation output by the transducer encoder is input as keys (K) and values (V), entity queries, and relationship queries as queries (Q) to the transducer decoder. The transform decoder is a deformable attention module, finds corresponding entity reference points and corresponding relation reference points for entity query and relation query respectively, and samples surrounding positions of the reference points to enable the query to focus on different context information, so that entity visual features and relation visual features containing context information of different areas are output.

In order to combine the physical visual features with the relational visual features of the corresponding positions in the relational category prediction, the number of physical queries and relational queries should be kept consistent.

By way of example, the present section may generate 200 physical visual features and 200 relational visual features, each of which absorbs contextual information at a different location.

As described earlier, the training phase is provided with a visual language model (pre-trained visual encoder), as shown in fig. 2, which is part of the visual embedding generation module, mainly responsible for outputting the visual embedding, and the visual embedding generation module is applied only to the training phase, to be removed in the testing phase. The training phase images need to be correspondingly preprocessed, namely: for an image, it is first cropped into a plurality of physical images according to the annotated physical bounding box. Since the relational bounding boxes are not annotated in the dataset, each relational entity pair is selected and the union of their bounding boxes is taken as the relational bounding box and cropped. These cropped entity images and triplet images will serve as inputs to the visual language model.

Visual embedding of the output of the visual language model may guide the entity visual features and the relationship visual features to obtain massive knowledge. Specifically, the entity image and the triplet image which are obtained by preprocessing the data and cut are transmitted into a visual encoder to obtain entity visual embedding and triplet visual embedding. The physical visual features from the transducer decoder then pass through the multi-layer perceptronThe loss function is aligned with the physical visual embedding:

；

wherein MLP represents a multi-layer perceptron, N represents the number of entities, i represents the ith entity,representing the visual characteristics of the ith entity, +.>Indicating that the i-th entity is visually embedded, +.>Representing physical visual features and physical visual embedment +.>A loss function; />Is->Loss letterA sign of a number.

By a pair of physical featuresAnd a pair of relational features->The three-group visual characteristics are formed by passing through the multi-layer perceptron and then by +.>The loss function is aligned with the triplet visual embedding:

；

wherein ,representing a join operation, K represents the number of triples, K represents the kth triplet,/->Representing the kth triplet visual feature, +.>Representing a kth triplet visual embedding; />And->Representing two entities in the kth triplet,and->A mark symbol for the entity characteristic and the relation characteristic; />A method for representing a visual feature of a triplet with visual embedding of the triplet>A loss function.

The two parts areLoss function->And->Constitutes a distillation loss during training.

After MLP mapping, the physical visual features and the combined triplet features acquire knowledge of the pre-trained visual encoder, thereby improving the open vocabulary classification capability.

2. A text token is generated.

For previously generated entity visual features and relationship visual features, it is necessary to generate corresponding all-category text tokens so that they interact at a later stage, implemented by a text token generation module. Specifically, since each entity class has no obvious hierarchical association, a single hint is first prefixed to all entity classes and passed to a pre-trained text encoder to generate text representations for all entity classes. And constructing a hierarchical prompt of each relation category according to the self-adaptive hierarchical structure which can be dynamically changed, so that the relation categories at different positions have different hierarchical prompts as prefixes, and finally generating text characterization of the relation categories through a pre-trained text encoder.

(1) An entity text representation is generated.

A single hint may be learned to use the same hint for all entity categories. For a given entity classIts vector form is->Taking the single hint as a prefix to obtain an entity class representation vector of the single hint, wherein ,/>Vectors of L single cues (which are learnable cue vectors),>is->Vector of single hint->L is a set integer; generating corresponding entity text representation by means of a pre-trained text encoder>Expressed as:

；

wherein ,representing the text encoder.

In fig. 2, in the single hint dashed box, the triangle symbol represents the learnable hint vector and the rectangle symbol represents the entity class.

(2) A relational text representation is generated.

And for the prefix of the relation category, an adaptive hierarchical prompt is used, an adaptive hierarchical structure of the relation category is firstly constructed by the adaptive hierarchical prompt, the structure can be dynamically changed in a training stage, and then the hierarchical prompt of each relation category is generated according to the structure, so that the relation category at different positions has different hierarchical prompts. Finally, text encoder with all prompt categories trained in advance obtains text representation of entity category and relation category.

Specifically: unlike a learnable single hint, an adaptive hierarchical hint employs different hints for the relationship categories of different hierarchical locations. As shown in FIG. 2, a hierarchy of relationship categories is built in a bottom-up fashion and these layers are labeled bottom, middle and top, respectively. The bottom layer is composed of nodes corresponding to known relation categories, and other layers are composed of virtual nodes abstracted from the bottom layer nodes, so that all categories including unseen types can be adaptively found out to be suitable prompts through dynamically constructing the structure.

In the embodiment of the invention, in the training stage, the nodes in the bottom layer are the nodes corresponding to the known relation categories in the training data, and in the testing stage, the nodes in the bottom layer are all the relation categories in the training data, namely the known category and the unknown category. Virtual nodes refer to nodes that do not correspond to classes in the dataset, for example, both the front and back belong to the abstract relationship class of azimuth, which is a virtual node when building the hierarchy.

To enable the pre-trained text encoder to distinguish between the top level virtual nodes and the middle level virtual nodes, naming is done sequentially; by way of example, the number of top-level virtual nodes may be set to 4, the number of middle-level virtual nodes to 10, the middle-level virtual nodes to be regarded as abstract categories of the bottom-level nodes, and the top-level virtual nodes to be regarded as abstract categories of the middle-level virtual nodes. Nodes of both the bottom layer and the middle layer may dynamically select their parent nodes. The learner cues are input as prefixes of the relationship classes to the text encoder in the bottom layer to obtain a representation of the relationship text, thereby facilitating expansion of the relationship classes. To obtain middle tier text representations while taking into account memory limitations, middle tier nodes without prefixes are directly input into the text encoder. Cosine similarity between the relational text representation and the middle tier text representation is then calculated to facilitate the adaptive selection of parent nodes by the relational category. Although there is no need to extend middle tier nodes, a leachable hint is still added in front of them to ensure that their adaptive parent searches match the relationship categories.

After the hierarchical structure of relationship categories is established, a hierarchical hint is constructed for each relationship category. Specifically, for relationship categoriesCorresponding bottom layer node, which adaptively selects middle layer node +.>As a parent node, while an intermediate layer nodeAdaptively select top level node->Relationship category ∈as parent node>Hierarchical hint representation vector for corresponding underlying nodeThe following are provided:

；

wherein the symbol j refers to any one of the top layer t, the middle layer m and the bottom layer b,、/>、/>hint vectors at top layer t, middle layer m, bottom layer b, respectively, each of which contains a vector representation of the relevant layer node +.>(it is a vector representation of the correlation class of nodes of the correlation layer, belonging to a non-learnable vector),>vectors for L single cues of the relevant layer, < ->Is the%>Vector of single hint->L is a set integer; />Is top level node->Is a learner-driven prompt vector of->Is->The%>A learnable hint vector->For intermediate layer node->Is a learner-driven prompt vector of->Is->The%>The hint vector may be learned.

In fig. 2, in the dashed box of the adaptive level hinting, the triangle symbols represent the learnable hinting vectors, the rectangle symbols represent the relationship classes, and the circles in the middle and top layers represent the unknown relationship classes.

In the embodiment of the invention, the text encoders involved in the text characterization generation can be the same text encoder.

3. An open vocabulary scene graph is generated.

The method is realized by a scene graph construction module, and the visual characteristics and the relation visual characteristics are utilized to respectively interact with the entity text characterization and the relation text characterization so as to classify the visual characteristics. Specifically, the visual features of the entity pass through the boundary frame head to obtain the boundary frame of the entity, and then calculate the similarity with the entity text representation to determine the entity category. And simultaneously, fusing each pair of entity visual features with the relation visual features of the corresponding positions to form triplet visual features, namely, each triplet visual feature comprises the entity visual features and the relation visual features of the two entities, calculating the similarity by taking the triplet visual features as a whole and the relation text representation, and determining the relation category between the two entities. And finally, combining the entity bounding box, the entity category and the relation category into a complete scene graph.

Here in triad visual featuresFor example, it is classified as the relation category +.>Probability of->The method comprises the following steps:

；

wherein ,representing cosine similarity calculation,/->Representing a relational base class, < >>Is an arbitrary relationship category of a relationship base class, +.>Is a temperature parameter.

In the embodiment of the invention, the relationship base class mainly refers to the relationship class of the underlying node, and all relationship classes are divided into a relationship base class (for example, 70%) and a relationship new class (for example, 30%) in the training stage. In the training stage, the relationship class of the bottom node is a known class (namely a relationship base class), and in the testing stage, the relationship class of the bottom node is a relationship base class and a relationship new class.

In fig. 2, three outputs of the three inner left three parts of the open vocabulary scene graph generating module are sequentially: real worldA volume bounding box, an entity class, a relationship class. Meanwhile, the diagram of the upper area in the middle of the open vocabulary scene graph generating module comprises a plurality of rows of circular symbols, and four circular symbols in each row refer to a triplet visual characteristic as described aboveThe four circular symbols of each row may refer to a first physical visual feature, a first relational visual feature, a second relational visual feature, and a second physical visual feature in the triplet visual feature in sequence.

In the above process, the manner of determining the entity class (relationship class) according to the similarity may be: and calculating the similarity with all the entity text tokens (relation text tokens), and selecting the entity text token (relation text token) with the maximum similarity as the corresponding entity category (relation category).

In the training of the model, the cross entropy loss, distillation loss, gIoU loss function (target localization loss function) are used to construct the total loss function, and a random gradient descent algorithm is used to update the model parameters. Specifically:

(1) The physical visual embedding and the triplet visual embedding are utilized, and the physical visual characteristics and the triplet visual characteristics of the corresponding training images output by the characteristic generating module passThe loss function calculates the distillation loss.

(2) And calculating a cross entropy loss function by using the interaction result of the entity visual features and the entity text representation and the interaction result of the triplet visual features and the relation text representation, which are obtained by the scene graph construction module.

(3) And calculating a target positioning loss function by using the open vocabulary scene graph generated by the scene graph construction module.

(4) And training the open vocabulary scene graph generation model by combining the distillation loss function, the cross entropy loss function and the target positioning loss function.

Illustratively, the as-trained optimizer may be a momentum optimizer (momentum optimizer) back-propagation optimization parameter. Each batch was 7 in size, the initial learning rate was set to 0.0001, and dropped to 0.00001 on round 15.

3. And (5) model testing.

Similar to the model training process, the data preprocessing and the processing links of the visual language model are eliminated, and the specific process can be seen in the steps 1-3, namely, the physical visual features and the relational visual features containing the context information of different areas are generated through the trained feature generation module, the physical text representation and the relational text representation are generated through the trained text representation generation module, and the open word list scene graph is generated through the trained scene graph construction module and the results generated by the two modules.

Note that the contents of various entity names (nose, person, etc.), relationship names (back, wearing, etc.), and open vocabulary scene graphs provided in the lower right of fig. 2 are only examples, and are not limiting.

From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.

Example two

The invention also provides an open vocabulary scene graph generating system which is mainly used for realizing the method provided by the previous embodiment, and the system mainly comprises the following steps: generating a model by using the open vocabulary scene graph; referring to fig. 2, the open vocabulary scene graph generating model includes:

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the system is divided into different functional modules to perform all or part of the functions described above.

Example III

The present invention also provides a processing apparatus, as shown in fig. 3, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.

Further, the processing device further comprises at least one input device and at least one output device; in the processing device, the processor, the memory, the input device and the output device are connected through buses.

In the embodiment of the invention, the specific types of the memory, the input device and the output device are not limited; for example:

the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;

the output device may be a display terminal;

the memory may be random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as disk memory.

Example IV

The invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.

The readable storage medium according to the embodiment of the present invention may be provided as a computer readable storage medium in the aforementioned processing apparatus, for example, as a memory in the processing apparatus. The readable storage medium may be any of various media capable of storing a program code, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, and an optical disk.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. The open vocabulary scene graph generation method is characterized by comprising the following steps:

2. The method of claim 1, wherein the extracting the representation of the input image by using the visual feature extraction method, and then applying the extracted representation to the entity query and the relationship query to generate the entity visual feature and the relationship visual feature including the context information of the different regions comprises:

extracting the representation of the input image through a convolution network, serializing the extracted representation, adding position codes, and processing through a transducer encoder to obtain the image representation after self-attention mechanism optimization; wherein, the Transformer represents a Transformer;

and inputting the image representation optimized by the self-attention mechanism to a transducer decoder, respectively searching for an entity reference point and a relationship reference point by combining the entity query and the relationship query through the deformable attention mechanism, and then sampling around the searched entity reference point and relationship reference point to generate entity visual features and relationship visual features containing context information of different areas.

3. The method for generating an open vocabulary scene graph according to claim 1, wherein generating entity text representations corresponding to each entity category based on a single hint comprises: the single prompt is used as the prefix of each entity category and is transmitted into a text encoder to generate corresponding entity text representation;

；

wherein ,representing the text encoder.

4. The method for generating an open vocabulary scene graph according to claim 1, wherein generating a relational text representation corresponding to each relational category based on the adaptive level hints comprises:

constructing a hierarchical structure of relation categories in a bottom-up mode, wherein the hierarchical structure is marked as a bottom layer, a middle layer and a top layer respectively; the method comprises the steps that a bottom layer is composed of nodes corresponding to known relation categories, virtual nodes of a plurality of middle layers are obtained through abstracting the relation categories corresponding to the bottom layer nodes, and virtual nodes of a top layer are obtained through abstracting the relation categories corresponding to the virtual nodes of the middle layers;

nodes of the bottom layer and the middle layer self-adaptively select father nodes of the nodes to form a hierarchical prompt representation vector corresponding to each bottom layer node, and corresponding relation text representation is generated through a text encoder.

5. The method of claim 4, wherein for a relationship categoryCorresponding bottom layer node, which adaptively selects middle layer node +.>As parent node, while middle layer node->Adaptively select top level node->Relationship category ∈as parent node>Hierarchical hint representation vector of corresponding bottom level node +.>The following are provided:

；

wherein the symbol j refers to any one of the top layer t, the middle layer m and the bottom layer b,、/>、/>hint vectors at top layer t, middle layer m, bottom layer b, respectively, each of which contains a vector representation of the relevant layer node +.>，/>Vectors for L single cues of the relevant layer, < ->Is the%>Vector of single hint->L is a set integer; />Is top level node->Is a learner-driven prompt vector of->Is->The%>The number of the prompt vectors that can be learned,for intermediate layer node->Is a learner-driven prompt vector of->Is->The%>A learnable hint vector;

6. The method of claim 1, wherein generating the open vocabulary scene graph based on the physical visual features, the interactive results of the physical visual features and the physical text representations, and the interactive results of the triplet visual features and the relational text representations comprises:

processing the visual features of the entity through a boundary frame head to obtain a boundary frame of the entity;

the interaction result of the entity visual characteristics and the entity text characterization refers to entity category determined according to the similarity between the entity visual characteristics and each entity text characterization;

the interactive result of the triple visual feature and the relation text representation refers to the relation category between two entities in the triple visual feature determined according to the similarity of the triple visual feature and each relation text representation;

the open vocabulary scene graph is combined using bounding boxes of all entities, all entity categories, and relationship categories between all two entities.

7. The method of generating an open vocabulary scene graph according to claim 1, further comprising: executing step 1 by using a feature generation module, executing step 2 by using a text characterization generation module, executing step 3 by using a scene graph construction module, forming an open vocabulary scene graph generation model by using the feature generation module, the text characterization generation module and the scene graph construction module, and training the open vocabulary scene graph generation model in advance;

during training, a visual language model is introduced, a training image is cut into a plurality of entity images according to the marked entity boundary frames, and the union of the entity boundary frames of each related entity pair is utilized as the boundary frame of the corresponding relationship to cut, so that a triplet image is obtained; taking the entity image and the triplet image as the input of a visual language model to obtain entity visual embedding and triplet visual embedding;

the physical visual embedding and the triplet visual embedding are utilized, and the physical visual characteristics and the triplet visual characteristics of the corresponding training images output by the characteristic generating module passCalculating distillation loss according to the loss function;

calculating a cross entropy loss function by utilizing the interaction result of the entity visual features and the entity text representation and the interaction result of the triplet visual features and the relation text representation, which are obtained by the scene graph construction module;

calculating a target positioning loss function by using the open vocabulary scene graph generated by the scene graph construction module;

and training the open vocabulary scene graph generation model by combining the distillation loss function, the cross entropy loss function and the target positioning loss function.

8. An open vocabulary scene graph generation system, comprising: generating a model by using the open vocabulary scene graph; the open vocabulary scene graph generation model comprises:

9. A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

10. A readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1-7.