CN114708472B

CN114708472B - AI (Artificial intelligence) training-oriented multi-modal data set labeling method and device and electronic equipment

Info

Publication number: CN114708472B
Application number: CN202210629969.XA
Authority: CN
Inventors: 吴超; 陈桂锟; 肖俊; 王朝; 张志猛
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-06-06
Filing date: 2022-06-06
Publication date: 2022-09-09
Anticipated expiration: 2042-06-06
Also published as: CN114708472A

Abstract

The invention discloses a multi-modal data set labeling method and device for AI (artificial intelligence) training and electronic equipment, and belongs to the field of computer vision. According to the method, through a scene graph generation algorithm based on deep learning technology and graph alignment fusion, the first type of scene graph is generated by using weak supervision information described by the image, the first type of scene graph is further aligned and fused with the second type of scene graph generated based on the image, and finally, a candidate initial scene graph is generated to serve as a reference of manual annotation, so that wrong annotation and missed annotation are avoided. The method can provide intelligent marking prompts for manual marking of the multi-modal data set, so that only candidate scene graphs need to be optimized during manual marking, marking scale and marking difficulty are greatly reduced, and marking efficiency of the multi-modal data can be effectively improved.

Description

AI (Artificial intelligence) training-oriented multi-modal data set labeling method and device and electronic equipment

Technical Field

The invention belongs to the field of computer vision, and particularly relates to an AI (artificial intelligence) training-oriented multi-modal data set labeling method, device and electronic equipment.

Background

The AI training is widely applied to the field of online education, such as artificial intelligence courses, specific task training and the like. The AI training requires providing corresponding tutorials and data according to the needs of the user, but with the continuous development of AI technology and the continuous improvement of task complexity, the requirements on the quality and quantity of the multi-modal data become higher and higher. The model required by AI practical training relies on high-quality labeling data for training, while the traditional multi-modal data set construction method relies on manual labeling work, and the labeling efficiency and quality of the traditional multi-modal data set construction method have defects.

In addition, in the prior art, the invention patent with application number CN202010131160.5 provides a multi-modal data annotation method, system and related device, and the method of the solution is: the method comprises the steps of dividing the labeling process of an image segmentation data set into a detection process and a segmentation process, firstly, detecting and positioning ImageNet image data by using a detection model, then, marking a small-range mask by using an image segmentation method to obtain labeling data information, and thus, completing batch automatic labeling of the data set. However, the scheme depends on the performance of the image detection model and the image segmentation algorithm, and the completeness and reliability of the annotation data cannot be guaranteed.

Therefore, a set of efficient and reliable labeling tools is needed to greatly improve the labeling quality and efficiency, so as to improve the AI training efficiency and the model performance.

Disclosure of Invention

The invention aims to solve the problem of low manual labeling efficiency in the process of constructing a multi-modal data set, and provides a multi-modal data set labeling method and device and electronic equipment for AI (artificial intelligence) training. The invention can efficiently output high-quality structured multi-mode data from the original unstructured data, and is used for improving the performance of the algorithm model in AI practical training.

The invention adopts the following specific technical scheme:

in a first aspect, the invention provides an AI-training-oriented multi-modal dataset labeling method, which includes:

s1, obtaining a sample to be annotated, wherein the sample to be annotated comprises an original image and a corresponding image description;

s2, aiming at the original image, obtaining a plurality of targets with category and frame information through target detection, and sampling all obtained target pairs to form a target pair set consisting of target pairs, wherein the target pairs comprise a target as a subject and a target as an object; extracting semantic information of two targets in each target pair and surrounding semantic information to form context features of the target pairs, taking respective visual features and category labels of the two targets in each target pair and the context features of the target pairs as input of a trained depth self-attention network, predicting the relationship of the two targets in the target pairs to obtain a first relationship triple set consisting of relationship triples existing in the original image, and converting the relationship triples in the first relationship triple set into a graph structure so as to obtain an image-based scene graph;

s3, aiming at the image description, identifying a first entity set from the image description text through an entity extraction rule, then screening entities in the first entity set by using a dictionary, and forming a second entity set by the reserved entities; identifying and obtaining the relation existing between the entities in the second entity set from the image description text by using a relation extraction rule to obtain a second relation triple set consisting of the relation triples existing in the image description; filtering the relation triples in the second relation triple set according to the relation filtering rules among the entities, wherein the reserved relation triples form a third relation triple set; converting the relation triples in the third relation triple set into a graph structure, thereby obtaining a scene graph based on image description;

s4, aligning and fusing the scene graph based on the image and the scene graph based on the image description through graph layers to obtain a fused scene graph;

and S5, sending the fusion scene graph as initial marking information to a manual proofreading terminal, generating a final marking result according to a proofreading result returned by the manual proofreading terminal, associating the final marking result with the to-be-marked sample, and adding the to-be-marked sample into a multi-modal data set.

As a preferred aspect of the first aspect, the target detection method includes: inputting an original image into a regional recommendation network to obtain a candidate frame and an image feature map of a target in the image, screening the candidate frame through non-maximum suppression, extracting pooling features of a region corresponding to each candidate frame from the image feature map according to the reserved candidate frame, and using the pooling features as feature vectors of the corresponding candidate frames; and respectively inputting the feature vector of each candidate frame into a classification network and a position regression network to obtain the category and the position of each candidate frame, thereby obtaining a plurality of targets with category and frame information.

As a preference of the first aspect described above, the deep self-attention network is composed of a plurality of superimposed blocks and a classification network; each block is formed by cascading a multi-head attention module, a multi-layer perceptron module and a layer standardization module, the input of the block and the output of the multi-layer perceptron are connected by residual errors and then input into the layer standardization module, and the output of the layer standardization module is the output of the whole block; the output of the last block is used as the input of the next block, the input of the first block is provided with a learnable position code, and the output of the last block is used as the input of the classification network; the classification network only comprises a multilayer perceptron module, the results of the multilayer perceptrons are converted into probability distribution of each relation category by using a softmax function, and then the category with the maximum probability is taken as the prediction result of the relation between two targets in a target pair.

As a preferred aspect of the first aspect, the deep self-attention network is trained in advance through semi-supervised learning, and the data set during training includes an original image data set and an enhanced image data set, the original image data set is composed of labeled original images, and the enhanced image data set is composed of unlabeled enhanced images obtained by performing data enhancement on all original images; the total loss function in training is a weighted sum of the cross-entropy loss of the deep self-attention network over the original image dataset and the KL divergence loss over the enhanced image dataset.

As a preference of the first aspect, after the second relationship triple set is obtained, it is determined, according to prior knowledge and context information and constraints of the target pair, whether a relationship triple that is ignored in the rule extraction process exists for each target pair in the target pair set, and if so, the relationship triple that is ignored in the rule extraction process is supplemented into the second relationship triple set, and then filtering is performed according to the relationship filtering rule.

Preferably, when the image-based scene graph and the image-based scene graph are subjected to graph-level alignment and fusion, each relationship triple of the image-based scene graph is traversed, whether two entities serving as a subject and an object in the relationship triple exist in the second entity set is judged, if yes, the relationship triple is added into the fusion scene graph, if not, the relationship triple is not added into the fusion scene graph, and a final fusion scene graph is obtained after the traversal is finished.

As a further preferred aspect of the first aspect, when the image-based scene graph and the image-description-based scene graph are aligned and fused in a graph hierarchy, each relationship triple of the image-based scene graph is traversed, whether the relationship triple exists in the third relationship triple set is determined, if yes, the relationship triple is added into the fused scene graph, if not, the relationship triple is not added into the fused scene graph, and a final fused scene graph is obtained after the traversal is completed.

As a preferred aspect of the first aspect, the manual proofreading terminal displays the current sample to be labeled and the initial labeling information through a UI interface, and provides a functional component for modifying the initial labeling information on the UI interface; and if the initial labeling information is modified at the manual proofreading end, taking the modified labeling information as the final labeling result, otherwise, taking the initial labeling information as the final labeling result.

In a second aspect, the invention provides an AI-training-oriented multi-modal dataset labeling device, which includes:

the system comprises a sample acquisition module, a labeling module and a labeling module, wherein the sample acquisition module is used for acquiring a sample to be labeled, and the sample to be labeled comprises an original image and a corresponding image description;

the first scene graph generation module is used for obtaining a plurality of targets with category and frame information through target detection aiming at the original image, and matching and sampling all the obtained targets to form a target pair set consisting of target pairs, wherein the target pairs comprise a target serving as a subject and a target serving as an object; extracting semantic information of two targets in each target pair and surrounding semantic information to form context features of the target pairs, taking respective visual features and category labels of the two targets in each target pair and the context features of the target pairs as input of a trained depth self-attention network, predicting the relationship of the two targets in the target pairs to obtain a first relationship triple set consisting of relationship triples existing in the original image, and converting the relationship triples in the first relationship triple set into a graph structure so as to obtain an image-based scene graph;

the second scene graph generation module is used for identifying a first entity set from the image description text through an entity extraction rule aiming at the image description, then screening entities in the first entity set by utilizing a dictionary, and forming a second entity set by the reserved entities; identifying and obtaining the relation existing between the entities in the second entity set from the image description text by using a relation extraction rule to obtain a second relation triple set consisting of the relation triples existing in the image description; filtering the relation triples in the second relation triple set according to the relation filtering rules among the entities, wherein the reserved relation triples form a third relation triple set; converting the relation triples in the third relation triple set into a graph structure, thereby obtaining a scene graph based on image description;

the scene graph alignment and fusion module is used for aligning and fusing the scene graph based on the image and the scene graph based on the image description through graph hierarchy to obtain a fused scene graph;

and the manual proofreading module is used for sending the fusion scene graph as initial marking information to a manual proofreading terminal, generating a final marking result according to a proofreading result returned by the manual proofreading terminal, and adding the final marking result into the multi-modal data set after associating the final marking result with the to-be-marked sample.

In a third aspect, the invention provides a computer electronic device comprising a memory and a processor;

the memory for storing a computer program;

the processor is configured to, when executing the computer program, implement the AI-training-oriented multimodal dataset labeling method according to any of the above first aspects.

Compared with the prior art, the invention has the following beneficial effects:

according to the method, a scene graph generation algorithm based on deep learning technology and graph alignment fusion is adopted, the weak supervision information described by the images is used for generating the candidate scene graphs, the candidate scene graphs are further aligned and fused with the scene graphs generated based on the images to avoid wrong labeling and label missing, and finally the candidate initial scene graphs are generated to serve as references of manual labeling. The method can provide intelligent labeling prompt for manual labeling of the multi-modal data set, so that only a candidate scene graph needs to be optimized during manual labeling, the labeling scale and the labeling difficulty are greatly reduced, the labeling efficiency of the multi-modal data can be effectively improved, the labeling process is simple, and the external interference is reduced. The invention can efficiently output high-quality structured multi-mode data from the original unstructured data, and is used for improving the performance of the algorithm model in AI practical training.

Drawings

FIG. 1 is a schematic step diagram of a multi-modal dataset labeling method for AI training;

FIG. 2 is a schematic diagram illustrating a flow of a multi-modal dataset labeling method for AI training;

FIG. 3 is a schematic flow chart of semi-supervised learning;

FIG. 4 is a diagram of annotation results in one example;

fig. 5 is a schematic diagram of a module composition of the AI practical training oriented multi-modal dataset labeling apparatus.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. The technical characteristics in the embodiments of the present invention can be combined correspondingly without mutual conflict.

In the description of the present invention, it is to be understood that the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature.

The multi-modal data set can provide data support for steps such as data enhancement, network training and the like, and is an important basic stone for AI practical training. An indispensable part of the multimodal data set is the scene graph data. The scene graph is structured data consisting of a plurality of triples (subject, predicate, object), and expresses high-level semantics of the image. Because the traditional marking mode is long in time consumption and low in efficiency, the intelligent marking auxiliary mode is designed, firstly, candidate initial scene graphs are generated through a scene graph generation algorithm based on deep learning technology and graph alignment fusion and used as references of artificial marking, and therefore only the candidate scene graphs need to be optimized during artificial marking, the marking scale and the marking difficulty are greatly reduced, and the marking efficiency of multi-modal data can be effectively improved.

As shown in FIG. 1, in a preferred embodiment of the present invention, an AI-training-oriented multi-modal dataset labeling method is provided, which includes steps S1-S5. The principle of the whole labeling method flow is shown in fig. 2. The specific implementation forms of the steps S1-S5 are described in detail below.

And S1, obtaining a sample to be annotated, wherein the sample to be annotated comprises an original image and a corresponding image description.

It should be noted that the samples to be labeled may be input by the user one by one, or may be input in the form of an unlabeled multimodal dataset, and then the samples are extracted from the multimodal dataset one by one so as to generate the labeling information. Therefore, the specific acquisition form of the sample to be labeled is not limited, and may not be limited.

S2, aiming at the original image, obtaining a plurality of targets with category and frame information through target detection, and sampling all obtained target pairs to form a target pair set consisting of target pairs, wherein the target pairs comprise a target as a subject and a target as an object; extracting semantic information of two targets and surrounding semantic information in each target pair to form context features of the target pairs, taking respective visual features of the two targets in each target pair, respective category labels of the two targets and the context features of the target pairs as input of a trained depth self-attention network, predicting the relationship between the two targets in the target pairs to obtain a first relationship triple set consisting of relationship triples existing in the original image, and converting the relationship triples in the first relationship triple set into a graph structure, thereby obtaining the image-based scene graph.

It should be noted that, any existing technology may be adopted in the method for detecting an object with respect to an original image, so as to meet the requirement of detection accuracy.

As a preferred mode of the embodiment of the present invention, the adopted target detection method has the following specific flow: inputting an original image into a regional recommended Network (RPN) to obtain a candidate frame and an image feature map of a target in the image, screening the candidate frame through non-maximum suppression, extracting pooling features of a Region corresponding to each candidate frame from the image feature map according to the reserved candidate frame, and using the pooling features as feature vectors of the corresponding candidate frame; and respectively inputting the feature vector of each candidate frame into a classification network and a position regression network to obtain the category and the position of each candidate frame, thereby obtaining a plurality of targets with category and frame (namely corresponding to the candidate frames). Specifically, the coordinate information of two top left and bottom right vertices with candidate boxes in each object, and the category to which the object belongs. The method for acquiring the candidate frame and the image feature map of the target in the image from the RPN belongs to the prior art, and the ROI Align is preferably adopted to perform pooling to obtain the image feature map in the invention. And recording the feature vector of the frame corresponding to each target as the visual feature of the target.

It should be noted that the Non-Maximum Suppression (NMS) is used to screen similar candidate frames with duplicates, and retain the frame with the highest confidence level to suppress the remaining redundant similar frames. The process of suppressing the redundant borders is an iteration-traversal-elimination process, each iteration needs to select a target border with the highest confidence from the remaining borders of the candidate border set for reservation, then all candidate borders with Intersection over unit (IoU) higher than a set threshold (0.7) are removed, and the step is repeated until no candidate borders remain. The specific implementation manner of the method can be found in the prior art, and details are not described herein. And inhibiting the obtained candidate frames as final candidate frames to participate in classification and position regression through the non-maximum value, wherein the feature vectors required by the candidate frames to participate in the classification and position regression can be realized by extracting pooling features from corresponding regions of the image feature map according to the positions of the candidate frames. And inputting the feature vectors of the candidate frames into the trained classification network and position regression network to obtain the category and position optimization parameters of the candidate frames, thereby obtaining a plurality of targets with category and frame information.

It should be noted that, in the present invention, the scene information needs to be processed by selecting the target pair. A target pair (pair) refers to two targets (i.e., subject and object), and if a target pair has a relationship, it is a positive example, and if no relationship exists, it is a negative example. The relationship of the object pair constitutes the predicate in the relationship triple in the scene graph. When the deep self-attention network predicts the relationship between two targets in a target pair, the target pair and the contextual features of the target pair need to be combined, and the contextual features of the target pair can be obtained by extracting semantic information of the two target frames and surrounding frames on an image. During specific implementation, an outsourcing frame surrounding the two targets can be formed according to the frames of the two targets in the target pair, and then the outsourcing frame is mapped to the corresponding position of the image feature map, so that the pooling feature in the outsourcing frame range is extracted as a context feature, and the context feature contains the target pair and the semantic information around the target pair.

As a preferred mode of the embodiment of the present invention, the deep self-attention network is composed of a plurality of stacked blocks (blocks) and a classification network; each block is formed by cascading a multi-head attention module, a Multilayer Perceptron module and a layer normalization module, wherein the input of the block and the output of the Multilayer Perceptron are subjected to residual error connection and then input into the layer normalization module, and the output of the layer normalization module is the output of the whole block; the output of the last block is used as the input of the next block, the input of the first block is provided with a learnable position code, and the output of the last block is used as the input of the classification network; the classification network only comprises a multilayer perceptron module, the results of the multilayer perceptrons are converted into probability distribution of each relation category by using a softmax function, and then the category with the maximum probability is taken as the prediction result of the relation between two targets in a target pair. In addition, similar to the deep learning of the conventional self-attention mechanism, in the above deep self-attention network, an additional learnable position code needs to be added to the original input portion of the first block according to the position of each input, so as to sense semantic information of input precedence.

It should be noted that, the deep self-attention network needs to be trained by using a data set in advance, and then put into practical use after being trained to meet performance requirements. Training the deep self-attention network to mask the predicates in a mask mode, and then inputting the class label information of the subject and the object, the mask, the visual characteristics of the subject, the visual characteristics of the object and the context characteristics of the subject object pair to obtain the predication result of the predicates. The training mode of the deep self-attention network can be realized by referring to the prior art, which may not be limited. And (3) coding and predicting each target pair obtained by sampling and the context characteristics by using the trained deep self-attention network, and judging the relation between the two targets (the two targets may not have the relation). After predicting whether all the target pairs have relationships or not, analyzing the target pairs with the relationships to construct relationship triples in a subject-predicate-object form, and converting all the relationship triples into a graph structure formed by nodes and edges to obtain the image-based scene graph.

It should be noted that, in practical application, all targets obtained through target detection need to be paired and sampled in a traversal manner to obtain all possible pairing modes, then all target pairs are combined into a target pair set, and a possible relationship between each pair of target pairs is predicted through a deep self-attention network. However, in the training process, all targets obtained through target detection in the images of the training samples do not need to be paired in pairs, so that the training efficiency is improved. At this time, in order to ensure the balance of the numbers of positive and negative samples, for all target pairs and labels thereof of each picture, the highest number of positive samples lower than the total number of positive samples and the highest number of negative samples lower than the total number of negative samples are set according to experience, and random sampling pairing is performed according to the highest number of positive samples and the highest number of negative samples in each round of training to obtain a series of target pairs for training, thereby reducing the number of target pairs required to be adopted in each round of training.

As a preferred mode of the embodiment of the present invention, the deep self-attention network may be trained in advance through semi-supervised learning, and the data set during training includes an original image data set and an enhanced image data set, where the original image data set is composed of labeled original images, and the enhanced image data set is composed of unlabeled enhanced images obtained by performing data enhancement on all original images. The total loss function in training is a weighted sum of the cross-entropy loss of the deep self-attention network over the original image dataset and the KL divergence loss over the enhanced image dataset. Specifically, as shown in fig. 3, an operation of performing scene graph generation on the enhanced image needs to be additionally set in the training process.

Specifically, after predicting whether all target pairs have a relationship, the image-based scene graph can be obtained by analyzing the target pairs having the relationship. Assume an original image dataset of

A first partial relationship loss may be defined as the cross-entropy loss of the depth self-attention network over the original image dataset:

wherein

Is the input of the data to be transmitted,

is a corresponding label that is to be attached to the tag,

is a cross entropy (cross entropy) loss,

is the parameter of the model and is,

representing a data set

The number of original images in (2).

Based on semi-supervised learning, the original image in the original image dataset is subjected to data enhancement to obtain an enhanced image (augmented image), which is recorded as

. And then all the enhanced images are coded and predicted by using the same depth self-attention network, and a final result is obtained. The second part of the relationship loss is obtained by minimizing the distance between the original image prediction result and the enhanced image prediction result, and the distance measurement mode adopts KL (Kullback-Leiber) divergence:

the total loss function employed for deep self-attention network training can thus be defined as follows:

in the formula:

and

is the lost weight. Weight of

And

can be adjusted empirically, and in the present invention will preferably be

Is set to 1, will

Set to 0.5.

S3, aiming at the image description, identifying a first entity set from the image description text through an entity extraction rule, then screening entities in the first entity set by using a dictionary, and forming a second entity set by the reserved entities; identifying and obtaining the relation existing between the entities in the second entity set from the image description text by using a relation extraction rule to obtain a second relation triple set consisting of the relation triples existing in the image description; filtering the relation triples in the second relation triple set according to the relation filtering rules among the entities, wherein the reserved relation triples form a third relation triple set; and converting the relation triples in the third relation triple set into a graph structure, thereby obtaining the scene graph based on the image description.

It should be noted that, the rule-based entity and relationship extraction belongs to the prior art, and both the entity extraction rule and the relationship extraction rule need to be set according to a specific multi-modal data set, so that the entity and the relationship can be accurately extracted from the text. The dictionary for screening entities corresponds to a vocabulary library including entities that may be included in the multi-modal dataset, and when the first set of entities is identified, only those entities that are similar in meaning or have membership to the entities in the vocabulary library are retained, and the remaining entities belong to entities that are unlikely to appear in the dataset, and therefore need to be filtered out, thereby obtaining the second set of entities. And when the relation extraction rule is subsequently utilized for carrying out relation extraction, all the aimed entities are sourced from the second entity set. Because the image description text is not a structured text, it is not necessarily completely accurate when entity and relationship extraction is performed based on the rule, so that the function of the relationship filtering rule is to perform filtering operation on the relationship triples in the second relationship triple set, and the target is to filter out the relationship that should not exist between two known class entities, so that the triple relationship that cannot exist in the multi-modal data set should be set in the relationship filtering rule according to the actual situation of the multi-modal data set, and when such relationship triples appear in the second relationship triple set, deletion is required, and finally a third relationship triple set is formed. And all relation triples in the form of subject-predicate-object in the third relation triple set are converted into a graph structure formed by nodes and edges, and the scene graph based on the image description can be obtained.

And S4, aligning and fusing the image-based scene graph and the image-description-based scene graph through graph-level (graph-level) to obtain a fused scene graph.

It should be noted that there are many implementation forms for the alignment and fusion of the graph hierarchy, for example, regarding the problem as small-scale knowledge fusion (fusion of two knowledge maps), and solving the problem by means of ontology alignment and entity matching, and theoretically any alignment fusion method that can be implemented can be applied to the present invention.

As a preferred mode of the embodiment of the present invention, due to the image-based triple relationship identification, triple relationships can be mined as much as possible based on information in an image, but erroneous judgment is easily caused, whereas a scene graph based on image description is identified from a description text based on rules, so that the scene graph has high accuracy, but cannot be extracted if the relationship description does not exist in the description text. Therefore, in order to organically combine the respective advantages of the two heterogeneous scene graphs from different sources, the following provides two methods for implementing layer hierarchy alignment and fusion between the image-based scene graph and the image-based description scene graph, which are described in detail as follows:

the first method of layer hierarchy alignment and fusion is based on the entity correspondence between relationship triples in two scene graphs for matching. The specific method comprises the following steps: and when the scene graph based on the image and the scene graph based on the image description are subjected to graph level alignment and fusion, traversing each relationship triple of the scene graph based on the image, judging whether two entities serving as a subject and an object in the relationship triple exist in the second entity set, if so, adding the two entities into the fusion scene graph, otherwise, not adding the two entities into the fusion scene graph, and obtaining the final fusion scene graph after the traversing is finished.

For example, if a person rides a bicycle in the image-based scene graph and the image-based scene graph has no bicycle at all, this means that the prediction of the image scene graph is incorrect, and the relationship triple of riding a bicycle should be deleted.

It should be noted that, when determining whether two entities serving as a subject and an object in the relationship triple exist in the second entity set, the determination needs to be performed through entity matching, and when the entities are matched, in addition to matching the entities themselves, the equivalent categories and the equivalent subcategories of the entities should also be matched. The equivalence category refers to a category in which entities can be equivalent, and the equivalence sub-category refers to a category in which concepts of the entities have subordination. The equivalence classes and equivalence subclasses can be relations (such as riding and sitting) or target classes (such as people and bicycles). For example, if an entity person occurs in a relationship triple in the image-based scene graph, the occurrence of the entity person in the second entity set is regarded as the entity itself existing in the second entity set, while peoples are the equivalence class of person, and man is the equivalence subclass of person, and the occurrence of man and peoples in the second entity set should also be regarded as the entity itself existing in the second entity set.

The second method for layer-level alignment and fusion is to match based on the corresponding relationship between the relationship triples in the two scene graphs. The specific method comprises the following steps: and when the image-based scene graph and the image-description-based scene graph are subjected to graph-level alignment and fusion, traversing each relationship triple of the image-based scene graph, judging whether the relationship triple exists in the third relationship triple set, if so, adding the relationship triple into the fusion scene graph, otherwise, not adding the relationship triple into the fusion scene graph, and obtaining the final fusion scene graph after the traversal is finished.

Similarly, when determining whether the relationship triple exists in the third relationship triple set, the relationship triple matching is required to be implemented, and the relationship and the entity in the relationship triple matching should consider the respective equivalence category and equivalence subclass of the relationship and the entity in addition to the relationship and the entity themselves.

Further, when the second method of layer level alignment and merging is used, there is actually a stricter matching rule than the first method of layer level alignment and merging. Considering that the image description text is often mainly described for the visually highlighted area, and the visually non-highlighted area may not be described, considering only the relation triples extracted from the image description text will ignore many possible relations. Therefore, as a further preferred mode of the embodiment of the present invention, when a second method of layer hierarchy alignment and fusion is adopted, the second relationship triple set may be supplemented in advance, and the specific method is: after the second relation triple set is obtained, firstly, according to prior knowledge (prior knowledge) and context information and constraint of each target pair in the target pair set obtained by carrying out target detection and matching sampling on the original image, judging whether a relation triple which is ignored in the rule extraction process exists or not, if so, supplementing the relation triple into the second relation triple set, and then, carrying out filtering according to the relation filtering rule. The specific form of the context information and constraints of the prior knowledge and target pairs can be designed according to the actual data set. One feasible method is to set a priori knowledge base in advance according to expert experience, wherein all entity pairs with specified relationships between two entities in the approximate probability are stored in the priori knowledge base if the two entities appear in the image, and the context information and the constraint of the target pair mainly consider the spatial position relationship of the target pair in the image. When the judgment is carried out, if one target pair in the target pair set appears in the prior knowledge base and the spatial position relation of the two targets in the image meets the condition required to be met by the corresponding specified relation in the prior knowledge base, the target pair and the corresponding specified relation are regarded as a relation triple which is ignored in the rule extraction process, and the relation triple is supplemented into the second relation triple set, so that the image level alignment and fusion are facilitated, and the omission is prevented.

It should be noted that the specific form of the manual verification terminal is not limited, as long as the functions of auditing the annotation information and modifying the annotation information when the annotation information is wrong can be provided for the verification personnel.

As a preferred mode of the embodiment of the present invention, the manual proofreading end displays the current sample to be labeled and the initial labeling information through the UI interface, and provides a functional component for modifying the initial labeling information on the UI interface; and if the initial labeling information is modified at the manual proofreading end, taking the modified labeling information as the final labeling result, otherwise, taking the initial labeling information as the final labeling result.

In a preferred embodiment, the functional components may be arranged on the UI interface in the form of buttons and identified. Preferably, besides the area for displaying the current sample to be labeled and the initial labeling information, a button for loading the multi-modal dataset sample to be labeled can be further arranged on the UI interface, and a button for confirming the initial labeling information or modifying the initial labeling information can be arranged at the same time. During manual review, the original image and the image description in the current sample to be annotated are both displayed on a UI (user interface), meanwhile, the graph structure finally fusing all the triple relations in the scene graph is also displayed on the UI, a reviewer only needs to make a correct and wrong judgment on the initial annotation information, if a deviation or an error exists, the initial annotation information is adjusted through a modification button and then confirmed through a confirmation button, and if no deviation or error exists, the original image and the image description in the current sample to be annotated are directly confirmed through the confirmation button. And after the confirmation, returning the modification made by the manual auditing end as a proofreading result so as to generate a final labeling result. And if the initial labeling information is modified at the manual proofreading end, taking the modified labeling information as a final labeling result, otherwise, directly taking the initial labeling information as the labeling information.

In addition, when a series of samples are loaded at one time, a button for switching the next sample can be further arranged on the UI, and after the proofreading of one sample is finished, the next sample can be switched to through the switching button, and the examination and verification can be continued.

Further, the specific carrier form of the manual proofreading terminal is not limited, and may be a local server, a cloud platform, or may be carried in the mobile terminal, which is not limited as long as the implementation requirement on performance can be met. The steps S1-S5 can be realized by writing a software program on a corresponding running platform, and the running platform for executing the steps S1-S5 can be the same as or different from a platform where the manual proofreading terminal is located.

Therefore, through the steps of S1-S5, when the multi-modal data set is marked, the marking is carried out arbitrarily, only the fusion scene graph needs to be corrected, wrong entities and relations are deleted, and missing entities and relations are supplemented, so that the final marking result can be obtained. As shown in fig. 4, an exemplary example of a multi-modal dataset annotation sample case is shown, in which a dog sitting on a lawn is shown in an original image, and a fused scene graph is obtained after the above steps S1 to S4, and the fused scene graph has three relationship triplets, namely, dog-, eye-, and dog-, nose-, respectively. After the fused scene graph is used as initial labeling information and sent to a UI interface of a manual checking end, the initial labeling information is labeled and combined with an original image for checking, and if a triple group of a dog-ear relation is omitted according to a labeling rule, the triple group needs to be supplemented into a final scene graph.

Actual experimental results show that, taking the Visual Genome dataset as an example, each image averagely comprises 35 entity and 21 relationship triplets. The multi-modal dataset labeling method facing the AI practical training can effectively reduce the number of entities to be labeled and relation triples, and each image can reduce labeling time by 50 percent through estimation.

Based on the same inventive concept, another preferred embodiment of the present invention further provides an AI practical training oriented multi-modal data set tagging apparatus corresponding to the AI practical training oriented multi-modal data set tagging method provided in the foregoing embodiment. As shown in fig. 5, the multi-modal dataset labeling device for AI training includes five basic modules, which are:

the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is used for acquiring a sample to be annotated, and the sample to be annotated comprises an original image and a corresponding image description;

the first scene graph generation module is used for obtaining a plurality of targets with category and frame information through target detection aiming at the original image, and matching and sampling all the obtained targets to form a target pair set consisting of target pairs, wherein the target pairs comprise a target serving as a subject and a target serving as an object; extracting semantic information of two targets in each target pair and surrounding semantic information to form context features of the target pairs, taking respective visual features of the two targets in each target pair, respective category labels of the two targets and the context features of the target pairs as input of a trained deep self-attention network, predicting the relationship of the two targets in the target pairs to obtain a first relationship triple set consisting of relationship triples existing in the original image, and converting the relationship triples in the first relationship triple set into a graph structure to obtain a scene graph based on the image;

Because the principle of solving the problems of the multi-modal dataset labeling device for the AI training is similar to that of the multi-modal dataset labeling method for the AI training of the above embodiment of the present invention, the specific implementation forms of the modules of the device in this embodiment may also be referred to as the specific implementation forms of the above method, and the repeated parts are not described again.

Similarly, based on the same inventive concept, another preferred embodiment of the present invention further provides a computer electronic device corresponding to the AI-training-oriented multi-modal dataset labeling method provided in the foregoing embodiment, which includes a memory and a processor;

the memory for storing a computer program;

the processor is configured to implement the AI-training-oriented multi-modal dataset labeling method when executing the computer program.

In addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention.

Therefore, based on the same inventive concept, another preferred embodiment of the present invention further provides a computer-readable storage medium corresponding to the AI practical training oriented multi-modal dataset labeling method provided in the foregoing embodiment, where the storage medium stores a computer program, and when the computer program is executed by a processor, the AI practical training oriented multi-modal dataset labeling method can be implemented.

Specifically, in the memories or computer-readable storage media of the above two embodiments, the stored computer program is executed by the processor and can execute the following steps of S1-S5:

s2, aiming at the original image, obtaining a plurality of targets with category and frame information through target detection, and pairing and sampling all the obtained targets to form a target pair set consisting of target pairs, wherein the target pairs comprise a target serving as a subject and a target serving as an object; extracting semantic information of two targets in each target pair and surrounding semantic information to form context features of the target pairs, taking respective visual features and category labels of the two targets in each target pair and the context features of the target pairs as input of a trained depth self-attention network, predicting the relationship of the two targets in the target pairs to obtain a first relationship triple set consisting of relationship triples existing in the original image, and converting the relationship triples in the first relationship triple set into a graph structure so as to obtain an image-based scene graph;

s4, aligning and fusing the scene graph based on the image and the scene graph based on the image description through graph hierarchy to obtain a fused scene graph;

It is understood that the storage medium and the Memory may be Random Access Memory (RAM) or Non-Volatile Memory (NVM), such as at least one disk Memory. Meanwhile, the storage medium may be various media capable of storing program codes, such as a U-disk, a removable hard disk, a magnetic disk, or an optical disk.

It is understood that the Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc.; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

It should be further noted that, as will be clearly understood by those skilled in the art, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again. In the embodiments provided in the present application, the division of the steps or modules in the apparatus and method is only one logical function division, and in actual implementation, there may be another division manner, for example, multiple modules or steps may be combined or may be integrated together, and one module or step may also be split.

The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims

1. An AI (AI-oriented training) -oriented multi-modal dataset labeling method is characterized by comprising the following steps:

s3, aiming at the image description, identifying a first entity set from the image description text through an entity extraction rule, then screening entities in the first entity set by using a dictionary, and forming a second entity set by the reserved entities; identifying and obtaining the relation existing between the entities in the second entity set from the image description text by using a relation extraction rule to obtain a second relation triple set consisting of the relation triples existing in the image description; filtering the relationship triples in the second relationship triple set according to the relationship filtering rule among the entities, wherein the retained relationship triples form a third relationship triple set; converting the relation triples in the third relation triple set into a graph structure, thereby obtaining a scene graph based on image description;

s4, according to the first fusion mode or the second fusion mode, aligning and fusing the scene graph based on the image and the scene graph based on the image description through graph layers to obtain a fusion scene graph;

in the first fusion mode, after the second relation triple set is obtained, whether a relation triple which is ignored in the rule extraction process exists or not is judged according to prior knowledge, context information and constraint of each target pair in the target pair set, if so, the relation triple is supplemented into the second relation triple set, and then filtering is executed according to the relation filtering rule; when the image-based scene graph and the image-description-based scene graph are subjected to graph-level alignment and fusion, traversing each relationship triple of the image-based scene graph, judging whether the relationship triple exists in the third relationship triple set, if so, adding the relationship triple into the fusion scene graph, if not, not adding the relationship triple into the fusion scene graph, and obtaining a final fusion scene graph after the traversal is finished;

in the second fusion mode, when the image-based scene graph and the image-description-based scene graph are subjected to graph-level alignment and fusion, traversing each relationship triple of the image-based scene graph, judging whether two entities serving as a subject and an object in the relationship triple exist in the second entity set, if so, adding the relationship triple into the fusion scene graph, if not, not adding the relationship triple into the fusion scene graph, and obtaining a final fusion scene graph after the traversal is finished;

2. The AI-training-oriented multimodal dataset labeling method of claim 1, wherein the target detection method comprises: inputting an original image into a regional recommendation network to obtain a candidate frame and an image feature map of a target in the image, screening the candidate frame through non-maximum suppression, extracting pooling features of a region corresponding to each candidate frame from the image feature map according to the reserved candidate frame, and using the pooling features as feature vectors of the corresponding candidate frames; and respectively inputting the feature vector of each candidate frame into a classification network and a position regression network to obtain the category and the position of each candidate frame, thereby obtaining a plurality of targets with category and frame information.

3. The AI-training-oriented multimodal dataset labeling method of claim 1, wherein the deep self-attention network is composed of a plurality of superimposed blocks and classification networks; each block is formed by cascading a multi-head attention module, a multi-layer perceptron module and a layer standardization module, the input of the block and the output of the multi-layer perceptron are connected by residual errors and then input into the layer standardization module, and the output of the layer standardization module is the output of the whole block; the output of the last block is used as the input of the next block, the input of the first block is provided with a learnable position code, and the output of the last block is used as the input of the classification network; the classification network only comprises a multilayer perceptron module, the results of the multilayer perceptrons are converted into probability distribution of each relation category by using a softmax function, and then the category with the maximum probability is taken as the prediction result of the relation between two targets in a target pair.

4. The AI-training-oriented multimodal dataset labeling method of claim 1, wherein the deep self-attention network is trained in advance by semi-supervised learning, the dataset during training comprises an original image dataset and an enhanced image dataset, the original image dataset consists of labeled original images, and the enhanced image dataset consists of unlabeled enhanced images of all original images after data enhancement respectively; the total loss function in training is a weighted sum of the cross-entropy loss of the deep self-attention network over the original image dataset and the KL divergence loss over the enhanced image dataset.

5. The AI-training-oriented multimodal dataset labeling method of claim 1, wherein said manual proofer displays the current sample to be labeled and the initial labeling information via a UI interface, and provides a functional component for modifying the initial labeling information on the UI interface; and if the initial labeling information is modified at the manual proofreading end, taking the modified labeling information as the final labeling result, otherwise, taking the initial labeling information as the final labeling result.

6. The utility model provides a multi-modal dataset annotation device towards real standard of AI which characterized in that includes:

the scene graph alignment and fusion module is used for aligning and fusing the scene graph based on the image and the scene graph based on the image description according to a first fusion mode or a second fusion mode to obtain a fused scene graph;

and the manual proofreading module is used for sending the fusion scene graph as initial marking information to a manual proofreading terminal, generating a final marking result according to a proofreading result returned by the manual proofreading terminal, and adding the final marking result into the multi-modal data set after associating with the sample to be marked.

7. A computer electronic device comprising a memory and a processor;

the memory for storing a computer program;

the processor is configured to, when executing the computer program, implement the AI-training-oriented multimodal dataset labeling method according to any one of claims 1 to 5.