CN115761036A - Education field image scene graph generation method based on multi-view information fusion - Google Patents

Education field image scene graph generation method based on multi-view information fusion Download PDF

Info

Publication number
CN115761036A
CN115761036A CN202211156523.6A CN202211156523A CN115761036A CN 115761036 A CN115761036 A CN 115761036A CN 202211156523 A CN202211156523 A CN 202211156523A CN 115761036 A CN115761036 A CN 115761036A
Authority
CN
China
Prior art keywords
semantic
view
visual
objects
scene graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211156523.6A
Other languages
Chinese (zh)
Inventor
宋凌云
伍智广
张炀
尚学群
张弛
李战怀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202211156523.6A priority Critical patent/CN115761036A/en
Publication of CN115761036A publication Critical patent/CN115761036A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a method for generating an image scene graph in the education field based on multi-view information fusion, which captures the semantics of an image from multiple aspects such as visual information of objects in the graph and multiple interactive information among the objects. The contribution of the invention lies in: the method can utilize a plurality of different view views of the image to dig out the semantics of objects in the image and the incidence relation of different types among the objects, thereby generating the heterogeneous Scene Graph containing a plurality of types of nodes and edges. The scene graph represents the complex semantic information extracted from the semantic of visual objects and the different association between the objects to represent the images of the education field.

Description

Education field image scene graph generation method based on multi-view information fusion
Technical Field
The invention belongs to the research of visual scene graph generation methods in the fields of computer application, multi-view information fusion and education, and particularly relates to a method for enabling a computer to generate a structured graph structure capable of representing image characteristics and information by using relationships among targets based on a convolutional neural network and a graph neural network by using a network topology structure.
Background
The scene graph can represent effective information in an original image in a concise and structured mode, the characteristic enables the scene graph to have high application value, and information extracted through the scene graph generation task can be used as input of other tasks. The information of the scene graph structure has been proved to be capable of helping various types of computer vision task enhancement effects, for example, the image generation task can effectively generate high-quality images by using the information provided by the scene graph, and a Visual image Question Answering (VQA) can directly answer questions in the VQA task by using semantic information contained in the scene graph, so that the reasoning ability of the VQA model is enhanced.
In the prior art, many tasks are performed in one step, which means that each system needs to extract information from an original image, perform relevant processing on the information, and predict an output. Such a network will at least need to include two modules, namely, information feature extraction and information feature processing, resulting in a relatively complex network and high training cost, and the effect of the network cannot be guaranteed due to the difficulty in distinguishing the key areas and the non-key areas in the image. The scene graph is used as a carrier of information transmission, downstream tasks can effectively pay more attention to the improvement of the network effect of the downstream tasks, the problem of extracting effective structural information from the image is handed to the scene graph generation task, the complexity of the network is reduced, and the function separation ensures that the network effect of each part can be guaranteed.
Disclosure of Invention
Technical problem to be solved
The bottom layer visual information (color, texture and the like) of the visual objects in the images in the education field is sparse, and the understanding of the image semantics needs to be combined with the information of the numbers appearing in the images, the interaction between the objects and the like, so that the image semantics are difficult to accurately represent by the traditional image characteristics based on the bottom layer visual information. Aiming at the characteristics of the images in the education field, the invention provides a visual scene graph generation method based on multi-view information fusion, and the semantics of the images are captured from multiple aspects such as the visual images of the objects in the graph and the interactive information among the objects.
Technical scheme
A method for generating an image scene graph in an education field based on multi-view information fusion is characterized by comprising the following steps:
step 1: multi-view construction
Step 1.1: and (3) constructing an object view: identifying the category and position coordinates of an object in the graph by using a Faster R-CNN object detector, constructing a visual information code of the identified object by using a convolutional neural network, combining the visual information code with the object in the graph, and constructing a fully-connected object view as a node of the visual graph;
step 1.2: and (3) constructing a semantic view: firstly, based on an OCR technology and an unsupervised semantic segmentation technology, recognizing text information in a graph in a fine-grained manner, then performing weighted fusion on context information of different categories, increasing the reasoning capability of a network, and finally forming a fully-connected semantic view;
and 2, step: construction of multi-view information fusion module
Step 2.1: and (3) fusing the semantic view and the object view: utilizing the node information in the semantic view, constructing a fully-connected network between the semantic view and the object view by using a graph volume network, updating the node information in the object view, and expecting to obtain the semantic information between the objects;
step 2.2: self-fusion of object views: after the semantic views are fused, a full-connection network between the object views is constructed by using the fusion method in the step 2.1, and the object views are updated automatically;
and 3, step 3: construction of visual scene graph generation module
Step 3.1: and (3) generating a visual scene graph based on semantic relations: after the node characteristics of the object view are updated, generating a visual scene graph based on semantic relation by calculating the probability distribution of semantic interaction among nodes; the nodes in the visual scene graph represent the areas of the objects and the class labels thereof, and the edges represent semantic interaction classes between the visual objects;
step 3.2: generating a visual scene graph based on the position relation: firstly, judging whether a containing or overlapping relationship exists between two objects by utilizing the Intersection ratio of the boundary boxes of the visual objects; and if the IOU is less than 0.5, judging the fine-grained position interaction categories between the boundary frames by calculating the distance and the angle between the center points of the boundary frames, wherein the fine-grained position interaction categories comprise eight categories, namely upper, lower, left, right, upper left, lower left, upper right and lower right.
The further technical scheme of the invention is as follows: the fusion method of the semantic view and the object view in the step 2.1 specifically comprises the following steps:
utilizing semantic views
Figure BDA0003858988520000031
The node information in (1) is used for constructing a fully-connected network between the semantic view and the object view by using the graph volume network, and the object view is subjected to
Figure BDA0003858988520000032
Updating the node information in the node; for node o in object view i The feature vector updating formula is as follows:
Figure BDA0003858988520000033
wherein N (i) is node o i B represents the offset of the model, W represents the parameter of the model, l represents the update iteration times of the node information in the object view, c j i is calculated by the following formula and represents the standard deviation of the node degree;
Figure BDA0003858988520000034
σ represents the Relu activation function, i.e.
f(x)=max(0,x), (2-3)
e ji Represented by node o in N (i) j The weight to node oi may be calculated by:
Figure BDA0003858988520000035
ρ[s i ,o j ]represents s i And o j By s i And o j Is determined by the category and the visual characteristics of the user; calculated from the following formula;
ρ[s i ,o j ]=f s ((f l (s i )·f l (o j ))||(f fe (s i )·f fe (o j ))), (2-5)
f l (. H) is a class encoder that obtains the word vectors represented by the classes via Fasttext; f. of fe (. The) represents a visual feature mapper to map the visual features of high dimension into a low dimension space; f. of s (. Is) a similarity encoder for calculating the degree of similarity between two nodes;
distance(s i ,o j ) Represents the distance between two objects, and the farther the distance between the objects is, the shallower the degree of mutual influence between the two objects is; "| |" represents the splicing operation, and the distance between objects and the similarity are spliced together;
Figure BDA0003858988520000041
will s i To o is to j At node o j And (4) normalizing.
The further technical scheme of the invention is as follows: the self-fusion method of the object view in the step 2.2 specifically comprises the following steps:
after the feature update of all nodes in the object view is completed, a full-connection network between the object views is constructed by using the fusion method in the step 2.1, and the object views are subjected to self-update; the feature vector updating formula is as follows:
Figure BDA0003858988520000042
except for e ji In all cases, step 2.1 is followed, e ji Is obtained from the following formula
Figure BDA0003858988520000043
distance(o i ,o j ) And rho [ o ] i ,o j ]The calculation method of (2) is identical to that of step 2.1.
The invention further adopts the technical scheme that: the method for generating the visual scene graph based on the semantic relationship, which is described in the step 3.1, specifically comprises the following steps:
the probability distribution of semantic interactions between nodes in a visual scene graph can be calculated by:
Figure BDA0003858988520000044
wherein "|" represents a splicing operation, will
Figure BDA0003858988520000045
And
Figure BDA0003858988520000046
respectively represent nodes v in the object view i And v j Updated feature vector, v i,j Corresponding to regions covering the ith and jth objects simultaneouslyVisual characteristics, f c Is a semantic interaction classifier, and the interaction categories comprise contact, of, leftchild, rightchild, next and hold;
after the currently predicted edge type is selected, positive sample fully-connected subgraphs are constructed in the object types at the two ends of the edge, negative sample subgraphs with the same number of associations in the subgraphs are constructed in the whole subgraphs, and the scores of the relationships in the two subgraphs are respectively subjected to score prediction, wherein the scores can be obtained by the following formula:
score(o i ,e ij ,o j )=f sc (o i ||o j *e ij ), (4-2)
wherein f is sc Predictor of representative score, o i ||o j Represents a reaction of o i And o j Spliced together as f sc The input of (1);
the loss function of the score evaluation network for semantic relationship prediction among scene graph nodes is defined as follows:
Figure BDA0003858988520000051
because the characteristics of the images in the education field, the related quantity, the connection types among the nodes and the quantity of the nodes are related, a topk algorithm for adaptively adjusting and selecting the most value quantity according to the node connection and the quantity of the nodes is designed:
Figure BDA0003858988520000052
wherein
Figure BDA0003858988520000053
Represents a relationship class of e ij Number of nodes of hour hums (o) i ,o j ) Mapping to the numerical value of the selected most value number;
after all kinds of relations are obtained, existing priori knowledge is utilized, pruning is conducted according to types, and finally the visual scene graph based on the semantic relations is obtained.
The further technical scheme of the invention is as follows: the method for generating the visual scene graph based on the position relationship, which is described in the step 3.2, specifically comprises the following steps:
taking all relations in the visual scene graph given with the semantic relation as the basis of position classification, and aiming at position interaction between objects, firstly, judging whether an inclusion or overlapping relation exists between the two objects by utilizing Intersection over Union of the boundary frames of the visual objects; if the IOU is less than 0.5, judging fine-grained position interaction categories among the IOU and the boundary box by calculating the distance and the angle among the center points of the boundary box, wherein the fine-grained position interaction categories comprise eight categories, namely upper category, lower category, left category, right category, upper left category, lower left category, upper right category and lower right category; the calculation process is as follows:
Figure BDA0003858988520000054
wherein f is p ,f i Is a location interaction category classifier.
A computer system, comprising: one or more processors, a computer readable storage medium, for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the above-described method.
A computer-readable storage medium having stored thereon computer-executable instructions for performing the above-described method when executed.
Advantageous effects
The invention provides a method for generating an image scene graph in an education field based on multi-view information fusion.
The scene graph generated by the invention can represent the effective information in the original image in a concise and structured way, so as to achieve the purpose of extracting the information characteristics, and transmit the result to the downstream task, so that the method can effectively pay more attention to the improvement of the processing effect of the information characteristics of the user, realize the function separation, reduce the complexity of the network and ensure the effect of each part of the network.
Secondly, the invention captures the semantics of the image from a plurality of aspects such as the visual image of the object in the image, the interactive information among the objects and the like, and fuses the semantic information and the visual information in the scene images expressing different relationships, thereby greatly improving the information extraction effect of the image in the education field. In contrast, the traditional methods often extract image features from bottom layer visual information, and the bottom layer visual information (color, texture, etc.) of visual objects in the field of images is sparse, so that the image features extracted by the methods are difficult to accurately represent image semantics.
Drawings
The drawings, in which like reference numerals refer to like parts throughout, are for the purpose of illustrating particular embodiments only and are not to be considered limiting of the invention.
FIG. 1 is an overall task framework diagram;
FIG. 2 is a view showing the structure of the Faster R-CNN model;
FIG. 3 illustrates the types of triplets and their numbers in a dataset;
FIG. 4 is an original image of Array-list and a view scene based on semantic relationship and position relationship;
FIG. 5 is a diagram of an original image of FlowChart and a visual scene based on semantic relationship and positional relationship;
FIG. 6 is a comparison of FlowChart and Deadlock.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention. In addition, the technical features involved in the respective embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The method is an educational field image scene graph generation method based on multi-view information fusion, and comprises the following three parts: the method comprises the steps of constructing multiple views, constructing a multiple-view information fusion module and constructing a visual scene graph generation module. The overall architecture diagram is shown in fig. 1, and is specifically described as follows:
1. multi-view construction
1.1 construction of object views
The invention uses Faster R-CNN to extract visual information in images. The model structure of Faster R-CNN is shown in FIG. 2.
A feature map is first extracted from the image using the Resnet50 backbone network, and the feature map is saved as an input shared by the RPN and ROI posing. And the RPN takes the feature subgraph wrapped by the anchor box in the global feature graph as input, judges whether the feature subgraph can be judged as a target by using a classifier, outputs the coordinate position of a bounding box of the feature subgraph by using a regression function, and obtains a feature vector of each target on the feature graph by using ROI posing unified output dimension for each target to finally obtain an object view.
1.2 construction of semantic views
Compared with natural images, the semantic information of the images in the education field has a fixed sentence relation structure, and the semantic information is deduced a priori, so that the reasoning capability of the scene graph can be effectively improved.
In order to enable the word vectors and the sentence vectors to contain grammar information, sentence structure information and relationship reasoning information as much as possible, the method describes the images and the problems in the data set, and a large number of data structure descriptions form a corpus with rich information, and a Fastext model is used for training the word vectors of the corpus. When the method is used, the word vector of each word is firstly obtained, then the relation between the word vectors of a complete sentence is extracted through a bidirectional long-short term memory neural network (BilSTM), so that the sentence vector with generalization and complete semantic relation is constructed, and finally the semantic view is constructed.
2. Construction of multi-view information fusion module
The traditional scene graph generation method based on the underlying visual information for inference is difficult to accurately represent the image semantics of the images in the educational visual field, so a multi-view information fusion method is designed to fuse the target in the graph and the interactive information between the numbers and objects appearing in the graph, and capture the image semantics from multiple aspects.
2.1 fusion of semantic views and object views
Utilizing semantic views
Figure BDA0003858988520000081
Using graph volume network to construct a fully connected network between semantic view and object view, and using the node information in (1)
Figure BDA0003858988520000082
The node information in (1) is updated. For node o in object view j The feature vector updating formula is as follows:
Figure BDA0003858988520000083
wherein N (i) is node o i B represents the offset of the model, W represents the parameters of the model, c ji The standard deviation of the node degree is calculated by the following formula.
Figure BDA0003858988520000084
σ represents the Relu activation function, i.e.
f(x)=max(0,x), (2-3)
e ji Represented by node o in N (i) j To node o i Can be calculated by the following formula:
Figure BDA0003858988520000091
ρ[s i ,o j ]represents s i And o j By s i And o j Is determined by the category and visual characteristics of the user. Calculated by the following formulaAnd (4) obtaining.
ρ[s i ,o j ]=f s ((f l (s i )·f l (o j ))||(f fe (s i )·f fe (o j ))), (2-5)
f l (. Cndot.) is a class coder, which obtains the word vectors represented by the classes through Fasttext. f. of fe (. The) represents a visual feature mapper which maps the visual features of the high dimension into the low dimension space. f. of s (. Cndot.) is a similarity encoder for calculating the degree of similarity between two nodes.
distance(s i ,o j ) Representing the distance between two objects, the farther the distance between the objects is, the less the two objects will influence each other. "| |" represents the stitching operation, stitching together the distance between objects and the degree of similarity.
Figure BDA0003858988520000092
Will s i To o is to j At node o j And (4) normalizing.
2.2 self-fusion of object views
After the feature update of all nodes in the object view is completed, a full connection network between the object views is constructed by using a fusion method similar to the step 2.1, and the object views are self-updated. The feature vector updating formula is as follows:
Figure BDA0003858988520000093
except for e ji In all cases, step 2.1 is followed, e ji Is derived from the formula
Figure BDA0003858988520000094
distance(o i ,o j ) And rho [ o ] i ,o j ]The calculation method of (2) also corresponds to step 2.1.
3. Construction of visual scene graph generation module
3.1 semantic relationship-based Generation of visual scene graphs
The invention generates a visual scene graph based on semantic information based on the updated object view of the node characteristics. The nodes in the visual scene graph represent the regions of the objects and their class labels, and the edges represent semantic interactions between the visual objects. Specifically, the probability distribution of semantic interaction between nodes in the visual scene graph can be calculated by:
Figure BDA0003858988520000101
wherein "|" represents a splicing operation, will
Figure BDA0003858988520000102
And
Figure BDA0003858988520000103
respectively represent nodes v in object views i And v j Updated feature vector, v i,j Is a visual feature corresponding to a region covering the ith and jth objects simultaneously, f c Is a semantic interaction classifier and interaction categories include content, of, leftchild, rightchild, next, and hold.
For the image in the education field, the multi-view fusion can effectively strengthen the semantic relation implied by the object questions by adding semantic information into the object view, thereby better improving the model performance and well solving the problem of insufficient reasoning capability caused by sparse visual information in the visual object of the image in the education field.
Since the objects having a relationship between them are relatively fixed in the image in the education field, the triple < subject, predicate, object > can be set in advance as a priori knowledge input model generated as a scene graph. When the correlation prediction and the classification are carried out, the object view can be subjected to sub-graph division according to the node types generated by the target detection, one edge type is selected each time, the prediction is carried out in the sub-graph, so that the redundant calculation generated on all nodes in the object view can be avoided to the greatest extent, and the efficiency is greatly improved. The types of triples in the dataset and their numbers are shown in figure 3.
After the currently predicted edge type is selected, positive sample fully-connected subgraphs are constructed in the object types at the two ends of the edge, negative sample subgraphs with the same number of associations in the subgraphs are constructed in the whole subgraphs, and the scores of the relationships in the two subgraphs are respectively subjected to score prediction, wherein the scores can be obtained by the following formula:
score(o i ,e ij ,o j )=f sc (o i ||o j *e ij ), (3-2)
wherein f is sc Predictor of representative score, o i ||o j Represents a reaction of o i And o j Spliced together as f sc Is input.
The loss function of the score evaluation network for semantic relationship prediction among scene graph nodes is defined as:
Figure BDA0003858988520000104
because the characteristics of the images in the education field, the related quantity, the connection types among the nodes and the quantity of the nodes are related, a topk algorithm for adaptively adjusting and selecting the most value quantity according to the node connection and the quantity of the nodes is designed:
Figure BDA0003858988520000111
wherein
Figure BDA0003858988520000112
Represents a relationship class of e ij Number of nodes of time nums (o) i ,o j ) To the numerical mapping of the selected most valued quantity.
After all kinds of relations are obtained, existing priori knowledge is utilized, pruning is carried out according to types (the times of occurrence of all relations in a sample and semantic importance are uniformly analyzed, the relations with few occurrence times and low importance degree are pruned, and the existing priori knowledge is referred to for the standard), and finally the visual scene graph based on the semantic relations is obtained.
3.2 Generation of visual scene graphs based on positional relationships
The invention provides a visual scene graph generation method based on the position relation of a visual scene graph based on a semantic relation, which aims to obtain the visual scene graph based on the position relation. The method takes all relations in the visual scene graph given the semantic relations as the basis of position classification. For the position interaction between the objects, firstly, the Intersection over Union of the bounding boxes of the visual objects is used to judge whether the two objects have a containing or overlapping relationship. If the IOU is less than 0.5, judging the fine-grained position interaction categories among the boundary frames by calculating the distance and the angle among the center points of the boundary frames, wherein the fine-grained position interaction categories include eight categories, namely upper category, lower category, left category, right category, upper left category, lower left category, upper right category and lower right category. The calculation process is as follows:
Figure BDA0003858988520000113
wherein f is p ,f l Is a location interaction category classifier.
3.3 implementation details
The invention adopts the Faster R-CNN which takes Resnet50 as a backbone network. The visual information obtained from Faster R-CNN is a 2048 dimensional vector. The word vector is a 50-dimensional vector and the sentence vector is a 512-dimensional vector. When information is transmitted, a bounding box, a central point, a length, a width and an aspect ratio of an object are constructed into a 9-dimensional normalized vector, and the vector is spliced with visual information carried by the object. And performing multi-view fusion by using a two-layer graph convolution network, inputting the updated node representation into a two-layer MLP for prediction after obtaining the updated node representation, and outputting the one-dimensional vector.
In the training process, the invention uses two stages, training separately. Firstly, training a Faster R-CNN target detection network and an OCR network on a data set, and then training a scene graph generation model by using visual information and semantic information output by the fast R-CNN target detection network and the OCR network. The learning rate was set to 0.003 using SGD as the optimization function.
3.4 quantitative assessment and analysis
The model is quantitatively evaluated by using the recall rate, and the quantitative evaluation is only used for judging the feasibility of the model because the model is only processed aiming at images in the education field and is not compared with other methods.
Recall rate (recall @ K, r @ K) is the proportion of K classification results with highest confidence in the relation truth value found from the overall graph, K is selected 30 in the quantitative analysis experiment, and recall rate comparison calculation is performed on different types of images respectively, and the results are shown in table 1.
TABLE 1 recall ratio for different image categories
Categories R@30
Array-list 77.8
Binary-tree 64.6
Deadlock 53.0
Directed-graph 60.1
Flowchart 51.7
Linked-List 73.8
TABLE 1 recall of different image categories
Categories R@30
Logic-circuit 44.8
Network-topology 48.5
Non-binary-tree 68.0
Queue 69.9
Stack 57.9
Undirected-graph 65.3
According to Table 1, it can be seen that the recall ratio of Array-List, linked-List and Queue is relatively high, while the recall ratio of Logic-circuit and Network-topology is relatively low. The lowest Logic-circuit has a recall rate of 44.8%, which is presumed to be related to sparseness and definition of image information of the data set and prior knowledge input in advance by the invention, but the prior knowledge also limits the generalization of the model to some extent, for example, in Network-topology, the target forms in the graph are very various and are easy to generate detection errors, and when the object class in the image does not meet the expectation of the prior knowledge, the generation of the relationship is inhibited.
3.5 visualization result evaluation and analysis
And (4) selecting the Array-list (figure 4) with better performance and the FlowChart (figure 5) with poor performance as examples to perform visual result analysis on the generated two scene graphs.
It can be seen from fig. 5 that the flowchart produces a large error in the target detection, terminal in an oval shape is identified as a process in deadlock, and fig. 6 is a comparison of two types of pictures.
This phenomenon not only occurs in these two types of pictures, such as the process in the Deadlock category and the node in the binary graph category, and the element in the arylist and the node in the multiple categories, which cause a misinterpretation, the nand gate and the and gate in the Network topology, etc., because the difference in the underlying visual characteristics of the various types of nodes in the data set is not significant. However, the influence of the scene graph generation task is acceptable, and in the generated visual scene graph based on the semantic relationship and the generated visual scene graph based on the position relationship, the relationship is relatively complete, and the accuracy is also ensured.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims (7)

1. A method for generating an image scene graph in an education field based on multi-view information fusion is characterized by comprising the following steps:
step 1: multi-view construction
Step 1.1: and (3) constructing an object view: identifying the category and position coordinates of an object in the graph by using a Faster R-CNN object detector, constructing a visual information code of the identified object by using a convolutional neural network, combining the visual information code with the object in the graph, and constructing a fully-connected object view as a node of the visual graph;
step 1.2: and (3) constructing a semantic view: firstly, based on an OCR technology and an unsupervised semantic segmentation technology, recognizing text information in a graph in a fine-grained manner, then performing weighted fusion on context information of different categories, increasing the reasoning capability of a network, and finally forming a fully-connected semantic view;
step 2: construction of multi-view information fusion module
Step 2.1: fusion of semantic view and object view: utilizing the node information in the semantic view, constructing a fully-connected network between the semantic view and the object view by using a graph convolution network, updating the node information in the object view, and expecting to obtain the semantic information between the objects;
step 2.2: self-fusion of object views: after the semantic views are fused, a full connection network between the object views is constructed by using the fusion method in the step 2.1, and the object views are self-updated;
and step 3: construction of visual scene graph generation module
Step 3.1: and (3) generating a visual scene graph based on semantic relations: after the node characteristics of the object view are updated, generating a visual scene graph based on semantic relation by calculating the probability distribution of semantic interaction among nodes; the nodes in the visual scene graph represent the areas of the objects and the class labels thereof, and the edges represent semantic interaction classes between the visual objects;
step 3.2: and (3) generating a visual scene graph based on the position relation: firstly, judging whether a containing or overlapping relation exists between two objects by utilizing an Intersection over Union of the boundary frames of the visual objects; and if the IOU is less than 0.5, judging the fine-grained position interaction categories between the boundary frames by calculating the distance and the angle between the center points of the boundary frames, wherein the fine-grained position interaction categories comprise eight categories, namely upper, lower, left, right, upper left, lower left, upper right and lower right.
2. The method for generating an image scene graph of an educational field based on multi-view information fusion according to claim 1, wherein: the fusion method of the semantic view and the object view in step 2.1 specifically comprises the following steps:
utilizing semantic views
Figure FDA0003858988510000021
Using graph volume network to construct a fully connected network between semantic view and object view, and using the node information in (1)
Figure FDA0003858988510000022
Updating the node information in the node; for node o in object view i The feature vector updating formula is as follows:
Figure FDA0003858988510000023
wherein N (i) is node o i B represents the offset of the model, W represents the parameter of the model, l represents the update iteration times of the node information in the object view, c ji The standard deviation represents the standard deviation of the node degree;
Figure FDA0003858988510000024
σ represents the Relu activation function, i.e.
f(x)=max(0,x), (2-3)
e ji Represented by node o in N (i) j To node o i The weight of (c) can be calculated by the following formula:
Figure FDA0003858988510000025
ρ[s i ,o j ]represents s i And o j By s i And o j Is determined by the category and the visual characteristics; calculated from the following formula;
ρ[s i ,o j ]=f s ((f l (s i )·f l (o j ))||(f fe (s i )·f fe (o j ))), (2-5)
f l () is a class encoder, which obtains the word vectors represented by the classes through Fasttext; f. of fe (. H) a representative visual feature mapper that maps the visual features of the high dimension into a low dimension space; f. of s (. Is) a similarity encoder for calculating the degree of similarity between two nodes;
distance(s i ,o j ) Represents the distance between two objects, and the farther the distance between the objects is, the shallower the degree of mutual influence between the two objects is; "| |" represents the splicing operation, and the distance and the similarity between the objects are spliced together;
Figure FDA0003858988510000026
will s is i To o is j At node o j And (4) normalizing.
3. The method for generating an image scene graph of an educational field based on multi-view information fusion according to claim 2, wherein: the self-fusion method of the object view in the step 2.2 specifically comprises the following steps:
after the feature update of all nodes in the object view is completed, a full-connection network between the object views is constructed by using the fusion method in the step 2.1, and the object views are subjected to self-update; the feature vector updating formula is as follows:
Figure FDA0003858988510000031
except for e ji In all cases, step 2.1 is followed, e ji Is obtained from the following formula
Figure FDA0003858988510000032
distance(o i ,o j ) And rho [ o ] i ,o j ]The calculation method of (2) is identical to step 2.1.
4. The method for generating an image scene graph of an educational field based on multi-view information fusion according to claim 3, wherein: the method for generating the visual scene graph based on the semantic relationship, which is described in the step 3.1, specifically comprises the following steps:
the probability distribution of semantic interactions between nodes in a visual scene graph can be calculated by:
Figure FDA0003858988510000033
wherein "|" represents a splicing operation, will
Figure FDA0003858988510000034
And
Figure FDA0003858988510000035
respectively represent nodes v in object views i And v j Updated feature vector, v i,j Is a visual feature corresponding to a region covering the ith and jth objects simultaneously, f c Is a semantic interaction classifier, and the interaction categories include continin, of, leftchild, rightchild, next and hold;
after the currently predicted edge type is selected, positive sample fully-connected subgraphs are constructed in the object types at the two ends of the edge, negative sample subgraphs with the same number of associations in the subgraphs are constructed in the whole subgraphs, and the scores of the relationships in the two subgraphs are respectively subjected to score prediction, wherein the scores can be obtained by the following formula:
score(o i ,e ij ,o j )=f sc (o i ||o j *e ij ), (4-2)
wherein f is sc Predictor of representative score, o i ||o j Represents o to i And o j Spliced together as f sc The input of (2);
the loss function of the score evaluation network for semantic relationship prediction among scene graph nodes is defined as:
Figure FDA0003858988510000041
because the characteristics of the images in the education field, the number of the associations are related to the types of the connections among the nodes and the number of the nodes, a topk algorithm for adaptively adjusting and selecting the most value number according to the node connections and the number of the nodes is designed:
Figure FDA0003858988510000042
wherein
Figure FDA0003858988510000043
Represents a relationship class of e ij Number of nodes nums (o) of time i ,o j ) Mapping to the numerical value of the selected most value number;
after all kinds of relations are obtained, existing priori knowledge is utilized, pruning is conducted according to types, and finally the visual scene graph based on the semantic relations is obtained.
5. The method for generating the image scene graph of the education domain based on the multi-view information fusion as claimed in claim 4, wherein: the method for generating the visual scene graph based on the position relationship, which is described in the step 3.2, specifically comprises the following steps:
taking all relations in the visual scene graph given with the semantic relation as the basis of position classification, and aiming at position interaction between objects, firstly, judging whether an inclusion or overlapping relation exists between the two objects by utilizing Intersection over Union of the boundary frames of the visual objects; if the IOU is less than 0.5, judging fine-grained position interaction categories among the IOU and the boundary box by calculating the distance and the angle among the center points of the boundary box, wherein the fine-grained position interaction categories comprise eight categories, namely upper category, lower category, left category, right category, upper left category, lower left category, upper right category and lower right category; the calculation process is as follows:
Figure FDA0003858988510000044
wherein f is p ,f i Is a location interaction category classifier.
6. A computer system, comprising: one or more processors, a computer readable storage medium, for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of claim 1.
7. A computer-readable storage medium having stored thereon computer-executable instructions for, when executed, implementing the method of claim 1.
CN202211156523.6A 2022-09-22 2022-09-22 Education field image scene graph generation method based on multi-view information fusion Pending CN115761036A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211156523.6A CN115761036A (en) 2022-09-22 2022-09-22 Education field image scene graph generation method based on multi-view information fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211156523.6A CN115761036A (en) 2022-09-22 2022-09-22 Education field image scene graph generation method based on multi-view information fusion

Publications (1)

Publication Number Publication Date
CN115761036A true CN115761036A (en) 2023-03-07

Family

ID=85351822

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211156523.6A Pending CN115761036A (en) 2022-09-22 2022-09-22 Education field image scene graph generation method based on multi-view information fusion

Country Status (1)

Country Link
CN (1) CN115761036A (en)

Similar Documents

Publication Publication Date Title
CN111489358B (en) Three-dimensional point cloud semantic segmentation method based on deep learning
US11670071B2 (en) Fine-grained image recognition
WO2019100724A1 (en) Method and device for training multi-label classification model
EP3859560A2 (en) Method and apparatus for visual question answering, computer device and medium
CN111428762B (en) Interpretable remote sensing image ground feature classification method combining deep data learning and ontology knowledge reasoning
Tang et al. Deep safe incomplete multi-view clustering: Theorem and algorithm
WO2014205231A1 (en) Deep learning framework for generic object detection
CN111027576B (en) Cooperative significance detection method based on cooperative significance generation type countermeasure network
Hara et al. Attentional network for visual object detection
CN111898432B (en) Pedestrian detection system and method based on improved YOLOv3 algorithm
CN109783666A (en) A kind of image scene map generation method based on iteration fining
CN115908908B (en) Remote sensing image aggregation type target recognition method and device based on graph attention network
US11514632B2 (en) Modifying neural networks for synthetic conditional digital content generation utilizing contrastive perceptual loss
US11803971B2 (en) Generating improved panoptic segmented digital images based on panoptic segmentation neural networks that utilize exemplar unknown object classes
CN111666406A (en) Short text classification prediction method based on word and label combination of self-attention
CN111553363B (en) End-to-end seal identification method and system
CN111931505A (en) Cross-language entity alignment method based on subgraph embedding
CN114419642A (en) Method, device and system for extracting key value pair information in document image
CN114255403A (en) Optical remote sensing image data processing method and system based on deep learning
KR20220047228A (en) Method and apparatus for generating image classification model, electronic device, storage medium, computer program, roadside device and cloud control platform
CN111325237A (en) Image identification method based on attention interaction mechanism
CN115438215A (en) Image-text bidirectional search and matching model training method, device, equipment and medium
Munoz Inference Machines Parsing Scenes via Iterated Predictions
CN114333062B (en) Pedestrian re-recognition model training method based on heterogeneous dual networks and feature consistency
CN110263855B (en) Method for classifying images by utilizing common-basis capsule projection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination