CN115761036A

CN115761036A - Education field image scene graph generation method based on multi-view information fusion

Info

Publication number: CN115761036A
Application number: CN202211156523.6A
Authority: CN
Inventors: 宋凌云; 伍智广; 张炀; 尚学群; 张弛; 李战怀
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2023-03-07

Abstract

The invention provides a method for generating an image scene graph in the education field based on multi-view information fusion, which captures the semantics of an image from multiple aspects such as visual information of objects in the graph and multiple interactive information among the objects. The contribution of the invention lies in: the method can utilize a plurality of different view views of the image to dig out the semantics of objects in the image and the incidence relation of different types among the objects, thereby generating the heterogeneous Scene Graph containing a plurality of types of nodes and edges. The scene graph represents the complex semantic information extracted from the semantic of visual objects and the different association between the objects to represent the images of the education field.

Description

Education field image scene graph generation method based on multi-view information fusion

Technical Field

The invention belongs to the research of visual scene graph generation methods in the fields of computer application, multi-view information fusion and education, and particularly relates to a method for enabling a computer to generate a structured graph structure capable of representing image characteristics and information by using relationships among targets based on a convolutional neural network and a graph neural network by using a network topology structure.

Background

The scene graph can represent effective information in an original image in a concise and structured mode, the characteristic enables the scene graph to have high application value, and information extracted through the scene graph generation task can be used as input of other tasks. The information of the scene graph structure has been proved to be capable of helping various types of computer vision task enhancement effects, for example, the image generation task can effectively generate high-quality images by using the information provided by the scene graph, and a Visual image Question Answering (VQA) can directly answer questions in the VQA task by using semantic information contained in the scene graph, so that the reasoning ability of the VQA model is enhanced.

In the prior art, many tasks are performed in one step, which means that each system needs to extract information from an original image, perform relevant processing on the information, and predict an output. Such a network will at least need to include two modules, namely, information feature extraction and information feature processing, resulting in a relatively complex network and high training cost, and the effect of the network cannot be guaranteed due to the difficulty in distinguishing the key areas and the non-key areas in the image. The scene graph is used as a carrier of information transmission, downstream tasks can effectively pay more attention to the improvement of the network effect of the downstream tasks, the problem of extracting effective structural information from the image is handed to the scene graph generation task, the complexity of the network is reduced, and the function separation ensures that the network effect of each part can be guaranteed.

Disclosure of Invention

Technical problem to be solved

The bottom layer visual information (color, texture and the like) of the visual objects in the images in the education field is sparse, and the understanding of the image semantics needs to be combined with the information of the numbers appearing in the images, the interaction between the objects and the like, so that the image semantics are difficult to accurately represent by the traditional image characteristics based on the bottom layer visual information. Aiming at the characteristics of the images in the education field, the invention provides a visual scene graph generation method based on multi-view information fusion, and the semantics of the images are captured from multiple aspects such as the visual images of the objects in the graph and the interactive information among the objects.

Technical scheme

A method for generating an image scene graph in an education field based on multi-view information fusion is characterized by comprising the following steps:

step 1: multi-view construction

Step 1.1: and (3) constructing an object view: identifying the category and position coordinates of an object in the graph by using a Faster R-CNN object detector, constructing a visual information code of the identified object by using a convolutional neural network, combining the visual information code with the object in the graph, and constructing a fully-connected object view as a node of the visual graph;

step 1.2: and (3) constructing a semantic view: firstly, based on an OCR technology and an unsupervised semantic segmentation technology, recognizing text information in a graph in a fine-grained manner, then performing weighted fusion on context information of different categories, increasing the reasoning capability of a network, and finally forming a fully-connected semantic view;

and 2, step: construction of multi-view information fusion module

Step 2.1: and (3) fusing the semantic view and the object view: utilizing the node information in the semantic view, constructing a fully-connected network between the semantic view and the object view by using a graph volume network, updating the node information in the object view, and expecting to obtain the semantic information between the objects;

step 2.2: self-fusion of object views: after the semantic views are fused, a full-connection network between the object views is constructed by using the fusion method in the step 2.1, and the object views are updated automatically;

and 3, step 3: construction of visual scene graph generation module

Step 3.1: and (3) generating a visual scene graph based on semantic relations: after the node characteristics of the object view are updated, generating a visual scene graph based on semantic relation by calculating the probability distribution of semantic interaction among nodes; the nodes in the visual scene graph represent the areas of the objects and the class labels thereof, and the edges represent semantic interaction classes between the visual objects;

step 3.2: generating a visual scene graph based on the position relation: firstly, judging whether a containing or overlapping relationship exists between two objects by utilizing the Intersection ratio of the boundary boxes of the visual objects; and if the IOU is less than 0.5, judging the fine-grained position interaction categories between the boundary frames by calculating the distance and the angle between the center points of the boundary frames, wherein the fine-grained position interaction categories comprise eight categories, namely upper, lower, left, right, upper left, lower left, upper right and lower right.

The further technical scheme of the invention is as follows: the fusion method of the semantic view and the object view in the step 2.1 specifically comprises the following steps:

utilizing semantic views

The node information in (1) is used for constructing a fully-connected network between the semantic view and the object view by using the graph volume network, and the object view is subjected to

Updating the node information in the node; for node o in object view _i The feature vector updating formula is as follows:

wherein N (i) is node o _i B represents the offset of the model, W represents the parameter of the model, l represents the update iteration times of the node information in the object view, c _j i is calculated by the following formula and represents the standard deviation of the node degree;

σ represents the Relu activation function, i.e.

f(x)＝max(0，x)， (2-3)

e _ji Represented by node o in N (i) _j The weight to node oi may be calculated by:

ρ[s _i ，o _j ]represents s _i And o _j By s _i And o _j Is determined by the category and the visual characteristics of the user; calculated from the following formula;

ρ[s _i ，o _j ]＝f _s ((f _l (s _i )·f _l (o _j ))||(f _fe (s _i )·f _fe (o _j )))， (2-5)

f _l (. H) is a class encoder that obtains the word vectors represented by the classes via Fasttext; f. of _fe (. The) represents a visual feature mapper to map the visual features of high dimension into a low dimension space; f. of _s (. Is) a similarity encoder for calculating the degree of similarity between two nodes;

distance(s _i ，o _j ) Represents the distance between two objects, and the farther the distance between the objects is, the shallower the degree of mutual influence between the two objects is; "| |" represents the splicing operation, and the distance between objects and the similarity are spliced together;

will s _i To o is to _j At node o _j And (4) normalizing.

The further technical scheme of the invention is as follows: the self-fusion method of the object view in the step 2.2 specifically comprises the following steps:

after the feature update of all nodes in the object view is completed, a full-connection network between the object views is constructed by using the fusion method in the step 2.1, and the object views are subjected to self-update; the feature vector updating formula is as follows:

except for e _ji In all cases, step 2.1 is followed, e _ji Is obtained from the following formula

distance(o _i ，o _j ) And rho [ o ] _i ，o _j ]The calculation method of (2) is identical to that of step 2.1.

The invention further adopts the technical scheme that: the method for generating the visual scene graph based on the semantic relationship, which is described in the step 3.1, specifically comprises the following steps:

the probability distribution of semantic interactions between nodes in a visual scene graph can be calculated by:

wherein "|" represents a splicing operation, will

And

respectively represent nodes v in the object view _i And v _j Updated feature vector, v _i，j Corresponding to regions covering the ith and jth objects simultaneouslyVisual characteristics, f _c Is a semantic interaction classifier, and the interaction categories comprise contact, of, leftchild, rightchild, next and hold;

after the currently predicted edge type is selected, positive sample fully-connected subgraphs are constructed in the object types at the two ends of the edge, negative sample subgraphs with the same number of associations in the subgraphs are constructed in the whole subgraphs, and the scores of the relationships in the two subgraphs are respectively subjected to score prediction, wherein the scores can be obtained by the following formula:

score(o _i ，e _ij ，o _j )＝f _sc (o _i ||o _j *e _ij )， (4-2)

wherein f is _sc Predictor of representative score, o _i ||o _j Represents a reaction of o _i And o _j Spliced together as f _sc The input of (1);

the loss function of the score evaluation network for semantic relationship prediction among scene graph nodes is defined as follows:

because the characteristics of the images in the education field, the related quantity, the connection types among the nodes and the quantity of the nodes are related, a topk algorithm for adaptively adjusting and selecting the most value quantity according to the node connection and the quantity of the nodes is designed:

wherein

Represents a relationship class of e _ij Number of nodes of hour hums (o) _i ，o _j ) Mapping to the numerical value of the selected most value number;

after all kinds of relations are obtained, existing priori knowledge is utilized, pruning is conducted according to types, and finally the visual scene graph based on the semantic relations is obtained.

The further technical scheme of the invention is as follows: the method for generating the visual scene graph based on the position relationship, which is described in the step 3.2, specifically comprises the following steps:

taking all relations in the visual scene graph given with the semantic relation as the basis of position classification, and aiming at position interaction between objects, firstly, judging whether an inclusion or overlapping relation exists between the two objects by utilizing Intersection over Union of the boundary frames of the visual objects; if the IOU is less than 0.5, judging fine-grained position interaction categories among the IOU and the boundary box by calculating the distance and the angle among the center points of the boundary box, wherein the fine-grained position interaction categories comprise eight categories, namely upper category, lower category, left category, right category, upper left category, lower left category, upper right category and lower right category; the calculation process is as follows:

wherein f is _p ，f _i Is a location interaction category classifier.

A computer system, comprising: one or more processors, a computer readable storage medium, for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the above-described method.

A computer-readable storage medium having stored thereon computer-executable instructions for performing the above-described method when executed.

Advantageous effects

The invention provides a method for generating an image scene graph in an education field based on multi-view information fusion.

The scene graph generated by the invention can represent the effective information in the original image in a concise and structured way, so as to achieve the purpose of extracting the information characteristics, and transmit the result to the downstream task, so that the method can effectively pay more attention to the improvement of the processing effect of the information characteristics of the user, realize the function separation, reduce the complexity of the network and ensure the effect of each part of the network.

Secondly, the invention captures the semantics of the image from a plurality of aspects such as the visual image of the object in the image, the interactive information among the objects and the like, and fuses the semantic information and the visual information in the scene images expressing different relationships, thereby greatly improving the information extraction effect of the image in the education field. In contrast, the traditional methods often extract image features from bottom layer visual information, and the bottom layer visual information (color, texture, etc.) of visual objects in the field of images is sparse, so that the image features extracted by the methods are difficult to accurately represent image semantics.

Drawings

The drawings, in which like reference numerals refer to like parts throughout, are for the purpose of illustrating particular embodiments only and are not to be considered limiting of the invention.

FIG. 1 is an overall task framework diagram;

FIG. 2 is a view showing the structure of the Faster R-CNN model;

FIG. 3 illustrates the types of triplets and their numbers in a dataset;

FIG. 4 is an original image of Array-list and a view scene based on semantic relationship and position relationship;

FIG. 5 is a diagram of an original image of FlowChart and a visual scene based on semantic relationship and positional relationship;

FIG. 6 is a comparison of FlowChart and Deadlock.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention. In addition, the technical features involved in the respective embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The method is an educational field image scene graph generation method based on multi-view information fusion, and comprises the following three parts: the method comprises the steps of constructing multiple views, constructing a multiple-view information fusion module and constructing a visual scene graph generation module. The overall architecture diagram is shown in fig. 1, and is specifically described as follows:

1. multi-view construction

1.1 construction of object views

The invention uses Faster R-CNN to extract visual information in images. The model structure of Faster R-CNN is shown in FIG. 2.

A feature map is first extracted from the image using the Resnet50 backbone network, and the feature map is saved as an input shared by the RPN and ROI posing. And the RPN takes the feature subgraph wrapped by the anchor box in the global feature graph as input, judges whether the feature subgraph can be judged as a target by using a classifier, outputs the coordinate position of a bounding box of the feature subgraph by using a regression function, and obtains a feature vector of each target on the feature graph by using ROI posing unified output dimension for each target to finally obtain an object view.

1.2 construction of semantic views

Compared with natural images, the semantic information of the images in the education field has a fixed sentence relation structure, and the semantic information is deduced a priori, so that the reasoning capability of the scene graph can be effectively improved.

In order to enable the word vectors and the sentence vectors to contain grammar information, sentence structure information and relationship reasoning information as much as possible, the method describes the images and the problems in the data set, and a large number of data structure descriptions form a corpus with rich information, and a Fastext model is used for training the word vectors of the corpus. When the method is used, the word vector of each word is firstly obtained, then the relation between the word vectors of a complete sentence is extracted through a bidirectional long-short term memory neural network (BilSTM), so that the sentence vector with generalization and complete semantic relation is constructed, and finally the semantic view is constructed.

2. Construction of multi-view information fusion module

The traditional scene graph generation method based on the underlying visual information for inference is difficult to accurately represent the image semantics of the images in the educational visual field, so a multi-view information fusion method is designed to fuse the target in the graph and the interactive information between the numbers and objects appearing in the graph, and capture the image semantics from multiple aspects.

2.1 fusion of semantic views and object views

Utilizing semantic views

Using graph volume network to construct a fully connected network between semantic view and object view, and using the node information in (1)

The node information in (1) is updated. For node o in object view _j The feature vector updating formula is as follows:

wherein N (i) is node o _i B represents the offset of the model, W represents the parameters of the model, c _ji The standard deviation of the node degree is calculated by the following formula.

σ represents the Relu activation function, i.e.

f(x)＝max(0，x)， (2-3)

e _ji Represented by node o in N (i) _j To node o _i Can be calculated by the following formula:

ρ[s _i ，o _j ]represents s _i And o _j By s _i And o _j Is determined by the category and visual characteristics of the user. Calculated by the following formulaAnd (4) obtaining.

f _l (. Cndot.) is a class coder, which obtains the word vectors represented by the classes through Fasttext. f. of _fe (. The) represents a visual feature mapper which maps the visual features of the high dimension into the low dimension space. f. of _s (. Cndot.) is a similarity encoder for calculating the degree of similarity between two nodes.

distance(s _i ，o _j ) Representing the distance between two objects, the farther the distance between the objects is, the less the two objects will influence each other. "| |" represents the stitching operation, stitching together the distance between objects and the degree of similarity.

Will s _i To o is to _j At node o _j And (4) normalizing.

2.2 self-fusion of object views

After the feature update of all nodes in the object view is completed, a full connection network between the object views is constructed by using a fusion method similar to the step 2.1, and the object views are self-updated. The feature vector updating formula is as follows:

except for e _ji In all cases, step 2.1 is followed, e _ji Is derived from the formula

distance(o _i ，o _j ) And rho [ o ] _i ，o _j ]The calculation method of (2) also corresponds to step 2.1.

3. Construction of visual scene graph generation module

3.1 semantic relationship-based Generation of visual scene graphs

The invention generates a visual scene graph based on semantic information based on the updated object view of the node characteristics. The nodes in the visual scene graph represent the regions of the objects and their class labels, and the edges represent semantic interactions between the visual objects. Specifically, the probability distribution of semantic interaction between nodes in the visual scene graph can be calculated by:

wherein "|" represents a splicing operation, will

And

respectively represent nodes v in object views _i And v _j Updated feature vector, v _i，j Is a visual feature corresponding to a region covering the ith and jth objects simultaneously, f _c Is a semantic interaction classifier and interaction categories include content, of, leftchild, rightchild, next, and hold.

For the image in the education field, the multi-view fusion can effectively strengthen the semantic relation implied by the object questions by adding semantic information into the object view, thereby better improving the model performance and well solving the problem of insufficient reasoning capability caused by sparse visual information in the visual object of the image in the education field.

Since the objects having a relationship between them are relatively fixed in the image in the education field, the triple < subject, predicate, object > can be set in advance as a priori knowledge input model generated as a scene graph. When the correlation prediction and the classification are carried out, the object view can be subjected to sub-graph division according to the node types generated by the target detection, one edge type is selected each time, the prediction is carried out in the sub-graph, so that the redundant calculation generated on all nodes in the object view can be avoided to the greatest extent, and the efficiency is greatly improved. The types of triples in the dataset and their numbers are shown in figure 3.

score(o _i ，e _ij ，o _j )＝f _sc (o _i ||o _j *e _ij )， (3-2)

wherein f is _sc Predictor of representative score, o _i ||o _j Represents a reaction of o _i And o _j Spliced together as f _sc Is input.

The loss function of the score evaluation network for semantic relationship prediction among scene graph nodes is defined as:

wherein

Represents a relationship class of e _ij Number of nodes of time nums (o) _i ，o _j ) To the numerical mapping of the selected most valued quantity.

After all kinds of relations are obtained, existing priori knowledge is utilized, pruning is carried out according to types (the times of occurrence of all relations in a sample and semantic importance are uniformly analyzed, the relations with few occurrence times and low importance degree are pruned, and the existing priori knowledge is referred to for the standard), and finally the visual scene graph based on the semantic relations is obtained.

3.2 Generation of visual scene graphs based on positional relationships

The invention provides a visual scene graph generation method based on the position relation of a visual scene graph based on a semantic relation, which aims to obtain the visual scene graph based on the position relation. The method takes all relations in the visual scene graph given the semantic relations as the basis of position classification. For the position interaction between the objects, firstly, the Intersection over Union of the bounding boxes of the visual objects is used to judge whether the two objects have a containing or overlapping relationship. If the IOU is less than 0.5, judging the fine-grained position interaction categories among the boundary frames by calculating the distance and the angle among the center points of the boundary frames, wherein the fine-grained position interaction categories include eight categories, namely upper category, lower category, left category, right category, upper left category, lower left category, upper right category and lower right category. The calculation process is as follows:

wherein f is _p ，f _l Is a location interaction category classifier.

3.3 implementation details

The invention adopts the Faster R-CNN which takes Resnet50 as a backbone network. The visual information obtained from Faster R-CNN is a 2048 dimensional vector. The word vector is a 50-dimensional vector and the sentence vector is a 512-dimensional vector. When information is transmitted, a bounding box, a central point, a length, a width and an aspect ratio of an object are constructed into a 9-dimensional normalized vector, and the vector is spliced with visual information carried by the object. And performing multi-view fusion by using a two-layer graph convolution network, inputting the updated node representation into a two-layer MLP for prediction after obtaining the updated node representation, and outputting the one-dimensional vector.

In the training process, the invention uses two stages, training separately. Firstly, training a Faster R-CNN target detection network and an OCR network on a data set, and then training a scene graph generation model by using visual information and semantic information output by the fast R-CNN target detection network and the OCR network. The learning rate was set to 0.003 using SGD as the optimization function.

3.4 quantitative assessment and analysis

The model is quantitatively evaluated by using the recall rate, and the quantitative evaluation is only used for judging the feasibility of the model because the model is only processed aiming at images in the education field and is not compared with other methods.

Recall rate (recall @ K, r @ K) is the proportion of K classification results with highest confidence in the relation truth value found from the overall graph, K is selected 30 in the quantitative analysis experiment, and recall rate comparison calculation is performed on different types of images respectively, and the results are shown in table 1.

TABLE 1 recall ratio for different image categories

Categories	R@30
		Array-list	77.8
Binary-tree	64.6
		Deadlock	53.0
Directed-graph	60.1
		Flowchart	51.7
Linked-List	73.8

TABLE 1 recall of different image categories

Categories	R@30
		Logic-circuit	44.8
Network-topology	48.5
		Non-binary-tree	68.0
Queue	69.9
		Stack	57.9
Undirected-graph	65.3

According to Table 1, it can be seen that the recall ratio of Array-List, linked-List and Queue is relatively high, while the recall ratio of Logic-circuit and Network-topology is relatively low. The lowest Logic-circuit has a recall rate of 44.8%, which is presumed to be related to sparseness and definition of image information of the data set and prior knowledge input in advance by the invention, but the prior knowledge also limits the generalization of the model to some extent, for example, in Network-topology, the target forms in the graph are very various and are easy to generate detection errors, and when the object class in the image does not meet the expectation of the prior knowledge, the generation of the relationship is inhibited.

3.5 visualization result evaluation and analysis

And (4) selecting the Array-list (figure 4) with better performance and the FlowChart (figure 5) with poor performance as examples to perform visual result analysis on the generated two scene graphs.

It can be seen from fig. 5 that the flowchart produces a large error in the target detection, terminal in an oval shape is identified as a process in deadlock, and fig. 6 is a comparison of two types of pictures.

This phenomenon not only occurs in these two types of pictures, such as the process in the Deadlock category and the node in the binary graph category, and the element in the arylist and the node in the multiple categories, which cause a misinterpretation, the nand gate and the and gate in the Network topology, etc., because the difference in the underlying visual characteristics of the various types of nodes in the data set is not significant. However, the influence of the scene graph generation task is acceptable, and in the generated visual scene graph based on the semantic relationship and the generated visual scene graph based on the position relationship, the relationship is relatively complete, and the accuracy is also ensured.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A method for generating an image scene graph in an education field based on multi-view information fusion is characterized by comprising the following steps:

step 1: multi-view construction

step 2: construction of multi-view information fusion module

Step 2.1: fusion of semantic view and object view: utilizing the node information in the semantic view, constructing a fully-connected network between the semantic view and the object view by using a graph convolution network, updating the node information in the object view, and expecting to obtain the semantic information between the objects;

step 2.2: self-fusion of object views: after the semantic views are fused, a full connection network between the object views is constructed by using the fusion method in the step 2.1, and the object views are self-updated;

and step 3: construction of visual scene graph generation module

step 3.2: and (3) generating a visual scene graph based on the position relation: firstly, judging whether a containing or overlapping relation exists between two objects by utilizing an Intersection over Union of the boundary frames of the visual objects; and if the IOU is less than 0.5, judging the fine-grained position interaction categories between the boundary frames by calculating the distance and the angle between the center points of the boundary frames, wherein the fine-grained position interaction categories comprise eight categories, namely upper, lower, left, right, upper left, lower left, upper right and lower right.

2. The method for generating an image scene graph of an educational field based on multi-view information fusion according to claim 1, wherein: the fusion method of the semantic view and the object view in step 2.1 specifically comprises the following steps:

utilizing semantic views

wherein N (i) is node o _i B represents the offset of the model, W represents the parameter of the model, l represents the update iteration times of the node information in the object view, c _ji The standard deviation represents the standard deviation of the node degree;

σ represents the Relu activation function, i.e.

f(x)＝max(0,x), (2-3)

e _ji Represented by node o in N (i) _j To node o _i The weight of (c) can be calculated by the following formula:

ρ[s _i ,o _j ]represents s _i And o _j By s _i And o _j Is determined by the category and the visual characteristics; calculated from the following formula;

ρ[s _i ,o _j ]＝f _s ((f _l (s _i )·f _l (o _j ))||(f _fe (s _i )·f _fe (o _j ))), (2-5)

f _l () is a class encoder, which obtains the word vectors represented by the classes through Fasttext; f. of _fe (. H) a representative visual feature mapper that maps the visual features of the high dimension into a low dimension space; f. of _s (. Is) a similarity encoder for calculating the degree of similarity between two nodes;

distance(s _i ,o _j ) Represents the distance between two objects, and the farther the distance between the objects is, the shallower the degree of mutual influence between the two objects is; "| |" represents the splicing operation, and the distance and the similarity between the objects are spliced together;

will s is _i To o is _j At node o _j And (4) normalizing.

3. The method for generating an image scene graph of an educational field based on multi-view information fusion according to claim 2, wherein: the self-fusion method of the object view in the step 2.2 specifically comprises the following steps:

distance(o _i ,o _j ) And rho [ o ] _i ,o _j ]The calculation method of (2) is identical to step 2.1.

4. The method for generating an image scene graph of an educational field based on multi-view information fusion according to claim 3, wherein: the method for generating the visual scene graph based on the semantic relationship, which is described in the step 3.1, specifically comprises the following steps:

wherein "|" represents a splicing operation, will

And

respectively represent nodes v in object views _i And v _j Updated feature vector, v _i,j Is a visual feature corresponding to a region covering the ith and jth objects simultaneously, f _c Is a semantic interaction classifier, and the interaction categories include continin, of, leftchild, rightchild, next and hold;

score(o _i ,e _ij ,o _j )＝f _sc (o _i ||o _j *e _ij ), (4-2)

wherein f is _sc Predictor of representative score, o _i ||o _j Represents o to _i And o _j Spliced together as f _sc The input of (2);

because the characteristics of the images in the education field, the number of the associations are related to the types of the connections among the nodes and the number of the nodes, a topk algorithm for adaptively adjusting and selecting the most value number according to the node connections and the number of the nodes is designed:

wherein

Represents a relationship class of e _ij Number of nodes nums (o) of time _i ,o _j ) Mapping to the numerical value of the selected most value number;

5. The method for generating the image scene graph of the education domain based on the multi-view information fusion as claimed in claim 4, wherein: the method for generating the visual scene graph based on the position relationship, which is described in the step 3.2, specifically comprises the following steps:

wherein f is _p ，f _i Is a location interaction category classifier.

6. A computer system, comprising: one or more processors, a computer readable storage medium, for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of claim 1.

7. A computer-readable storage medium having stored thereon computer-executable instructions for, when executed, implementing the method of claim 1.