CN115952794A

CN115952794A - Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph

Info

Publication number: CN115952794A
Application number: CN202211373435.1A
Authority: CN
Inventors: 余正涛; 朱栩冉; 张亚飞
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-04-11

Abstract

The invention relates to a Chinese Thai cross-language sensitive information recognition method fusing a bilingual sensitive dictionary and a heterogeneous graph, which solves the problem that bilingual sensitive words are difficult to align in Chinese Thai sensitive information recognition. According to the invention, firstly, a Hantai bilingual sensitive dictionary is constructed based on Wikipedia and social media sensitive data. Then, the document, the contained keywords and the contained sensitive words are used as nodes, bilingual alignment, similar relations and different parts of speech are used as edges to construct a Chinese-Tai cross-language heteromorphic graph, the alignment of the sensitive features of the document and the bilingual sensitive features is enhanced, and a multilingual pre-training model is used for representing the document nodes and the word nodes. And finally, coding the input document through a multilayer graph convolutional neural network, and performing classification prediction on the document by using a sensitive information classifier. The experimental result shows that the proposed model has better effect than the general cross-language text classification method on the task of identifying the cross-language sensitive information of Hantai.

Description

Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph

Technical Field

The invention relates to a Chinese-Tai cross-language sensitive information recognition method fusing a bilingual sensitive dictionary and a heterogeneous graph, belonging to the field of natural language processing.

Background

Cross-language sensitive information recognition can be viewed as a domain-specific cross-language text classification task. However, in social media data, sensitive words are diversified, and the sensitive words often appear as rare words and alternative names, so that the universal cross-language text classification method is poor in effect on the cross-language sensitive information identification task.

In social media text data, one of the core problems of sensitive information identification is how to identify sensitive features present in the data. Traditional cross-language classification methods are generally based on bilingual aligned resource methods, such as bilingual dictionaries (Balamurali et al, 2012, barnes et al, 2018) or parallel corpora (Zhou et al, 2016 xu et al, 2017), but often face the problem of few labeling data and lack of large-scale training data sets in low-resource languages. Most of the currently common cross-language text classification methods are represented by learning shared codes of different languages, including bilingual word embedding (Ziser et al, 2018. Most of the above studies are text classification in the general field, such as emotion classification. However, in the cross-language sensitive information recognition task, the sensitive words in the Chinese Thai social media sensitive data are represented in a diversified manner, and bilingual sensitive words are difficult to recognize and align.

In order to solve the problem that bilingual sensitive words are difficult to recognize and align due to the fact that sensitive word representations in social media Hantai sensitive data are diversified, a Hantai cross-language sensitive information recognition method fusing a bilingual sensitive dictionary and a heterogeneous graph is provided, a cross-language heteromorphic graph is built by using sensitive word alignment information of the bilingual sensitive dictionary, and the cross-language migration learning capacity is enhanced. Firstly, a Chinese-Thai bilingual sensitive dictionary is constructed based on Wikipedia and social media sensitive data, then documents, contained keywords and sensitive words are used as nodes, bilingual alignment, similar relations and different parts of speech are used as edges to construct a Chinese-Thai cross-language abnormal graph, sensitive features of the documents and bilingual sensitive feature alignment are enhanced, and a multilingual pre-training model is used for representing document nodes and word nodes. And finally, coding the input document through a multilayer graph convolutional neural network, and performing classification prediction on the document by using a sensitive information classifier.

Disclosure of Invention

The invention provides a Chinese-Tai cross-language sensitive information recognition method fusing a bilingual sensitive dictionary and a heterogeneous graph, which is used for solving the problem that bilingual sensitive words are difficult to recognize and align due to the fact that the expression of the bilingual sensitive words is diversified in the Chinese-Tai cross-language sensitive information recognition, overcomes the defects of a general method and improves the performance of the Chinese-Tai cross-language sensitive information recognition.

The technical scheme of the invention is as follows: the method for recognizing the Chinese-Thai cross-language sensitive information fusing a bilingual sensitive dictionary and a heterogeneous graph comprises the steps of constructing the Chinese-Thai bilingual sensitive dictionary based on Wikipedia and social media sensitive data; then, the document, the contained keywords and the contained sensitive words are used as nodes, bilingual alignment, similar relations and different parts of speech are used as edges to construct a Chinese-Tai cross-language heteromorphic graph, the alignment of the sensitive features of the document and the bilingual sensitive features is enhanced, and a multi-language pre-training model is used for representing the document nodes and the word nodes; and finally, coding the input document through a multilayer graph convolutional neural network, and performing classification prediction on the document by using a sensitive information classifier.

As a further scheme of the invention, the method comprises the following specific steps:

step1: adopting Python language to write a web crawler program, collecting, sorting and arranging multi-language text data on a public network, cleaning the data, and constructing a Chinese-Tai cross-language sensitive information data set;

step2: counting the processed data to obtain bilingual sensitive words in different sensitive categories, forming phrases of the bilingual sensitive words with similar word senses, and constructing a Chinese-Thai bilingual sensitive dictionary;

step3: sensitive words in the bilingual sensitive dictionary and keywords in the document are used as word nodes, the document is used as a document node, and the bilingual sensitive word alignment relation based on the alignment and similarity relation of the document, the keywords and the sensitive words and the bilingual sensitive word alignment relation based on the Chinese Thai bilingual sensitive dictionary are used as edges to form a Chinese Thai cross-language abnormal graph;

step4: using a multi-language pre-training model to characterize all nodes, acquiring global information and context information of a document, transmitting the characterizations obtained by all the nodes to a graph convolution neural network, finally performing graph convolution on different subgraphs constructed by different types of edges, and performing feature aggregation on different nodes; and sending the obtained document node characteristics into a sensitive information classifier, and finally obtaining a sensitive information prediction result.

As a further scheme of the invention, the Step1 comprises the following specific steps:

step1.1: collecting 158638 pieces of collected, sorted and published multilingual text data on the public network by using a web crawler technology, wherein 15798 pieces of Wikipedia data, 46119 pieces of microblog data and 96721 pieces of Twitter data; then, the non-Chinese and Thai text data are cleaned through a language identification method, expressions, symbols, hyperlinks and the like in the text data are removed through an emoji data packet and a regular expression, and then the data cleaning is completed through manual data screening and sorting. And finally, labeling the Chinese Thai text data with a sensitive class label according to the sensitive characteristics, and constructing a Chinese Thai cross-language sensitive information identification data set with the class label.

As a further scheme of the invention, the Step2 comprises the following specific steps:

step2.1: based on Wikipedia and social media sensitive data, sensitive words contained in the data are manually identified by means of a machine translation method, bilingual sensitive words in different sensitive categories are obtained through statistics, the bilingual sensitive words with similar word senses form phrases, and a Hantai bilingual sensitive word alignment relation is established, so that a Hantai bilingual sensitive dictionary is constructed.

As a further aspect of the present invention, step3 includes:

step3.1: the method comprises the steps of using co-occurrence words of documents and documents of a Chinese Thai cross-language sensitive information text data set and sensitive words of a bilingual sensitive dictionary as nodes to construct a Chinese Thai cross-language heterogeneous graph structure, wherein different relation types exist among the documents, the documents and the words and among the sensitive words, and the relation types include translation and similar relations among the documents, part-of-speech relations among the documents and the words and part-of-speech relations among the sensitive words.

Step3.2: document and document edge: in order to obtain semantic information contained in a document and enable a Chinese document and a Thai document to better perform cross-language transfer learning, two types of document relation edges are set. Firstly, constructing corresponding translation edges between Chinese documents and Thai documents based on the relation between pseudo parallel corpora translated by a machine. Secondly, after the Chinese and Thai bilingual documents are subjected to vector representation of the documents through a multi-language pre-training model, similarity between the documents is calculated by utilizing the document vectors, such as the document vector A = (x) ₁ ,x ₂ ,…,x _l ) And document vector B = (y) ₁ ,y ₂ ,…,y _l ) The similarity S is obtained through cosine similarity calculation, and the formula is shown as (1):

for each document, k documents with the highest similarity S are taken, similar edges among document nodes are constructed, and the value of k is 3;

step3.3: document and word edges: the most obvious of documents and words is the coexistence relationship, words have different parts of speech and important grammatical information, and adjectives, nouns and verbs can contain sensitive information for sensitive information tasks. The method comprises the steps of accurately segmenting words in a document by using a constructed bilingual sensitive dictionary assisted word segmentation tool, performing part-of-speech tagging on the words by using POS-Tagger, adding part-of-speech tags, connecting the words with different parts-of-speech with a co-occurrence document through part-of-speech relations, and constructing edges with different types.

Step3.4: words and word edges: for cross-language sensitive information recognition, sensitive words have a deeper predictive influence on the result than other words, but sensitive words contained in social media texts can have multiple different parts of speech, and usually only individual sensitive words appear in the form of obscure words. Based on the constructed bilingual sensitive word list, for the sensitive words segmented from the document, the bilingual sensitive words with similar sensitive word semantics are used as word nodes, a graph structure is established through edges between the word nodes, the sensitive information weight in the document is increased, and the Chinese-Tai cross-language word-level alignment and aggregation are performed.

As a further scheme of the invention, the Step4 comprises the following steps:

step4.1: constructing an abnormal graph G for the relationship contained in the Chinese Thai cross-language sensitive information text data set F _F Composed of different types of subgraphs, each subgraph G = (V, E), V representing defined nodes including document nodes and word nodes; e represents different types of edges between nodes. Coding all nodes by using a multi-language pre-training model, and defining initialized X e R ^n×m Is a matrix comprising n nodes and their features, where m is the dimension of the feature vector, and X is the dimension of each row _v ＝R ^m Is the feature vector of v. For each graph G we use its adjacency matrix A and degree matrix

Element D in the degree matrix _ii ＝∑ _j1 A _ij1 I, j1 are the rows and columns, respectively, of the adjacency matrix a; using a trainable weight matrix W ^(j) For the first layer GCN of each subgraph, the m-dimensional node feature matrix H ⁽¹⁾ ＝R ^n×m The calculation is as follows:

where a () is an activation function,

is an adjacency matrix of an undirected graph G with self-join added, I _N Is an identity matrix. For a multi-layer GCN structure, higher-order neighborhood information can be obtained, as shown below:

where j denotes the number of layers of graph convolution, H ⁽⁰⁾ = X, after the graph convolution operation is completed, all different types of sub-graph features are aggregated to a common implicit space, as follows:

the tau represents different subgraphs, the different subgraphs are aggregated to obtain the representation of the whole abnormal composition, and the information of the word nodes is aggregated to the document nodes;

step4.2: and then, enabling the document features h obtained by the GCN layer to enter a full-link layer through an activation function LeakyReLU to obtain output, finally performing class prediction on document nodes by using a normalization index function softmax function to obtain predicted values corresponding to different classes, wherein the class with the highest predicted value is a predicted classification result, and the method specifically comprises the following steps:

q＝Linear(p)

wherein alpha is 0.01 _q And b is the weight and bias, respectively, of q, the document features output by the last fully connected layer of the model,

represents the probability that the document corresponds over M categories, based on the score of the document in the document's score in the score set>

Representing the class result predicted by the model for the document. />

The beneficial effects of the invention are:

1. the Chinese Thai bilingual sensitive dictionary and the Chinese Thai cross-language sensitive information recognition data set are constructed, sensitive words in an input text are expanded on the basis of the Chinese Thai bilingual sensitive dictionary, and the sensitive features of the input text are enhanced when a multi-language pre-training model is represented.

2. The Chinese Thai cross-language abnormal picture is constructed, documents, keywords and sensitive words are used as nodes, the alignment and similarity relation of the documents, the different part-of-speech relations of the keywords and the sensitive words and the bilingual sensitive word alignment relation based on the Chinese Thai bilingual sensitive dictionary are used as edges, the relevance of the Chinese Thai input text and the sensitive words is enhanced, and the problems that the difference of the Chinese Thai language is large and the Chinese Thai sensitive words are difficult to align are solved.

3. The Chinese-Tai cross-language sensitive information recognition method fusing the bilingual sensitive dictionary and the heterogeneous graph is characterized in that a multilayer graph convolutional neural network is used, chinese-Tai cross-language information aggregation is performed based on different relations contained in the Chinese-Tai cross-language heterogeneous graph, the Chinese-Tai cross-language migration learning capacity is enhanced, and the performance of the Chinese-Tai cross-language sensitive information recognition is improved.

Drawings

Fig. 1 is a flow chart diagram of a method for recognizing cross-language sensitive information of hantai in a fusion of a bilingual sensitive dictionary and a heterogeneous graph according to the present invention.

Detailed Description

Example 1: as shown in fig. 1, a method for identifying hantai cross-language sensitive information fusing a bilingual sensitive dictionary and a heterogeneous graph trains a model by taking a constructed data set of hantai cross-language sensitive information as an example, and the method specifically includes the following steps:

step3: sensitive words in the bilingual sensitive dictionary and keywords in the documents are used as word nodes, the documents are used as document nodes, alignment and similarity relation of the documents, different part-of-speech relations of the keywords and the sensitive words and bilingual sensitive word alignment relation of the Chinese and Tai bilingual sensitive dictionary are used as edges, and a Chinese and Tai cross-language abnormal graph is formed;

step4: using a multi-language pre-training model to characterize all nodes, acquiring global information and context information of a document, transmitting the characterizations obtained by all the nodes to a graph convolution neural network, finally performing graph convolution on different subgraphs constructed by different types of edges, and performing feature aggregation on the different nodes;

step5: and sending the finally obtained document node characteristics into a sensitive information classifier, and finally obtaining a sensitive information prediction result.

step1.1: through a web crawler technology, collecting 158638 pieces of collected and sorted multilingual text data on a public network, wherein 15798 pieces of Wikipedia data, 46119 pieces of microblog data and 96721 pieces of Twitter data are obtained. Then, the non-Chinese and Thai text data are cleaned through a language identification method, expressions, symbols, hyperlinks and the like in the text data are removed through an emoji data packet and a regular expression, and then the data cleaning is completed through manual data screening and sorting. Finally, labeling a sensitive category label on the Chinese Thai text data according to the sensitive characteristics, and constructing a Chinese Thai cross-language sensitive information identification data set with the category label;

As a further aspect of the present invention, step3 includes:

Step3.2: document and document edge: in order to obtain semantic information contained in the document and enable cross-language transfer learning of the Chinese document and the Thai document to be better carried out, two types of document relation edges are set. Firstly, constructing corresponding translation edges between the Chinese document and the Thai document based on the relation between pseudo parallel linguistic data translated by a machine. Secondly, after the Chinese and Thai bilingual documents are subjected to vector representation of the documents through a multi-language pre-training model, similarity between the documents is calculated by utilizing the document vectors, such as the document vector A = (x) ₁ ,x ₂ ,…,x _l ) And document vector B = (y) ₁ ,y ₂ ,…,y _l ) The similarity S is obtained through cosine similarity calculation, and the formula is shown as (1):

and for each document, taking k documents with the highest similarity S, constructing similar edges among document nodes, and taking the value of k as 3.

Step3.3: document and word edges: the most obvious of documents and words is the coexistence relationship, words have different parts of speech and important grammatical information, and adjectives, nouns and verbs can contain sensitive information for sensitive information tasks. The method comprises the steps of using a built bilingual sensitive dictionary to assist a word segmentation tool to accurately segment words in a document, using POS-Tagger to label the parts of speech of the words, adding part of speech labels, connecting the words with different parts of speech with a co-occurrence document through part of speech relations, and building edges with different types.

As a further scheme of the invention, the Step4 comprises the following steps:

step4.1: constructing an abnormal graph G for the relationship contained in the Chinese Thai cross-language sensitive information text data set F _F Composed of different types of subgraphs, each subgraph G = (V, E), V representing defined nodes, including document nodes and word nodes; e represents different types of edges between nodes. Coding all nodes by using a multi-language pre-training model, and defining initialized X e R ^n×m Is a matrix containing n nodes and their features, where m is the dimension of the feature vector, and each row X _v ＝R ^m Is the feature vector of v. For each graph G we use its adjacency matrix A and degree matrix

Element D in the degree matrix _ii ＝∑ _j1 A _ij1 I, j1 are the rows and columns, respectively, of the adjacency matrix a; using a trainable weight matrix W ^(j) For the first layer GCN of each subgraph, the m-dimensional node feature matrix H in this text ⁽¹⁾ The calculation of = Rn × m is as follows:

where a () is an activation function that,

is an adjacency matrix of undirected graph G with self-join added, I _N Is an identity matrix. For a multi-layer GCN structure,higher-order neighborhood information can be obtained, as follows: />

Where j denotes the number of layers of graph convolution, H ⁽⁰⁾ = X, after the completion of the graph convolution operation, all the different types of sub-graph features are aggregated into a common implicit space, as follows:

q＝Linear(p)

wherein, alpha is 0.01 _q And b is the weight and bias, respectively, of q, the document features output by the last fully connected layer of the model,

indicating that the document corresponds in M categoriesIs greater than or equal to>

Representing the class result predicted by the model for the document.

In order to illustrate the effect of the invention, 3 groups of comparison experiments are set, the 1 st group is a main experiment, and cross-language sensitive information identification data sets are compared with a baseline model; the 2 nd group of experiments are the comparison of the effects of the invention using different multilingual pre-training models; the 3 rd group of experiments show that the effects of the invention are compared under different GCN layer numbers;

(1) Results of the Main experiment

Firstly, comparing several most advanced cross-language text classification models on a general cross-language text classification method, wherein all data sets are constructed Chinese Thai cross-language sensitive information identification data sets. The results in Table 1 show that the invention is superior to other most advanced models, and is superior to baseline-CLHG under the cross-language sensitive information recognition task. Experiments prove that the method can effectively identify the sensitive features contained in the social media text data and classify the sensitive information.

Table 1: different cross-language specimen classification method result

(2) Different multi-language pre-training model experiment results

In order to explore the influence of different multi-language pre-training models for characterization on the model performance, a comparison experiment using different multi-language pre-training models was performed. The experimental results are shown in table 2, when different multilingual pre-training models are used for representing the nodes, the multilingual pre-training model XLM-R is selected as the representation model with the best effect, and the mBert and XLM effects are not ideal.

Table 2: effect of different multilingual pre-training on classification results

Table 3: effect of convolutional layer number on classification results

(3) Effect of GCN convolutional layers on model Performance

In order to explore the influence of different GCN convolution layer numbers on model performance, ablation experiments with 2, 3, 4 and 5 GCN layer numbers were performed. The experimental results are shown in table 3, and when the number of convolution layers is 3, the model achieves the best effect; when the convolution layer is less than 3 layers, the convolution network has insufficient information aggregation capability, and the model performance is low; when the number of the convolutional layers exceeds 3, the overall performance is reduced with the increase of the number of the layers.

In a word, in order to solve the problem that bilingual sensitive words are difficult to recognize and align due to the fact that sensitive word representations in social media Hantai sensitive data are diversified, the Hantai cross-language sensitive information recognition method fusing a bilingual sensitive dictionary and a heterogeneous graph is provided. And constructing a cross-language abnormal graph by using sensitive word alignment information of the bilingual sensitive dictionary, and enhancing the cross-language transfer learning capability. A large number of experiments verify that the cross-language heteromorphic graph constructed based on the bilingual sensitive dictionary can accurately construct the alignment relation between bilingual sensitive words, so that the target language can learn the sensitive characteristics of the source language more effectively, and the cross-language migration learning capability is enhanced.

While the present invention has been described in detail with reference to the embodiments, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The Chinese-Tai cross-language sensitive information recognition method fusing the bilingual sensitive dictionary and the heterogeneous graph is characterized by comprising the following steps of: the method comprises the steps of firstly, constructing a Hantai bilingual sensitive dictionary based on Wikipedia and social media sensitive data; then, the document, the contained keywords and the contained sensitive words are used as nodes, bilingual alignment, similar relations and different parts of speech are used as edges to construct a Chinese-Tai cross-language heteromorphic graph, the alignment of the sensitive features of the document and the bilingual sensitive features is enhanced, and a multi-language pre-training model is used for representing the document nodes and the word nodes; and finally, coding the input document through a multilayer graph convolutional neural network, and performing classification prediction on the document by using a sensitive information classifier.

2. The method for recognizing Chinese Thai cross-language sensitive information fusing a bilingual sensitive dictionary and a heterogeneous graph according to claim 1, wherein: the method comprises the following specific steps:

step2: counting the processed data to obtain bilingual sensitive words in different sensitive categories, and forming phrases of the bilingual sensitive words with similar word senses to construct a Chinese-Tai bilingual sensitive dictionary;

step3: sensitive words in the bilingual sensitive dictionary based on the Hantai and the documents serve as word nodes, the documents serve as document nodes, and the bilingual sensitive word alignment relation based on the alignment and similarity relation of the documents, the keywords and the sensitive words serves as edges to form a Hantai cross-language abnormal graph;

step4: using a multi-language pre-training model to characterize all nodes, acquiring global information and context information of a document, transmitting the characterizations obtained by all the nodes to a graph convolution neural network, finally performing graph convolution on different subgraphs constructed by different types of edges, and performing feature aggregation on the different nodes; and sending the obtained document node characteristics into a sensitive information classifier, and finally obtaining a sensitive information prediction result.

3. The method for recognizing Chinese Thai cross-language sensitive information fused with bilingual sensitive dictionary and heterogeneous graph according to claim 2, wherein: the specific steps of Step1 are as follows:

step1.1: collecting, sorting and disclosing multilingual text data on a network by a web crawler technology; then, cleaning the non-Chinese and Thai text data by a language identification method, removing expressions, symbols and hyperlinks in the text data by using an emoji data packet and a regular expression, and completing data cleaning by manually screening and sorting the data; and finally, labeling the Chinese Thai text data with a sensitive class label according to the sensitive characteristics, and constructing a Chinese Thai cross-language sensitive information identification data set with the class label.

4. The method for recognizing Chinese Thai cross-language sensitive information fused with bilingual sensitive dictionary and heterogeneous graph according to claim 2, wherein: the specific steps of Step2 are as follows:

step2.1: based on Wikipedia and social media sensitive data, sensitive words contained in the data are manually identified by means of a machine translation method, bilingual sensitive words in different sensitive categories are obtained through statistics, the bilingual sensitive words with similar word senses form word groups, and a Chinese-Tai bilingual sensitive word alignment relation is established, so that a Chinese-Tai bilingual sensitive dictionary is constructed.

5. The method for recognizing the Chinese-Tai cross-language sensitive information fusing the bilingual sensitive dictionary and the heterogeneous graph according to claim 2, wherein: the Step3 comprises the following steps:

step3.1: the method comprises the steps that co-occurrence words of documents and documents of a Chinese Thai cross-language sensitive information text data set and sensitive words of a bilingual sensitive dictionary are used as nodes to construct a Chinese Thai cross-language heterogeneous graph structure, wherein different relation types exist among the documents, the documents and the words and among the sensitive words, and the relation types include translation and similar relations among the documents, part-of-speech relations among the documents and the words and part-of-speech relations among the sensitive words;

step3.2: document and document edge: in order to obtain semantic information contained in the documents and enable the Chinese documents and the Thai documents to better perform cross-language transfer learning, two types of document relation edges are set; firstly, constructing corresponding translation edges between Chinese documents and Thai documents based on the relation between pseudo parallel corpora translated by a machine; secondly, after vector representation of the Chinese and Thai bilingual documents is obtained through a multi-language pre-training model, similarity among the documents is calculated by utilizing the document vectors;

step3.3: document and word edges: accurately segmenting words in a document by using a constructed bilingual sensitive dictionary assisted word segmentation tool, performing part-of-speech tagging on the words by using POS-Tagger, adding part-of-speech tags, connecting words with different parts-of-speech with a co-occurrence document through part-of-speech relations, and constructing edges of different types;

step3.4: words and word edges: based on the constructed bilingual sensitive word list, for the sensitive words segmented from the document, the bilingual sensitive words with similar sensitive word semantics are used as word nodes, a graph structure is established through edges between the word nodes, the sensitive information weight in the document is increased, and the Chinese-Tai cross-language word-level alignment and aggregation are performed.

6. The method for recognizing Chinese Thai cross-language sensitive information fused with bilingual sensitive dictionary and heterogeneous graph according to claim 2, wherein: the Step4 comprises the following steps:

step4.1: constructing an abnormal graph G for the relationship contained in the Chinese Thai cross-language sensitive information text data set F _F Composed of different types of subgraphs, each subgraph G = (V, E), V representing defined nodes including document nodes and word nodes; e represents different types of edges between nodes; coding all nodes by using a multi-language pre-training model, and defining initialized X e R ^n×m Is a matrix comprising n nodes and their features, where m is the dimension of the feature vector, and X is the dimension of each row _v ＝R ^m Is the feature vector of v; using its adjacency matrix A and degree matrix for each graph G

Element D in the degree matrix _ii ＝∑ _j1 A _ij1 I, j1 are the rows and columns, respectively, of the adjacency matrix a; using a trainable weight matrix W ^(j) For the first layer GCN of each subgraph, m-dimensional node feature matrix H ⁽¹⁾ ＝R ^n×m Computing methodAs follows:

where a () is an activation function,

is an adjacency matrix of an undirected graph G with self-join added, I _N The unit matrix is used for acquiring higher-order neighborhood information for a multi-layer GCN structure, and the specific steps are as follows:

the tau represents different subgraphs, the different subgraphs are aggregated together to obtain the representation of the whole abnormal composition, and the information of the word nodes is aggregated to the document nodes;

q＝Linear(p)

represents the probability that the document corresponds over M categories, based on the document's score, and the value of the score>

Representing the class result predicted by the model for the document. />