CN115952794A - Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph - Google Patents

Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph Download PDF

Info

Publication number
CN115952794A
CN115952794A CN202211373435.1A CN202211373435A CN115952794A CN 115952794 A CN115952794 A CN 115952794A CN 202211373435 A CN202211373435 A CN 202211373435A CN 115952794 A CN115952794 A CN 115952794A
Authority
CN
China
Prior art keywords
sensitive
bilingual
document
language
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211373435.1A
Other languages
Chinese (zh)
Inventor
余正涛
朱栩冉
张亚飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202211373435.1A priority Critical patent/CN115952794A/en
Publication of CN115952794A publication Critical patent/CN115952794A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a Chinese Thai cross-language sensitive information recognition method fusing a bilingual sensitive dictionary and a heterogeneous graph, which solves the problem that bilingual sensitive words are difficult to align in Chinese Thai sensitive information recognition. According to the invention, firstly, a Hantai bilingual sensitive dictionary is constructed based on Wikipedia and social media sensitive data. Then, the document, the contained keywords and the contained sensitive words are used as nodes, bilingual alignment, similar relations and different parts of speech are used as edges to construct a Chinese-Tai cross-language heteromorphic graph, the alignment of the sensitive features of the document and the bilingual sensitive features is enhanced, and a multilingual pre-training model is used for representing the document nodes and the word nodes. And finally, coding the input document through a multilayer graph convolutional neural network, and performing classification prediction on the document by using a sensitive information classifier. The experimental result shows that the proposed model has better effect than the general cross-language text classification method on the task of identifying the cross-language sensitive information of Hantai.

Description

Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph
Technical Field
The invention relates to a Chinese-Tai cross-language sensitive information recognition method fusing a bilingual sensitive dictionary and a heterogeneous graph, belonging to the field of natural language processing.
Background
Cross-language sensitive information recognition can be viewed as a domain-specific cross-language text classification task. However, in social media data, sensitive words are diversified, and the sensitive words often appear as rare words and alternative names, so that the universal cross-language text classification method is poor in effect on the cross-language sensitive information identification task.
In social media text data, one of the core problems of sensitive information identification is how to identify sensitive features present in the data. Traditional cross-language classification methods are generally based on bilingual aligned resource methods, such as bilingual dictionaries (Balamurali et al, 2012, barnes et al, 2018) or parallel corpora (Zhou et al, 2016 xu et al, 2017), but often face the problem of few labeling data and lack of large-scale training data sets in low-resource languages. Most of the currently common cross-language text classification methods are represented by learning shared codes of different languages, including bilingual word embedding (Ziser et al, 2018. Most of the above studies are text classification in the general field, such as emotion classification. However, in the cross-language sensitive information recognition task, the sensitive words in the Chinese Thai social media sensitive data are represented in a diversified manner, and bilingual sensitive words are difficult to recognize and align.
In order to solve the problem that bilingual sensitive words are difficult to recognize and align due to the fact that sensitive word representations in social media Hantai sensitive data are diversified, a Hantai cross-language sensitive information recognition method fusing a bilingual sensitive dictionary and a heterogeneous graph is provided, a cross-language heteromorphic graph is built by using sensitive word alignment information of the bilingual sensitive dictionary, and the cross-language migration learning capacity is enhanced. Firstly, a Chinese-Thai bilingual sensitive dictionary is constructed based on Wikipedia and social media sensitive data, then documents, contained keywords and sensitive words are used as nodes, bilingual alignment, similar relations and different parts of speech are used as edges to construct a Chinese-Thai cross-language abnormal graph, sensitive features of the documents and bilingual sensitive feature alignment are enhanced, and a multilingual pre-training model is used for representing document nodes and word nodes. And finally, coding the input document through a multilayer graph convolutional neural network, and performing classification prediction on the document by using a sensitive information classifier.
Disclosure of Invention
The invention provides a Chinese-Tai cross-language sensitive information recognition method fusing a bilingual sensitive dictionary and a heterogeneous graph, which is used for solving the problem that bilingual sensitive words are difficult to recognize and align due to the fact that the expression of the bilingual sensitive words is diversified in the Chinese-Tai cross-language sensitive information recognition, overcomes the defects of a general method and improves the performance of the Chinese-Tai cross-language sensitive information recognition.
The technical scheme of the invention is as follows: the method for recognizing the Chinese-Thai cross-language sensitive information fusing a bilingual sensitive dictionary and a heterogeneous graph comprises the steps of constructing the Chinese-Thai bilingual sensitive dictionary based on Wikipedia and social media sensitive data; then, the document, the contained keywords and the contained sensitive words are used as nodes, bilingual alignment, similar relations and different parts of speech are used as edges to construct a Chinese-Tai cross-language heteromorphic graph, the alignment of the sensitive features of the document and the bilingual sensitive features is enhanced, and a multi-language pre-training model is used for representing the document nodes and the word nodes; and finally, coding the input document through a multilayer graph convolutional neural network, and performing classification prediction on the document by using a sensitive information classifier.
As a further scheme of the invention, the method comprises the following specific steps:
step1: adopting Python language to write a web crawler program, collecting, sorting and arranging multi-language text data on a public network, cleaning the data, and constructing a Chinese-Tai cross-language sensitive information data set;
step2: counting the processed data to obtain bilingual sensitive words in different sensitive categories, forming phrases of the bilingual sensitive words with similar word senses, and constructing a Chinese-Thai bilingual sensitive dictionary;
step3: sensitive words in the bilingual sensitive dictionary and keywords in the document are used as word nodes, the document is used as a document node, and the bilingual sensitive word alignment relation based on the alignment and similarity relation of the document, the keywords and the sensitive words and the bilingual sensitive word alignment relation based on the Chinese Thai bilingual sensitive dictionary are used as edges to form a Chinese Thai cross-language abnormal graph;
step4: using a multi-language pre-training model to characterize all nodes, acquiring global information and context information of a document, transmitting the characterizations obtained by all the nodes to a graph convolution neural network, finally performing graph convolution on different subgraphs constructed by different types of edges, and performing feature aggregation on different nodes; and sending the obtained document node characteristics into a sensitive information classifier, and finally obtaining a sensitive information prediction result.
As a further scheme of the invention, the Step1 comprises the following specific steps:
step1.1: collecting 158638 pieces of collected, sorted and published multilingual text data on the public network by using a web crawler technology, wherein 15798 pieces of Wikipedia data, 46119 pieces of microblog data and 96721 pieces of Twitter data; then, the non-Chinese and Thai text data are cleaned through a language identification method, expressions, symbols, hyperlinks and the like in the text data are removed through an emoji data packet and a regular expression, and then the data cleaning is completed through manual data screening and sorting. And finally, labeling the Chinese Thai text data with a sensitive class label according to the sensitive characteristics, and constructing a Chinese Thai cross-language sensitive information identification data set with the class label.
As a further scheme of the invention, the Step2 comprises the following specific steps:
step2.1: based on Wikipedia and social media sensitive data, sensitive words contained in the data are manually identified by means of a machine translation method, bilingual sensitive words in different sensitive categories are obtained through statistics, the bilingual sensitive words with similar word senses form phrases, and a Hantai bilingual sensitive word alignment relation is established, so that a Hantai bilingual sensitive dictionary is constructed.
As a further aspect of the present invention, step3 includes:
step3.1: the method comprises the steps of using co-occurrence words of documents and documents of a Chinese Thai cross-language sensitive information text data set and sensitive words of a bilingual sensitive dictionary as nodes to construct a Chinese Thai cross-language heterogeneous graph structure, wherein different relation types exist among the documents, the documents and the words and among the sensitive words, and the relation types include translation and similar relations among the documents, part-of-speech relations among the documents and the words and part-of-speech relations among the sensitive words.
Step3.2: document and document edge: in order to obtain semantic information contained in a document and enable a Chinese document and a Thai document to better perform cross-language transfer learning, two types of document relation edges are set. Firstly, constructing corresponding translation edges between Chinese documents and Thai documents based on the relation between pseudo parallel corpora translated by a machine. Secondly, after the Chinese and Thai bilingual documents are subjected to vector representation of the documents through a multi-language pre-training model, similarity between the documents is calculated by utilizing the document vectors, such as the document vector A = (x) 1 ,x 2 ,…,x l ) And document vector B = (y) 1 ,y 2 ,…,y l ) The similarity S is obtained through cosine similarity calculation, and the formula is shown as (1):
Figure SMS_1
for each document, k documents with the highest similarity S are taken, similar edges among document nodes are constructed, and the value of k is 3;
step3.3: document and word edges: the most obvious of documents and words is the coexistence relationship, words have different parts of speech and important grammatical information, and adjectives, nouns and verbs can contain sensitive information for sensitive information tasks. The method comprises the steps of accurately segmenting words in a document by using a constructed bilingual sensitive dictionary assisted word segmentation tool, performing part-of-speech tagging on the words by using POS-Tagger, adding part-of-speech tags, connecting the words with different parts-of-speech with a co-occurrence document through part-of-speech relations, and constructing edges with different types.
Step3.4: words and word edges: for cross-language sensitive information recognition, sensitive words have a deeper predictive influence on the result than other words, but sensitive words contained in social media texts can have multiple different parts of speech, and usually only individual sensitive words appear in the form of obscure words. Based on the constructed bilingual sensitive word list, for the sensitive words segmented from the document, the bilingual sensitive words with similar sensitive word semantics are used as word nodes, a graph structure is established through edges between the word nodes, the sensitive information weight in the document is increased, and the Chinese-Tai cross-language word-level alignment and aggregation are performed.
As a further scheme of the invention, the Step4 comprises the following steps:
step4.1: constructing an abnormal graph G for the relationship contained in the Chinese Thai cross-language sensitive information text data set F F Composed of different types of subgraphs, each subgraph G = (V, E), V representing defined nodes including document nodes and word nodes; e represents different types of edges between nodes. Coding all nodes by using a multi-language pre-training model, and defining initialized X e R n×m Is a matrix comprising n nodes and their features, where m is the dimension of the feature vector, and X is the dimension of each row v =R m Is the feature vector of v. For each graph G we use its adjacency matrix A and degree matrix
Figure SMS_2
Element D in the degree matrix ii =∑ j1 A ij1 I, j1 are the rows and columns, respectively, of the adjacency matrix a; using a trainable weight matrix W (j) For the first layer GCN of each subgraph, the m-dimensional node feature matrix H (1) =R n×m The calculation is as follows:
Figure SMS_3
where a () is an activation function,
Figure SMS_4
is an adjacency matrix of an undirected graph G with self-join added, I N Is an identity matrix. For a multi-layer GCN structure, higher-order neighborhood information can be obtained, as shown below:
Figure SMS_5
where j denotes the number of layers of graph convolution, H (0) = X, after the graph convolution operation is completed, all different types of sub-graph features are aggregated to a common implicit space, as follows:
Figure SMS_6
the tau represents different subgraphs, the different subgraphs are aggregated to obtain the representation of the whole abnormal composition, and the information of the word nodes is aggregated to the document nodes;
step4.2: and then, enabling the document features h obtained by the GCN layer to enter a full-link layer through an activation function LeakyReLU to obtain output, finally performing class prediction on document nodes by using a normalization index function softmax function to obtain predicted values corresponding to different classes, wherein the class with the highest predicted value is a predicted classification result, and the method specifically comprises the following steps:
Figure SMS_7
q=Linear(p)
Figure SMS_8
Figure SMS_9
wherein alpha is 0.01 q And b is the weight and bias, respectively, of q, the document features output by the last fully connected layer of the model,
Figure SMS_10
represents the probability that the document corresponds over M categories, based on the score of the document in the document's score in the score set>
Figure SMS_11
Representing the class result predicted by the model for the document. />
The beneficial effects of the invention are:
1. the Chinese Thai bilingual sensitive dictionary and the Chinese Thai cross-language sensitive information recognition data set are constructed, sensitive words in an input text are expanded on the basis of the Chinese Thai bilingual sensitive dictionary, and the sensitive features of the input text are enhanced when a multi-language pre-training model is represented.
2. The Chinese Thai cross-language abnormal picture is constructed, documents, keywords and sensitive words are used as nodes, the alignment and similarity relation of the documents, the different part-of-speech relations of the keywords and the sensitive words and the bilingual sensitive word alignment relation based on the Chinese Thai bilingual sensitive dictionary are used as edges, the relevance of the Chinese Thai input text and the sensitive words is enhanced, and the problems that the difference of the Chinese Thai language is large and the Chinese Thai sensitive words are difficult to align are solved.
3. The Chinese-Tai cross-language sensitive information recognition method fusing the bilingual sensitive dictionary and the heterogeneous graph is characterized in that a multilayer graph convolutional neural network is used, chinese-Tai cross-language information aggregation is performed based on different relations contained in the Chinese-Tai cross-language heterogeneous graph, the Chinese-Tai cross-language migration learning capacity is enhanced, and the performance of the Chinese-Tai cross-language sensitive information recognition is improved.
Drawings
Fig. 1 is a flow chart diagram of a method for recognizing cross-language sensitive information of hantai in a fusion of a bilingual sensitive dictionary and a heterogeneous graph according to the present invention.
Detailed Description
Example 1: as shown in fig. 1, a method for identifying hantai cross-language sensitive information fusing a bilingual sensitive dictionary and a heterogeneous graph trains a model by taking a constructed data set of hantai cross-language sensitive information as an example, and the method specifically includes the following steps:
step1: adopting Python language to write a web crawler program, collecting, sorting and arranging multi-language text data on a public network, cleaning the data, and constructing a Chinese-Tai cross-language sensitive information data set;
step2: counting the processed data to obtain bilingual sensitive words in different sensitive categories, forming phrases of the bilingual sensitive words with similar word senses, and constructing a Chinese-Thai bilingual sensitive dictionary;
step3: sensitive words in the bilingual sensitive dictionary and keywords in the documents are used as word nodes, the documents are used as document nodes, alignment and similarity relation of the documents, different part-of-speech relations of the keywords and the sensitive words and bilingual sensitive word alignment relation of the Chinese and Tai bilingual sensitive dictionary are used as edges, and a Chinese and Tai cross-language abnormal graph is formed;
step4: using a multi-language pre-training model to characterize all nodes, acquiring global information and context information of a document, transmitting the characterizations obtained by all the nodes to a graph convolution neural network, finally performing graph convolution on different subgraphs constructed by different types of edges, and performing feature aggregation on the different nodes;
step5: and sending the finally obtained document node characteristics into a sensitive information classifier, and finally obtaining a sensitive information prediction result.
As a further scheme of the invention, the Step1 comprises the following specific steps:
step1.1: through a web crawler technology, collecting 158638 pieces of collected and sorted multilingual text data on a public network, wherein 15798 pieces of Wikipedia data, 46119 pieces of microblog data and 96721 pieces of Twitter data are obtained. Then, the non-Chinese and Thai text data are cleaned through a language identification method, expressions, symbols, hyperlinks and the like in the text data are removed through an emoji data packet and a regular expression, and then the data cleaning is completed through manual data screening and sorting. Finally, labeling a sensitive category label on the Chinese Thai text data according to the sensitive characteristics, and constructing a Chinese Thai cross-language sensitive information identification data set with the category label;
as a further scheme of the invention, the Step2 comprises the following specific steps:
step2.1: based on Wikipedia and social media sensitive data, sensitive words contained in the data are manually identified by means of a machine translation method, bilingual sensitive words in different sensitive categories are obtained through statistics, the bilingual sensitive words with similar word senses form phrases, and a Hantai bilingual sensitive word alignment relation is established, so that a Hantai bilingual sensitive dictionary is constructed.
As a further aspect of the present invention, step3 includes:
step3.1: the method comprises the steps of using co-occurrence words of documents and documents of a Chinese Thai cross-language sensitive information text data set and sensitive words of a bilingual sensitive dictionary as nodes to construct a Chinese Thai cross-language heterogeneous graph structure, wherein different relation types exist among the documents, the documents and the words and among the sensitive words, and the relation types include translation and similar relations among the documents, part-of-speech relations among the documents and the words and part-of-speech relations among the sensitive words.
Step3.2: document and document edge: in order to obtain semantic information contained in the document and enable cross-language transfer learning of the Chinese document and the Thai document to be better carried out, two types of document relation edges are set. Firstly, constructing corresponding translation edges between the Chinese document and the Thai document based on the relation between pseudo parallel linguistic data translated by a machine. Secondly, after the Chinese and Thai bilingual documents are subjected to vector representation of the documents through a multi-language pre-training model, similarity between the documents is calculated by utilizing the document vectors, such as the document vector A = (x) 1 ,x 2 ,…,x l ) And document vector B = (y) 1 ,y 2 ,…,y l ) The similarity S is obtained through cosine similarity calculation, and the formula is shown as (1):
Figure SMS_12
and for each document, taking k documents with the highest similarity S, constructing similar edges among document nodes, and taking the value of k as 3.
Step3.3: document and word edges: the most obvious of documents and words is the coexistence relationship, words have different parts of speech and important grammatical information, and adjectives, nouns and verbs can contain sensitive information for sensitive information tasks. The method comprises the steps of using a built bilingual sensitive dictionary to assist a word segmentation tool to accurately segment words in a document, using POS-Tagger to label the parts of speech of the words, adding part of speech labels, connecting the words with different parts of speech with a co-occurrence document through part of speech relations, and building edges with different types.
Step3.4: words and word edges: for cross-language sensitive information recognition, sensitive words have a deeper predictive influence on the result than other words, but sensitive words contained in social media texts can have multiple different parts of speech, and usually only individual sensitive words appear in the form of obscure words. Based on the constructed bilingual sensitive word list, for the sensitive words segmented from the document, the bilingual sensitive words with similar sensitive word semantics are used as word nodes, a graph structure is established through edges between the word nodes, the sensitive information weight in the document is increased, and the Chinese-Tai cross-language word-level alignment and aggregation are performed.
As a further scheme of the invention, the Step4 comprises the following steps:
step4.1: constructing an abnormal graph G for the relationship contained in the Chinese Thai cross-language sensitive information text data set F F Composed of different types of subgraphs, each subgraph G = (V, E), V representing defined nodes, including document nodes and word nodes; e represents different types of edges between nodes. Coding all nodes by using a multi-language pre-training model, and defining initialized X e R n×m Is a matrix containing n nodes and their features, where m is the dimension of the feature vector, and each row X v =R m Is the feature vector of v. For each graph G we use its adjacency matrix A and degree matrix
Figure SMS_13
Element D in the degree matrix ii =∑ j1 A ij1 I, j1 are the rows and columns, respectively, of the adjacency matrix a; using a trainable weight matrix W (j) For the first layer GCN of each subgraph, the m-dimensional node feature matrix H in this text (1) The calculation of = Rn × m is as follows:
Figure SMS_14
where a () is an activation function that,
Figure SMS_15
is an adjacency matrix of undirected graph G with self-join added, I N Is an identity matrix. For a multi-layer GCN structure,higher-order neighborhood information can be obtained, as follows: />
Figure SMS_16
Where j denotes the number of layers of graph convolution, H (0) = X, after the completion of the graph convolution operation, all the different types of sub-graph features are aggregated into a common implicit space, as follows:
Figure SMS_17
the tau represents different subgraphs, the different subgraphs are aggregated to obtain the representation of the whole abnormal composition, and the information of the word nodes is aggregated to the document nodes;
step4.2: and then, enabling the document features h obtained by the GCN layer to enter a full-link layer through an activation function LeakyReLU to obtain output, finally performing class prediction on document nodes by using a normalization index function softmax function to obtain predicted values corresponding to different classes, wherein the class with the highest predicted value is a predicted classification result, and the method specifically comprises the following steps:
Figure SMS_18
q=Linear(p)
Figure SMS_19
Figure SMS_20
wherein, alpha is 0.01 q And b is the weight and bias, respectively, of q, the document features output by the last fully connected layer of the model,
Figure SMS_21
indicating that the document corresponds in M categoriesIs greater than or equal to>
Figure SMS_22
Representing the class result predicted by the model for the document.
In order to illustrate the effect of the invention, 3 groups of comparison experiments are set, the 1 st group is a main experiment, and cross-language sensitive information identification data sets are compared with a baseline model; the 2 nd group of experiments are the comparison of the effects of the invention using different multilingual pre-training models; the 3 rd group of experiments show that the effects of the invention are compared under different GCN layer numbers;
(1) Results of the Main experiment
Firstly, comparing several most advanced cross-language text classification models on a general cross-language text classification method, wherein all data sets are constructed Chinese Thai cross-language sensitive information identification data sets. The results in Table 1 show that the invention is superior to other most advanced models, and is superior to baseline-CLHG under the cross-language sensitive information recognition task. Experiments prove that the method can effectively identify the sensitive features contained in the social media text data and classify the sensitive information.
Table 1: different cross-language specimen classification method result
Figure SMS_23
(2) Different multi-language pre-training model experiment results
In order to explore the influence of different multi-language pre-training models for characterization on the model performance, a comparison experiment using different multi-language pre-training models was performed. The experimental results are shown in table 2, when different multilingual pre-training models are used for representing the nodes, the multilingual pre-training model XLM-R is selected as the representation model with the best effect, and the mBert and XLM effects are not ideal.
Table 2: effect of different multilingual pre-training on classification results
Figure SMS_24
Table 3: effect of convolutional layer number on classification results
Figure SMS_25
(3) Effect of GCN convolutional layers on model Performance
In order to explore the influence of different GCN convolution layer numbers on model performance, ablation experiments with 2, 3, 4 and 5 GCN layer numbers were performed. The experimental results are shown in table 3, and when the number of convolution layers is 3, the model achieves the best effect; when the convolution layer is less than 3 layers, the convolution network has insufficient information aggregation capability, and the model performance is low; when the number of the convolutional layers exceeds 3, the overall performance is reduced with the increase of the number of the layers.
In a word, in order to solve the problem that bilingual sensitive words are difficult to recognize and align due to the fact that sensitive word representations in social media Hantai sensitive data are diversified, the Hantai cross-language sensitive information recognition method fusing a bilingual sensitive dictionary and a heterogeneous graph is provided. And constructing a cross-language abnormal graph by using sensitive word alignment information of the bilingual sensitive dictionary, and enhancing the cross-language transfer learning capability. A large number of experiments verify that the cross-language heteromorphic graph constructed based on the bilingual sensitive dictionary can accurately construct the alignment relation between bilingual sensitive words, so that the target language can learn the sensitive characteristics of the source language more effectively, and the cross-language migration learning capability is enhanced.
While the present invention has been described in detail with reference to the embodiments, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (6)

1. The Chinese-Tai cross-language sensitive information recognition method fusing the bilingual sensitive dictionary and the heterogeneous graph is characterized by comprising the following steps of: the method comprises the steps of firstly, constructing a Hantai bilingual sensitive dictionary based on Wikipedia and social media sensitive data; then, the document, the contained keywords and the contained sensitive words are used as nodes, bilingual alignment, similar relations and different parts of speech are used as edges to construct a Chinese-Tai cross-language heteromorphic graph, the alignment of the sensitive features of the document and the bilingual sensitive features is enhanced, and a multi-language pre-training model is used for representing the document nodes and the word nodes; and finally, coding the input document through a multilayer graph convolutional neural network, and performing classification prediction on the document by using a sensitive information classifier.
2. The method for recognizing Chinese Thai cross-language sensitive information fusing a bilingual sensitive dictionary and a heterogeneous graph according to claim 1, wherein: the method comprises the following specific steps:
step1: adopting Python language to write a web crawler program, collecting, sorting and arranging multi-language text data on a public network, cleaning the data, and constructing a Chinese-Tai cross-language sensitive information data set;
step2: counting the processed data to obtain bilingual sensitive words in different sensitive categories, and forming phrases of the bilingual sensitive words with similar word senses to construct a Chinese-Tai bilingual sensitive dictionary;
step3: sensitive words in the bilingual sensitive dictionary based on the Hantai and the documents serve as word nodes, the documents serve as document nodes, and the bilingual sensitive word alignment relation based on the alignment and similarity relation of the documents, the keywords and the sensitive words serves as edges to form a Hantai cross-language abnormal graph;
step4: using a multi-language pre-training model to characterize all nodes, acquiring global information and context information of a document, transmitting the characterizations obtained by all the nodes to a graph convolution neural network, finally performing graph convolution on different subgraphs constructed by different types of edges, and performing feature aggregation on the different nodes; and sending the obtained document node characteristics into a sensitive information classifier, and finally obtaining a sensitive information prediction result.
3. The method for recognizing Chinese Thai cross-language sensitive information fused with bilingual sensitive dictionary and heterogeneous graph according to claim 2, wherein: the specific steps of Step1 are as follows:
step1.1: collecting, sorting and disclosing multilingual text data on a network by a web crawler technology; then, cleaning the non-Chinese and Thai text data by a language identification method, removing expressions, symbols and hyperlinks in the text data by using an emoji data packet and a regular expression, and completing data cleaning by manually screening and sorting the data; and finally, labeling the Chinese Thai text data with a sensitive class label according to the sensitive characteristics, and constructing a Chinese Thai cross-language sensitive information identification data set with the class label.
4. The method for recognizing Chinese Thai cross-language sensitive information fused with bilingual sensitive dictionary and heterogeneous graph according to claim 2, wherein: the specific steps of Step2 are as follows:
step2.1: based on Wikipedia and social media sensitive data, sensitive words contained in the data are manually identified by means of a machine translation method, bilingual sensitive words in different sensitive categories are obtained through statistics, the bilingual sensitive words with similar word senses form word groups, and a Chinese-Tai bilingual sensitive word alignment relation is established, so that a Chinese-Tai bilingual sensitive dictionary is constructed.
5. The method for recognizing the Chinese-Tai cross-language sensitive information fusing the bilingual sensitive dictionary and the heterogeneous graph according to claim 2, wherein: the Step3 comprises the following steps:
step3.1: the method comprises the steps that co-occurrence words of documents and documents of a Chinese Thai cross-language sensitive information text data set and sensitive words of a bilingual sensitive dictionary are used as nodes to construct a Chinese Thai cross-language heterogeneous graph structure, wherein different relation types exist among the documents, the documents and the words and among the sensitive words, and the relation types include translation and similar relations among the documents, part-of-speech relations among the documents and the words and part-of-speech relations among the sensitive words;
step3.2: document and document edge: in order to obtain semantic information contained in the documents and enable the Chinese documents and the Thai documents to better perform cross-language transfer learning, two types of document relation edges are set; firstly, constructing corresponding translation edges between Chinese documents and Thai documents based on the relation between pseudo parallel corpora translated by a machine; secondly, after vector representation of the Chinese and Thai bilingual documents is obtained through a multi-language pre-training model, similarity among the documents is calculated by utilizing the document vectors;
step3.3: document and word edges: accurately segmenting words in a document by using a constructed bilingual sensitive dictionary assisted word segmentation tool, performing part-of-speech tagging on the words by using POS-Tagger, adding part-of-speech tags, connecting words with different parts-of-speech with a co-occurrence document through part-of-speech relations, and constructing edges of different types;
step3.4: words and word edges: based on the constructed bilingual sensitive word list, for the sensitive words segmented from the document, the bilingual sensitive words with similar sensitive word semantics are used as word nodes, a graph structure is established through edges between the word nodes, the sensitive information weight in the document is increased, and the Chinese-Tai cross-language word-level alignment and aggregation are performed.
6. The method for recognizing Chinese Thai cross-language sensitive information fused with bilingual sensitive dictionary and heterogeneous graph according to claim 2, wherein: the Step4 comprises the following steps:
step4.1: constructing an abnormal graph G for the relationship contained in the Chinese Thai cross-language sensitive information text data set F F Composed of different types of subgraphs, each subgraph G = (V, E), V representing defined nodes including document nodes and word nodes; e represents different types of edges between nodes; coding all nodes by using a multi-language pre-training model, and defining initialized X e R n×m Is a matrix comprising n nodes and their features, where m is the dimension of the feature vector, and X is the dimension of each row v =R m Is the feature vector of v; using its adjacency matrix A and degree matrix for each graph G
Figure FDA0003925800790000031
Element D in the degree matrix ii =∑ j1 A ij1 I, j1 are the rows and columns, respectively, of the adjacency matrix a; using a trainable weight matrix W (j) For the first layer GCN of each subgraph, m-dimensional node feature matrix H (1) =R n×m Computing methodAs follows:
Figure FDA0003925800790000032
where a () is an activation function,
Figure FDA0003925800790000033
is an adjacency matrix of an undirected graph G with self-join added, I N The unit matrix is used for acquiring higher-order neighborhood information for a multi-layer GCN structure, and the specific steps are as follows:
Figure FDA0003925800790000034
where j denotes the number of layers of graph convolution, H (0) = X, after the graph convolution operation is completed, all different types of sub-graph features are aggregated to a common implicit space, as follows:
Figure FDA0003925800790000035
the tau represents different subgraphs, the different subgraphs are aggregated together to obtain the representation of the whole abnormal composition, and the information of the word nodes is aggregated to the document nodes;
step4.2: and then, enabling the document features h obtained by the GCN layer to enter a full-link layer through an activation function LeakyReLU to obtain output, finally performing class prediction on document nodes by using a normalization index function softmax function to obtain predicted values corresponding to different classes, wherein the class with the highest predicted value is a predicted classification result, and the method specifically comprises the following steps:
Figure FDA0003925800790000036
q=Linear(p)
Figure FDA0003925800790000037
Figure FDA0003925800790000038
wherein alpha is 0.01 q And b is the weight and bias, respectively, of q, the document features output by the last fully connected layer of the model,
Figure FDA0003925800790000039
represents the probability that the document corresponds over M categories, based on the document's score, and the value of the score>
Figure FDA00039258007900000310
Representing the class result predicted by the model for the document. />
CN202211373435.1A 2022-11-04 2022-11-04 Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph Pending CN115952794A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211373435.1A CN115952794A (en) 2022-11-04 2022-11-04 Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211373435.1A CN115952794A (en) 2022-11-04 2022-11-04 Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph

Publications (1)

Publication Number Publication Date
CN115952794A true CN115952794A (en) 2023-04-11

Family

ID=87286500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211373435.1A Pending CN115952794A (en) 2022-11-04 2022-11-04 Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph

Country Status (1)

Country Link
CN (1) CN115952794A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117149839A (en) * 2023-09-14 2023-12-01 中国科学院软件研究所 Cross-ecological software detection method and device for open source software supply chain

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117149839A (en) * 2023-09-14 2023-12-01 中国科学院软件研究所 Cross-ecological software detection method and device for open source software supply chain
CN117149839B (en) * 2023-09-14 2024-04-16 中国科学院软件研究所 Cross-ecological software detection method and device for open source software supply chain

Similar Documents

Publication Publication Date Title
Jung Semantic vector learning for natural language understanding
Zhao et al. Ngram2vec: Learning improved word representations from ngram co-occurrence statistics
CN108628828B (en) Combined extraction method based on self-attention viewpoint and holder thereof
CN108595708A (en) A kind of exception information file classification method of knowledge based collection of illustrative plates
CN110532328B (en) Text concept graph construction method
CN114065758B (en) Document keyword extraction method based on hypergraph random walk
CN111897917B (en) Rail transit industry term extraction method based on multi-modal natural language features
CN114896388A (en) Hierarchical multi-label text classification method based on mixed attention
CN113157859B (en) Event detection method based on upper concept information
CN112183094A (en) Chinese grammar debugging method and system based on multivariate text features
CN113961685A (en) Information extraction method and device
WO2024036840A1 (en) Open-domain dialogue reply method and system based on topic enhancement
CN107391565B (en) Matching method of cross-language hierarchical classification system based on topic model
CN113806547B (en) Deep learning multi-label text classification method based on graph model
CN111191051A (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
Ashna et al. Lexicon based sentiment analysis system for malayalam language
CN113590827B (en) Scientific research project text classification device and method based on multiple angles
CN113609267B (en) Speech relation recognition method and system based on GCNDT-MacBERT neural network framework
CN115952794A (en) Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph
Bokaei et al. Improved deep persian named entity recognition
Ahmad et al. Machine and Deep Learning Methods with Manual and Automatic Labelling for News Classification in Bangla Language
CN116562302A (en) Multi-language event viewpoint object identification method integrating Han-Yue association relation
CN111178080A (en) Named entity identification method and system based on structured information
Flicoteaux ECSTRA-APHP@ CLEF eHealth2018-task 1: ICD10 Code Extraction from Death Certificates.
CN115827871A (en) Internet enterprise classification method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination