CN114218389A

CN114218389A - Long text classification method in chemical preparation field based on graph neural network

Info

Publication number: CN114218389A
Application number: CN202111567698.1A
Authority: CN
Inventors: 周焕来; 张博阳; 陈璐; 唐小龙; 高源�; 孙靖哲; 贾海涛; 王俊
Original assignee: Yituo Communications Group Co ltd
Current assignee: Yituo Communications Group Co ltd
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-03-22

Abstract

The invention provides a long text classification method in the field of chemical preparation based on a graph neural network, which comprises the following steps: firstly, integrating word statistical characteristics and word vector characteristics, inputting multidimensional characteristics as an RNN + CRF model, extracting new words of texts in the chemical preparation field, and constructing a word segmentation dictionary; secondly, initializing all documents and word segments (including training data and prediction data) as nodes, and constructing a global grammar and a sequential tensor graph; next, through a graph message transmission mechanism, iterating information on a graph (nodes, edges and global information) and between graphs, and updating the feature representation; then, carrying out dimensionality reduction fusion on the grammar and the sequence tensor map to obtain a global semantic map; and finally, taking the global semantic graph as input, obtaining information representation through graph convolution network training, accessing softmax layer classification, outputting the class information of the document node to be predicted, and obtaining a final prediction result.

Description

Long text classification method in chemical preparation field based on graph neural network

Technical Field

The invention belongs to the field of natural language processing, and relates to a long text classification method in the field of chemical preparation based on a graph neural network.

Background

In recent years, breakthrough of big data technology and artificial intelligence technology has injected a new growth point for the traditional industry, and increasingly deepens the influence on industry development, research and decision making. The chemical industry is an important support of the second industry in China, China is also a recognized chemical industry big country, and most of chemical product productivity is in the first world.

New challenges and opportunities are created for the traditional chemical industry. On one hand, the chemical industry is difficult as the task of emission reduction of the traditional carbon emission households, but on the other hand, the chemical industry has unique advantages in the aspects of carbon dioxide resource utilization and the like. Therefore, the method selection and path exploration in the chemical preparation field are very critical and urgent. At present, aiming at information such as a preparation process and a flow of any chemical product, a large amount of text data can be obtained by methods such as searching related patents of the product and inquiring related files from the internet. It becomes crucial to classify these text knowledge by the manufacturing process. Therefore, how to classify the massive data texts and acquire the texts according to the categories to analyze the texts is a key link of research.

The study of the text classification problem has been one of the basic problems in the field of natural language processing. Researchers have been focusing on the remote relevance of text, from shallow machine learning to deep learning. Until a BERT (bidirectional Encoder retrieval from transforms) model appears, a large amount of linguistic data are trained to bidirectionally encode and generate word vectors related to context semantics, and the word vectors become an important turn of downstream tasks in the field of natural language processing such as text classification.

The BERT and the improved pre-training model thereof have two important problems, the BERT limits the length of an input text to 512 characters, and a large number of long texts with the length exceeding 512 characters exist in the production work in the field of chemical preparation, so that the semantic pre-training model cannot be popularized to a long text classification task, and the exploration based on the GNN graph neural network text classification technology in recent years can well capture the structure information of the long text; in addition, the BERT has no Chinese word segmentation function, word embedding mapping is carried out on each word, and a large number of new words in the field of chemical preparation exist, so that the vector embedding learning of the BERT is influenced.

Therefore, the invention designs a long text classification method in the chemical preparation field based on the graph neural network, which realizes field Chinese word segmentation by identifying new words in the chemical field through a new word discovery algorithm, realizes global graph structure construction by fusing multi-source relations between nodes, and realizes text node classification by accessing softmax through a fully connected layer through graph convolution neural network iterative classification characteristic. Therefore, the problem of long text classification in the field of chemical preparation is solved.

Disclosure of Invention

The long text classification in the field of chemical preparation based on the graph neural network mainly comprises four steps: the method comprises the steps of discovering new words in the chemical field, constructing a global knowledge graph, acquiring node classification information by a graph convolution neural network and outputting a layer.

The invention mainly aims at the problem that global semantic features cannot be effectively acquired in the long text classification problem in the chemical preparation field, and provides a long text classification method in the chemical preparation field based on a graph neural network. The method is based on a multi-dimensional word feature fusion method and a deep learning method to realize new word discovery in the chemical field, global knowledge graph node embedding is realized by dictionary word Glove vector embedding and new word graph structure embedding, a tensor graph of syntax and sequence among nodes is constructed, relationship features among nodes are fused to realize global knowledge graph edge embedding, and a global knowledge graph is constructed. And acquiring node classification information by using the graph convolutional neural network, accessing the softmax to the full connection layer to classify the text nodes, and outputting a classification result. The method comprises the following steps:

(1) and constructing a new word dictionary, and performing word segmentation on the text in a special field fit manner by using the user-defined dictionary. And (4) fusing word statistical characteristics and word vector characteristics, inputting multidimensional characteristics as an RNN + CRF model, and extracting new words of the text.

(2) Constructing a multi-dimensional tensor graph, initializing all documents and word segments (including training data and prediction data) as nodes Glove, and constructing a global grammar and sequence tensor graph representation;

(3) transmitting iterative updating feature representation of nodes, edges and global information in the graph through a graph message transmission mechanism;

(4) fusing the grammar and the sequence feature tensor map to obtain a global semantic map;

(5) the graph convolutional neural network obtains node classification information, after the global knowledge graph is constructed, the node classification information is used as input and is placed into the graph convolutional neural network for training iteration to obtain document node classification information, a network full connection layer is connected with softmax and is used as an output layer, the classification information of the document nodes to be predicted is output, and a final prediction result is obtained.

Description of the figures and accompanying tables

FIG. 1 is a block diagram of the overall algorithm of the present invention.

FIG. 2 is a schematic diagram of the extraction of the RNN text word feature of the present invention.

Fig. 3 is a diagram of the syntax tensor map class coding comparison of the present invention.

FIG. 4 is a diagram illustrating an iterative embedding of a graph message passing mechanism according to the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

As shown in FIG. 1, the invention mainly aims at the problem of long text classification in the field of chemical engineering preparation, and provides a long text classification method in the field of chemical engineering preparation based on a graph neural network. The method comprises the steps of utilizing multi-dimensional feature fusion and combining a deep learning method to achieve new word discovery in the chemical field, achieving global knowledge graph node embedding by embedding a dictionary word Glove vector and embedding a new word graph structure, constructing a tensor graph of syntax and sequence among nodes, fusing relationship features among nodes to achieve global knowledge graph edge embedding, and constructing a global knowledge graph. And acquiring node classification information by using the graph convolutional neural network, and classifying the text nodes by using a full connection layer access softmax. The invention solves the problem of low classification accuracy of long texts in the field of chemical engineering preparation by utilizing a mode of constructing a multivariate global semantic graph and a convolutional graph neural network. The concrete entity mode is as follows:

the method comprises the following steps: discovery of new words in chemical field

In order to improve the accuracy of word segmentation of texts in the chemical field, a new word dictionary in the chemical field needs to be extracted. Because the existing dictionary of the common word segmentation tool is specific to the general field, the word segmentation error rate of professional words in the chemical field is high. Therefore, an improved text word segmentation method in the chemical field is provided, and the specific process comprises the following four steps:

1.1 raw corpus preprocessing

The Chinese grammar features are clearer, so that the large-scale chemical field corpus text is divided into sentences according to punctuations such as commas, periods and the like, special symbols are removed, and noise characters are reduced. And extracting all text segments with the character length not more than 5 in each sentence as new word candidate words.

1.2 extracting word features

Extracting word frequency, word length, mutual information and context information entropy as characteristics according to a statistical model, and adding word vectors to increase the richness of word characteristics

The invention uses mutual information as word characteristics. Mutual information is used for measuring the degree of mutual dependence among variables, and word mutual information is used for measuring the degree of mutual correlation among characters. Specifically, it is shown in formula (1)

Where p (X, Y) is the joint probability distribution function of X and Y, and p (X) and p (Y) are the edge probability distribution functions of X and Y, respectively.

The invention adopts the context information entropy to measure the uncertainty of the left character and the right character of a certain character segment. The larger the information entropy, the higher the probability that the character fragment is alone word-forming. Specifically, it is shown in formula (2)

H(X)＝-∑_x∈Xp(x)log₂p(x) (2)

Where p (X) is the probability distribution of X.

1.3 fusing word multidimensional features

And inputting the multi-dimensional features into an RNN + CRF model to obtain a dictionary of new words in the chemical field in the text.

A Recurrent Neural Network (RNN) can process data of a sequence change, compared with a general Neural Network. The network can be applied to the field of new word discovery, and can better contact context information to extract new words in the text. The value of the hidden layer of the recurrent neural network as shown in fig. 2 depends not only on the current input text but also on the value s of the last hidden layer. The weight matrix W is the weight of the last value of the hidden layer as the input of this time.

Conditional Random Fields (CRFs) are a special case of markov Random Fields. In the field of new word discovery in the chemical industry field, the label of each word is influenced by adjacent labels with high probability. According to the new word discovery task, the CRF learns corresponding label rules, so that the final label result is most reasonable while conforming to the current character segment.

1.4 New word dictionary participles

And adding the new word dictionary in the chemical field into the word segmentation tool to obtain a word segmentation result of the text in the chemical field. A user-defined dictionary is added into the word segmentation tool jieba, and the word segmentation tool can give priority to the user dictionary, so that the applicability of the word segmentation result in the professional field is improved. And obtaining a high-accuracy text word segmentation result in the chemical field by using the optimized word segmentation method.

Step two: initial multidimensional tensor graph construction

The method obtains new word information in the chemical industry field through the steps, and realizes long text word segmentation operation in the chemical industry preparation field. And then, constructing a global knowledge graph for the linguistic data in the preparation field of a certain chemical product. The nodes in the global knowledge graph are all document nodes and word segmentation nodes in the corpus, and are constructed by the following four steps:

2.1 initialization node embedding

The nodes in the global graph adopt a certain chemical product to prepare all document nodes and word segmentation nodes including a training set and a test set in a domain corpus.

The word segmentation nodes are initialized and embedded into a vector space by adopting Glove; initializing a new word node of the chemical field outside the dictionary to 0; the document nodes are encoded in order. Dimension d is set to 300. Thus, a vector space representation with 300 dimensions per node is obtained.

2.2 Multi-dimensional tensor map construction

And constructing a global grammar tensor map and a sequential tensor map. The syntax graph has the same nodes as the syntax graph. The document nodes and the participle (new word) nodes have the same edges, and the weights are calculated by adopting tf-idf word frequency inverse document frequency, as shown in formula (3).

Where i is a document, j is a term (new term), C_iRepresenting the total number of words, T, of the document i_ijRepresenting the number of occurrences of j in i, CP being the total number of documents in the corpus, CP_jThe number of documents in the corpus that contain the word j.

Next, an edge between the participle (new word) and the participle (new word) node is constructed:

G_Synand (4) a grammatical tensor graph, firstly extracting grammatical dependency relations between words by using an ltp parser for each document, and regarding various relations as undirected edges. Computing degree of each pair of words with syntactic dependency relationship in whole corpusThe number definition does not weight the edges between words (syntactic graph nodes) as shown in equation (4).

Wherein A is_j1j2Weight, N, representing the edge between words j1 and j2_syntactic(w_j1,w_j2) Representing the number of times two words have syntactic dependencies in all documents of a corpus, N_total(w_j1,w_j2) Indicating the number of times two words are present in the same document throughout the corpus, and num represents the coding of the syntactic dependency between the two words, as shown in fig. 3.

Code (num)

Type of relationship

Tag

Code (num)

Type of relationship

Tag

0

Relationship between major and minor

SBV

7

Dynamic complement relationship

CMP

1

Moving guest relationship

VOB

8

In a parallel relationship

COO

2

Inter-guest relationships

IOB

9

Intermediary relation

POB

3

Preposition object

FOB

10

Left attachment

LAD

4

Concurrent language

DBL

11

Right attachment

RAD

5

Centering relationships

ATT

12

Independent structure

IS

6

Relationship between aspects

ADV

13

Core relationships

HRD

FIG. 3 syntax dependency encoding diagram

G_PMIThe sequential tensor graph, for each document, computes the weights between the term (new term) nodes in a sliding window fashion as shown in equation (5).

Where W (j1, j2) is the number of sliding windows containing word j1 and word j2, W (j) is the number of sliding windows containing word j, and W is the total number of sliding windows. The sliding window length is set to 25 herein.

Step three: graph message passing feature iteration

To this end, the initial grammar tensor map G has been completed_SynAnd sequential tensor map G_PMIAnd (4) constructing.

As shown in fig. 4, we adopt a graph message passing mechanism to iterate through a GNN graph neural network message passing layer, and obtain an embedded feature fusing adjacent nodes and edges in the graph. The method updates the initial vector embedding of the syntactic tensor map and the sequential tensor map, and particularly represents the oov (out of vocabularies) out-of-dictionary features with the adjacent features for the new word nodes.

In the invention, a GNN network transfer iteration characteristic with the layer number of 4 is constructed. Definition of

Is a collection of all the nodes that are,

is the set of all edges and u is the global property of the graph. The 3 transfer update functions for the node V, the edge E, and the global feature U are shown in equations (6) to (8).

Formula (7) accepts a parameter as a set of edges, and uses the information of all edges in the set to adjust the state of a node. Formula (8) accepts the parameters as a set of edges (points), and then uses the information of all edges (points) in the set to adjust the global state. Accessing a full link layer to obtain an updated grammar tensor map G'_SynAnd sequential tensor map G'_PMI。

Step four: construction of global semantic graph by tensor graph fusion

The invention adopts a method of 1 multiplied by 1 convolution kernel to fuse the grammatical tensor map and the sequential tensor map.

The 1 × 1 convolution kernel method is widely applied to feature dimension change of image processing, and a filter with the size of 1 × 1 is used for convolution operation. It was first presented in the paper of networklnnetwork, which was used to deepen the widening of the network structure. By the idea of integrating dimension transformation by the 1 × 1 convolution kernel of the image across channel characteristics, the invention is provided with a plurality of 1 × 1 filters, and the number of output channels, namely, the dimensionality reduction and the dimensionality increase can be increased and decreased at will.

By adding grammar vector graph G'_SynAnd sequential tensor map G'_PMIPerforming 1 multiplied by 1 convolution kernel dimension reduction integration operation on the adjacency matrix to obtain a global semantic graph G constructed by linguistic data in the chemical preparation field_E。

Step five: graph convolution neural network obtaining node classification information

According to the invention, through the steps, the global semantic graph G constructed by the linguistic data in the chemical preparation field is obtained_EAnd then, the classification of the unclassified document nodes in the global graph needs to be trained and predicted by taking the classification as network input.

The graph neural network is a differentiable information propagation model based on thermodynamic propagation transformation. Graph neural networks that make global semantic graph document node classification predictions are a variant of the underlying graph neural networks GCNs.

GCN graph convolution neural networks are a generalized form of conventional Convolution Neural Networks (CNNs) that can operate directly on graphs. Formally, consider a graph G ═ V, E, where V is the set of nodes of the graph, E is the set of edges of the graph, and X ∈ R^n×nIs a matrix that contains all nodes and their characteristics.

An adjacency matrix A with G and a degree matrix D thereof, wherein D_ii＝∑A_ij. The GCN can only capture the information of the direct neighbors by one layer of convolution. When multiple GCNs are stacked, information of a large semantic interval can be integrated. For single layer GCN, the node feature matrix L⁽¹⁾＝R^n*kThe calculation is shown in equation (9).

Wherein

Is a normalized symmetric adjacency matrix, W₀∈R^m×kIs a weight matrix and p is an activation function, e.g., a ReLU function.

The invention combines the global semantic graph G_EInputting the document nodes into a double-layer GCN (graph relational networks) graph volume network, wherein the embedding size of the second layer of nodes is the same as that of a tag set, then accessing a softmax layer to classify the document nodes to be classified, and calculating as shown in formula (10).

Wherein the content of the first and second substances,

is a normalized adjacency matrix that is,

and finally, predicting and outputting the graph neural network by using a softmax classifier to obtain a classification result of each document node to be classified.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited in scope to the specific embodiments. Such variations are obvious and all the inventions utilizing the concepts of the present invention are intended to be protected.

Claims

1. A long text classification method in the chemical preparation field based on a graph neural network is characterized by comprising the following steps:

step 1: discovery of new words in chemical field

Step 2: initial multidimensional tensor graph construction

And step 3: graph message passing feature iteration

And 4, step 4: construction of global semantic graph by tensor graph fusion

And 5: the graph convolution neural network obtains node classification information.

2. The method as claimed in claim 1, wherein the specific method for proposing new word discovery in the chemical field in the step 1 is as follows:

step 1.1 raw corpus preprocessing

Step 1.2 extraction of word features

H(X)＝-∑_x∈Xp(x)log₂p(x) (2)

Where p (X) is the probability distribution of X.

Step 1.3 fusion of word multidimensional features

A Recurrent Neural Network (RNN) can process data of a sequence change, compared with a general Neural Network. The network can be applied to the field of new word discovery, and can better contact context information to extract new words in the text. The value of the hidden layer of the recurrent neural network as shown in fig. 1 depends not only on the current input text but also on the value s of the last hidden layer. The weight matrix W is the weight of the last value of the hidden layer as the input of this time.

Conditional Random Fields (CRFs) are a special case of markov Random Fields. In the field of new word discovery in the chemical industry field, the label of each word is influenced by adjacent labels with high probability. According to the new word discovery task, the CRF learns corresponding label rules, so that the final label result conforms to the current character segment and the whole sentence is most reasonable.

Step 1.4 New word dictionary participle

3. The method for classifying long texts in the chemical engineering preparation field based on the graph neural network according to claim 2, wherein the method for constructing the initial multi-dimensional tensor map in the step 2 specifically comprises:

step 2.1 initialization node embedding

Step 2.2 Multi-dimensional tensor map construction

Where i is a document, j is a term (new term), C_iRepresenting the total number of words, T, of the document i_ijDenotes the number of occurrences of j in i, CP is corpusTotal number of files in library, CP_jThe number of documents in the corpus that contain the word j.

G_Synand (4) a grammatical tensor graph, firstly extracting grammatical dependency relations between words by using an ltp parser for each document, and regarding various relations as undirected edges. The number of times each pair of words having syntactic dependencies in the entire corpus is calculated defines the weight of the edge between the non-pair of words (syntactic graph nodes) as shown in equation (4).

Wherein A is_j1j2Weight, N, representing the edge between words j1 and j2_syntactic(w_j1,w_j2) Representing the number of times two words have syntactic dependencies in all documents of a corpus, N_total(w_j1,w_j2) Indicating the number of times two words are present in the same document throughout the corpus, and num represents the coding of the syntactic dependency between the two words, as shown in fig. 2.

4. The long text classification method in the chemical industry preparation field based on the graph neural network according to claim 3, wherein the graph message transmission feature iteration method in the step 3 specifically comprises:

as shown in fig. 3, we use a graph message passing mechanism to iterate through a GNN graph neural network message passing layer, so as to obtain embedded features in the graph, which merge the adjacent nodes and edges. The method updates the initial vector embedding of the syntactic tensor map and the sequential tensor map, and particularly represents the oov (out of vocabularies) out-of-dictionary features with the adjacent features for the new word nodes.

Is a collection of all the nodes that are,

5. The method for classifying long texts in the chemical engineering preparation field based on the graph neural network according to claim 4, wherein the method for constructing the global semantic graph by fusing the tensor graphs in the step 4 specifically comprises the following steps:

The 1 × 1 convolution kernel method is widely applied to feature dimension change of image processing, and a filter with the size of 1 × 1 is used for convolution operation. It was first presented In the Network In Network paper, which was used to deepen the widening of the Network structure. By the idea of integrating dimension transformation by the 1 × 1 convolution kernel of the image across channel characteristics, the invention is provided with a plurality of 1 × 1 filters, and the number of output channels, namely, the dimensionality reduction and the dimensionality increase can be increased and decreased at will.

6. The method for classifying long texts in the chemical engineering preparation field based on the graph neural network according to claim 5, wherein the method for obtaining node classification information by the graph convolution neural network in the step 5 specifically comprises:

GCN graph convolution neural networks are a generalized form of conventional Convolution Neural Networks (CNNs) that can operate directly on graphs. Formally, consider a graph G ═ V, E, where V is the set of nodes of the graph, E is the set of edges of the graph, and X ∈ R^n×nIs a node which contains all nodes and characteristics thereofOf the matrix of (a).

Wherein

Wherein the content of the first and second substances,

is a normalized adjacency matrix that is,