CN114218389A - Long text classification method in chemical preparation field based on graph neural network - Google Patents

Long text classification method in chemical preparation field based on graph neural network Download PDF

Info

Publication number
CN114218389A
CN114218389A CN202111567698.1A CN202111567698A CN114218389A CN 114218389 A CN114218389 A CN 114218389A CN 202111567698 A CN202111567698 A CN 202111567698A CN 114218389 A CN114218389 A CN 114218389A
Authority
CN
China
Prior art keywords
graph
word
nodes
neural network
chemical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111567698.1A
Other languages
Chinese (zh)
Inventor
周焕来
张博阳
陈璐
唐小龙
高源�
孙靖哲
贾海涛
王俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yituo Communications Group Co ltd
Original Assignee
Yituo Communications Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yituo Communications Group Co ltd filed Critical Yituo Communications Group Co ltd
Priority to CN202111567698.1A priority Critical patent/CN114218389A/en
Publication of CN114218389A publication Critical patent/CN114218389A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a long text classification method in the field of chemical preparation based on a graph neural network, which comprises the following steps: firstly, integrating word statistical characteristics and word vector characteristics, inputting multidimensional characteristics as an RNN + CRF model, extracting new words of texts in the chemical preparation field, and constructing a word segmentation dictionary; secondly, initializing all documents and word segments (including training data and prediction data) as nodes, and constructing a global grammar and a sequential tensor graph; next, through a graph message transmission mechanism, iterating information on a graph (nodes, edges and global information) and between graphs, and updating the feature representation; then, carrying out dimensionality reduction fusion on the grammar and the sequence tensor map to obtain a global semantic map; and finally, taking the global semantic graph as input, obtaining information representation through graph convolution network training, accessing softmax layer classification, outputting the class information of the document node to be predicted, and obtaining a final prediction result.

Description

Long text classification method in chemical preparation field based on graph neural network
Technical Field
The invention belongs to the field of natural language processing, and relates to a long text classification method in the field of chemical preparation based on a graph neural network.
Background
In recent years, breakthrough of big data technology and artificial intelligence technology has injected a new growth point for the traditional industry, and increasingly deepens the influence on industry development, research and decision making. The chemical industry is an important support of the second industry in China, China is also a recognized chemical industry big country, and most of chemical product productivity is in the first world.
New challenges and opportunities are created for the traditional chemical industry. On one hand, the chemical industry is difficult as the task of emission reduction of the traditional carbon emission households, but on the other hand, the chemical industry has unique advantages in the aspects of carbon dioxide resource utilization and the like. Therefore, the method selection and path exploration in the chemical preparation field are very critical and urgent. At present, aiming at information such as a preparation process and a flow of any chemical product, a large amount of text data can be obtained by methods such as searching related patents of the product and inquiring related files from the internet. It becomes crucial to classify these text knowledge by the manufacturing process. Therefore, how to classify the massive data texts and acquire the texts according to the categories to analyze the texts is a key link of research.
The study of the text classification problem has been one of the basic problems in the field of natural language processing. Researchers have been focusing on the remote relevance of text, from shallow machine learning to deep learning. Until a BERT (bidirectional Encoder retrieval from transforms) model appears, a large amount of linguistic data are trained to bidirectionally encode and generate word vectors related to context semantics, and the word vectors become an important turn of downstream tasks in the field of natural language processing such as text classification.
The BERT and the improved pre-training model thereof have two important problems, the BERT limits the length of an input text to 512 characters, and a large number of long texts with the length exceeding 512 characters exist in the production work in the field of chemical preparation, so that the semantic pre-training model cannot be popularized to a long text classification task, and the exploration based on the GNN graph neural network text classification technology in recent years can well capture the structure information of the long text; in addition, the BERT has no Chinese word segmentation function, word embedding mapping is carried out on each word, and a large number of new words in the field of chemical preparation exist, so that the vector embedding learning of the BERT is influenced.
Therefore, the invention designs a long text classification method in the chemical preparation field based on the graph neural network, which realizes field Chinese word segmentation by identifying new words in the chemical field through a new word discovery algorithm, realizes global graph structure construction by fusing multi-source relations between nodes, and realizes text node classification by accessing softmax through a fully connected layer through graph convolution neural network iterative classification characteristic. Therefore, the problem of long text classification in the field of chemical preparation is solved.
Disclosure of Invention
The long text classification in the field of chemical preparation based on the graph neural network mainly comprises four steps: the method comprises the steps of discovering new words in the chemical field, constructing a global knowledge graph, acquiring node classification information by a graph convolution neural network and outputting a layer.
The invention mainly aims at the problem that global semantic features cannot be effectively acquired in the long text classification problem in the chemical preparation field, and provides a long text classification method in the chemical preparation field based on a graph neural network. The method is based on a multi-dimensional word feature fusion method and a deep learning method to realize new word discovery in the chemical field, global knowledge graph node embedding is realized by dictionary word Glove vector embedding and new word graph structure embedding, a tensor graph of syntax and sequence among nodes is constructed, relationship features among nodes are fused to realize global knowledge graph edge embedding, and a global knowledge graph is constructed. And acquiring node classification information by using the graph convolutional neural network, accessing the softmax to the full connection layer to classify the text nodes, and outputting a classification result. The method comprises the following steps:
(1) and constructing a new word dictionary, and performing word segmentation on the text in a special field fit manner by using the user-defined dictionary. And (4) fusing word statistical characteristics and word vector characteristics, inputting multidimensional characteristics as an RNN + CRF model, and extracting new words of the text.
(2) Constructing a multi-dimensional tensor graph, initializing all documents and word segments (including training data and prediction data) as nodes Glove, and constructing a global grammar and sequence tensor graph representation;
(3) transmitting iterative updating feature representation of nodes, edges and global information in the graph through a graph message transmission mechanism;
(4) fusing the grammar and the sequence feature tensor map to obtain a global semantic map;
(5) the graph convolutional neural network obtains node classification information, after the global knowledge graph is constructed, the node classification information is used as input and is placed into the graph convolutional neural network for training iteration to obtain document node classification information, a network full connection layer is connected with softmax and is used as an output layer, the classification information of the document nodes to be predicted is output, and a final prediction result is obtained.
Description of the figures and accompanying tables
FIG. 1 is a block diagram of the overall algorithm of the present invention.
FIG. 2 is a schematic diagram of the extraction of the RNN text word feature of the present invention.
Fig. 3 is a diagram of the syntax tensor map class coding comparison of the present invention.
FIG. 4 is a diagram illustrating an iterative embedding of a graph message passing mechanism according to the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
As shown in FIG. 1, the invention mainly aims at the problem of long text classification in the field of chemical engineering preparation, and provides a long text classification method in the field of chemical engineering preparation based on a graph neural network. The method comprises the steps of utilizing multi-dimensional feature fusion and combining a deep learning method to achieve new word discovery in the chemical field, achieving global knowledge graph node embedding by embedding a dictionary word Glove vector and embedding a new word graph structure, constructing a tensor graph of syntax and sequence among nodes, fusing relationship features among nodes to achieve global knowledge graph edge embedding, and constructing a global knowledge graph. And acquiring node classification information by using the graph convolutional neural network, and classifying the text nodes by using a full connection layer access softmax. The invention solves the problem of low classification accuracy of long texts in the field of chemical engineering preparation by utilizing a mode of constructing a multivariate global semantic graph and a convolutional graph neural network. The concrete entity mode is as follows:
the method comprises the following steps: discovery of new words in chemical field
In order to improve the accuracy of word segmentation of texts in the chemical field, a new word dictionary in the chemical field needs to be extracted. Because the existing dictionary of the common word segmentation tool is specific to the general field, the word segmentation error rate of professional words in the chemical field is high. Therefore, an improved text word segmentation method in the chemical field is provided, and the specific process comprises the following four steps:
1.1 raw corpus preprocessing
The Chinese grammar features are clearer, so that the large-scale chemical field corpus text is divided into sentences according to punctuations such as commas, periods and the like, special symbols are removed, and noise characters are reduced. And extracting all text segments with the character length not more than 5 in each sentence as new word candidate words.
1.2 extracting word features
Extracting word frequency, word length, mutual information and context information entropy as characteristics according to a statistical model, and adding word vectors to increase the richness of word characteristics
The invention uses mutual information as word characteristics. Mutual information is used for measuring the degree of mutual dependence among variables, and word mutual information is used for measuring the degree of mutual correlation among characters. Specifically, it is shown in formula (1)
Figure BDA0003422452850000031
Where p (X, Y) is the joint probability distribution function of X and Y, and p (X) and p (Y) are the edge probability distribution functions of X and Y, respectively.
The invention adopts the context information entropy to measure the uncertainty of the left character and the right character of a certain character segment. The larger the information entropy, the higher the probability that the character fragment is alone word-forming. Specifically, it is shown in formula (2)
H(X)=-∑x∈Xp(x)log2p(x) (2)
Where p (X) is the probability distribution of X.
1.3 fusing word multidimensional features
And inputting the multi-dimensional features into an RNN + CRF model to obtain a dictionary of new words in the chemical field in the text.
A Recurrent Neural Network (RNN) can process data of a sequence change, compared with a general Neural Network. The network can be applied to the field of new word discovery, and can better contact context information to extract new words in the text. The value of the hidden layer of the recurrent neural network as shown in fig. 2 depends not only on the current input text but also on the value s of the last hidden layer. The weight matrix W is the weight of the last value of the hidden layer as the input of this time.
Conditional Random Fields (CRFs) are a special case of markov Random Fields. In the field of new word discovery in the chemical industry field, the label of each word is influenced by adjacent labels with high probability. According to the new word discovery task, the CRF learns corresponding label rules, so that the final label result is most reasonable while conforming to the current character segment.
1.4 New word dictionary participles
And adding the new word dictionary in the chemical field into the word segmentation tool to obtain a word segmentation result of the text in the chemical field. A user-defined dictionary is added into the word segmentation tool jieba, and the word segmentation tool can give priority to the user dictionary, so that the applicability of the word segmentation result in the professional field is improved. And obtaining a high-accuracy text word segmentation result in the chemical field by using the optimized word segmentation method.
Step two: initial multidimensional tensor graph construction
The method obtains new word information in the chemical industry field through the steps, and realizes long text word segmentation operation in the chemical industry preparation field. And then, constructing a global knowledge graph for the linguistic data in the preparation field of a certain chemical product. The nodes in the global knowledge graph are all document nodes and word segmentation nodes in the corpus, and are constructed by the following four steps:
2.1 initialization node embedding
The nodes in the global graph adopt a certain chemical product to prepare all document nodes and word segmentation nodes including a training set and a test set in a domain corpus.
The word segmentation nodes are initialized and embedded into a vector space by adopting Glove; initializing a new word node of the chemical field outside the dictionary to 0; the document nodes are encoded in order. Dimension d is set to 300. Thus, a vector space representation with 300 dimensions per node is obtained.
2.2 Multi-dimensional tensor map construction
And constructing a global grammar tensor map and a sequential tensor map. The syntax graph has the same nodes as the syntax graph. The document nodes and the participle (new word) nodes have the same edges, and the weights are calculated by adopting tf-idf word frequency inverse document frequency, as shown in formula (3).
Figure BDA0003422452850000041
Where i is a document, j is a term (new term), CiRepresenting the total number of words, T, of the document iijRepresenting the number of occurrences of j in i, CP being the total number of documents in the corpus, CPjThe number of documents in the corpus that contain the word j.
Next, an edge between the participle (new word) and the participle (new word) node is constructed:
GSynand (4) a grammatical tensor graph, firstly extracting grammatical dependency relations between words by using an ltp parser for each document, and regarding various relations as undirected edges. Computing degree of each pair of words with syntactic dependency relationship in whole corpusThe number definition does not weight the edges between words (syntactic graph nodes) as shown in equation (4).
Figure BDA0003422452850000051
Wherein A isj1j2Weight, N, representing the edge between words j1 and j2syntactic(wj1,wj2) Representing the number of times two words have syntactic dependencies in all documents of a corpus, Ntotal(wj1,wj2) Indicating the number of times two words are present in the same document throughout the corpus, and num represents the coding of the syntactic dependency between the two words, as shown in fig. 3.
Code (num) Type of relationship Tag Code (num) Type of relationship Tag
0 Relationship between major and minor SBV 7 Dynamic complement relationship CMP
1 Moving guest relationship VOB 8 In a parallel relationship COO
2 Inter-guest relationships IOB 9 Intermediary relation POB
3 Preposition object FOB 10 Left attachment LAD
4 Concurrent language DBL 11 Right attachment RAD
5 Centering relationships ATT 12 Independent structure IS
6 Relationship between aspects ADV 13 Core relationships HRD
FIG. 3 syntax dependency encoding diagram
GPMIThe sequential tensor graph, for each document, computes the weights between the term (new term) nodes in a sliding window fashion as shown in equation (5).
Figure BDA0003422452850000052
Where W (j1, j2) is the number of sliding windows containing word j1 and word j2, W (j) is the number of sliding windows containing word j, and W is the total number of sliding windows. The sliding window length is set to 25 herein.
Step three: graph message passing feature iteration
To this end, the initial grammar tensor map G has been completedSynAnd sequential tensor map GPMIAnd (4) constructing.
As shown in fig. 4, we adopt a graph message passing mechanism to iterate through a GNN graph neural network message passing layer, and obtain an embedded feature fusing adjacent nodes and edges in the graph. The method updates the initial vector embedding of the syntactic tensor map and the sequential tensor map, and particularly represents the oov (out of vocabularies) out-of-dictionary features with the adjacent features for the new word nodes.
In the invention, a GNN network transfer iteration characteristic with the layer number of 4 is constructed. Definition of
Figure BDA0003422452850000061
Figure BDA0003422452850000062
Is a collection of all the nodes that are,
Figure BDA0003422452850000063
is the set of all edges and u is the global property of the graph. The 3 transfer update functions for the node V, the edge E, and the global feature U are shown in equations (6) to (8).
Figure BDA0003422452850000064
Figure BDA0003422452850000065
Figure BDA0003422452850000066
Formula (7) accepts a parameter as a set of edges, and uses the information of all edges in the set to adjust the state of a node. Formula (8) accepts the parameters as a set of edges (points), and then uses the information of all edges (points) in the set to adjust the global state. Accessing a full link layer to obtain an updated grammar tensor map G'SynAnd sequential tensor map G'PMI
Step four: construction of global semantic graph by tensor graph fusion
The invention adopts a method of 1 multiplied by 1 convolution kernel to fuse the grammatical tensor map and the sequential tensor map.
The 1 × 1 convolution kernel method is widely applied to feature dimension change of image processing, and a filter with the size of 1 × 1 is used for convolution operation. It was first presented in the paper of networklnnetwork, which was used to deepen the widening of the network structure. By the idea of integrating dimension transformation by the 1 × 1 convolution kernel of the image across channel characteristics, the invention is provided with a plurality of 1 × 1 filters, and the number of output channels, namely, the dimensionality reduction and the dimensionality increase can be increased and decreased at will.
By adding grammar vector graph G'SynAnd sequential tensor map G'PMIPerforming 1 multiplied by 1 convolution kernel dimension reduction integration operation on the adjacency matrix to obtain a global semantic graph G constructed by linguistic data in the chemical preparation fieldE
Step five: graph convolution neural network obtaining node classification information
According to the invention, through the steps, the global semantic graph G constructed by the linguistic data in the chemical preparation field is obtainedEAnd then, the classification of the unclassified document nodes in the global graph needs to be trained and predicted by taking the classification as network input.
The graph neural network is a differentiable information propagation model based on thermodynamic propagation transformation. Graph neural networks that make global semantic graph document node classification predictions are a variant of the underlying graph neural networks GCNs.
GCN graph convolution neural networks are a generalized form of conventional Convolution Neural Networks (CNNs) that can operate directly on graphs. Formally, consider a graph G ═ V, E, where V is the set of nodes of the graph, E is the set of edges of the graph, and X ∈ Rn×nIs a matrix that contains all nodes and their characteristics.
An adjacency matrix A with G and a degree matrix D thereof, wherein Dii=∑Aij. The GCN can only capture the information of the direct neighbors by one layer of convolution. When multiple GCNs are stacked, information of a large semantic interval can be integrated. For single layer GCN, the node feature matrix L(1)=Rn*kThe calculation is shown in equation (9).
Figure BDA0003422452850000071
Wherein
Figure BDA0003422452850000072
Is a normalized symmetric adjacency matrix, W0∈Rm×kIs a weight matrix and p is an activation function, e.g., a ReLU function.
The invention combines the global semantic graph GEInputting the document nodes into a double-layer GCN (graph relational networks) graph volume network, wherein the embedding size of the second layer of nodes is the same as that of a tag set, then accessing a softmax layer to classify the document nodes to be classified, and calculating as shown in formula (10).
Figure BDA0003422452850000073
Wherein the content of the first and second substances,
Figure BDA0003422452850000074
is a normalized adjacency matrix that is,
Figure BDA0003422452850000075
and finally, predicting and outputting the graph neural network by using a softmax classifier to obtain a classification result of each document node to be classified.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited in scope to the specific embodiments. Such variations are obvious and all the inventions utilizing the concepts of the present invention are intended to be protected.

Claims (6)

1. A long text classification method in the chemical preparation field based on a graph neural network is characterized by comprising the following steps:
step 1: discovery of new words in chemical field
Step 2: initial multidimensional tensor graph construction
And step 3: graph message passing feature iteration
And 4, step 4: construction of global semantic graph by tensor graph fusion
And 5: the graph convolution neural network obtains node classification information.
2. The method as claimed in claim 1, wherein the specific method for proposing new word discovery in the chemical field in the step 1 is as follows:
step 1.1 raw corpus preprocessing
The Chinese grammar features are clearer, so that the large-scale chemical field corpus text is divided into sentences according to punctuations such as commas, periods and the like, special symbols are removed, and noise characters are reduced. And extracting all text segments with the character length not more than 5 in each sentence as new word candidate words.
Step 1.2 extraction of word features
Extracting word frequency, word length, mutual information and context information entropy as characteristics according to a statistical model, and adding word vectors to increase the richness of word characteristics
The invention uses mutual information as word characteristics. Mutual information is used for measuring the degree of mutual dependence among variables, and word mutual information is used for measuring the degree of mutual correlation among characters. Specifically, it is shown in formula (1)
Figure FDA0003422452840000011
Where p (X, Y) is the joint probability distribution function of X and Y, and p (X) and p (Y) are the edge probability distribution functions of X and Y, respectively.
The invention adopts the context information entropy to measure the uncertainty of the left character and the right character of a certain character segment. The larger the information entropy, the higher the probability that the character fragment is alone word-forming. Specifically, it is shown in formula (2)
H(X)=-∑x∈Xp(x)log2p(x) (2)
Where p (X) is the probability distribution of X.
Step 1.3 fusion of word multidimensional features
And inputting the multi-dimensional features into an RNN + CRF model to obtain a dictionary of new words in the chemical field in the text.
A Recurrent Neural Network (RNN) can process data of a sequence change, compared with a general Neural Network. The network can be applied to the field of new word discovery, and can better contact context information to extract new words in the text. The value of the hidden layer of the recurrent neural network as shown in fig. 1 depends not only on the current input text but also on the value s of the last hidden layer. The weight matrix W is the weight of the last value of the hidden layer as the input of this time.
Conditional Random Fields (CRFs) are a special case of markov Random Fields. In the field of new word discovery in the chemical industry field, the label of each word is influenced by adjacent labels with high probability. According to the new word discovery task, the CRF learns corresponding label rules, so that the final label result conforms to the current character segment and the whole sentence is most reasonable.
Step 1.4 New word dictionary participle
And adding the new word dictionary in the chemical field into the word segmentation tool to obtain a word segmentation result of the text in the chemical field. A user-defined dictionary is added into the word segmentation tool jieba, and the word segmentation tool can give priority to the user dictionary, so that the applicability of the word segmentation result in the professional field is improved. And obtaining a high-accuracy text word segmentation result in the chemical field by using the optimized word segmentation method.
3. The method for classifying long texts in the chemical engineering preparation field based on the graph neural network according to claim 2, wherein the method for constructing the initial multi-dimensional tensor map in the step 2 specifically comprises:
step 2.1 initialization node embedding
The nodes in the global graph adopt a certain chemical product to prepare all document nodes and word segmentation nodes including a training set and a test set in a domain corpus.
The word segmentation nodes are initialized and embedded into a vector space by adopting Glove; initializing a new word node of the chemical field outside the dictionary to 0; the document nodes are encoded in order. Dimension d is set to 300. Thus, a vector space representation with 300 dimensions per node is obtained.
Step 2.2 Multi-dimensional tensor map construction
And constructing a global grammar tensor map and a sequential tensor map. The syntax graph has the same nodes as the syntax graph. The document nodes and the participle (new word) nodes have the same edges, and the weights are calculated by adopting tf-idf word frequency inverse document frequency, as shown in formula (3).
Figure FDA0003422452840000021
Where i is a document, j is a term (new term), CiRepresenting the total number of words, T, of the document iijDenotes the number of occurrences of j in i, CP is corpusTotal number of files in library, CPjThe number of documents in the corpus that contain the word j.
Next, an edge between the participle (new word) and the participle (new word) node is constructed:
GSynand (4) a grammatical tensor graph, firstly extracting grammatical dependency relations between words by using an ltp parser for each document, and regarding various relations as undirected edges. The number of times each pair of words having syntactic dependencies in the entire corpus is calculated defines the weight of the edge between the non-pair of words (syntactic graph nodes) as shown in equation (4).
Figure FDA0003422452840000031
Wherein A isj1j2Weight, N, representing the edge between words j1 and j2syntactic(wj1,wj2) Representing the number of times two words have syntactic dependencies in all documents of a corpus, Ntotal(wj1,wj2) Indicating the number of times two words are present in the same document throughout the corpus, and num represents the coding of the syntactic dependency between the two words, as shown in fig. 2.
GPMIThe sequential tensor graph, for each document, computes the weights between the term (new term) nodes in a sliding window fashion as shown in equation (5).
Figure FDA0003422452840000032
Where W (j1, j2) is the number of sliding windows containing word j1 and word j2, W (j) is the number of sliding windows containing word j, and W is the total number of sliding windows. The sliding window length is set to 25 herein.
4. The long text classification method in the chemical industry preparation field based on the graph neural network according to claim 3, wherein the graph message transmission feature iteration method in the step 3 specifically comprises:
as shown in fig. 3, we use a graph message passing mechanism to iterate through a GNN graph neural network message passing layer, so as to obtain embedded features in the graph, which merge the adjacent nodes and edges. The method updates the initial vector embedding of the syntactic tensor map and the sequential tensor map, and particularly represents the oov (out of vocabularies) out-of-dictionary features with the adjacent features for the new word nodes.
In the invention, a GNN network transfer iteration characteristic with the layer number of 4 is constructed. Definition of
Figure FDA0003422452840000033
Figure FDA0003422452840000034
Is a collection of all the nodes that are,
Figure FDA0003422452840000035
is the set of all edges and u is the global property of the graph. The 3 transfer update functions for the node V, the edge E, and the global feature U are shown in equations (6) to (8).
Figure FDA0003422452840000036
Figure FDA0003422452840000037
Figure FDA0003422452840000041
Formula (7) accepts a parameter as a set of edges, and uses the information of all edges in the set to adjust the state of a node. Formula (8) accepts the parameters as a set of edges (points), and then uses the information of all edges (points) in the set to adjust the global state. Accessing a full link layer to obtain an updated grammar tensor map G'SynAnd sequential tensor map G'PMI
5. The method for classifying long texts in the chemical engineering preparation field based on the graph neural network according to claim 4, wherein the method for constructing the global semantic graph by fusing the tensor graphs in the step 4 specifically comprises the following steps:
the invention adopts a method of 1 multiplied by 1 convolution kernel to fuse the grammatical tensor map and the sequential tensor map.
The 1 × 1 convolution kernel method is widely applied to feature dimension change of image processing, and a filter with the size of 1 × 1 is used for convolution operation. It was first presented In the Network In Network paper, which was used to deepen the widening of the Network structure. By the idea of integrating dimension transformation by the 1 × 1 convolution kernel of the image across channel characteristics, the invention is provided with a plurality of 1 × 1 filters, and the number of output channels, namely, the dimensionality reduction and the dimensionality increase can be increased and decreased at will.
By adding grammar vector graph G'SynAnd sequential tensor map G'PMIPerforming 1 multiplied by 1 convolution kernel dimension reduction integration operation on the adjacency matrix to obtain a global semantic graph G constructed by linguistic data in the chemical preparation fieldE
6. The method for classifying long texts in the chemical engineering preparation field based on the graph neural network according to claim 5, wherein the method for obtaining node classification information by the graph convolution neural network in the step 5 specifically comprises:
according to the invention, through the steps, the global semantic graph G constructed by the linguistic data in the chemical preparation field is obtainedEAnd then, the classification of the unclassified document nodes in the global graph needs to be trained and predicted by taking the classification as network input.
The graph neural network is a differentiable information propagation model based on thermodynamic propagation transformation. Graph neural networks that make global semantic graph document node classification predictions are a variant of the underlying graph neural networks GCNs.
GCN graph convolution neural networks are a generalized form of conventional Convolution Neural Networks (CNNs) that can operate directly on graphs. Formally, consider a graph G ═ V, E, where V is the set of nodes of the graph, E is the set of edges of the graph, and X ∈ Rn×nIs a node which contains all nodes and characteristics thereofOf the matrix of (a).
An adjacency matrix A with G and a degree matrix D thereof, wherein Dii=∑Aij. The GCN can only capture the information of the direct neighbors by one layer of convolution. When multiple GCNs are stacked, information of a large semantic interval can be integrated. For single layer GCN, the node feature matrix L(1)=Rn*kThe calculation is shown in equation (9).
Figure FDA0003422452840000042
Wherein
Figure FDA0003422452840000051
Is a normalized symmetric adjacency matrix, W0∈Rm×kIs a weight matrix and p is an activation function, e.g., a ReLU function.
The invention combines the global semantic graph GEInputting the document nodes into a double-layer GCN (graph relational networks) graph volume network, wherein the embedding size of the second layer of nodes is the same as that of a tag set, then accessing a softmax layer to classify the document nodes to be classified, and calculating as shown in formula (10).
Figure FDA0003422452840000052
Wherein the content of the first and second substances,
Figure FDA0003422452840000053
is a normalized adjacency matrix that is,
Figure FDA0003422452840000054
and finally, predicting and outputting the graph neural network by using a softmax classifier to obtain a classification result of each document node to be classified.
CN202111567698.1A 2021-12-21 2021-12-21 Long text classification method in chemical preparation field based on graph neural network Pending CN114218389A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111567698.1A CN114218389A (en) 2021-12-21 2021-12-21 Long text classification method in chemical preparation field based on graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111567698.1A CN114218389A (en) 2021-12-21 2021-12-21 Long text classification method in chemical preparation field based on graph neural network

Publications (1)

Publication Number Publication Date
CN114218389A true CN114218389A (en) 2022-03-22

Family

ID=80704562

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111567698.1A Pending CN114218389A (en) 2021-12-21 2021-12-21 Long text classification method in chemical preparation field based on graph neural network

Country Status (1)

Country Link
CN (1) CN114218389A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114817538A (en) * 2022-04-26 2022-07-29 马上消费金融股份有限公司 Training method of text classification model, text classification method and related equipment
CN114942816A (en) * 2022-06-10 2022-08-26 南京大学 Cross-application interface classification method based on text features and graph neural network
CN115858792A (en) * 2023-02-20 2023-03-28 山东省计算中心(国家超级计算济南中心) Short text classification method and system for bidding project names based on graph neural network
CN116932765A (en) * 2023-09-15 2023-10-24 中汽信息科技(天津)有限公司 Patent text multi-stage classification method and equipment based on graphic neural network
CN117273015A (en) * 2023-11-22 2023-12-22 湖南省水运建设投资集团有限公司 Electronic file archiving and classifying method for semantic analysis
CN117421487A (en) * 2023-12-19 2024-01-19 西安康奈网络科技有限公司 Multiple network information screening management system based on artificial intelligence

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114817538A (en) * 2022-04-26 2022-07-29 马上消费金融股份有限公司 Training method of text classification model, text classification method and related equipment
CN114817538B (en) * 2022-04-26 2023-08-08 马上消费金融股份有限公司 Training method of text classification model, text classification method and related equipment
CN114942816A (en) * 2022-06-10 2022-08-26 南京大学 Cross-application interface classification method based on text features and graph neural network
CN114942816B (en) * 2022-06-10 2023-09-05 南京大学 Cross-application interface classification method based on text features and graph neural network
CN115858792A (en) * 2023-02-20 2023-03-28 山东省计算中心(国家超级计算济南中心) Short text classification method and system for bidding project names based on graph neural network
CN115858792B (en) * 2023-02-20 2023-06-09 山东省计算中心(国家超级计算济南中心) Short text classification method and system for bidding project names based on graphic neural network
CN116932765A (en) * 2023-09-15 2023-10-24 中汽信息科技(天津)有限公司 Patent text multi-stage classification method and equipment based on graphic neural network
CN116932765B (en) * 2023-09-15 2023-12-08 中汽信息科技(天津)有限公司 Patent text multi-stage classification method and equipment based on graphic neural network
CN117273015A (en) * 2023-11-22 2023-12-22 湖南省水运建设投资集团有限公司 Electronic file archiving and classifying method for semantic analysis
CN117273015B (en) * 2023-11-22 2024-02-13 湖南省水运建设投资集团有限公司 Electronic file archiving and classifying method for semantic analysis
CN117421487A (en) * 2023-12-19 2024-01-19 西安康奈网络科技有限公司 Multiple network information screening management system based on artificial intelligence
CN117421487B (en) * 2023-12-19 2024-03-08 西安康奈网络科技有限公司 Multiple network information screening management system based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN114218389A (en) Long text classification method in chemical preparation field based on graph neural network
CN108681557B (en) Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint
CN108415953A (en) A kind of non-performing asset based on natural language processing technique manages knowledge management method
CN112487190B (en) Method for extracting relationships between entities from text based on self-supervision and clustering technology
CN112507699A (en) Remote supervision relation extraction method based on graph convolution network
CN112306494A (en) Code classification and clustering method based on convolution and cyclic neural network
CN111767325B (en) Multi-source data deep fusion method based on deep learning
CN114896388A (en) Hierarchical multi-label text classification method based on mixed attention
CN115017299A (en) Unsupervised social media summarization method based on de-noised image self-encoder
CN111143553A (en) Method and system for identifying specific information of real-time text data stream
CN108846033B (en) Method and device for discovering specific domain vocabulary and training classifier
CN114510946B (en) Deep neural network-based Chinese named entity recognition method and system
CN115438709A (en) Code similarity detection method based on code attribute graph
Zhuo et al. Context attention heterogeneous network embedding
CN111382333B (en) Case element extraction method in news text sentence based on case correlation joint learning and graph convolution
CN111930892A (en) Scientific and technological text classification method based on improved mutual information function
CN116821371A (en) Method for generating scientific abstracts of multiple documents by combining and enhancing topic knowledge graphs
CN116629361A (en) Knowledge reasoning method based on ontology learning and attention mechanism
CN116992040A (en) Knowledge graph completion method and system based on conceptual diagram
CN115617981A (en) Information level abstract extraction method for short text of social network
CN113111288A (en) Web service classification method fusing unstructured and structured information
Li et al. An Efficient Minimal Text Segmentation Method for URL Domain Names
CN114722160B (en) Text data comparison method and device
Colombo et al. Discovering patterns within the drilling reports using artificial intelligence for operation monitoring
Yang et al. Construction and analysis of scientific and technological personnel relational graph for group recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination