CN116263783A

CN116263783A - Text classification method, device, equipment and storage medium

Info

Publication number: CN116263783A
Application number: CN202111506316.4A
Authority: CN
Inventors: 丁辰晖
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2023-06-16

Abstract

The application discloses a text classification method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring text data; determining feature vectors of document nodes, concept nodes and word nodes corresponding to the text data based on the text data; constructing a text iso-graph based on feature vectors of document nodes, concept nodes and word nodes; determining the weight of edges between nodes in the text iso-graph; based on the text heterogram, obtaining a text feature vector corresponding to the text data; and classifying the text feature vectors by using a classification function to determine the text category. Thus, the prior knowledge in the text is obtained by obtaining the feature vector of the concept node; when the text iso-composition is constructed, concept nodes are fused, so that the problem of feature sparseness caused by short text lack of context can be relieved to a certain extent, the text feature vector extracted based on the text iso-composition can more accurately represent the features of the text, and the accuracy of text classification is further improved.

Description

Text classification method, device, equipment and storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a text classification method, apparatus, device, and storage medium.

Background

In the big data age, a large amount of short text appears in the network. Because these short texts have short text length, lack of context information, and large content spoken language noise, how to accurately extract text features, and classifying the short texts by using a proper classification model is an important problem. When the text classification method in the prior art is adopted for text classification, semantic relations among short texts cannot be completely captured, and the problems of low classification accuracy and the like can be caused.

Disclosure of Invention

In order to solve the above technical problems, an embodiment of the present application is expected to provide a text classification method, apparatus, device and storage medium.

The technical scheme of the application is realized as follows:

in a first aspect, a text classification method is provided, the method comprising:

acquiring text data;

determining feature vectors of document nodes, feature vectors of concept nodes and feature vectors of word nodes corresponding to the text data based on the text data;

constructing a text iso-graph based on the feature vector of the document node, the feature vector of the concept node and the feature vector of the word node;

determining the weight of edges between nodes in the text iso-graph;

Based on the text heterogram, obtaining a text feature vector corresponding to the text data;

and classifying the text feature vectors by using a classification function to determine text categories.

In the above solution, the obtaining, based on the text iso-graph, a text feature vector corresponding to the text data includes: determining at least one type of attention weight for each node based on the text iso-graph; the type attention weight is a document type attention weight, a conceptual type attention weight or a word type attention weight; determining an inter-node attention weight between each node and a neighboring node based on the at least one type of attention weight, the feature vector of each node, and the feature vector of the at least one type of neighboring node; the text feature vector is determined based on the inter-node attention weights between all nodes and neighboring nodes, and the feature vectors of all nodes.

In the above solution, the determining at least one type of attention weight of each node based on the text iso-graph includes: calculating the sum of the feature vectors of the ith type adjacent nodes of the kth node in the text iso-graph to obtain the ith type feature vector of the kth node; wherein the i-th type is any one of a document type, a concept type or a word type; and determining the ith type attention weight of the kth node based on the eigenvector of the kth node, the eigenvector of the ith type adjacent node and the ith type eigenvector.

In the above aspect, the determining the text feature vector based on the attention weights between all nodes and adjacent nodes and feature vectors of all nodes includes: the attention weights among all nodes and adjacent nodes and the feature vectors of all nodes are input into a heterogeneous graph convolution network, and the text feature vectors corresponding to the text data are obtained; wherein the heterogeneous graph rolling network is constructed based on a multi-head attention mechanism.

In the above solution, the determining, based on the text data, a feature vector of a document node, a feature vector of a concept node, and a feature vector of a word node corresponding to the text data includes: calculating word frequency-inverse document frequency TF-IDF vector of the text data as a feature vector of the document node; acquiring a concept set of the text data based on a concept graph; mapping concepts in the concept set into feature vectors based on a word vector model to obtain feature vectors of the concept nodes; and acquiring the single-hot code vector of the word node in the text data from a preset vocabulary as the characteristic vector of the word node.

In the above scheme, the determining the weight of the edge between the nodes in the text iso-graph includes: acquiring at least one concept node corresponding to a document node and a correlation value between the document node and the at least one concept node based on a concept graph; determining weights for edges between the document node and the at least one concept node based on the relevance values; determining the weight of edges between the document nodes and at least one word node based on a word frequency-inverse document frequency TF-IDF algorithm; weights for edges between word nodes and word nodes are determined based on the point-to-point information between the words.

In the above scheme, the acquiring text data includes: acquiring original text data; preprocessing the original text data to obtain the text data; wherein the pretreatment comprises at least one of: noise information removal, word segmentation and word disabling.

In a second aspect, there is provided a text classification apparatus, the apparatus comprising:

the acquisition module is used for acquiring text data;

the processing module is used for determining the feature vector of the document node, the feature vector of the concept node and the feature vector of the word node corresponding to the text data based on the text data;

The processing module is further used for constructing a text iso-graph based on the feature vector of the document node, the feature vector of the concept node and the feature vector of the word node;

the processing module is further used for determining the weight of the edges between the nodes in the text iso-graph;

the processing module is further used for obtaining text feature vectors corresponding to the text data based on the text iso-graph;

the processing module is further used for classifying the text feature vectors by using a classification function to determine text categories.

In a third aspect, there is provided a text classification apparatus, the apparatus comprising: a processor and a memory configured to store a computer program capable of running on the processor, wherein the processor is configured to perform the steps of any of the preceding methods when the computer program is run.

In a fourth aspect, a computer storage medium is provided, on which a computer program is stored, wherein the computer program, when being executed by a processor, carries out the steps of the aforementioned method.

The application discloses a text classification method, a device, equipment and a storage medium, wherein the method obtains priori knowledge in a text by acquiring feature vectors of concept nodes; when the text iso-composition is constructed, concept nodes are fused, so that the problem of feature sparseness caused by short text lack of context can be relieved to a certain extent, the text feature vector extracted based on the text iso-composition can more accurately represent the features of the text, and the accuracy of text classification is further improved.

Drawings

FIG. 1 is a schematic flow chart of a text classification method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a text iso-composition set in an embodiment of the present application;

FIG. 3 is a schematic flow chart of determining text feature vectors according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a text iso-composition aggregation calculation method according to an embodiment of the present application;

fig. 5 is a schematic diagram of a composition structure of a text classification device according to an embodiment of the present application;

fig. 6 is a schematic diagram of a composition structure of a text classification apparatus according to an embodiment of the present application.

Detailed Description

For a more complete understanding of the features and technical content of the embodiments of the present application, reference should be made to the following detailed description of the embodiments of the present application, taken in conjunction with the accompanying drawings, which are for purposes of illustration only and not intended to limit the embodiments of the present application.

Fig. 1 is a schematic flow chart of a text classification method according to an embodiment of the present application. As shown in fig. 1, the text classification method specifically may include:

step 101: acquiring text data;

here, the text data may be understood as text data corresponding to a text to be classified (also referred to as "document" in the embodiment of the present application), and if multiple texts need to be classified, the multiple text data may form a text data set, where the acquiring text data includes: text data of a text to be classified is obtained from a text data set.

Illustratively, in some embodiments, the obtaining text data includes: acquiring original text data; preprocessing the original text data to obtain the text data; wherein the pretreatment comprises at least one of: noise information removal, word segmentation and word disabling.

Here, the original text data may be understood as original data corresponding to the text to be classified. The preprocessing is used for simplifying the original text data, and related terms in the original text data are reserved, so that the text classification efficiency can be improved.

Step 102: determining feature vectors of document nodes, feature vectors of concept nodes and feature vectors of word nodes corresponding to the text data based on the text data;

illustratively, in some embodiments, the determining, based on the text data, a feature vector of a document node, a feature vector of a concept node, and a feature vector of a word node corresponding to the text data includes: calculating word frequency-inverse document frequency TF-IDF vector of the text data as a feature vector of the document node; acquiring a concept set of the text data based on a concept graph; mapping concepts in the concept set into feature vectors based on a word vector model to obtain feature vectors of the concept nodes; and acquiring the single-hot code vector of the word node in the text data from a preset vocabulary as the characteristic vector of the word node.

For example, in practical application, a TF-IDF vector of a document to be classified in text data is calculated and used as a feature vector of a node of the document.

For example, the acquiring the concept set of the text data based on the concept graph may be: calling an API of the concept map, inputting text data, and obtaining a concept set corresponding to the text data. Illustratively, text data is entered: from "Eason Chan was born in Hong Kong" to concept atlas, a concept set { top chinese entertainer, singer, place, asian city }.

For example, in practical application, the preset vocabulary includes all words in the corpus and One-Hot (One-Hot) vectors corresponding to the words. The single-hot code vector of the word node in the text data can be directly obtained from the preset vocabulary and used as the characteristic vector of the word node.

Step 103: constructing a text iso-graph based on the feature vector of the document node, the feature vector of the concept node and the feature vector of the word node;

here, by acquiring the feature vector of the concept node corresponding to the text, a priori knowledge in the text is obtained; by constructing the text iso-composition based on the feature vectors of the document nodes, the word nodes and the concept nodes, the feature sparseness problem caused by short text lack context can be relieved to a certain extent, so that the text feature vector extracted based on the text iso-composition can more accurately represent the features of the text, and further the accuracy of text classification is improved.

Here, when text of a document is included in the text data, all document nodes, concept nodes, and word nodes in the document are included in the text iso-graph, and the document is classified.

Illustratively, in some embodiments, the method further comprises: a set of text iso-composition containing a plurality of texts is constructed, which set can be used to categorize the plurality of texts.

Fig. 2 is a schematic diagram of a text iso-composition set according to an embodiment of the present application. As shown in fig. 2, the text iso-composition set includes at least three text-corresponding text iso-compositions. The text iso-composition of each text includes: document nodes, concept nodes, and word nodes.

Step 104: determining the weight of edges between nodes in the text iso-graph;

illustratively, in some embodiments, the determining the weights of the edges between the nodes in the text iso-graph includes: acquiring at least one concept node corresponding to a document node and a correlation value between the document node and the at least one concept node based on a concept graph; determining weights for edges between the document node and the at least one concept node based on the relevance values; determining the weight of edges between the document nodes and at least one word node based on a word frequency-inverse document frequency TF-IDF algorithm; weights for edges between word nodes and word nodes are determined based on the point-to-point information between the words.

Illustratively, and exemplary, text data is entered: from "Eason Chan was born in HongKong" to concept graphs, a set of concepts and relevance values { < top chinese entertainer,0.6>, < singer,0.4>, < place,0.2>, < asian city,0.1> } can be obtained. And taking the relevance value of each concept and the text data as the weight of the edge between the document node and the concept node. Illustratively, the weight of the edges between the concept node "top chinese entertainer" and the document node "Eason Chan was born in HongKong" corresponding to the text data is 0.6.

Illustratively, in some embodiments, the determining weights for edges between the document node and at least one word node based on a word frequency-inverse document frequency TF-IDF algorithm comprises: calculating TF-IDF values between the words and the documents corresponding to the text data; and taking the TF-IDF value as the weight of the edge between the document node and the word node.

Specifically, the TF-IDF value between a word and a document corresponding to text data is calculated as follows: calculating the ratio of the number of times of the word appearing in the document to the total number of times of the document as word frequency TF; based on

Calculating an inverse document frequency IDF, wherein N is the total number of texts, and N is the number of documents containing the word in the text data set; the product of TF and IDF is calculated as TF-IDF value.

Illustratively, determining weights for the word nodes and edges between the word nodes based on the point-to-point information between the words includes: acquiring point mutual information between words in text data from a word point mutual information library; establishing edges between words with dot mutual information being positive values; the point mutual information between words is used as the weight of the edge between the word nodes.

Here, the word mutual information base is determined based on a corpus, and includes mutual information between words of all documents in the corpus. Illustratively, to utilize co-occurrence information of global words, co-occurrence statistics are collected by using a fixed size sliding window for all documents in the corpus; the relevance between words is measured using point to point information (PMI) as a weight between two word nodes. Wherein, the PMI calculation formula is as follows:

where PMI (i, j) is the (PMI) between words i, j, # W (i) is the number of sliding windows in the corpus containing word i, # W (i, j) is the number of sliding windows containing words i and j, and# W is the total number of sliding windows in the corpus. When the PMI is positive, the semantic association degree between words in the corpus is higher; when the PMI is negative, the semantic association degree between words in the corpus is lower or zero.

Step 105: based on the text heterogram, obtaining a text feature vector corresponding to the text data;

here, the text heterogeneous map includes: document nodes, concept nodes, and weights of edges between word nodes and nodes. The text feature vector corresponding to the text data is a feature vector for representing the text data, and is used for classifying the text based on the text feature vector.

Exemplary, in some embodiments, the obtaining, based on the text iso-graph, a text feature vector corresponding to the text data includes: calculating the attention weight among nodes in the text iso-graph; and inputting the attention weights among the nodes and the feature vectors of all the nodes in the text iso-graph into the heterogeneous graph convolution neural network to obtain the text feature vector corresponding to the text data. Here, the graph convolution neural network is used to extract spatial features of the topological graph, and may be obtained through training.

Step 106: and classifying the text feature vectors by using a classification function to determine text categories.

For example, in some embodiments, the classification function may be a softmax function.

Illustratively, in some embodiments, the text classification method further comprises: acquiring a training text data set; the text classification model is trained based on the training text dataset.

Illustratively, in some embodiments, the text classification method further comprises: a training text isomerous atlas is constructed containing all text in the training text dataset. The training text heterogeneous atlas is used for training a text classification model, and the text classification model can classify texts in a text data set which is directly input, so that text types corresponding to the texts are obtained.

Illustratively, constructing a training text isomerous atlas containing all text in the training text dataset includes: constructing a document d= { D containing short text ₁ ,d ₂ ,...,d _m The word w= { W } ₁ ,w ₂ ,...,w _n Concept c= { C ₁ ,c ₂ ,...,c _k Training text as nodeConstructing a atlas; where m is the total number of documents in the corpus, n is the number of unique words in the corpus (vocabulary size), and k is the total number of concepts for all documents in the training text dataset. For document nodes, their feature vectors are represented using their word frequency-inverse document frequency (TF-IDF) vectors, the One-Hot vectors of the vocabulary are used as feature vectors for word nodes, and the pre-training word vectors are used to map conceptual words into feature vectors.

Illustratively, in some embodiments, the model uses cross entropy as a loss function while the model is trained, while using L2 regularization prevents the model from overfitting. The loss function is:

Wherein, C is the classification number, D is the training set size, Z is the prediction category, y is the actual category, and lambda theta ² For the regular term, the model optimization adopts a gradient descent method.

Here, the execution subject of steps 101 to 106 may be a processor of the text classification apparatus.

The text data is text data corresponding to short text. Because the short text length is too short, and the context information is lacking, the common text classification method cannot obtain effective classification features from sparse text features. According to the technical scheme, the prior knowledge in the text is obtained by obtaining the feature vector of the concept node corresponding to the text; by constructing the text iso-composition based on the feature vectors of the document nodes, the word nodes and the concept nodes, the feature sparseness problem caused by short text lack context can be relieved to a certain extent, so that the text feature vector extracted based on the text iso-composition can more accurately represent the features of the text, and further the accuracy of text classification is improved.

On the basis of the above embodiment, a method for obtaining the text feature vector corresponding to the text data based on the text iso-graph in step 105 is further illustrated. Fig. 3 is a schematic flow chart of determining a text feature vector according to an embodiment of the present application. As shown in fig. 3, the determining procedure of the text feature vector includes:

Step 301: determining at least one type of attention weight for each node based on the text iso-graph; the type attention weight is document type attention weight, conceptual type attention weight or word type attention weight;

illustratively, in some embodiments, the determining at least one type of attention weight for each node based on the text iso-graph comprises: calculating the sum of the feature vectors of the ith type adjacent nodes of the kth node in the text iso-graph to obtain the ith type feature vector of the kth node; wherein the i-th type is any one of a document type, a concept type or a word type; and determining the ith type attention weight of the kth node based on the eigenvector of the kth node, the eigenvector of the ith type adjacent node and the ith type eigenvector.

Illustratively, the i-th type is: document type (type 1), concept type (type 2) or word type neighbor node (type 3). The ith type of feature vector of the kth node is: document type feature vectors, concept type feature vectors or word type feature vectors of the kth node; the document type feature vector of the kth node is the sum of feature vectors of document type adjacent nodes of the kth node; the conceptual feature vector of the kth node is the sum of the feature vectors of conceptual neighboring nodes of the kth node; the word-type feature vector of the kth node is the sum of feature vectors of word-type neighboring nodes of the kth node.

Illustratively, in some embodiments, the determination of the ith type of attention weight for the kth node based on the eigenvectors of the kth node, the eigenvectors of the ith type of neighboring nodes, and the ith type of eigenvectors, may be accomplished by,

wherein alpha is _i An ith type attention weight representing a kth node,

representing an i-th type of attention vector; h is a _k A feature vector representing a kth node; h is a _i An ith type feature vector representing a kth node; the l represents the join operation, the leak ReLU (·) represents the leak ReLU activate function, and the softmax (·) represents processing using the softmax function.

Wherein, the liquid crystal display device comprises a liquid crystal display device,

the i-th type of attention vector, which is the current node, is obtained by the following equation,

wherein V is ^T W, U are all learnable network parameters, h _k Is the eigenvector of the kth node, h _k' Is the feature vector of the i-th type neighboring node of the k-th node.

Illustratively, in some embodiments, the method further comprises: the type attention value is regularized using a softmax function.

Step 302: determining an inter-node attention weight between each node and a neighboring node based on the at least one type of attention weight, the feature vector of each node, and the feature vector of the at least one type of neighboring node;

Exemplary, in some embodiments, an inter-node attention weight β between a node k (also referred to herein as a kth node) and its neighboring node k _kk' It can be calculated by the following formula, where node k is the i ' th type node and node k ' is the i ' th type node:

β _kk' ＝softmax(LeakyReLU(V ^T ·α _i' [h _k ||h _k' ]))

wherein V represents the attention vector of the node k, is a learnable network parameter and can be obtained by all adjacent nodes of the node k; alpha _i' Attention rights of type i' for node kWeighing; h is a _k A feature vector representing a kth node; h is a _k' A feature vector representing a neighboring node k'; the l represents the join operation, the leak ReLU (·) represents the leak ReLU activate function, and the softmax (·) represents processing using the softmax function.

Illustratively, in some embodiments, the method further comprises: the inter-node attention value is regularized using a softmax function.

Step 303: the text feature vector is determined based on the inter-node attention weights between all nodes and neighboring nodes, and the feature vectors of all nodes.

Illustratively, in some embodiments, the determining the text feature vector based on the inter-node attention weights between all nodes and neighboring nodes, and the feature vector of all nodes includes: and inputting the attention weights among all the nodes and the adjacent nodes and the feature vectors of all the nodes into a heterogeneous graph convolution network to obtain the text feature vectors corresponding to the text data.

Illustratively, in some embodiments, the propagation rules of the heterogeneous graph rolling network include:

wherein H is ^(l+1) Representing the text feature vector of layer L+1, beta _i Representing the attention weight between the i-th type node and the adjacent nodes;

layer L feature vector, W, representing type i node _i ^(l) For a trainable transformation matrix, σ (·) is the ReLU activation function. I represents a set of I-th types.

Illustratively, in some embodiments, the heterogeneous graph convolution network is a multi-headed heterograph convolution network constructed based on a multi-headed attention mechanism.

By calculating different types of attention weights and combining the types of attention weights, the inter-node attention weights among the nodes are calculated, and then a multi-head iso-composition attention mechanism is added to calculate the importance of different types of adjacent nodes, so that the text feature extraction based on the text iso-composition has better performance and stronger robustness.

In order to further embody the purpose of the application, further illustrating the embodiment of the application, a text classification method based on the heterograph attention network is provided, and the text classification method can be applied to a text classification model. The method specifically comprises the following steps:

Step 401: acquiring an original text data set and preprocessing the original text data set to obtain the text data set;

wherein the text data set comprises at least one text data.

Specifically, preprocessing an original short text data set, correcting word spelling errors by using a script program, cleaning unnecessary labels, removing punctuation marks and special symbols, deleting high-frequency words which do not affect semantics in the original text data by using an nltk tool kit, and performing word removal operation to obtain the text data set.

Illustratively, in some embodiments, the method further comprises: a training text dataset is obtained. Wherein the training text data set contains text data of a plurality of training texts.

Step 402: using word frequency-inverse document frequency (TF-IDF) vector of text to represent its characteristic vector as characteristic vector of document node;

step 403: acquiring a concept set of the short text by using a Microsoft concept graph, and mapping the concept set into a feature vector by using a pre-training word vector to serve as a feature vector of a concept node;

illustratively, the short text data is input into the concept graph by calling an API of the concept graph to obtain a short text concept set. The method comprises the following specific steps:

Step 411: inputting an original text data set t, setting the number of expansion words Topk, setting an expansion algorithm as EX, setting data information MSCG of a conceptual diagram, and selecting a mode SelectMode;

step 412: words=split data (t); word segmentation is carried out on each text in the text data set, so that an initial feature set Words is obtained;

step 413: words= ReduceStopWords (Words); stop words in the reject feature words

Step 414: select = getselect wordset (MSCG, selectMode); obtaining a corresponding word set Select in the conceptual diagram according to the representation mode

Step 415: sel_words=words ∈select; selecting a feature word set Sel_Words to be expanded according to Select

Step 416: word_dic=access api (sel_words, EX, topk)/. For each feature Word in sel_words, a different interface of the conceptualized extension algorithm is invoked. The interface returns Topk conceptual words related to the feature and then forms the extended dictionary word_dic ++

Step 417: d=getingextension (words_dic); according to word_dic, the original feature words are expanded, so that a conceptual expanded semantic representation d × is obtained

Step 418: return d.

In addition to the concept set of short text, the relevance value between the concept set concept and the short text can be obtained from the concept graph. By calling the API of the concept graph and inputting short text, the concept set C= { < C can be obtained ₁ ,w ₁ ＞,＜c ₂ ,w ₂ ＞,...＜c ₁ ,w ₃ > }, where ci represents the concept set concept and wi represents the association between the concept ci and the short text. For example, input short text { Eason Chan was born in HongKong }, a set of concepts and a relevance value { can be obtained<top chinese entertainer，0.6>，<singer，0.4>，<place，0.2>，<asian city，0.1>}。

Step 404: using One-Hot vector of the vocabulary as a feature vector of the word node;

step 405: constructing a text iso-composition;

specifically, constructing a text iso-composition corresponding to each text in the text data set; and integrating the text iso-graphs of all texts in the text data set to obtain a text iso-graph set. The text isomerism atlas is represented as g= (V, E), V is the set of nodes in the graph, and E is the set of edges between nodes.

Step 406: determining the weight of edges between nodes in the text iso-graph;

specifically, establishing an edge between the document and the related concept by using a correlation value between the concept and the short text, which are commonly acquired from the concept graph, and taking the correlation value as the weight of the edge between the concept node and the document node;

building edges on the document nodes and the word nodes based on word co-occurrence in the document, and calculating edge weights by using a TF-IDF algorithm, wherein word frequency is the number of times of word occurrence in the document, and inverse document frequency is the logarithm of the reciprocal of the number of the document containing the word;

To take advantage of co-occurrence information of global words, co-occurrence statistics are collected by using a fixed size sliding window for all documents in the corpus. The system uses Point Mutual Information (PMI) to measure the relevance between words to calculate the weight between two word nodes, and the PMI calculation formula is as follows:

where, is the (PMI) between words i, j, # W (i) is the number of sliding windows in the corpus containing words i, # W (i, j) is the number of sliding windows containing words i and j, and# W is the total number of sliding windows in the corpus. When the PMI is positive, the semantic association degree between words in the corpus is higher; when the PMI is negative, the semantic association degree between words in the corpus is lower or zero. Edges are added between word pairs where PMI is positive.

Step 407: in the constructed text iso-graph, aggregation calculation is carried out on each node by using a multi-head iso-graph attention mechanism, and finally text feature vectors are aggregated.

The spatial features of the topology graph can be extracted by a graph convolutional neural network (GCN), which is a multi-layer neural network, which calculates an embedding vector (embedding vector) for each node by aggregating neighboring node features of the node. For the text iso-graph G= (V, E), V and E are respectively the set of nodes and edges in the graph, and the node characteristic matrix X epsilon R ^|V|×q Feature vector x including all nodes _v ∈R ^q Adding a self-connected adjacency matrix A' =A+I, a degree matrix and a standard graph convolution neural network, wherein the propagation rule is as follows:

wherein H is ^(l) ∈R ^N×D Is a hidden feature of layer L, H ⁽⁰⁾ ＝X，W ^(l) In the form of a transformation matrix that can be trained,

to normalize the symmetric adjacency matrix, σ (·) is the ReLU activation function. The formula represents that the text feature vector of the L+1 layer is calculated by the text feature vector aggregation of the L layer.

However, the standard GCN cannot be directly applied to the text iso-graph in the present application, because there are three different types of nodes in the text iso-graph in the present application, which are respectively in the document type, the concept type and the word type. Different types of nodes have different feature spaces. The common solution is to splice the feature vectors of different types of nodes to obtain a new larger feature space, and fill the irrelevant dimensions of each node with 0 value, but the method ignores some feature information and affects the performance of the model.

In order to solve the problem, the method uses a method of heterogeneous graph convolution, considers the difference of different types of information, transforms the transformation matrix of each node of different types into a public implicit space, and the basic propagation rule of standard heterogeneous graph convolution is as follows:

Wherein the method comprises the steps of

Is A ^～ Is represented by V represents all nodes, |V _t The i indicates a neighboring node of node type t. The above expression pair->

Different types of neighboring nodes of a node utilize a transformation matrix +.>

Transforming the characteristic vector, and aggregating the adjacent nodes of different types to obtain H ^(l+1) The characteristics of the nodes take into account the differences in different characteristic spaces and project them into the implicit public space +.>

Initial->

Specifically, for a given node in the heterogram, its neighbors of different types will have different effects on it. For neighboring nodes of the same type, more useful information can be aggregated, while for nodes of different types, there is some efficient information transfer, and in order to obtain efficient information between nodes and between different types, the method uses a multi-headed heterograph attention mechanism.

Given node v (corresponding to the kth node in this application), type attention learns the weights of different types of neighboring nodes. The feature vector of the t type (t type is any one of a document type, a concept type, and a word type, and corresponds to the i-th type in the present application) is expressed as:

wherein the method comprises the steps of

An adjacency matrix representing a node v and its neighboring node v ', v' representing the neighboring node of the t-type of the current node v, the above expression representing the calculation of the feature vector h of the neighboring node of the t-type _v' The sum is used for obtaining a t type characteristic vector h _t . Then using all types of type feature vectors with the current node feature vector h _v The t-type attention weight α is calculated using the following equation _t ，

Wherein alpha is _t The t-type attention weight representing node v,

representing a t-type attention vector; the l represents the join operation, the leak ReLU (·) represents the leak ReLU activate function, and the softmax (·) represents processing using the softmax function.

the t-type attention vector for the current node, can be derived from the following equation,

wherein V is ^T Both W, U are network parameters that can be learned.

The text classification method further comprises the following steps: all types of attention values were regularized using a softmax function to obtain type attention weights.

To capture important neighboring nodesInformation, while reducing the weight of noise nodes, the model uses node attention to calculate neighboring nodes. Given a node v of type t, and its neighbors v' e N of type t _v And calculates the attention weight between the nodes using the following formula:

β _vv' ＝softmax(LeakyReLU(V ^T ·α _t' [h _v ||h _v' ]))

wherein V is the attention vector, alpha _t' For the t' type attention weight, h _v Is the characteristic vector of the current node, h _v' The node attention value is then regularized, again using the softmax function, for the feature vectors of neighboring nodes.

Integrating the obtained attention weights among the nodes into the heterogram convolution to obtain a new heterogram attention network, wherein the propagation rule is as follows:

wherein H is ^(l+1) Representing the text feature vector of layer L+1, beta _t Representing the attention weight calculated by the current node of the t type with the neighboring nodes,

feature vector of layer L representing current node of t type,>

representing a trainable transformation matrix, σ (·) is the ReLU activation function. I represents a set of t types.

Fig. 4 is a schematic diagram illustrating a text iso-composition aggregation calculation method according to an embodiment of the present application. As shown in figure 4 of the drawings,

representing document d _i Feature vector of layer 1, c ₁ 、c ₂ 、c ₃ Respectively represent and document d _i Is a concept node adjacent to w ₁ 、w ₂ 、w ₃ 、w ₄ Respectively represent and document d _i Is convolved based on the attention between nodes and the feature vector of the node to obtain a document d _i Feature vector of layer 1+1->

To make the model more stable, the present system incorporates a multi-head attention mechanism (multi-head attention). And independently calculating by using K groups of attention mechanisms, wherein each group of attention weights is calculated, and connecting calculation results to obtain the characteristics of each head, and the characteristics are shown in the following formula:

Wherein H is ^(l+1) Represents the text feature vector of the layer L+1, the I is the connection operation,

is the kth inter-node attention weight between the t-type node and other nodes, ++>

And the feature vector of the L layer of the current node of the t type is represented. W (W) _t ^k Representing the k-th set of trainable transformation matrices.

For the last layer of the network, the connection mode is not used any more, but the average summation mode is used to obtain the final node characteristics:

calculating a neighboring concept node c _i Word node w _i With its attention weight, aggregate computation is performed in combination with the attention weight, while using a multi-headed attention mechanism, different arrow patterns represent independent attention computation by concatenationEach head is then or averaged to obtain the node characteristics of the new layer. The multi-head heterographing attention mechanism calculates the importance of different types of adjacent nodes, so that the model has better performance and stronger robustness.

Step 408: text feature vectors are classified using a softmax function.

Specifically, through calculation of the L layer MHGAT, text feature vectors corresponding to all texts in the heterogeneous text graph can be obtained, and then the text feature vectors are classified by using a softmax function to obtain text categories corresponding to all the texts.

Illustratively, constructing a training text isomerous atlas containing all text in the training text dataset includes: constructing a document d= { D containing short text ₁ ,d ₂ ,...,d _m The word w= { W } ₁ ,w ₂ ,...,w _n Concept c= { C ₁ ,c ₂ ,...,c _k A training text heterogeneous atlas as a node; where m is the total number of documents in the corpus, n is the number of unique words in the corpus (vocabulary size), and k is the total number of concepts for all documents in the training text dataset. For document nodes, their feature vectors are represented using their word frequency-inverse document frequency (TF-IDF) vectors, the One-Hot vectors of the vocabulary are used as feature vectors for word nodes, and the pre-training word vectors are used to map conceptual words into feature vectors.

According to the technical scheme, when the text iso-composition is constructed, not only the feature vectors of the document nodes and the word nodes are used, but also the feature vectors of the concept nodes are obtained, priori knowledge in the text is obtained, the problem of feature sparseness caused by short text lack of context can be relieved to a certain extent, the feature of the text can be more accurately represented by the text feature vectors extracted based on the text iso-composition, and the accuracy of text classification is further improved; the attention weights among the nodes of different types are calculated to obtain the type attention weights, the double attention weights among the nodes are calculated by combining the type attention weights, and the importance of adjacent nodes of different types is calculated by adding a multi-head iso-composition attention mechanism, so that the aggregation calculation of the text iso-composition has better performance and stronger robustness, further more accurate text feature vectors are obtained, and the accuracy of text classification results is improved.

Fig. 5 is a schematic diagram of a composition structure of a text classification device in an embodiment of the present application, which shows a device 50 for implementing a text classification method, where the device 50 specifically includes:

an obtaining module 501, configured to obtain text data;

a processing module 502, configured to determine, based on the text data, a feature vector of a document node, a feature vector of a concept node, and a feature vector of a word node corresponding to the text data;

the processing module 502 is further configured to construct a text iso-graph based on the feature vector of the document node, the feature vector of the concept node, and the feature vector of the word node;

the processing module 502 is further configured to determine weights of edges between nodes in the text iso-graph;

the processing module 502 is further configured to obtain a text feature vector corresponding to the text data based on the text iso-graph;

the processing module 502 is further configured to classify the text feature vector by using a classification function, and determine a text category.

In some embodiments, the processing module 502 is configured to determine at least one type of attention weight for each node based on the text iso-graph; the type attention weight is document type attention weight, conceptual type attention weight or word type attention weight; determining an inter-node attention weight between each node and a neighboring node based on the at least one type of attention weight, the feature vector of each node, and the feature vector of the at least one type of neighboring node; the text feature vector is determined based on the inter-node attention weights between all nodes and neighboring nodes, and the feature vectors of all nodes.

In some embodiments, the processing module 502 is configured to calculate a sum of feature vectors of i-th type neighboring nodes of a k-th node in the text iso-graph to obtain an i-th type feature vector of the k-th node; wherein the i-th type is any one of a document type, a concept type or a word type; and determining the ith type attention weight of the kth node based on the eigenvector of the kth node, the eigenvector of the ith type adjacent node and the ith type eigenvector.

In some embodiments, the processing module 502 is configured to input the attention weights between all nodes and adjacent nodes, and the feature vectors of all nodes into the heterogeneous graph convolution network, so as to obtain text feature vectors corresponding to the text data; wherein the heterogeneous graph rolling network is constructed based on a multi-head attention mechanism.

In some embodiments, the processing module 502 is configured to calculate a word frequency-inverse document frequency TF-IDF vector of the text data as a feature vector of the document node; acquiring a concept set of the text data based on a concept graph; mapping concepts in the concept set into feature vectors based on a word vector model to obtain feature vectors of the concept nodes; and acquiring the single-hot code vector of the word node in the text data from a preset vocabulary as the characteristic vector of the word node.

In some embodiments, the processing module 502 is configured to obtain, based on a concept graph, at least one concept node corresponding to a document node, and a relevance value between the document node and the at least one concept node; determining weights for edges between the document node and the at least one concept node based on the relevance values; determining the weight of edges between the document nodes and at least one word node based on a word frequency-inverse document frequency TF-IDF algorithm; weights for edges between word nodes and word nodes are determined based on the point-to-point information between the words.

In some embodiments, the obtaining module 501 is configured to obtain original text data; preprocessing the original text data to obtain the text data; wherein the pretreatment comprises at least one of: noise information removal, word segmentation and word disabling.

Based on the hardware implementation of each unit in the text classification device, another text classification device is further provided in the embodiment of the present application, and fig. 6 is a schematic diagram of the composition structure of the text classification device in the embodiment of the present application. As shown in fig. 6, the apparatus 60 includes: a processor 601 and a memory 602 configured to store a computer program capable of running on the processor;

Wherein the processor 601 is arranged to execute the method steps of the previous embodiments when running a computer program.

Of course, in actual practice, the various components of the text classification apparatus are coupled together via a bus system 603 as shown in fig. 6. It is understood that the bus system 603 is used to enable connected communications between these components. The bus system 603 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 603 in fig. 6.

In practical applications, the processor may be at least one of an application specific integrated circuit (ASIC, application Specific Integrated Circuit), a digital signal processing device (DSPD, digital Signal Processing Device), a programmable logic device (PLD, programmable Logic Device), a Field-programmable gate array (Field-Programmable Gate Array, FPGA), a controller, a microcontroller, and a microprocessor. It will be appreciated that the electronic device for implementing the above-mentioned processor function may be other for different apparatuses, and embodiments of the present application are not specifically limited.

The Memory may be a volatile Memory (RAM) such as Random-Access Memory; or a nonvolatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (HDD) or a Solid State Drive (SSD); or a combination of the above types of memories and provide instructions and data to the processor.

In an exemplary embodiment, the present application also provides a computer readable storage medium, e.g. a memory comprising a computer program executable by a processor of a text classification apparatus to perform the steps of the aforementioned method.

It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items. The expressions "having," "including," and "containing," or "including" and "comprising" are used herein to indicate the presence of corresponding features (e.g., elements such as values, functions, operations, or components), but do not exclude the presence of additional features.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another and do not necessarily describe a particular order or sequence. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the invention.

The technical solutions described in the embodiments of the present application may be arbitrarily combined without any conflict.

In the several embodiments provided in the present application, it should be understood that the disclosed methods, apparatuses, and devices may be implemented in other manners. The above-described embodiments are merely illustrative, and for example, the division of units is merely a logical function division, and other divisions may be implemented in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application.

Claims

1. A method of text classification, the method comprising:

acquiring text data;

determining the weight of edges between nodes in the text iso-graph;

2. The method according to claim 1, wherein the obtaining, based on the text iso-graph, a text feature vector corresponding to the text data includes:

determining at least one type of attention weight for each node based on the text iso-graph; the type attention weight is document type attention weight, conceptual type attention weight or word type attention weight;

determining an inter-node attention weight between each node and a neighboring node based on the at least one type of attention weight, the feature vector of each node, and the feature vector of the at least one type of neighboring node;

the text feature vector is determined based on the inter-node attention weights between all nodes and neighboring nodes, and the feature vectors of all nodes.

3. The method of claim 2, wherein the determining at least one type of attention weight for each node based on the text iso-graph comprises:

Calculating the sum of the feature vectors of the ith type adjacent nodes of the kth node in the text iso-graph to obtain the ith type feature vector of the kth node; wherein the i-th type is any one of a document type, a concept type or a word type;

and determining the ith type attention weight of the kth node based on the eigenvector of the kth node, the eigenvector of the ith type adjacent node and the ith type eigenvector.

4. The method of claim 2, wherein the determining the text feature vector based on inter-node attention weights between all nodes and neighboring nodes, and feature vectors of all nodes, comprises:

the attention weights among all nodes and adjacent nodes and the feature vectors of all nodes are input into a heterogeneous graph convolution network, and the text feature vectors corresponding to the text data are obtained;

wherein the heterogeneous graph rolling network is constructed based on a multi-head attention mechanism.

5. The method of claim 1, wherein the determining, based on the text data, a feature vector of a document node, a feature vector of a concept node, and a feature vector of a word node corresponding to the text data comprises:

Calculating word frequency-inverse document frequency TF-IDF vector of the text data as a feature vector of the document node;

acquiring a concept set of the text data based on a concept graph;

mapping concepts in the concept set into feature vectors based on a word vector model to obtain feature vectors of the concept nodes;

and acquiring the single-hot code vector of the word node in the text data from a preset vocabulary as the characteristic vector of the word node.

6. The method of claim 1, wherein determining weights for edges between nodes in the text iso-graph comprises:

acquiring at least one concept node corresponding to a document node and a correlation value between the document node and the at least one concept node based on a concept graph;

determining weights for edges between the document node and the at least one concept node based on the relevance values;

determining the weight of edges between the document nodes and at least one word node based on a word frequency-inverse document frequency TF-IDF algorithm;

weights for edges between word nodes and word nodes are determined based on the point-to-point information between the words.

7. The method of claim 1, wherein the obtaining text data comprises:

Acquiring original text data;

preprocessing the original text data to obtain the text data;

wherein the pretreatment comprises at least one of: noise information removal, word segmentation and word disabling.

8. A text classification device, the device comprising:

the acquisition module is used for acquiring text data;

9. A text classification device, the device comprising: a processor and a memory configured to store a computer program capable of running on the processor,

Wherein the processor is configured to perform the steps of the method of any of claims 1-7 when the computer program is run.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-7.