CN113495958A - Text classification method and device - Google Patents

Text classification method and device Download PDF

Info

Publication number
CN113495958A
CN113495958A CN202010202427.5A CN202010202427A CN113495958A CN 113495958 A CN113495958 A CN 113495958A CN 202010202427 A CN202010202427 A CN 202010202427A CN 113495958 A CN113495958 A CN 113495958A
Authority
CN
China
Prior art keywords
text
word
neural network
constructing
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010202427.5A
Other languages
Chinese (zh)
Inventor
陈龙
李宥壑
周伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Wodong Tianjun Information Technology Co Ltd
Priority to CN202010202427.5A priority Critical patent/CN113495958A/en
Publication of CN113495958A publication Critical patent/CN113495958A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text classification method and device, and relates to the technical field of deep learning. One embodiment of the method comprises: constructing a text map through each text in the text set; wherein the text set comprises tagged text and untagged text; constructing a graph convolution neural network according to the text atlas, and training the graph convolution neural network by adopting the labeled text in the text set; and inputting the text to be classified into the trained graph convolution neural network to obtain the classification mark of the text to be classified. The embodiment can solve the technical problem that the text classification is not accurate enough.

Description

Text classification method and device
Technical Field
The invention relates to the technical field of deep learning, in particular to a text classification method and device.
Background
Text classification is the basis of applications such as intelligence analysis, theme classification, spam content detection, emotion analysis and the like, and has very wide development prospects, so that the search for text classification is particularly important.
Most of the traditional text classification methods are based on a bag-of-words model, and on the basis, some more complex features (such as n-grams and entity identification labels) are designed, and then traditional classification algorithms (such as support vector machines and logistic regression) are used; there are also methods for representing and classifying text based on representing the text into graph structures by graph algorithms (e.g., hidden markov chains, conditional random fields).
With the big development of the 2010 deep neural network, some classification methods based on Word Embedding (Word Embedding) and Convolutional Neural Network (CNN)/Recurrent Neural Network (RNN) appeared. They all attempt to learn the mathematical representation of words based on their context co-occurrence, word order, and distance characteristics. Tools such as Word2Vec, GloVe, TextCNN, etc. were born from these ideas.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
the traditional text classification method has at least two important defects: on one hand, the potential semantics of the text are difficult to express, such as synonyms are often not recognized; another aspect is that it is difficult to describe semantic differences between the text, such as the two words "eggplant", "tires", which are closer to "watermelon".
Deep neural networks are better at solving the above problems, but they also have some drawbacks: the text classification method based on Word Embedding requires training vector representation of a text first and then performing text classification according to the vector representation, which is not an end-to-end model and may bring extra semantic category information loss.
The text classification model represented by CNN/RNN is better mined for local context and word order relationships, and is less mined for global, especially "word-to-text" relationships, which also brings great information loss. If all the texts in the entire corpus are short (e.g., a group of names), insufficient information may be mined, resulting in poor classification.
Disclosure of Invention
In view of this, embodiments of the present invention provide a text classification method and apparatus to solve the technical problem of inaccurate text classification.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a text classification method including:
constructing a text map through each text in the text set; wherein the text set comprises tagged text and untagged text;
constructing a graph convolution neural network according to the text atlas, and training the graph convolution neural network by adopting the labeled text in the text set;
and inputting the text to be classified into the trained graph convolution neural network to obtain the classification mark of the text to be classified.
Optionally, constructing a text atlas through each text in the text set includes:
performing word segmentation processing on each text in the text set;
respectively calculating word frequency-inverse text frequency indexes of each word in each text in the text set and mutual information among the words;
and constructing a text map according to the word frequency-inverse text frequency index of each word in each text and mutual information among words.
Optionally, for each word in each text, calculating a word frequency-inverse text frequency index of the word in the text by using the following method:
dividing the number of occurrences of the word in the text by the number of words in the text to obtain a first index;
dividing the total number of texts in the text set with the number of texts containing the words, and then taking a logarithm to obtain a second logarithm;
and multiplying the first index and the second index to obtain a word frequency-inverse text frequency index of the word in the text.
Optionally, for each word in the text set, calculating mutual information between the word and any one of the other words by using the following method:
taking k adjacent words as a window, and calculating the times of the word and the any one other word appearing in the same window and the times of at least one word in the word and the any one other word appearing in the same window;
dividing the frequency of the word and the any one other word appearing in the same window by the frequency of at least one word of the word and the any one other word appearing in one window to obtain mutual information between the word and the any one other word.
Optionally, constructing a text map according to the word frequency-inverse text frequency index of each word in each text and mutual information between words, including:
taking words and texts as entities, and constructing a text map with rights and no directional edges; the weight of the word and the text edge is the word frequency-inverse text frequency index of the word in the text, and the weight of the word and the word edge is mutual information among the words.
Optionally, constructing a graph convolution neural network according to the text atlas includes:
constructing an adjacency matrix and a degree matrix according to the text map, and constructing a Laplace matrix according to the adjacency matrix and the degree matrix;
and constructing a graph convolution neural network according to the Laplace matrix.
Optionally, the number of layers of the graph convolutional neural network is 2-3.
Optionally, the method for calculating the output result of the graph convolution neural network includes:
solving the Laplacian matrix, the sample text input information and the first layer of neural network parameters by adopting a linear rectification activation function to obtain an intermediate result;
and solving the Laplacian matrix, the intermediate result and the second-layer neural network parameter by adopting a loss function to obtain an output result.
In addition, according to another aspect of the embodiments of the present invention, there is provided a text classification apparatus including:
the construction module is used for constructing a text map through each text in the text set; wherein the text set comprises tagged text and untagged text;
the training module is used for constructing a graph convolution neural network according to the text atlas and training the graph convolution neural network by adopting the labeled text in the text set;
and the classification module is used for inputting the text to be classified into the trained atlas neural network to obtain the classification mark of the text to be classified.
Optionally, the building module is further configured to:
performing word segmentation processing on each text in the text set;
respectively calculating word frequency-inverse text frequency indexes of each word in each text in the text set and mutual information among the words;
and constructing a text map according to the word frequency-inverse text frequency index of each word in each text and mutual information among words.
Optionally, the building module is further configured to: for each word in each text, calculating the word frequency-inverse text frequency index of the word in the text by adopting the following method:
dividing the number of occurrences of the word in the text by the number of words in the text to obtain a first index;
dividing the total number of texts in the text set with the number of texts containing the words, and then taking a logarithm to obtain a second logarithm;
and multiplying the first index and the second index to obtain a word frequency-inverse text frequency index of the word in the text.
Optionally, the building module is further configured to: for each word in the text set, calculating mutual information between the word and any one other word by adopting the following method:
taking k adjacent words as a window, and calculating the times of the word and the any one other word appearing in the same window and the times of at least one word in the word and the any one other word appearing in the same window;
dividing the frequency of the word and the any one other word appearing in the same window by the frequency of at least one word of the word and the any one other word appearing in one window to obtain mutual information between the word and the any one other word.
Optionally, the building module is further configured to:
taking words and texts as entities, and constructing a text map with rights and no directional edges; the weight of the word and the text edge is the word frequency-inverse text frequency index of the word in the text, and the weight of the word and the word edge is mutual information among the words.
Optionally, the training module is further configured to:
constructing an adjacency matrix and a degree matrix according to the text map, and constructing a Laplace matrix according to the adjacency matrix and the degree matrix;
and constructing a graph convolution neural network according to the Laplace matrix.
Optionally, the number of layers of the graph convolutional neural network is 2-3.
Optionally, the method for calculating the output result of the graph convolution neural network includes:
solving the Laplacian matrix, the sample text input information and the first layer of neural network parameters by adopting a linear rectification activation function to obtain an intermediate result;
and solving the Laplacian matrix, the intermediate result and the second-layer neural network parameter by adopting a loss function to obtain an output result.
According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any of the embodiments described above.
According to another aspect of the embodiments of the present invention, there is also provided a computer readable medium, on which a computer program is stored, which when executed by a processor implements the method of any of the above embodiments.
One embodiment of the above invention has the following advantages or benefits: because the technical means that the text atlas is constructed through each text in the text set, the graph convolution neural network is constructed according to the text atlas, and the graph convolution neural network is trained by adopting the labeled text in the text set, the technical problem that the text classification is not accurate enough in the prior art is solved. The embodiment of the invention realizes the classification of the text by establishing the text atlas of the word-text and the word-word and combining the atlas neural network, thereby not only fully excavating the context information of the local word, but also fully excavating the relation between the global word and the text, and further improving the accuracy of the text classification.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of a main flow of a text classification method according to an embodiment of the invention;
FIG. 2 is a schematic illustration of textual content according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a text atlas according to an embodiment of the invention;
FIG. 4 is a diagram illustrating a main flow of a text classification method according to a referential embodiment of the present invention;
fig. 5 is a schematic diagram of main blocks of a text classification apparatus according to an embodiment of the present invention;
FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 7 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of a main flow of a text classification method according to an embodiment of the present invention. As an embodiment of the present invention, as shown in fig. 1, the text classification method may include:
step 101, constructing a text map through each text in the text set.
In an embodiment of the invention, the set of text comprises tagged text and untagged text. Assuming that the text set of labeled labels is Y, there are C categories, and the text set of unlabeled labels is Z, then the text set (i.e., all texts) Φ ═ Y ═ Z, and the total text number is N. In Φ, all text class labels of Y need to be retained (as training data, training graph convolutional neural network).
In practical applications, the training data, i.e. Y, may be generated in such a way that the labels are manually marked. Limited to manual work, generally Y is much smaller than Z.
Optionally, step 101 may comprise: performing word segmentation processing on each text in the text set; respectively calculating word frequency-inverse text frequency indexes of each word in each text in the text set and mutual information among the words; and constructing a text map according to the word frequency-inverse text frequency index of each word in each text and mutual information among words. In step 101, for all texts in a text set Φ, a word segmentation is performed by using a word segmentation device (such as NLPIR chinese word segmentation system), and each text is divided into a plurality of words, so that the text set includes a plurality of words. Then, word frequency-inverse text frequency indexes (TF-IDF) of all words in all texts and mutual information (PMI) among words are calculated respectively, and finally a text map is constructed according to the TF-IDF and the PMI.
Optionally, for each word in each text, calculating a word frequency-inverse text frequency index of the word in the text by using the following method: dividing the number of occurrences of the word in the text by the number of words in the text to obtain a first index; dividing the total number of texts in the text set with the number of texts containing the words, and then taking a logarithm to obtain a second logarithm; and multiplying the first index and the second index to obtain a word frequency-inverse text frequency index of the word in the text.
Let the jth word of the text i be wi,jFor each word of each text, w is calculated using the following formulai,jTfidf value of (a):
Figure BDA0002419840650000081
and N is the total number of texts in the text set phi.
tfidfi,jCan be used to describe the word wi,jAnd text diThe contact of (2).
Since the tf values of the partial stop words may be large and interfere with the final results, e.g. "of", "and", etc., it is not necessary to calculate the tfidf values for these stop words in view of removing them from the original corpus. Therefore, the words in the text set are the word set obtained by performing word segmentation processing and removing stop words on each text in the text set.
If the tagged text collection Y has an uneven distribution of text categories, for example all text is related to a digital product, the idf value of certain words may be made small, and deviations may occur if used to gauge new text classifications. Therefore, the text set Y should be made to include the text of the global category as much as possible.
Optionally, for each word in the text set, calculating mutual information between the word and any one of the other words by using the following method: taking k adjacent words as a window, and calculating the times of the word and the any one other word appearing in the same window and the times of at least one word in the word and the any one other word appearing in the same window; dividing the frequency of the word and the any one other word appearing in the same window by the frequency of at least one word of the word and the any one other word appearing in one window to obtain mutual information between the word and the any one other word.
The word-to-word mutual information can be used to describe the context relationship between multiple words in the same text, for example, the probability of the "apple" and the "iphone" appearing in the same text is higher, and the probability of the "iphone" and the "tomato" appearing in the same text is lower.
For document diTaking k adjacent words as a window; in each window, the co-occurrence frequency of each pair of words appearing simultaneously is increased by 1, and finally the word w can be obtainediAnd the word wjNumber of occurrences in the same window (i.e., co-occurrence frequency) CoFreqi,jAnd the number of times Freq any of them appears in a windowi,jFinally, the mutual information between the two words is obtained as follows:
Figure BDA0002419840650000082
as shown in fig. 2, if k is 3, the number of times that "search engine" and "advertisement" appear in the same window is 2, and the number of times that at least one word of "search engine" and "advertisement" appears in one window is 8, then the mutual information of "search engine" and "advertisement" is 0.25. And the "search engine" and "mature" do not co-occur in any same window at this time, so their mutual information is 0.
Optionally, constructing a text map according to the word frequency-inverse text frequency index of each word in each text and mutual information between words, including: taking words and texts as entities, and constructing a text map with rights and no directional edges; the weight of the word and the text edge is the word frequency-inverse text frequency index of the word in the text, and the weight of the word and the word edge is mutual information among the words. The embodiment of the invention establishes a text map containing two types of entities (entries), wherein one type of entity is a text and the other type of entity is a word. The entity-to-entity associations are a number of undirected edges with rights.
Assuming that there are only two labeled texts, i.e., [ internet/search engine/commercial value/yes/advertisement ], and [ search engine/advertisement/technology ], the constructed text atlas is shown in fig. 3.
Wherein, the solid line edge is the link from "text" to "word", the dashed line edge is the link from "word" to "word", the category edge is only for illustration, and the weight of each category edge is defined as follows:
word-word: the weight of each word and other words is defined as the mutual information of two words
Word-text: certain word (denoted as w)j) With a document (denoted as d)i) Is the TFIDF value of the word (i.e., TFIDF)i,j)
Self-self: each entity is naturally identical to itself, so the weight is denoted as 1.
The embodiment of the invention skillfully constructs the text atlas through the quantitative calculation of the word-word relation and the word-text relation, and is a key for correctly classifying the text.
And 102, constructing a graph convolution neural network according to the text atlas, and training the graph convolution neural network by adopting the labeled text in the text set.
The graph convolutional neural network (GCN) can well learn the adjacency relation of nodes (namely entities) of a text graph, and the vector representation of each node can be learned by using relatively small resource consumption. The method can simultaneously carry out end-to-end learning on the node characteristic information and the structure information.
Optionally, constructing a graph convolution neural network according to the text atlas includes: constructing an adjacency matrix and a degree matrix according to the text map, and constructing a Laplace matrix according to the adjacency matrix and the degree matrix; and constructing a graph convolution neural network according to the Laplace matrix.
In an embodiment of the invention, the respective node representations of the text graph (i.e. vector representations of the text and words) are learned by a convolutional neural network model, and the documents and words are simultaneously classified by a softmax classifier of the output layer of the convolutional network.
An adjacency matrix a of the text atlas may be constructed, defined as follows:
Figure BDA0002419840650000101
assuming that the degree matrix of the adjacent matrix A is D, the Laplace matrix A' is D-1/2AD-1/2
Optionally, the number of layers of the graph convolutional neural network is 2-3, the method only uses a 2-3 layer neural network structure while improving the text classification accuracy, does not increase too much resource consumption compared with the prior art, and has stronger practicability.
Optionally, the method for calculating the output result of the graph convolution neural network may include: solving the Laplacian matrix, the sample text input information and the first layer of neural network parameters by adopting a linear rectification activation function to obtain an intermediate result; and solving the Laplacian matrix, the intermediate result and the second-layer neural network parameter by adopting a loss function to obtain an output result. Taking the example of constructing a two-layer neural network, the graph convolution neural network is described as follows:
Z=softmax(A′*RELU(A′XW0)W1)。
wherein X is sample text input information, W0、W1The parameters are the first and second layer neural network parameters respectively, the RELU is the linear rectification activation function, and A' is the Laplace matrix.
The loss value of model training is defined as the sum of the cross entropies of each marked text of the training set and the softmax value of the text output by the network under each category, and is expressed as follows:
Figure BDA0002419840650000102
where C is the number of categories of tagged text, Zd,fIs the probability of belonging to class f; y in the formulad,fIs defined as:
Figure BDA0002419840650000103
meanwhile, the original output E of the second layer is defined by the graph convolution neural network2=A′*RELU(AXW0)W1I.e. embedded (Embedding) representation of text and words. Obviously, by modifying the width of the hidden layer, the dimensions of the representation vectors can be changed.
The embodiment of the invention optimizes the model by using an adaptive moment estimation (Adam) optimizer, and optionally, the width of the hidden layer can be defined as 500 dimensions. Optionally, 32 samples are randomly sampled from the tagged text set Y for training at a time.
Step 103, inputting the text to be classified into the trained atlas neural network to obtain the classification label of the text to be classified.
After the model converges, any text in the text set Z without the label is input into the model, and the classification label of the text is obtained. The text classification method provided by the embodiment of the invention is an end-to-end text classification method, and can finish text classification while training to obtain text representation.
In addition, the embodiment of the invention adopts less text training image convolutional neural networks marked with labels, and can accurately classify the texts. It should be noted that the trained atlas neural network may classify not only the unlabeled text in the text set Φ, but also the text outside the text set Φ, which is not limited in this embodiment of the present invention.
According to the various embodiments, the technical means that the text atlas is constructed through each text in the text set, the graph convolution neural network is constructed according to the text atlas, and the graph convolution neural network is trained through the labeled text in the text set are adopted, so that the technical problem that the text classification in the prior art is not accurate enough is solved. The embodiment of the invention realizes the classification of the text by establishing the text atlas of the word-text and the word-word and combining the atlas neural network, thereby not only fully excavating the context information of the local word, but also fully excavating the relation between the global word and the text, and further improving the accuracy of the text classification.
Fig. 4 is a schematic diagram of a main flow of a text classification method according to a referential embodiment of the present invention. As another embodiment of the present invention, as shown in fig. 4, the text classification method may include:
step 401, performing word segmentation processing on each text in the text set.
Optionally, Φ in the text set includes both the tagged text set Y and the untagged text set Z. The training data, i.e., Y, may be generated in a manner that manually labels. Limited to manual work, generally Y is much smaller than Z.
All texts in the text set phi are segmented by using a word segmentation device (such as an NLPIR Chinese word segmentation system), and each text is divided into a plurality of words.
And step 402, removing stop words in the word segmentation result.
Since the tf value of partial stop words may be large, interfering with the final result, these stop words are removed from the text collection.
Step 403, respectively calculating word frequency-inverse text frequency indexes of each word in each text in the text set and mutual information between words.
Let the jth word of the text i be wi,jFor each word of each text, w is calculated using the following formulai,jTfidf value of (a):
Figure BDA0002419840650000121
and N is the total number of texts in the text set phi.
For document diTaking k adjacent words as a window; in each window, the co-occurrence frequency of each pair of words appearing simultaneously is increased by 1, and finally the word w can be obtainediAnd the word wjNumber of occurrences in the same window (i.e., co-occurrence frequency) CoFreqi,jAnd the number of times Freq any of them appears in a windowi,Finally, the mutual information between the two words is obtained as follows:
Figure BDA0002419840650000122
and step 404, constructing a text map according to the word frequency-inverse text frequency index of each word in each text and mutual information among words.
Taking words and texts as entities, and constructing a text map with rights and no directional edges; the weight of the word and the text edge is the word frequency-inverse text frequency index of the word in the text, and the weight of the word and the word edge is mutual information among the words.
And step 405, constructing a graph convolution neural network according to the text atlas.
The node representations of the text graph (namely the vector representations of the text and the words) are learned through a graph convolution neural network model, and the document and the words are classified simultaneously through an output layer softmax classifier of the graph convolution network.
An adjacency matrix a of the text atlas may be constructed, defined as follows:
Figure BDA0002419840650000131
assuming that the degree matrix of the adjacent matrix A is D, the Laplace matrix A' is D-1/2AD-1/2
Taking the example of constructing a two-layer neural network, the graph convolution neural network is described as follows:
Z=softmax(A′*RELU(X′XW0)W1)。
wherein X is sample text input information, W0、W1The parameters are the first and second layer neural network parameters respectively, the RELU is the linear rectification activation function, and A' is the Laplace matrix.
Step 406, training the atlas neural network by using the labeled texts in the text set.
The loss value of model training is defined as the sum of the cross entropies of each marked text of the training set and the softmax value of the text output by the network under each category, and is expressed as follows:
Figure BDA0002419840650000132
where C is the number of categories of tagged text, Zd,fIs the probability of belonging to class f; y in the formulad,fIs defined as:
Figure BDA0002419840650000133
meanwhile, the original output E of the second layer is defined by the graph convolution neural network2=A′*RELU(A′XW0)W1I.e. the Embedding representation of the text and words.
Step 407, inputting the text to be classified into the trained atlas neural network to obtain the classification label of the text to be classified.
After the model converges, any text in the text set Z without the label is input into the model, and the classification label of the text is obtained.
In addition, in one embodiment of the present invention, the detailed implementation of the text classification method is described in detail in the above text classification method, so that the repeated content is not described again.
Fig. 5 is a schematic diagram of main modules of a text classification apparatus according to an embodiment of the present invention, and as shown in fig. 5, the text classification apparatus 500 includes a construction module 501, a training module 502, and a classification module 503. The building module 501 is configured to build a text map through each text in the text set; wherein the text set comprises tagged text and untagged text; the training module 502 is configured to construct a graph convolution neural network according to the text atlas, and train the graph convolution neural network by using a text labeled with a label in the text set; the classification module 503 is configured to input a text to be classified into the trained atlas neural network, so as to obtain a classification label of the text to be classified.
Optionally, the building module 501 is further configured to:
performing word segmentation processing on each text in the text set;
respectively calculating word frequency-inverse text frequency indexes of each word in each text in the text set and mutual information among the words;
and constructing a text map according to the word frequency-inverse text frequency index of each word in each text and mutual information among words.
Optionally, the building module 501 is further configured to: for each word in each text, calculating the word frequency-inverse text frequency index of the word in the text by adopting the following method:
dividing the number of occurrences of the word in the text by the number of words in the text to obtain a first index;
dividing the total number of texts in the text set with the number of texts containing the words, and then taking a logarithm to obtain a second logarithm;
and multiplying the first index and the second index to obtain a word frequency-inverse text frequency index of the word in the text.
Optionally, the building module 501 is further configured to: for each word in the text set, calculating mutual information between the word and any one other word by adopting the following method:
taking k adjacent words as a window, and calculating the times of the word and the any one other word appearing in the same window and the times of at least one word in the word and the any one other word appearing in the same window;
dividing the frequency of the word and the any one other word appearing in the same window by the frequency of at least one word of the word and the any one other word appearing in one window to obtain mutual information between the word and the any one other word.
Optionally, the building module 501 is further configured to:
taking words and texts as entities, and constructing a text map with rights and no directional edges; the weight of the word and the text edge is the word frequency-inverse text frequency index of the word in the text, and the weight of the word and the word edge is mutual information among the words.
Optionally, the training module 502 is further configured to:
constructing an adjacency matrix and a degree matrix according to the text map, and constructing a Laplace matrix according to the adjacency matrix and the degree matrix;
and constructing a graph convolution neural network according to the Laplace matrix.
Optionally, the number of layers of the graph convolutional neural network is 2-3.
Optionally, the method for calculating the output result of the graph convolution neural network includes:
solving the Laplacian matrix, the sample text input information and the first layer of neural network parameters by adopting a linear rectification activation function to obtain an intermediate result;
and solving the Laplacian matrix, the intermediate result and the second-layer neural network parameter by adopting a loss function to obtain an output result.
According to the various embodiments, the technical means that the text atlas is constructed through each text in the text set, the graph convolution neural network is constructed according to the text atlas, and the graph convolution neural network is trained through the labeled text in the text set are adopted, so that the technical problem that the text classification in the prior art is not accurate enough is solved. The embodiment of the invention realizes the classification of the text by establishing the text atlas of the word-text and the word-word and combining the atlas neural network, thereby not only fully excavating the context information of the local word, but also fully excavating the relation between the global word and the text, and further improving the accuracy of the text classification.
It should be noted that, in the embodiment of the text classification apparatus of the present invention, the details have been described in the above text classification method, and therefore, the repeated details are not described herein.
Fig. 6 shows an exemplary system architecture 600 to which the text classification method or the text classification apparatus of the embodiments of the present invention may be applied.
As shown in fig. 6, the system architecture 600 may include terminal devices 601, 602, 603, a network 604, and a server 605. The network 604 serves to provide a medium for communication links between the terminal devices 601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. The terminal devices 601, 602, 603 may have installed thereon various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 605 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the terminal devices 601, 602, 603. The background management server may analyze and otherwise process the received data such as the item information query request, and feed back a processing result (for example, target push information, item information — just an example) to the terminal device.
It should be noted that the text classification method provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the text classification apparatus is generally disposed in the server 605. The text classification method provided by the embodiment of the present invention may also be executed by the terminal devices 601, 602, and 603, and accordingly, the text classification apparatus may be disposed in the terminal devices 601, 602, and 603.
It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer programs according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a building module, a training module, and a classification module, where the names of the modules do not in some cases constitute a limitation on the modules themselves.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: constructing a text map through each text in the text set; wherein the text set comprises tagged text and untagged text; constructing a graph convolution neural network according to the text atlas, and training the graph convolution neural network by adopting the labeled text in the text set; and inputting the text to be classified into the trained graph convolution neural network to obtain the classification mark of the text to be classified.
According to the technical scheme of the embodiment of the invention, the technical means that the text atlas is constructed through each text in the text set, the graph convolution neural network is constructed according to the text atlas, and the graph convolution neural network is trained by adopting the text labeled in the text set is adopted, so that the technical problem of inaccurate text classification in the prior art is solved. The embodiment of the invention realizes the classification of the text by establishing the text atlas of the word-text and the word-word and combining the atlas neural network, thereby not only fully excavating the context information of the local word, but also fully excavating the relation between the global word and the text, and further improving the accuracy of the text classification.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (11)

1. A method of text classification, comprising:
constructing a text map through each text in the text set; wherein the text set comprises tagged text and untagged text;
constructing a graph convolution neural network according to the text atlas, and training the graph convolution neural network by adopting the labeled text in the text set;
and inputting the text to be classified into the trained graph convolution neural network to obtain the classification mark of the text to be classified.
2. The method of claim 1, wherein constructing a text atlas from each text in the text collection comprises:
performing word segmentation processing on each text in the text set;
respectively calculating word frequency-inverse text frequency indexes of each word in each text in the text set and mutual information among the words;
and constructing a text map according to the word frequency-inverse text frequency index of each word in each text and mutual information among words.
3. The method of claim 2, wherein for each word in each text, the word frequency-inverse text frequency index of the word in the text is calculated as follows:
dividing the number of occurrences of the word in the text by the number of words in the text to obtain a first index;
dividing the total number of texts in the text set with the number of texts containing the words, and then taking a logarithm to obtain a second logarithm;
and multiplying the first index and the second index to obtain a word frequency-inverse text frequency index of the word in the text.
4. The method of claim 2, wherein for each word in the set of text, the mutual information between the word and any one of the other words is calculated as follows:
taking k adjacent words as a window, and calculating the times of the word and the any one other word appearing in the same window and the times of at least one word in the word and the any one other word appearing in the same window;
dividing the frequency of the word and the any one other word appearing in the same window by the frequency of at least one word of the word and the any one other word appearing in one window to obtain mutual information between the word and the any one other word.
5. The method of claim 2, wherein constructing a text atlas according to the word frequency-inverse text frequency index of each word in each text and mutual information between words comprises:
taking words and texts as entities, and constructing a text map with rights and no directional edges; the weight of the word and the text edge is the word frequency-inverse text frequency index of the word in the text, and the weight of the word and the word edge is mutual information among the words.
6. The method of claim 1, wherein constructing a graph convolutional neural network from the text atlas comprises:
constructing an adjacency matrix and a degree matrix according to the text map, and constructing a Laplace matrix according to the adjacency matrix and the degree matrix;
and constructing a graph convolution neural network according to the Laplace matrix.
7. The method of claim 6, wherein the number of layers of the graph convolutional neural network is 2-3.
8. The method of claim 7, wherein the calculating of the output result of the graph convolution neural network comprises:
solving the Laplacian matrix, the sample text input information and the first layer of neural network parameters by adopting a linear rectification activation function to obtain an intermediate result;
and solving the Laplacian matrix, the intermediate result and the second-layer neural network parameter by adopting a loss function to obtain an output result.
9. A text classification apparatus, comprising:
the construction module is used for constructing a text map through each text in the text set; wherein the text set comprises tagged text and untagged text;
the training module is used for constructing a graph convolution neural network according to the text atlas and training the graph convolution neural network by adopting the labeled text in the text set;
and the classification module is used for inputting the text to be classified into the trained atlas neural network to obtain the classification mark of the text to be classified.
10. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.
11. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.
CN202010202427.5A 2020-03-20 2020-03-20 Text classification method and device Pending CN113495958A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010202427.5A CN113495958A (en) 2020-03-20 2020-03-20 Text classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010202427.5A CN113495958A (en) 2020-03-20 2020-03-20 Text classification method and device

Publications (1)

Publication Number Publication Date
CN113495958A true CN113495958A (en) 2021-10-12

Family

ID=77993702

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010202427.5A Pending CN113495958A (en) 2020-03-20 2020-03-20 Text classification method and device

Country Status (1)

Country Link
CN (1) CN113495958A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109189925A (en) * 2018-08-16 2019-01-11 华南师范大学 Term vector model based on mutual information and based on the file classification method of CNN
CN109299270A (en) * 2018-10-30 2019-02-01 云南电网有限责任公司信息中心 A kind of text data unsupervised clustering based on convolutional neural networks
CN110717047A (en) * 2019-10-22 2020-01-21 湖南科技大学 Web service classification method based on graph convolution neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109189925A (en) * 2018-08-16 2019-01-11 华南师范大学 Term vector model based on mutual information and based on the file classification method of CNN
CN109299270A (en) * 2018-10-30 2019-02-01 云南电网有限责任公司信息中心 A kind of text data unsupervised clustering based on convolutional neural networks
CN110717047A (en) * 2019-10-22 2020-01-21 湖南科技大学 Web service classification method based on graph convolution neural network

Similar Documents

Publication Publication Date Title
Ozyurt et al. A new topic modeling based approach for aspect extraction in aspect based sentiment analysis: SS-LDA
Zhang et al. A quantum-inspired sentiment representation model for twitter sentiment analysis
US11714831B2 (en) Data processing and classification
CN107665252B (en) Method and device for creating knowledge graph
US20190163742A1 (en) Method and apparatus for generating information
Ma et al. Content representation for microblog rumor detection
Wu et al. Personalized microblog sentiment classification via multi-task learning
CN112231569A (en) News recommendation method and device, computer equipment and storage medium
CN113268560A (en) Method and device for text matching
Wei et al. Sentiment classification of Chinese Weibo based on extended sentiment dictionary and organisational structure of comments
CN114707041B (en) Message recommendation method and device, computer readable medium and electronic equipment
Vaish et al. Machine learning techniques for sentiment analysis of hotel reviews
Wei et al. Sentiment classification of tourism reviews based on visual and textual multifeature fusion
CN113761186A (en) Text emotion classification method and device
CN113139558B (en) Method and device for determining multi-stage classification labels of articles
CN110807097A (en) Method and device for analyzing data
US20230367644A1 (en) Computing environment provisioning
US20230162518A1 (en) Systems for Generating Indications of Relationships between Electronic Documents
CN108694165B (en) Cross-domain dual emotion analysis method for product comments
US20220358163A1 (en) Data processing in enterprise application
CN113495958A (en) Text classification method and device
Kamel et al. Robust sentiment fusion on distribution of news
Wang Research on the art value and application of art creation based on the emotion analysis of art
CN111274383B (en) Object classifying method and device applied to quotation
CN114490946A (en) Xlnet model-based class case retrieval method, system and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination