WO2022105123A1

WO2022105123A1 - Text classification method, topic generation method, apparatus, device, and medium

Info

Publication number: WO2022105123A1
Application number: PCT/CN2021/090711
Authority: WO
Inventors: 刘金克
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-11-19
Filing date: 2021-04-28
Publication date: 2022-05-27
Also published as: CN112380344A; CN112380344B

Abstract

Provided are a text classification method, topic generation method, apparatus, device, and medium, said text classification method comprising: scraping for online articles and obtaining keywords corresponding to each article (S1); obtaining common keywords among articles and constructing a representation graph on the basis of the common keywords, each node in the representation graph representing one article, and connecting lines between nodes having common keywords (S2); calculating the closeness between each node and other connected nodes on the basis of the common keywords to obtain a node vector of each node on the basis of the closeness (S3); feeding the node vector of each node into a predetermined classification model and training to obtain a set of classified individual nodes outputted by the classification model (S4). The method is able to classify texts accurately.

Description

Method for text classification, method, device, device and medium for topic generation

This application claims the priority of the Chinese patent application filed on November 19, 2020, with the application number CN202011305385.4, and the invention title is "Text Classification Method, Topic Generation Method, Apparatus, Equipment and Medium", The entire contents of which are incorporated herein by reference.

technical field

The present application relates to the field of artificial intelligence technology, and in particular, to a method for text classification, and a method, apparatus, device and medium for topic generation.

Background technique

At present, a large amount of information is produced on the Internet every day, including emergencies, event analysis, public opinion prediction, social development events, etc. Information relies on the Internet to achieve rapid dissemination, and everyone can quickly obtain a large amount of information. Text classification occupies an important position in information processing, and it is of great value for information processing to accurately classify information through effective methods. There are two traditional text classification methods, one is a method based on clustering and similarity, which clusters related texts together by calculating the similarity of the title or abstract of the text, and the other is a method based on classification models. , such as using RNN, Text-CNN and other algorithms to model texts such as articles, and output text classification.

The inventor realizes that the above methods are all serialized representation features of the processed text, which can achieve certain effects, but the text contains a lot of information. The relationship between the two is relative to the article, which can represent the relative degree of correlation between the article and other articles, and the method of serializing the characteristics cannot mine this internal relationship. , the text cannot be accurately classified, therefore, the technology for accurate text classification needs to be further improved.

SUMMARY OF THE INVENTION

A method of text classification, including:

Crawl web articles and obtain keywords corresponding to each article;

Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;

Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;

The node vector of each node is input into a predetermined classification model for training, and a set of each classified node output by the classification model is obtained.

A method for topic generation based on the above-mentioned text classification method, the method for topic generation includes:

Crawl web articles and obtain keywords corresponding to each article;

inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;

A preset number of nodes are selected from sets of various categories, common information of the corresponding articles is extracted based on the selected nodes, and topics are generated based on the common information.

A text classification device, comprising:

The crawling module is used to crawl web articles and obtain the keywords corresponding to each article;

The building module is used to obtain the common keywords between each article, and build a representation graph based on the common keywords. Each node in the representation graph represents an article, and one of the two nodes with the common keyword is used. connection between;

a processing module, configured to calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;

The classification module is configured to input the node vector of each node into a predetermined classification model for training, and obtain a set of classified nodes output by the classification model.

A computer device, the computer device comprising a memory and a processor connected to the memory, the memory stores a computer program that can run on the processor, and the processor executes the computer program to achieve Follow the steps below:

Crawl web articles and obtain keywords corresponding to each article;

Or implement the following steps:

Crawl web articles and obtain keywords corresponding to each article;

A computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the following steps are implemented:

Crawl web articles and obtain keywords corresponding to each article;

Or implement the following steps:

Crawl web articles and obtain keywords corresponding to each article;

This application can accurately classify the most similar articles into one category for better classification.

Description of drawings

1 is a schematic flowchart of an embodiment of a method for text classification of the application;

Fig. 2 is the schematic diagram of the characterization diagram in Fig. 1;

Fig. 3 is a detailed flow diagram of the step of calculating the tightness between each node and other connected nodes based on the common keyword in Fig. 1, and obtaining the node vector of each node based on the tightness;

Fig. 4 is the refinement flow schematic diagram of the step of inputting the node vector of each node into a predetermined classification model for training in Fig. 1, and obtaining the set of each classified node output by the classification model;

FIG. 5 is a schematic flowchart of an embodiment of a method for generating a topic of the application;

FIG. 6 is a schematic structural diagram of an embodiment of an apparatus for text classification of the application;

FIG. 7 is a schematic diagram of a hardware architecture of an embodiment of a computer device of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

It should be noted that the descriptions involving "first", "second", etc. in this application are only for the purpose of description, and should not be construed as indicating or implying their relative importance or implying the number of indicated technical features . Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In addition, the technical solutions between the various embodiments can be combined with each other, but must be based on the realization by those of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be realized, it should be considered that the combination of such technical solutions does not exist. , is not within the scope of protection claimed in this application.

Referring to FIG. 1, it is a schematic flowchart of an embodiment of a method for text classification of the present application. The method includes:

Step S1, grab web articles, and obtain keywords corresponding to each article;

Wherein, web articles can be grabbed from the web periodically (for example, every day), so as to generate topics of the corresponding period. Web articles include articles with different tag categories, such as news, finance, education, sports, and other tag categories.

Among them, each article is firstly segmented, and word segmentation tools can be used to segment each article one by one. For example, word segmentation tools such as Stanford Chinese word segmentation tool and jieba word segmentation tool are used to perform word segmentation processing. For each article, a corresponding word list can be obtained after word segmentation.

Extract keywords through a predetermined keyword extraction algorithm, for example, use TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, LSA (Latent Semantic Analysis, latent semantic analysis) algorithm or PLSA ( Any one of the keyword extraction algorithms such as Probabilisitic Latent Semantic Analysis, probabilistic implicit semantic analysis) algorithm calculates the word list of each article, and obtains the word with higher score as the keyword of the article. As another implementation, this embodiment can also use multiple keyword extraction algorithms to extract keywords of an article at the same time, and use the same keywords extracted from multiple keyword extraction algorithms as the key of the article word.

In step S2, common keywords between each article are acquired, and a characterization graph is constructed based on the common keywords. Each node in the characterization graph represents an article. connection;

Analyze whether there are common keywords between the two articles. If there are common keywords between the two articles, then each article is used as a node to connect the two nodes. After analyzing all the captured articles, all nodes with common keywords are connected. In this way, a characterization graph is constructed, and the constructed characterization graph is shown in Figure 2.

Step S3, calculating the degree of closeness between each node and other connected nodes based on the common keyword, and obtaining the node vector of each node based on the degree of closeness;

Wherein, in one embodiment, as shown in FIG. 3 , step S3 includes:

Step S31, count the number of common keywords described in the two articles corresponding to the two connected nodes;

Step S32, count the number of times that each common keyword appears in the two articles corresponding to the two connected nodes;

Step S33, calculating the tightness S between each node and other connected nodes based on the number of the common keywords and the number of occurrences respectively:

Among them, A and B represent the connected nodes in the representation graph, n is the number of common keywords in the two articles corresponding to the two nodes A and B, i is the serial number of the common keywords, and A _i is the ith common keyword. The number of times the keyword appears in the article corresponding to node A, B _i is the number of times the ith common keyword appears in the article corresponding to node B, and μ is the inverse of the number of common keywords.

is the sum of all ratios, and multiplied by μ is the ratio averaged to each common keyword. The relationship between two articles with common key and the degree of closeness are expressed by the degree of closeness S. Among them, when two articles are very similar, the value of closeness S tends to be 1, for example, the value of S of two identical articles is 1. If the two articles are not similar, then the S value will approach 0 or be much larger than 1, which is equivalent to fluctuating around the value of 1, and the fluctuation is large.

Step S34, vectorize the closeness between each node and other connected nodes to obtain a node vector corresponding to each node.

In this embodiment, the closeness between each node and other connected nodes is vectorized to obtain a node vector corresponding to the node. For example, all the captured article nodes are represented as A0, A1, A2, ..., An, the tightness of node A0 and node A1 is S1, and the tightness of node A2 is S2, and so on to get the node vector representation of node A0 is (S1, S2, ..., Sn), and then constructs the node vector representation of each article, completes the vectorization of node A0, and finally obtains the vector representation that characterizes each node in the graph. The vector representation of each node contains not only the sequence features of keywords, but also the closeness of each node to other nodes.

Step S4, input the node vector of each node into a predetermined classification model for training, and obtain a set of classified nodes output by the classification model.

Among them, the predetermined classification model can be any one of Naive Bayesian model (NB model), random forest model (RF), SVM classification model, KNN classification model, neural network classification model, and of course other deep Learn text classification models, such as fastText model, TextCNN model, etc. The classification model in this embodiment adopts a graph neural network (Graph Neural Network, GNN). Graph neural networks are connectionist models for learning graphs that contain a large number of connections. Graph neural networks capture the independence of nodes as information is propagated among the nodes of the graph. Unlike other classification models, graph neural networks maintain a state that can represent information derived from human-specified depths. Furthermore, the goal of a graph neural network is to learn the state embeddings of each node's neighbors, which are vectors and can be used to produce outputs. This embodiment specifically adopts the graph attention network (Graph Attention Networks) in the graph neural network. The graph attention network introduces the attention mechanism into the graph neural network, and gives more important nodes through the attention mechanism. the weight of.

Wherein, in one embodiment, as shown in FIG. 4 , step S4 includes:

Step S41, input the node vector of each node to the graph attention network, input the node vector to each node of the graph attention network as each node to be classified, and calculate the loss function of each node to be classified;

Step S42, for each node to be classified, when the loss function is minimized, calculate the contribution degree of neighbor nodes to the node to be classified based on the node vector of the node to be classified, and the neighbor node is the representation graph. A node connected to the to-be-classified node;

Step S43: Aggregate the neighbor nodes based on the contribution degree.

Among them, the adopted loss function is to encourage more similar nodes to aggregate, while less similar nodes are spaced away. The formula for the loss function is:

Among them, Z _u is the embedding vector generated by node u (that is, the embedding vector), node v is the neighbor node that node u randomly walks to, Z _v is the embedding vector generated by node v, σ represents the sigmoid function, T is the transpose, Negative samples are nodes that cannot become neighbor nodes after random walk, Q is the number of negative samples, E is the expected value of the probability distribution, P _n (v) is the probability distribution of negative samples, n is the node number, and “～” is the obedience distributed.

Among them, the node vector of each node is input into the graph attention network, and these nodes are regarded as the nodes to be classified. For each node to be classified, when the loss function of the node to be classified is minimized, the neighbor nodes are calculated for the to-be-classified node. Node contribution degree, the neighbor nodes are aggregated based on the contribution degree, and several categories are output, and the nodes included in each category are the most similar nodes. The classification here refers to the classification according to the similarity of the content of the articles. The more similar articles are, the greater the probability of belonging to the same category.

Wherein, calculating the contribution of neighbor nodes to the to-be-classified node based on the node vector of the to-be-classified node includes:

e _AB =LeakyReLU(α ^T [W _A ||W _B ]), where A and B are connected nodes in the representation graph, node A is the node to be classified, node B is the neighbor node of node A, and e _AB is the contribution of neighbor node B to node A, LeakyReLU is a linear unit function with leakage correction, which can perform nonlinear transformation activation, W _A is the node vector of node A, W _B is the node vector of node B, and || is The concatenation of W _A and W _B node vectors, α is the shared attention calculation function, α ^T is the transpose of the shared attention calculation function.

When generating new features of the next hidden layer, node A will base on the contribution degree e _AB of neighbor node B. The greater the contribution degree e _AB , the greater the probability of the nodes being aggregated together.

Among them, the contribution degree e _AB of the neighbor node B to the node A to generate new features is calculated by using the feedforward neural network through the graph attention network. The graph attention network calculates the contribution degree of the neighbor nodes and aggregates similar nodes. Among them, a node can be aggregated in only one category or in multiple different categories.

Wherein, the above step S4 further includes: using a normalized exponential function to calculate a score corresponding to each node aggregated under the current category after aggregation, and determining the category corresponding to the node based on the score.

The formula for calculating the normalized exponential function is as follows:

Among them, p(y|x) is the probability that the node x belongs to the category y, C is the set of categories, c is the serial number of a certain category, and W is the vector mapping matrix. The larger p(y|x) is, the greater the probability of the node being divided into the corresponding category. In this embodiment, the probability p(y|x) corresponding to the division of a node into each category is obtained, and the probability p(y|x) is used as the corresponding score for the node to be divided into each category, and the highest score Category as the finalized category for this node.

In this embodiment, a representation graph is constructed by common keywords between articles, and the closeness of each node and other connected nodes in the representation graph is calculated, so as to obtain a node vector corresponding to each node, and the node vector of each node is input into the classification model In this example, by constructing the representation graph of the article, the closeness between the node and the connected node is used as the node vector, and the node vector is input into the classification model for training to classify the nodes. , which can mine the relative closeness of the association between a node and other nodes. This closeness is the further internal or spatial connection between the article and other articles. Through this internal connection or spatial connection, the most Similar articles are grouped into one category for better classification.

The present application also provides a method for generating a topic based on the above-mentioned method for text classification. As shown in FIG. 5 , the method for generating a topic includes:

Step S101, crawl web articles, and obtain keywords corresponding to each article;

In step S102, common keywords between each article are acquired, and a characterization graph is constructed based on the common keywords. Each node in the characterization graph represents an article. connection;

Step S103, calculating the degree of closeness between each node and other connected nodes based on the common keyword, and obtaining the node vector of each node based on the degree of closeness;

Step S104, input the node vector of each node into a predetermined classification model for training, and obtain a set of classified nodes output by the classification model;

In step S105, a preset number of nodes are selected from the sets of various categories, common information of the corresponding articles is extracted based on the selected nodes, and topics are generated based on the common information.

For the definitions of the above steps S101 to S104, reference may be made to the various embodiments of the above text classification method. In step S5, in one embodiment, a preset number of nodes are selected from the sets of various categories, and each node can be divided into the corresponding scores under each category, and the nodes can be sorted in descending order. , select the preset number of nodes with the largest score, for example, select the 5 nodes with the largest score. Obtain common information of articles corresponding to the selected nodes, and generate topics based on the common information. Among them, the common information in the articles corresponding to two or more nodes can be obtained from the preset number of nodes, or the common information in the articles corresponding to all the nodes can be directly obtained. Common information generates topics. Among them, the common information of the article can be obtained by using the method of extracting text features in the prior art, which will not be described here.

In this embodiment, a representation graph is constructed by common keywords between articles, and the closeness of each node and other connected nodes in the representation graph is calculated, so as to obtain a node vector corresponding to each node, and the node vector of each node is input into the classification model In this example, by constructing the representation graph of the article, the closeness between the node and the connected node is used as the node vector, and the node vector is input into the classification model for training to classify the nodes. , which can mine the relative closeness of the association between a node and other nodes. This closeness is the further internal or spatial connection between the article and other articles. Through this internal connection or spatial connection, the most Similar articles are classified into one category. Based on this category, common information of nodes is extracted and topics are generated, which can obtain high-quality topics.

In one embodiment, the present application provides an apparatus for text classification, and the apparatus for text classification corresponds one-to-one with the method for text classification in the foregoing embodiment. As shown in Figure 6, the text classification device includes:

The crawling module 101 is used for crawling web articles and obtaining keywords corresponding to each article;

The building module 102 is used to obtain the common keywords between the articles, and build a representation graph based on the common keywords. Each node in the representation graph represents an article, and there are pairs of nodes with common keywords. connection between;

A processing module 103, configured to calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain the node vector of each node based on the degree of closeness;

The classification module 104 is configured to input the node vector of each node into a predetermined classification model for training, and obtain a set of classified nodes output by the classification model.

For the specific definition of the apparatus for text classification, reference may be made to the definition of the method for text classification above, which will not be repeated here. Each module in the above-mentioned apparatus for text classification may be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, the present application provides an apparatus for generating a topic, and the apparatus for generating a topic corresponds one-to-one with the method for generating a topic in the foregoing embodiment. The means for generating the topic include:

The generating module is configured to select a preset number of nodes from the sets of various categories, extract common information of the corresponding articles based on the selected nodes, and generate topics based on the common information.

For the specific limitation of the apparatus for generating a topic, refer to the limitation on the method for generating a topic above, which will not be repeated here. Each module in the above topic generating apparatus may be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a computer device is provided, which is a device capable of automatically performing numerical calculation and/or information processing according to pre-set or stored instructions. The computer equipment can be a PC (Personal Computer, personal computer), or a smart phone, a tablet computer, a computer, a single network server, a server group composed of multiple network servers, or a cloud-based computing system consisting of a large number of hosts or networks. A cloud composed of servers, in which cloud computing is a type of distributed computing, a super virtual computer composed of a group of loosely coupled computer sets.

As shown in FIG. 7 , the computer device may include, but is not limited to, a memory 11 , a processor 12 , and a network interface 13 that can be communicatively connected to each other through a system bus, and the memory 11 stores a computer program that can run on the processor 12 . It should be noted that FIG. 7 only shows a computer device having components 11-13, but it should be understood that it is not required to implement all of the illustrated components, and more or less components may be implemented instead.

Among them, the memory 11 may be a non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc. In this embodiment, the readable storage medium of the memory 11 is generally used to store the operating system and various application software installed in the computer device, for example, to store the program code of the computer program in an embodiment of the present application. In addition, the memory 11 can also be used to temporarily store various types of data that have been output or will be output.

The processor 12 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments, and is used for running the data stored in the memory 11. Program code or processing data, such as running a computer program, etc.

The network interface 13 may include a standard wireless network interface and a wired network interface, and the network interface 13 is generally used to establish a communication connection between the computer device and other electronic devices.

The computer program is stored in the memory 11 and includes at least one computer-readable instruction stored in the memory 11, the at least one computer-readable instruction being executable by the processor 12 to implement the following steps:

Crawl web articles and obtain keywords corresponding to each article;

Input the node vector of each node into a predetermined classification model for training, and obtain a set of classified nodes output by the classification model; or

The at least one computer-readable instruction is executable by the processor 12 to implement the following steps:

Crawl web articles and obtain keywords corresponding to each article;

In one embodiment, the present application provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile and/or volatile memory on which a computer program is stored, and when the computer program is executed by a processor Implement the following steps:

Crawl web articles and obtain keywords corresponding to each article;

Or implement the following steps:

Crawl web articles and obtain keywords corresponding to each article;

For example, steps S1 to S4 shown in FIG. 1 , or steps S1 to S5 shown in FIG. 5 . Alternatively, when the computer program is executed by the processor, the functions of each module/unit of the apparatus for text classification in the above-mentioned embodiment are implemented, for example, the functions of modules 101 to 104 shown in FIG. 6 . In order to avoid repetition, details are not repeated here.

It should be emphasized that, in order to further ensure the privacy and security of the data such as the above common information, topics and representation graphs, the above common information, topics and representation graphs and other data can also be stored in the nodes of a blockchain.

The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware through a computer program. When the computer program is executed, it can include the processes of the embodiments of the above-mentioned methods. .

The above-mentioned serial numbers of the embodiments of the present application are only for description, and do not represent the advantages or disadvantages of the embodiments.

It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, device, article or method comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, apparatus, article or method.

The above are only the preferred embodiments of the present application, and are not intended to limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present application, or directly or indirectly applied in other related technical fields , are similarly included within the scope of patent protection of this application.

Claims

A method of text classification, including:

Crawl web articles and obtain keywords corresponding to each article;

Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;

Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;

The node vector of each node is input into a predetermined classification model for training, and a set of each classified node output by the classification model is obtained.
The method for text classification according to claim 1, wherein the degree of closeness between each node and other connected nodes is calculated based on the common keywords, and the degree of closeness of the node vector of each node is obtained based on the degree of closeness steps, including:

Count the number of common keywords described in the two articles corresponding to the two connected nodes;

Count the number of times each common keyword appears in the two articles corresponding to the two connected nodes;

Calculate the closeness S between each node and other connected nodes based on the number of the common keywords and the number of occurrences respectively:

Among them, A and B represent the connected nodes in the representation graph, n is the number of common keywords in the two articles corresponding to the two nodes A and B, i is the serial number of the common keywords, and A i is the ith common keyword. The number of times the keyword appears in the article corresponding to node A, B i is the number of times the ith common keyword appears in the article corresponding to node B, and μ is the inverse of the number of common keywords;

The closeness between each node and other connected nodes is vectorized, and the node vector corresponding to each node is obtained.
The method for text classification according to claim 1 or 2, wherein the node vector of each node is input into a predetermined classification model for training, and a set of classified nodes output from the classification model is obtained. steps, including:

Input the node vector of each node to the graph attention network, input the node vector to each node of the graph attention network as each node to be classified, and calculate the loss function of each node to be classified;

For each node to be classified, when the loss function is minimized, the contribution degree of neighbor nodes to the node to be classified is calculated based on the node vector of the node to be classified, and the neighbor node is the same as that in the representation graph. Describe the nodes connected to the nodes to be classified;

The neighbor nodes are aggregated based on the contribution degree.
The method for text classification according to claim 3, wherein calculating the contribution of neighbor nodes to the node to be classified based on the node vector of the node to be classified comprises:

e AB =LeakyReLU(α T [W A ||W B ]), wherein LeakyReLU is a linear unit function with leak correction, A and B are the connected nodes in the representation graph, W A is the node vector of node A, W B is the node vector of node B, || is the concatenation of W A and W B node vectors, α is the shared attention calculation function, and α T is the transpose of the shared attention calculation function.
The method for text classification according to claim 3, wherein the step of inputting the node vector of each node into a predetermined classification model for training, and obtaining the set of each classified node output by the classification model, Further includes:

Use the normalized exponential function to calculate the corresponding score when each node is aggregated under the current category after aggregation;

The category corresponding to the node is determined based on the score.
The method for text classification according to claim 5, wherein the calculation of a score corresponding to each node aggregated under the current category after the aggregation by using a normalized exponential function comprises:

in,

p(y|x) is the probability that the node x belongs to the category y, C is the set of categories, c is the serial number of a category, and W is the vector mapping matrix.
A method for generating topics based on the method for text classification according to any one of claims 1 to 6, wherein the method for generating topics comprises:

Crawl web articles and obtain keywords corresponding to each article;

Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;

Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;

inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;

A preset number of nodes are selected from sets of various categories, common information of the corresponding articles is extracted based on the selected nodes, and topics are generated based on the common information.
A text classification device, comprising:

The crawling module is used to crawl web articles and obtain the keywords corresponding to each article;

The building module is used to obtain the common keywords between each article, and build a representation graph based on the common keywords. Each node in the representation graph represents an article, and one of the two nodes with the common keyword is used. connection between;

a processing module, configured to calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;

The classification module is configured to input the node vector of each node into a predetermined classification model for training, and obtain a set of classified nodes output by the classification model.
A computer device comprising a memory and a processor connected to the memory, the memory storing a computer program executable on the processor, wherein the processor executes the computer program When implementing the following steps:

Crawl web articles and obtain keywords corresponding to each article;

Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;

Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;

inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;

Or implement the following steps:

Crawl web articles and obtain keywords corresponding to each article;

Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;

Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;

inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;

A preset number of nodes are selected from sets of various categories, common information of the corresponding articles is extracted based on the selected nodes, and topics are generated based on the common information.
The computer device according to claim 9, wherein the step of calculating the degree of closeness between each node and other connected nodes based on the common keyword, and obtaining the node vector of each node based on the degree of closeness, Specifically include:

Count the number of common keywords described in the two articles corresponding to the two connected nodes;

Count the number of times each common keyword appears in the two articles corresponding to the two connected nodes;

Calculate the closeness S between each node and other connected nodes based on the number of the common keywords and the number of occurrences respectively:

Among them, A and B represent the connected nodes in the representation graph, n is the number of common keywords in the two articles corresponding to the two nodes A and B, i is the serial number of the common keywords, and A i is the ith common keyword. The number of times the keyword appears in the article corresponding to node A, B i is the number of times the ith common keyword appears in the article corresponding to node B, and μ is the inverse of the number of common keywords;

The closeness between each node and other connected nodes is vectorized, and the node vector corresponding to each node is obtained.
The computer device according to claim 9 or 10, wherein the step of inputting the node vector of each node into a predetermined classification model for training, and obtaining the set of each classified node output by the classification model, Specifically include:

Input the node vector of each node to the graph attention network, input the node vector to each node of the graph attention network as each node to be classified, and calculate the loss function of each node to be classified;

For each node to be classified, when the loss function is minimized, the contribution degree of neighbor nodes to the node to be classified is calculated based on the node vector of the node to be classified, and the neighbor node is the same as that in the representation graph. Describe the nodes connected to the nodes to be classified;

The neighbor nodes are aggregated based on the contribution degree.
The computer device according to claim 11, wherein calculating the contribution degree of neighbor nodes to the node to be classified based on the node vector of the node to be classified comprises:

e AB =LeakyReLU(α T [W A ||W B ]), wherein LeakyReLU is a linear unit function with leak correction, A and B are the connected nodes in the representation graph, W A is the node vector of node A, W B is the node vector of node B, || is the concatenation of W A and W B node vectors, α is the shared attention calculation function, and α T is the transpose of the shared attention calculation function.
The computer device according to claim 11, wherein the step of inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of each classified node output by the classification model, further comprises: :

Use the normalized exponential function to calculate the corresponding score when each node is aggregated under the current category after aggregation;

The category corresponding to the node is determined based on the score.
The computer device according to claim 13, wherein, after calculating the aggregation by using a normalized exponential function, the corresponding score when each node is aggregated under the current category comprises:

in,

p(y|x) is the probability that the node x belongs to the category y, C is the set of categories, c is the serial number of a category, and W is the vector mapping matrix.
A computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, the following steps are implemented:

Crawl web articles and obtain keywords corresponding to each article;

Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;

Calculate the closeness between each node and other connected nodes based on the common keyword, and obtain the node vector of each node based on the closeness;

inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;

Or implement the following steps:

Crawl web articles and obtain keywords corresponding to each article;

Obtaining the common keywords between each article, constructing a representation graph based on the common keywords, each node in the representation graph representing an article, and connecting the two nodes with the common keywords;

Calculate the degree of closeness between each node and other connected nodes based on the common keyword, and obtain a node vector of each node based on the degree of closeness;

inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model;

A preset number of nodes are selected from sets of various categories, common information of the corresponding articles is extracted based on the selected nodes, and topics are generated based on the common information.
The computer-readable storage medium according to claim 15, wherein the degree of closeness between each node and other connected nodes is calculated based on the common keyword, and a node vector of each node is obtained based on the degree of closeness steps, including:

Count the number of common keywords described in the two articles corresponding to the two connected nodes;

Count the number of times each common keyword appears in the two articles corresponding to the two connected nodes;

Calculate the closeness S between each node and other connected nodes based on the number of the common keywords and the number of occurrences respectively:

Among them, A and B represent the connected nodes in the representation graph, n is the number of common keywords in the two articles corresponding to the two nodes A and B, i is the serial number of the common keywords, and A i is the ith common keyword. The number of times the keyword appears in the article corresponding to node A, B i is the number of times the ith common keyword appears in the article corresponding to node B, and μ is the inverse of the number of common keywords;

The closeness between each node and other connected nodes is vectorized, and the node vector corresponding to each node is obtained.
The computer-readable storage medium according to claim 15 or 16, wherein the node vector of each node is input into a predetermined classification model for training, and a set of each classified node output by the classification model is obtained. steps, including:

Input the node vector of each node to the graph attention network, input the node vector to each node of the graph attention network as each node to be classified, and calculate the loss function of each node to be classified;

For each node to be classified, when the loss function is minimized, the contribution degree of neighbor nodes to the node to be classified is calculated based on the node vector of the node to be classified, and the neighbor node is the same as that in the representation graph. Describe the nodes connected to the nodes to be classified;

The neighbor nodes are aggregated based on the contribution degree.
The computer-readable storage medium according to claim 17, wherein calculating the contribution of neighbor nodes to the node to be classified based on the node vector of the node to be classified comprises:

e AB =LeakyReLU(α T [W A ||W B ]), wherein LeakyReLU is a linear unit function with leak correction, A and B are the connected nodes in the representation graph, W A is the node vector of node A, W B is the node vector of node B, || is the concatenation of W A and W B node vectors, α is the shared attention calculation function, and α T is the transpose of the shared attention calculation function.
The computer-readable storage medium according to claim 17, wherein the step of inputting the node vector of each node into a predetermined classification model for training, and obtaining a set of classified nodes output by the classification model , which further includes:

Use the normalized exponential function to calculate the corresponding score when each node is aggregated under the current category after aggregation;

The category corresponding to the node is determined based on the score.
The computer-readable storage medium according to claim 19, wherein after calculating the aggregation by using a normalized exponential function, a score corresponding to each node aggregated under the current category comprises:

in,

p(y|x) is the probability that the node x belongs to the category y, C is the set of categories, c is the serial number of a category, and W is the vector mapping matrix.