CN112231482A

CN112231482A - Long and short text classification method based on scalable representation learning

Info

Publication number: CN112231482A
Application number: CN202011230707.3A
Authority: CN
Inventors: 汪祥; 李小勇; 王辉赞; 朱俊星; 张卫民; 任开军; 李金才; 邓科峰; 吴松; 赵娟
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2021-01-15

Abstract

The invention discloses a long and short text classification method based on scalable representation learning, which comprises the following steps: preprocessing texts in the long and short text sets, and expressing the text sets as a feature matrix M, wherein elements in the M are weights of corresponding words calculated by using a TF-IDF method; inputting the characteristic matrix M into a scalable representation learning process to obtain a low-dimensional target matrix; training the KNN classifier by adopting the training set represented by the low-dimensional target matrix; and utilizing the trained KNN classifier to classify the documents to be classified. The method provided by the invention designs a scalable representation learning method, retains the similarity relation of data, is extensible and easy to parallelize, is suitable for general classification application of long and short texts, and experiments show that the method provided by the invention has better classification performance in the classification problem of large-scale long and short texts.

Description

Long and short text classification method based on scalable representation learning

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a long and short text classification method based on scalable expression learning.

Background

In machine learning and data mining applications for text classification, the input data is always composed of redundant features or noise, which can negatively impact the generalization ability of machine learning and data mining methods. As a result, much of the actual work has to be done in data preprocessing and data transformation processes to effectively apply machine learning and data mining to practical problems. It is well known that feature engineering is very important, but requires a lot of labor, which highlights the weakness of extracting and organizing discriminative information from the data of many machine learning algorithms. To solve this problem and improve data quality, alleviate the puzzlement of the dimensionality problem and further reduce memory space, it is important to develop a data representation learning method.

Generally, there are three learning methods for representing data in a machine learning task: feature selection, dimension reduction and embedded representation learning. (1) Feature selection involves selecting the best set of variables from all available variables. Typical feature selection methods include Relief, Las Vegas Wrapper, etc., which severely limit the available representation of functions for the kind of feature. (2) Dimension reduction studies how to reduce data size while retaining the most important information have many linesExemplary linear and nonlinear dimensionality reduction algorithms, such as PCA (principal component analysis), LLE (local linear embedding), SVD (singular value decomposition) and LE (Laplace feature map). Although these methods are suitable for low-dimensional representation of matrices without specifying data fields, they have high computational and storage complexity and are difficult to apply to representation learning of large-scale data. For example, the computational complexity of LLE and LE are both O (n)²) (proportional to the number of data points). (3) Embedded representation learning aims at learning an information representation of data using a neural network. However, the method of embedded representation learning is generally used in a specified domain, and it is difficult to universally adapt to the low-dimensional data representation of the matrix in the application of common classification of long texts and short texts.

Disclosure of Invention

In view of the above, the present invention aims to provide a long and short text classification method based on scalable representation learning, which overcomes the disadvantages of the prior art and is used for rapidly and efficiently classifying texts in a long and short text set. The method can better universally perform low-dimensional data representation of the adaptive matrix in the application of common classification of long texts and short texts

Based on the above purpose, the method for classifying the long and short texts based on the scalable representation learning comprises the following steps:

step 1, preprocessing texts in a long and short text set, and expressing the text set as a feature matrix

n is the number of documents in the text set, D is the number of words in the data set, and the elements in M are the weights of corresponding words calculated by using a TF-IDF (term frequency-inverse document frequency) method;

step 2, inputting the feature matrix M into a scalable representation learning process to obtain a low-dimensional target matrix;

step 3, training a KNN (k-Nearest Neighbor) classifier by adopting the training set represented by the low-dimensional target matrix;

and 4, classifying the documents to be classified by using the trained KNN classifier.

Specifically, the scalable representation learning process in step 2 includes the following steps:

step 201, constructing an adjacency graph G according to the pairwise similarity of the vectors in the feature matrix M, wherein the vectors in the feature matrix form nodes of the adjacency graph;

step 202, generating a context of a node in an adjacency graph by using a weighted random walk model in the adjacency graph G;

and step 203, learning the embedded representation by expanding the skip-gram model to obtain a low-dimensional target matrix of the embedded representation.

Specifically, in step 201, each node in the adjacency graph G represents a vector in the feature matrix, and the similarity between the nodes is calculated, and if one node in the two nodes is the first k similarity nodes of the other node, the two nodes are directly connected through an edge;

in step 202, the weighted random walk model is a method for generating random sequences on the adjacency graph, if (x)_w1，x_w2，…，x_wl) Is a random sequence with the length of l, a sliding window with the size of c is adopted to represent the context of a node, and a node x in the random sequence_wjContext of (x) NC (x)_wj) Can be expressed as NC (x)_wj)＝{x_wmI-c m-j c, m e (1, 2, …, l) }, giving the previous node x in the adjacency graph_w(t-1)＝υ_bThe following formula is adopted to calculate that the current node is upsilon_aProbability of (c):

where E is the set of edges of the adjacency graph, P represents the conditional probability, sim () represents the similarity between two nodes,

is a normalization constant.

In step 203, the objective function for learning the embedded representation in the extended skip-gram model is:

wherein f represents an objective function to be calculated, which realizes that the current high-dimensional data is represented as low-dimensional data, but still maintains the similarity information between the data after dimension reduction representation, NC (x)_i) Is denoted by x_iWherein exp () represents an exponential function with e as the base.

Further, in step 203, a negative sampling method is used for the pair

The approximation is done to calculate quickly, negative sampling is performed using the gensim toolkit, and the sampling threshold is set to 0.001, and the objective function is optimized using random gradient descent and learned to obtain the function f.

Further, the similarity between the nodes is measured using cosine similarity in step 201.

Drawings

FIG. 1 is a schematic overall flow chart of an embodiment of the present invention;

fig. 2 is a schematic diagram of a scalable representation learning process according to an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.

As shown in fig. 1, the text classification method based on scalable representation learning includes the following steps:

step 1, preprocessing the texts in the long and short text setIn principle, the text set is represented as a feature matrix

As shown in fig. 2, the scalable representation learning process in step 2 includes the following steps:

The purpose of step 2 is to learn the matrix M by means of the function f^n×DTo the target matrix Z^n×d∈R^n×dAnd retain the most important information in the original data set. Formally, the problem can be represented using the following equation.

Z^n×d＝f(M^n×D，sim())

In the equation, each vector x_i∈M(x_i∈R^D) Is expressed as a D-dimensional vector (D < D) and the number of vectors in M is unchanged. In order to define the learning function f, it is necessary to consider which attributes of the data to capture or retain in the invention. In the above equation, sim () is a similarity function defined for capturing and retaining data attributes.

In the process of constructing the adjacency graph, the weighted adjacency graph G is used for representing the matrix M, which is a node of the graph, by considering the existence of vectors in the matrix M and hopefully maintaining the pairwise similarity relationship of the vectors. The problem of learning matrix representation and preserving vector similarity is then transferred to the representation of the learning weighted graph and the neighborhood of nodes in the adjacency graph is preserved. For top-k similar vectors of vectors in matrix M, there are edges connecting corresponding nodes in graph G if v_iIs v_jTop-k one of the nearest neighbors of or v_jIs v_iOne of the top-k nearest neighbors of, node v_iAnd v_jConnected by an edge. To calculate the pairwise similarity between two nodes, many common global and local similarity/distance functions may be used, such as Jaccard, cosine, dice, and overlapping similarity of sets and vectors, in the experiments of this embodiment, the local similarity function "cosine similarity" is chosen to preserve the pairwise similarity relationship of the vectors in M.

Skip-gram is a language model that maximizes the probability of co-occurrence between words appearing in a sentence window. In the embodiment, an extended skip-gram model is used to learn the embedded representation of the adjacency graph, but the graph has no natural context of the nodes, so that it is necessary to know the context of each node, and we generate random walking paths of the nodes on the graph as the contexts of the nodes to simulate the process of generating sentences by words. The weighted random walk is chosen to establish the context of the node in the adjacency graph based on the traditional search methods of breadth-first sampling (BFS) and depth-first sampling (DSF) because it is computationally efficient in terms of both memory and time requirements. The method can sample the nodes for multiple times and capture the neighborhood similarity and the graph structure. The memory complexity of the direct neighbors of each node in the storage map is O (| E |). Random walks have been used as similarity measures for various problems in content recommendation and community detection. Random walks are very easy to parallelize, and several random walkers can simultaneously browse different parts of the graph. It can accommodate subtle changes in the graph structure without requiring global recalculation.

The weighted random walk model isA method of generating a random sequence on a adjacency graph, provided that (x)_w1，x_w2，…，x_wl) Is a random sequence with length l, and we use a sliding window with size c to represent the context of a node. Node x in random sequence_wjContext of (x) NC (x)_wj) Can be expressed as NC (x)_wj)＝{x_wmI-c is less than or equal to m-j is less than or equal to c, and m belongs to (1, 2, …, l) }. Random walk is a markov chain, the generation of the current node is only related to the previous node. In the adjacency graph, a previous node x is given_w(t-1)＝υ_bWe use the following formula to calculate that the current node is upsilon_aThe probability of (c). For any node, the length of the random walk is fixed and is a small length l, and each node starts to randomly walk w times from the node.

is a normalization constant.

Inspired by the distributed representation learning method of Word2Vec, Node2Vec and Deepwalk which are developed recently, the embodiment expands the skip-gram model to learn the embedded representation learning function f. It was developed to learn the similarity of words entered with training text. The algorithm is programmed as follows: it scans the words in the documents in the dataset and embeds for each word so that the features of the word can predict nearby words. Its output is an embedded vector for each word and preserves the similarity between words. It learns by optimizing likelihood using SGD with negative sampling. The algorithm is based on the distribution assumption that words in similar contexts tend to have similar meanings. By extending the skip-gram model, the target of learning embedded representation in the embodiment can be defined as:

wherein f represents an objective function to be calculated, which realizes that the current high-dimensional data is represented as low-dimensional data, but still maintains the similarity information between the data after dimension reduction representation, NC (x)_i) Is denoted by x_iExp () represents an exponential function with e as the base.

For a large data set, the data set is,

is very costly and is therefore approximated using a negative sampling method for fast calculation. In the experiment, negative sampling was performed using the gensim toolkit and the sampling threshold was set to 0.001. We optimize the objective function using random gradient descent (SGD) to learn the function f.

To test the performance of the method of the present invention, it was experimentally compared to all selected methods in long and short text classification applications. In the experiment 4-fold cross-validation was performed in all data sets. For documents in the dataset, rare words with a total frequency of less than 5 in the dataset were discarded in the experiment, and 201 common nonsense stop words, such as "a", "is", "of", etc., were also deleted and found. Some standard text preprocessing steps are also performed in the experiment, such as stemming, rooting, and lowercase documents. Then, the weight of each remaining word is calculated using the "TF-IDF" method. Representing documents by vectors, using a matrix M ∈ R^n*DRepresenting a data set, where n is the number of documents and D is the number of words in the data set. The following four real data sets were used in the experiments to evaluate the performance of the methods in the classification of long and short text.

1. Movie reviews. This collection contains 2000 reviews of the movie, of which 1000 express positive reviews and 1000 express negative reviews about the movie.

2. 20 news groups. After removing duplicate documents and rare words, it contains 18,825 articles in 20 categories (approximately 1,000 documents in each category). These articles were taken from the "Usenet" newsgroup collection and we used only the subject and body of each message. For computational reasons, 2,000 documents were randomly selected from the experiment for experimentation

3. 20 news short groups. To evaluate the performance of the method in short text classification, only the article titles in 20 newsgroup datasets were used for short text classification in the experiment.

4. Google fragments. This tagged collection was retrieved from a Google search using JWebPro, which consisted of 12,000 digests (10,000 for training and 2,000 for testing) and was labeled as 8 categories. The digests in the dataset are on average about 17.99 words in length. This set was used experimentally to evaluate the performance of the method on short text classification.

In order to show the effect of the method, five other unsupervised dimension reduction methods are compared simultaneously in the experiment: (1) principal Component Analysis (PCA), (2) Local Linear Embedding (LLE), (3) Laplacian Eigenmap (LE), (4) classical multidimensional scaling (CMDS), and (5) Isomap. PCA and CMDS are typically linear methods, while LLE, LE and Isomap are typically nonlinear methods. These five comparisons represent a brief introduction to the learning method as follows:

principal Component Analysis (PCA). PCA is a linear dimensionality reduction method that performs dimensionality reduction by embedding data into a lower-dimensional linear subspace. Although various techniques exist for linear and non-linear dimensionality reduction, PCA remains one of the most popular and most powerful unsupervised linear techniques. We performed experiments using the implementation of PCA in the "scimit-spare" kit.

Locally Linear Embedding (LLE). LLE is a manifold learning method based on manifold geometry concepts. LLE constructs a graphical representation of the data points that retains only local attributes of the data and treats the high-dimensional data points as linear combinations of their nearest neighbors. We performed experiments using the implementation of LLE in the "scimit-spare" kit.

Laplace feature map (LE). Similar to LLE, LE finds a low-dimensional data representation by preserving the local properties of the manifold. The local attribute is based on the pair-wise distance between neighbors. We performed for experiments on LE in the "scimit-spare" kit.

Classical multidimensional scaling (CMDS). The CMDS seeks a low-dimensional representation of the data, where distances take good account of distances in the original high-dimensional space. It attempts to model similarity or dissimilarity data as distances in geometric space. Also, we performed experiments using the implementation of classical MDS in the "scimit-spare" kit.

ISOMAP. It is one of several widely used non-linear dimension reduction methods. It is used to compute quasi-equidistant and low-dimensional embeddings of a set of high-dimensional data points. The algorithm provides a simple method to estimate the inherent geometry of a data manifold based on a rough estimate of the neighborhood of each data point on the manifold. It is very efficient and is generally applicable to a wide range of data sources. We performed the experiments using the implementation in the "scimit-spare" kit.

During the experiment, KNN is realized by using 'scimit-lean'. To select the parameter k in KNN, the "GridSearch" method is used to find the optimal parameter k in each dataset and method (k ═ 1, 3, 5, 7, 9, 11). In the experiment, the target dimension for each method was 128. The "GridSearch" is also used to find the best neighbor numbers for LLE, LE, Isomap and matrix2vec, since the neighbor numbers are important parameters for these methods.

Table 1 experimental results of short and long text classification

Method of producing a composite material	Google fragment	20 news short groups	20 news groups	Film review
					LLE	0.7970(+/-0.1127)	0.4258(+/-0.0277)	0.4980(+/-0.0199)	0.6650(+/-0.0254)
LE	0.7675(+/-0.0402)	0.4022(+/-0.0287)	0.4805(+/-0.0308)	0.6555(+/-0.0325)
					PCA	0.9210(+/-0.0681)	0.4006(+/-0.0248)	0.4864(+/-0.0600)	0.6875(+/-0.0380)
CMDS	0.8565(+/-0.1504)	0.0862(+/-0.0208)	0.1575(+/-0.0197)	0.5345(+/-0.0312)
					ISOMAP	0.8735(+/-0.1552)	0.4152(+/-0.0367)	0.3613(+/-0.0418)	0.6415(+/-0.0266)
Method for producing a composite material	0.9135(+/-0.0633)	0.5452(+/-0.0128)	0.7710(+/-0.0194)	0.6875(+/-0.0380)

Table 1 shows the accuracy of the characterization learning method in the classification of long and short texts. In matrix representation learning methods (LLE, LE, PCA, CMDS, Isomap and the method of the present invention), we can find that the algorithm of the present method exhibits the best performance in both long and short text classifications. In the short text data sets of "Google fragments" and "20 news short groups", the performance of the method is improved by 19.97% and 28.04% compared to LLE method. In the long text dataset "20 newsgroups", the method is much better than the comparative method and has an 100.96% improvement over the LLE method. Furthermore, in the long text data set "movie reviews", the method has the same performance as the PCA method, and the accuracy scores of both methods are 0.6875. The matrix representation learning based on the weighted random walk model and the expanded skip-gram model in the method is disclosed. The computation and storage complexity is O (| E |) and O (n), which are the smallest of the latest learning methods for representing matrices in the past.

The above embodiment is an implementation manner of the method for text classification, but the implementation manner of the invention is not limited by the above embodiment, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the invention should be regarded as equivalent substitutions, and are included in the protection scope of the invention.

Claims

1. The long and short text classification method based on scalable representation learning is characterized by comprising the following steps of:

n is the number of documents in the text set, D is the number of words in the data set, and the elements in M are the weights of corresponding words calculated by using a TF-IDF method;

step 3, training the KNN classifier by adopting the training set represented by the low-dimensional target matrix;

step 4, classifying the documents to be classified by using the trained KNN classifier;

the scalable representation learning process in step 2 comprises the following steps:

2. The method according to claim 1, wherein in step 201, each node in the adjacency graph G represents a vector in the feature matrix, and the similarity between the nodes is calculated, and if one of the two nodes is the first k similarity nodes of the other node, the two nodes are directly connected by an edge;

in step 202, the weighted random walk model is a method for generating random sequences on the adjacency graph, if (x)_w1，x_w2，…，x_wl) Is a random sequence with the length of l, a sliding window with the size of c is adopted to represent the context of a node, and a node x in the random sequence_wjContext of (x) NC (x)_wj) Can be expressed as NC (x)_wj)＝{x_wmI-c m-j c, m e (1, 2, …, l) }, giving the previous node x in the adjacency graph_w(t-1)＝v_bCalculating that the current node is v using the following formula_aProbability of (c):

is a normalization constant;

3. The method for classifying long and short texts according to claim 2, wherein the pair of negative sampling methods is used in step 203

And performing approximate quick calculation, implementing negative sampling by using a genim toolkit, setting a sampling threshold value to be 0.001, optimizing the objective function by using random gradient descent, and learning to obtain a function f.

4. The method according to claim 1, wherein the similarity between nodes is measured by cosine similarity in step 201.