CN112231482A - Long and short text classification method based on scalable representation learning - Google Patents

Long and short text classification method based on scalable representation learning Download PDF

Info

Publication number
CN112231482A
CN112231482A CN202011230707.3A CN202011230707A CN112231482A CN 112231482 A CN112231482 A CN 112231482A CN 202011230707 A CN202011230707 A CN 202011230707A CN 112231482 A CN112231482 A CN 112231482A
Authority
CN
China
Prior art keywords
similarity
node
long
nodes
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011230707.3A
Other languages
Chinese (zh)
Inventor
汪祥
李小勇
王辉赞
朱俊星
张卫民
任开军
李金才
邓科峰
吴松
赵娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202011230707.3A priority Critical patent/CN112231482A/en
Publication of CN112231482A publication Critical patent/CN112231482A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a long and short text classification method based on scalable representation learning, which comprises the following steps: preprocessing texts in the long and short text sets, and expressing the text sets as a feature matrix M, wherein elements in the M are weights of corresponding words calculated by using a TF-IDF method; inputting the characteristic matrix M into a scalable representation learning process to obtain a low-dimensional target matrix; training the KNN classifier by adopting the training set represented by the low-dimensional target matrix; and utilizing the trained KNN classifier to classify the documents to be classified. The method provided by the invention designs a scalable representation learning method, retains the similarity relation of data, is extensible and easy to parallelize, is suitable for general classification application of long and short texts, and experiments show that the method provided by the invention has better classification performance in the classification problem of large-scale long and short texts.

Description

Long and short text classification method based on scalable representation learning
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a long and short text classification method based on scalable expression learning.
Background
In machine learning and data mining applications for text classification, the input data is always composed of redundant features or noise, which can negatively impact the generalization ability of machine learning and data mining methods. As a result, much of the actual work has to be done in data preprocessing and data transformation processes to effectively apply machine learning and data mining to practical problems. It is well known that feature engineering is very important, but requires a lot of labor, which highlights the weakness of extracting and organizing discriminative information from the data of many machine learning algorithms. To solve this problem and improve data quality, alleviate the puzzlement of the dimensionality problem and further reduce memory space, it is important to develop a data representation learning method.
Generally, there are three learning methods for representing data in a machine learning task: feature selection, dimension reduction and embedded representation learning. (1) Feature selection involves selecting the best set of variables from all available variables. Typical feature selection methods include Relief, Las Vegas Wrapper, etc., which severely limit the available representation of functions for the kind of feature. (2) Dimension reduction studies how to reduce data size while retaining the most important information have many linesExemplary linear and nonlinear dimensionality reduction algorithms, such as PCA (principal component analysis), LLE (local linear embedding), SVD (singular value decomposition) and LE (Laplace feature map). Although these methods are suitable for low-dimensional representation of matrices without specifying data fields, they have high computational and storage complexity and are difficult to apply to representation learning of large-scale data. For example, the computational complexity of LLE and LE are both O (n)2) (proportional to the number of data points). (3) Embedded representation learning aims at learning an information representation of data using a neural network. However, the method of embedded representation learning is generally used in a specified domain, and it is difficult to universally adapt to the low-dimensional data representation of the matrix in the application of common classification of long texts and short texts.
Disclosure of Invention
In view of the above, the present invention aims to provide a long and short text classification method based on scalable representation learning, which overcomes the disadvantages of the prior art and is used for rapidly and efficiently classifying texts in a long and short text set. The method can better universally perform low-dimensional data representation of the adaptive matrix in the application of common classification of long texts and short texts
Based on the above purpose, the method for classifying the long and short texts based on the scalable representation learning comprises the following steps:
step 1, preprocessing texts in a long and short text set, and expressing the text set as a feature matrix
Figure BDA0002765093900000021
n is the number of documents in the text set, D is the number of words in the data set, and the elements in M are the weights of corresponding words calculated by using a TF-IDF (term frequency-inverse document frequency) method;
step 2, inputting the feature matrix M into a scalable representation learning process to obtain a low-dimensional target matrix;
step 3, training a KNN (k-Nearest Neighbor) classifier by adopting the training set represented by the low-dimensional target matrix;
and 4, classifying the documents to be classified by using the trained KNN classifier.
Specifically, the scalable representation learning process in step 2 includes the following steps:
step 201, constructing an adjacency graph G according to the pairwise similarity of the vectors in the feature matrix M, wherein the vectors in the feature matrix form nodes of the adjacency graph;
step 202, generating a context of a node in an adjacency graph by using a weighted random walk model in the adjacency graph G;
and step 203, learning the embedded representation by expanding the skip-gram model to obtain a low-dimensional target matrix of the embedded representation.
Specifically, in step 201, each node in the adjacency graph G represents a vector in the feature matrix, and the similarity between the nodes is calculated, and if one node in the two nodes is the first k similarity nodes of the other node, the two nodes are directly connected through an edge;
in step 202, the weighted random walk model is a method for generating random sequences on the adjacency graph, if (x)w1,xw2,…,xwl) Is a random sequence with the length of l, a sliding window with the size of c is adopted to represent the context of a node, and a node x in the random sequencewjContext of (x) NC (x)wj) Can be expressed as NC (x)wj)={xwmI-c m-j c, m e (1, 2, …, l) }, giving the previous node x in the adjacency graphw(t-1)=υbThe following formula is adopted to calculate that the current node is upsilonaProbability of (c):
Figure BDA0002765093900000031
where E is the set of edges of the adjacency graph, P represents the conditional probability, sim () represents the similarity between two nodes,
Figure BDA0002765093900000032
is a normalization constant.
In step 203, the objective function for learning the embedded representation in the extended skip-gram model is:
Figure BDA0002765093900000033
wherein f represents an objective function to be calculated, which realizes that the current high-dimensional data is represented as low-dimensional data, but still maintains the similarity information between the data after dimension reduction representation, NC (x)i) Is denoted by xiWherein exp () represents an exponential function with e as the base.
Further, in step 203, a negative sampling method is used for the pair
Figure BDA0002765093900000034
The approximation is done to calculate quickly, negative sampling is performed using the gensim toolkit, and the sampling threshold is set to 0.001, and the objective function is optimized using random gradient descent and learned to obtain the function f.
Further, the similarity between the nodes is measured using cosine similarity in step 201.
Drawings
FIG. 1 is a schematic overall flow chart of an embodiment of the present invention;
fig. 2 is a schematic diagram of a scalable representation learning process according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.
As shown in fig. 1, the text classification method based on scalable representation learning includes the following steps:
step 1, preprocessing the texts in the long and short text setIn principle, the text set is represented as a feature matrix
Figure BDA0002765093900000041
n is the number of documents in the text set, D is the number of words in the data set, and the elements in M are the weights of corresponding words calculated by using a TF-IDF (term frequency-inverse document frequency) method;
step 2, inputting the feature matrix M into a scalable representation learning process to obtain a low-dimensional target matrix;
step 3, training a KNN (k-Nearest Neighbor) classifier by adopting the training set represented by the low-dimensional target matrix;
and 4, classifying the documents to be classified by using the trained KNN classifier.
As shown in fig. 2, the scalable representation learning process in step 2 includes the following steps:
step 201, constructing an adjacency graph G according to the pairwise similarity of the vectors in the feature matrix M, wherein the vectors in the feature matrix form nodes of the adjacency graph;
step 202, generating a context of a node in an adjacency graph by using a weighted random walk model in the adjacency graph G;
and step 203, learning the embedded representation by expanding the skip-gram model to obtain a low-dimensional target matrix of the embedded representation.
The purpose of step 2 is to learn the matrix M by means of the function fn×DTo the target matrix Zn×d∈Rn×dAnd retain the most important information in the original data set. Formally, the problem can be represented using the following equation.
Zn×d=f(Mn×D,sim())
In the equation, each vector xi∈M(xi∈RD) Is expressed as a D-dimensional vector (D < D) and the number of vectors in M is unchanged. In order to define the learning function f, it is necessary to consider which attributes of the data to capture or retain in the invention. In the above equation, sim () is a similarity function defined for capturing and retaining data attributes.
In the process of constructing the adjacency graph, the weighted adjacency graph G is used for representing the matrix M, which is a node of the graph, by considering the existence of vectors in the matrix M and hopefully maintaining the pairwise similarity relationship of the vectors. The problem of learning matrix representation and preserving vector similarity is then transferred to the representation of the learning weighted graph and the neighborhood of nodes in the adjacency graph is preserved. For top-k similar vectors of vectors in matrix M, there are edges connecting corresponding nodes in graph G if viIs vjTop-k one of the nearest neighbors of or vjIs viOne of the top-k nearest neighbors of, node viAnd vjConnected by an edge. To calculate the pairwise similarity between two nodes, many common global and local similarity/distance functions may be used, such as Jaccard, cosine, dice, and overlapping similarity of sets and vectors, in the experiments of this embodiment, the local similarity function "cosine similarity" is chosen to preserve the pairwise similarity relationship of the vectors in M.
Skip-gram is a language model that maximizes the probability of co-occurrence between words appearing in a sentence window. In the embodiment, an extended skip-gram model is used to learn the embedded representation of the adjacency graph, but the graph has no natural context of the nodes, so that it is necessary to know the context of each node, and we generate random walking paths of the nodes on the graph as the contexts of the nodes to simulate the process of generating sentences by words. The weighted random walk is chosen to establish the context of the node in the adjacency graph based on the traditional search methods of breadth-first sampling (BFS) and depth-first sampling (DSF) because it is computationally efficient in terms of both memory and time requirements. The method can sample the nodes for multiple times and capture the neighborhood similarity and the graph structure. The memory complexity of the direct neighbors of each node in the storage map is O (| E |). Random walks have been used as similarity measures for various problems in content recommendation and community detection. Random walks are very easy to parallelize, and several random walkers can simultaneously browse different parts of the graph. It can accommodate subtle changes in the graph structure without requiring global recalculation.
The weighted random walk model isA method of generating a random sequence on a adjacency graph, provided that (x)w1,xw2,…,xwl) Is a random sequence with length l, and we use a sliding window with size c to represent the context of a node. Node x in random sequencewjContext of (x) NC (x)wj) Can be expressed as NC (x)wj)={xwmI-c is less than or equal to m-j is less than or equal to c, and m belongs to (1, 2, …, l) }. Random walk is a markov chain, the generation of the current node is only related to the previous node. In the adjacency graph, a previous node x is givenw(t-1)=υbWe use the following formula to calculate that the current node is upsilonaThe probability of (c). For any node, the length of the random walk is fixed and is a small length l, and each node starts to randomly walk w times from the node.
Figure BDA0002765093900000061
Where E is the set of edges of the adjacency graph, P represents the conditional probability, sim () represents the similarity between two nodes,
Figure BDA0002765093900000062
is a normalization constant.
Inspired by the distributed representation learning method of Word2Vec, Node2Vec and Deepwalk which are developed recently, the embodiment expands the skip-gram model to learn the embedded representation learning function f. It was developed to learn the similarity of words entered with training text. The algorithm is programmed as follows: it scans the words in the documents in the dataset and embeds for each word so that the features of the word can predict nearby words. Its output is an embedded vector for each word and preserves the similarity between words. It learns by optimizing likelihood using SGD with negative sampling. The algorithm is based on the distribution assumption that words in similar contexts tend to have similar meanings. By extending the skip-gram model, the target of learning embedded representation in the embodiment can be defined as:
Figure BDA0002765093900000071
wherein f represents an objective function to be calculated, which realizes that the current high-dimensional data is represented as low-dimensional data, but still maintains the similarity information between the data after dimension reduction representation, NC (x)i) Is denoted by xiExp () represents an exponential function with e as the base.
For a large data set, the data set is,
Figure BDA0002765093900000072
is very costly and is therefore approximated using a negative sampling method for fast calculation. In the experiment, negative sampling was performed using the gensim toolkit and the sampling threshold was set to 0.001. We optimize the objective function using random gradient descent (SGD) to learn the function f.
To test the performance of the method of the present invention, it was experimentally compared to all selected methods in long and short text classification applications. In the experiment 4-fold cross-validation was performed in all data sets. For documents in the dataset, rare words with a total frequency of less than 5 in the dataset were discarded in the experiment, and 201 common nonsense stop words, such as "a", "is", "of", etc., were also deleted and found. Some standard text preprocessing steps are also performed in the experiment, such as stemming, rooting, and lowercase documents. Then, the weight of each remaining word is calculated using the "TF-IDF" method. Representing documents by vectors, using a matrix M ∈ Rn*DRepresenting a data set, where n is the number of documents and D is the number of words in the data set. The following four real data sets were used in the experiments to evaluate the performance of the methods in the classification of long and short text.
1. Movie reviews. This collection contains 2000 reviews of the movie, of which 1000 express positive reviews and 1000 express negative reviews about the movie.
2. 20 news groups. After removing duplicate documents and rare words, it contains 18,825 articles in 20 categories (approximately 1,000 documents in each category). These articles were taken from the "Usenet" newsgroup collection and we used only the subject and body of each message. For computational reasons, 2,000 documents were randomly selected from the experiment for experimentation
3. 20 news short groups. To evaluate the performance of the method in short text classification, only the article titles in 20 newsgroup datasets were used for short text classification in the experiment.
4. Google fragments. This tagged collection was retrieved from a Google search using JWebPro, which consisted of 12,000 digests (10,000 for training and 2,000 for testing) and was labeled as 8 categories. The digests in the dataset are on average about 17.99 words in length. This set was used experimentally to evaluate the performance of the method on short text classification.
In order to show the effect of the method, five other unsupervised dimension reduction methods are compared simultaneously in the experiment: (1) principal Component Analysis (PCA), (2) Local Linear Embedding (LLE), (3) Laplacian Eigenmap (LE), (4) classical multidimensional scaling (CMDS), and (5) Isomap. PCA and CMDS are typically linear methods, while LLE, LE and Isomap are typically nonlinear methods. These five comparisons represent a brief introduction to the learning method as follows:
principal Component Analysis (PCA). PCA is a linear dimensionality reduction method that performs dimensionality reduction by embedding data into a lower-dimensional linear subspace. Although various techniques exist for linear and non-linear dimensionality reduction, PCA remains one of the most popular and most powerful unsupervised linear techniques. We performed experiments using the implementation of PCA in the "scimit-spare" kit.
Locally Linear Embedding (LLE). LLE is a manifold learning method based on manifold geometry concepts. LLE constructs a graphical representation of the data points that retains only local attributes of the data and treats the high-dimensional data points as linear combinations of their nearest neighbors. We performed experiments using the implementation of LLE in the "scimit-spare" kit.
Laplace feature map (LE). Similar to LLE, LE finds a low-dimensional data representation by preserving the local properties of the manifold. The local attribute is based on the pair-wise distance between neighbors. We performed for experiments on LE in the "scimit-spare" kit.
Classical multidimensional scaling (CMDS). The CMDS seeks a low-dimensional representation of the data, where distances take good account of distances in the original high-dimensional space. It attempts to model similarity or dissimilarity data as distances in geometric space. Also, we performed experiments using the implementation of classical MDS in the "scimit-spare" kit.
ISOMAP. It is one of several widely used non-linear dimension reduction methods. It is used to compute quasi-equidistant and low-dimensional embeddings of a set of high-dimensional data points. The algorithm provides a simple method to estimate the inherent geometry of a data manifold based on a rough estimate of the neighborhood of each data point on the manifold. It is very efficient and is generally applicable to a wide range of data sources. We performed the experiments using the implementation in the "scimit-spare" kit.
During the experiment, KNN is realized by using 'scimit-lean'. To select the parameter k in KNN, the "GridSearch" method is used to find the optimal parameter k in each dataset and method (k ═ 1, 3, 5, 7, 9, 11). In the experiment, the target dimension for each method was 128. The "GridSearch" is also used to find the best neighbor numbers for LLE, LE, Isomap and matrix2vec, since the neighbor numbers are important parameters for these methods.
Table 1 experimental results of short and long text classification
Method of producing a composite material Google fragment 20 news short groups 20 news groups Film review
LLE 0.7970(+/-0.1127) 0.4258(+/-0.0277) 0.4980(+/-0.0199) 0.6650(+/-0.0254)
LE 0.7675(+/-0.0402) 0.4022(+/-0.0287) 0.4805(+/-0.0308) 0.6555(+/-0.0325)
PCA 0.9210(+/-0.0681) 0.4006(+/-0.0248) 0.4864(+/-0.0600) 0.6875(+/-0.0380)
CMDS 0.8565(+/-0.1504) 0.0862(+/-0.0208) 0.1575(+/-0.0197) 0.5345(+/-0.0312)
ISOMAP 0.8735(+/-0.1552) 0.4152(+/-0.0367) 0.3613(+/-0.0418) 0.6415(+/-0.0266)
Method for producing a composite material 0.9135(+/-0.0633) 0.5452(+/-0.0128) 0.7710(+/-0.0194) 0.6875(+/-0.0380)
Table 1 shows the accuracy of the characterization learning method in the classification of long and short texts. In matrix representation learning methods (LLE, LE, PCA, CMDS, Isomap and the method of the present invention), we can find that the algorithm of the present method exhibits the best performance in both long and short text classifications. In the short text data sets of "Google fragments" and "20 news short groups", the performance of the method is improved by 19.97% and 28.04% compared to LLE method. In the long text dataset "20 newsgroups", the method is much better than the comparative method and has an 100.96% improvement over the LLE method. Furthermore, in the long text data set "movie reviews", the method has the same performance as the PCA method, and the accuracy scores of both methods are 0.6875. The matrix representation learning based on the weighted random walk model and the expanded skip-gram model in the method is disclosed. The computation and storage complexity is O (| E |) and O (n), which are the smallest of the latest learning methods for representing matrices in the past.
The above embodiment is an implementation manner of the method for text classification, but the implementation manner of the invention is not limited by the above embodiment, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the invention should be regarded as equivalent substitutions, and are included in the protection scope of the invention.

Claims (4)

1. The long and short text classification method based on scalable representation learning is characterized by comprising the following steps of:
step 1, preprocessing texts in a long and short text set, and expressing the text set as a feature matrix
Figure FDA0002765093890000011
n is the number of documents in the text set, D is the number of words in the data set, and the elements in M are the weights of corresponding words calculated by using a TF-IDF method;
step 2, inputting the feature matrix M into a scalable representation learning process to obtain a low-dimensional target matrix;
step 3, training the KNN classifier by adopting the training set represented by the low-dimensional target matrix;
step 4, classifying the documents to be classified by using the trained KNN classifier;
the scalable representation learning process in step 2 comprises the following steps:
step 201, constructing an adjacency graph G according to the pairwise similarity of the vectors in the feature matrix M, wherein the vectors in the feature matrix form nodes of the adjacency graph;
step 202, generating a context of a node in an adjacency graph by using a weighted random walk model in the adjacency graph G;
and step 203, learning the embedded representation by expanding the skip-gram model to obtain a low-dimensional target matrix of the embedded representation.
2. The method according to claim 1, wherein in step 201, each node in the adjacency graph G represents a vector in the feature matrix, and the similarity between the nodes is calculated, and if one of the two nodes is the first k similarity nodes of the other node, the two nodes are directly connected by an edge;
in step 202, the weighted random walk model is a method for generating random sequences on the adjacency graph, if (x)w1,xw2,…,xwl) Is a random sequence with the length of l, a sliding window with the size of c is adopted to represent the context of a node, and a node x in the random sequencewjContext of (x) NC (x)wj) Can be expressed as NC (x)wj)={xwmI-c m-j c, m e (1, 2, …, l) }, giving the previous node x in the adjacency graphw(t-1)=vbCalculating that the current node is v using the following formulaaProbability of (c):
Figure FDA0002765093890000021
where E is the set of edges of the adjacency graph, P represents the conditional probability, sim () represents the similarity between two nodes,
Figure FDA0002765093890000022
is a normalization constant;
in step 203, the objective function for learning the embedded representation in the extended skip-gram model is:
Figure FDA0002765093890000023
wherein f represents an objective function to be calculated, which realizes that the current high-dimensional data is represented as low-dimensional data, but still maintains the similarity information between the data after dimension reduction representation, NC (x)i) Is denoted by xiExp () represents an exponential function with e as the base.
3. The method for classifying long and short texts according to claim 2, wherein the pair of negative sampling methods is used in step 203
Figure FDA0002765093890000024
And performing approximate quick calculation, implementing negative sampling by using a genim toolkit, setting a sampling threshold value to be 0.001, optimizing the objective function by using random gradient descent, and learning to obtain a function f.
4. The method according to claim 1, wherein the similarity between nodes is measured by cosine similarity in step 201.
CN202011230707.3A 2020-11-06 2020-11-06 Long and short text classification method based on scalable representation learning Pending CN112231482A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011230707.3A CN112231482A (en) 2020-11-06 2020-11-06 Long and short text classification method based on scalable representation learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011230707.3A CN112231482A (en) 2020-11-06 2020-11-06 Long and short text classification method based on scalable representation learning

Publications (1)

Publication Number Publication Date
CN112231482A true CN112231482A (en) 2021-01-15

Family

ID=74122435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011230707.3A Pending CN112231482A (en) 2020-11-06 2020-11-06 Long and short text classification method based on scalable representation learning

Country Status (1)

Country Link
CN (1) CN112231482A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158391A (en) * 2021-04-30 2021-07-23 中国人民解放军国防科技大学 Method, system, device and storage medium for visualizing multi-dimensional network node classification
CN114595741A (en) * 2022-01-17 2022-06-07 中国人民解放军国防科技大学 High-dimensional data rapid dimension reduction method and system based on neighborhood relationship
CN115015390A (en) * 2022-06-08 2022-09-06 华侨大学 MWTLMDS-based curtain wall working modal parameter identification method and system
CN115767204A (en) * 2022-11-10 2023-03-07 北京奇艺世纪科技有限公司 Video processing method, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915386A (en) * 2015-05-25 2015-09-16 中国科学院自动化研究所 Short text clustering method based on deep semantic feature learning
CN110134786A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text classification method based on theme term vector and convolutional neural networks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915386A (en) * 2015-05-25 2015-09-16 中国科学院自动化研究所 Short text clustering method based on deep semantic feature learning
CN110134786A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text classification method based on theme term vector and convolutional neural networks

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BRUNO TRSTENJAKA等: "KNN with TF-IDF Based Framework for Text Categorization", 《PROCEDIA ENGINEERING》 *
XIANG WANG等: "A Low-Dimensional Representation Learning Method for Text Classification and Clustering", 《2020 IEEE FIFTH INTERNATIONAL CONFERENCE ON DATA SCIENCE IN CYBERSPACE(DSC)》 *
景永霞等: "基于矩阵奇异值分解的文本分类算法研究", 《西北师范大学学报》 *
陈宗海: "《系统仿真技术及其应用》", 31 August 2017 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113158391A (en) * 2021-04-30 2021-07-23 中国人民解放军国防科技大学 Method, system, device and storage medium for visualizing multi-dimensional network node classification
CN113158391B (en) * 2021-04-30 2023-05-30 中国人民解放军国防科技大学 Visualization method, system, equipment and storage medium for multidimensional network node classification
CN114595741A (en) * 2022-01-17 2022-06-07 中国人民解放军国防科技大学 High-dimensional data rapid dimension reduction method and system based on neighborhood relationship
CN114595741B (en) * 2022-01-17 2023-09-01 中国人民解放军国防科技大学 High-dimensional data rapid dimension reduction method and system based on neighborhood relation
CN115015390A (en) * 2022-06-08 2022-09-06 华侨大学 MWTLMDS-based curtain wall working modal parameter identification method and system
CN115767204A (en) * 2022-11-10 2023-03-07 北京奇艺世纪科技有限公司 Video processing method, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109189925B (en) Word vector model based on point mutual information and text classification method based on CNN
CN112231482A (en) Long and short text classification method based on scalable representation learning
WO2020199591A1 (en) Text categorization model training method, apparatus, computer device, and storage medium
CN109408743B (en) Text link embedding method
WO2019019860A1 (en) Method and apparatus for training classification model
TW201837746A (en) Method, apparatus, and electronic devices for searching images
CN111079419B (en) National defense science and technology hotword discovery method and system based on big data
CN113392191B (en) Text matching method and device based on multi-dimensional semantic joint learning
CN107357895B (en) Text representation processing method based on bag-of-words model
CN110688479A (en) Evaluation method and sequencing network for generating abstract
CN114461890A (en) Hierarchical multi-modal intellectual property search engine method and system
Ekbal et al. A deep learning architecture for protein-protein interaction article identification
CN114186017A (en) Code searching method based on multi-dimensional matching
CN112487110A (en) Overlapped community evolution analysis method and system based on network structure and node content
Parvathi et al. Identifying relevant text from text document using deep learning
Wong et al. Feature selection and feature extraction: highlights
Song et al. Sparse multi-modal topical coding for image annotation
Benghuzzi et al. An investigation of keywords extraction from textual documents using Word2Vec and Decision Tree
CN117435685A (en) Document retrieval method, document retrieval device, computer equipment, storage medium and product
Tian et al. Chinese short text multi-classification based on word and part-of-speech tagging embedding
Ronghui et al. Application of Improved Convolutional Neural Network in Text Classification.
CN112766297A (en) Image classification method based on scalable representation learning
WO2023147299A1 (en) Systems and methods for short text similarity based clustering
CN114298020B (en) Keyword vectorization method based on topic semantic information and application thereof
Foncubierta-Rodríguez et al. From visual words to a visual grammar: using language modelling for image classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210115

RJ01 Rejection of invention patent application after publication