CN107609121B - News text classification method based on LDA and word2vec algorithm - Google Patents

News text classification method based on LDA and word2vec algorithm Download PDF

Info

Publication number
CN107609121B
CN107609121B CN201710828232.XA CN201710828232A CN107609121B CN 107609121 B CN107609121 B CN 107609121B CN 201710828232 A CN201710828232 A CN 201710828232A CN 107609121 B CN107609121 B CN 107609121B
Authority
CN
China
Prior art keywords
text
classified
category
vector
sample set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710828232.XA
Other languages
Chinese (zh)
Other versions
CN107609121A (en
Inventor
赵阔
王峰
谢珍真
孙小雅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN201710828232.XA priority Critical patent/CN107609121B/en
Publication of CN107609121A publication Critical patent/CN107609121A/en
Application granted granted Critical
Publication of CN107609121B publication Critical patent/CN107609121B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a news text classification method based on LDA and word2vec algorithms, which comprises the following steps: obtaining a word vector of a corpus by word2 vec; segmenting words of texts in a training sample set and removing stop words; obtaining a category core word of a training sample set through an LDA model; constructing a category center vector of a training sample set; after preprocessing a text to be classified, extracting text characteristic words to obtain a text vector of the text to be classified; performing similarity calculation on a text vector of the text to be classified and a category center vector of the training sample set, and classifying the text to be classified; and carrying out secondary classification on the text to be classified by using a KNN algorithm. The invention has the beneficial effects that: and when the primary classification is not enough for definitely classifying the classes, the KNN algorithm is used for carrying out secondary classification, class samples are intensively and equivalently extracted from new samples, and the influence of uneven sample distribution on the classification accuracy is eliminated.

Description

News text classification method based on LDA and word2vec algorithm
Technical Field
The invention relates to the technical field of file classification, in particular to a news text classification method based on LDA and word2vec algorithms.
Background
At present, the most widely used text representation methods are based on a bag-of-words method, a document is regarded as a word set by the bag-of-words method, the appearance of each word is independent, and information such as the sequence, grammar and semantics of the words is not considered. The method organizes characteristic items in a training text set into a vector space model, each document is represented as a vector with the same dimension as the model, and the value of each position in the vector is the weight of a word represented by the position in a training sample set. The main problems of the method are as follows:
(1) vector dimension is too high:
the dimension of the vector is the same as the number of the feature items reserved in the whole training sample set, which can reach tens of thousands or even hundreds of thousands, so that the phenomenon of dimension disaster is caused, and the text vectors occupy a large storage space;
(2) data sparseness:
one document vector only has a weight value at the position where the feature item in the document appears, and the weight values at most other positions are all 0, so that the calculation efficiency in the text classification task is reduced, and meanwhile, the storage space is wasted;
(3) semantic information of a document cannot be represented well:
the bag-of-words method assumes that words in the document are completely independent, omits the semantic relation between the words, and for two documents with similar semantics but without the same characteristic words, the text similarity calculated by the text vector represented by the bag-of-words method is 0.
The KNN algorithm is simple in principle, easy to implement, high in stability and accuracy, and one of the classic algorithms applied to text classification at present, and the algorithm has the following two main defects:
(1) when the training sample set is large, the KNN algorithm is inefficient:
the common KNN algorithm needs to calculate the similarity between the feature vector of the text to be classified and the feature vectors of all samples in a training set, select K nearest training samples, count the number of classes to which the training samples belong, and finally divide the text to be classified into the class with the largest number, wherein the calculation of the feature vector of the text to be classified and the feature vector of the text in the whole training sample set is a key factor of the low efficiency of the KNN algorithm;
(2) the weights of the attributes are the same, which affects the accuracy of the classification result:
when the samples of each class in the training sample set are unevenly distributed, for example, the capacity of a certain class of samples is large, and the capacities of other classes of samples are small, it may cause that when a text to be classified is input, the samples of the high-capacity class in the K nearest neighbor samples of the text account for most, and since the KNN algorithm only considers the "nearest" neighbor samples finally, if the number of the certain class of samples is large, the text to be classified may not be close to the samples, but may be mistakenly classified under the class, which affects the accuracy of classification.
Disclosure of Invention
In order to solve the problems, the invention aims to provide a news text classification method based on LDA and word2vec algorithms, similarity calculation is carried out on a feature vector and a class center vector of a text to be classified for primary classification, the calculated amount is greatly reduced, when the primary classification is not enough for clearly classifying, secondary classification is carried out by using a KNN algorithm, class samples are extracted in a concentrated and equivalent manner from a new sample after cutting, and the influence of uneven sample distribution on classification accuracy is eliminated.
The invention provides a news text classification method based on LDA and word2vec algorithms, which comprises the following steps:
step 1, obtaining word vectors of a corpus by a word2vec tool:
performing word segmentation on a large-scale corpus, inputting the text after word segmentation into a word2vec tool, and training to obtain word vectors of all words in the corpus;
step 2, performing text preprocessing on the training sample set:
performing word segmentation on the text in the training sample set and removing stop words;
and 3, obtaining the category core words of the training sample set through the LDA topic model:
respectively training LDA topic models on each category of a training sample set, obtaining the probability distribution of text-topics and topic-words of each category after the training sample set is trained under the LDA topic models, and taking the words with the maximum topic lower probability value larger than a threshold value alpha in each category as core words of the category according to the output result of the LDA topic models;
step 4, a word vector a of the core word of the category is passediConstructing a class center vector c of the training sample seti
Step 5, after the text to be classified is preprocessed, extracting text characteristic words to obtain a text vector d of the text to be classifiedj
Step 6, similarity calculation is carried out on the text vector of the text to be classified and the category center vector of the training sample set, the similarity values are sorted in a descending order, the text to be classified is primarily classified according to the sorting, and step 7 is carried out when the difference value between the first two similarity values in the descending order is smaller than a threshold value epsilon;
and 7, carrying out secondary classification on the text to be classified by adopting a KNN algorithm.
As a further improvement of the present invention, step 4 specifically includes:
step 401, selecting the word vector a of the core word of each category from all the word vectors of step 1i
Step 402, the probability value beta of the subject-word obtained by the LDA subject modeliAs the weight of the word to the category, the weighted word vectors of the same category are added to obtain the average value as the category center vector c of the categoryiExpressed as formula (1);
Figure BDA0001408035540000031
as a further improvement of the present invention, step 5 specifically includes:
step 501, preprocessing a text to be classified, including word segmentation and stop word removal;
step 502, extracting text feature words by adopting a TF-IDF algorithm:
calculating text feature words extracted by TF-IDF according to a formula (2), and taking words with TF-IDF values larger than a threshold value theta as feature words w of the text to be classified;
Figure BDA0001408035540000032
in the formula, M is the occurrence frequency of the characteristic words w in the text to be classified, M is the total number of words in the text to be classified, N is the total number of texts in the training sample set, and N is the total number of texts containing the characteristic words w in the training sample set;
step 503, inputting the feature words in the text to be classified into the word2vec tool to obtain word vectors of the feature words in the text to be classified, and adding the word vectors of all the feature words to obtain an average valueText vector d to text to be classifiedj
As a further improvement of the present invention, step 6 specifically includes:
601, text vector d in the text to be classifiedjClass center vector c for each classiCarrying out similarity calculation, wherein the calculation formula is shown as formula (3);
Figure BDA0001408035540000041
in the formula, sim (c)i,dj) For the similarity value, T is the dimension of the text vector of the text to be classified and the category center vector of each category, wikFor values in each dimension of the class-center vector, wjkThe numerical value of each dimension in the text vector of the text to be classified;
step 602, sorting the similarity values calculated in step 601 in descending order;
step 603, calculating the difference between the first similarity value and the second similarity value in the descending order of step 602:
if the difference value is larger than epsilon, classifying the text to be classified into a class corresponding to the first similarity value;
if the difference is less than epsilon, a secondary classification of step 7 is performed.
As a further improvement of the present invention, step 7 specifically includes:
step 701, extracting the texts in the category corresponding to the category of which the difference between the front x adjacent numerical values in the descending order of the similarity values in the step 6 is smaller than epsilon in a training text set;
step 702, randomly extracting z texts in each category to form a new training sample set;
step 703, repeating step 5 for each text in the new training sample set to obtain a text vector of each text;
step 704, using KNN algorithm, to classify the text vector d of the text to be classifiedjWith the text vectors d of all the texts in the new training sample setiThe calculation of the similarity is carried out,the calculation formula is formula (4), and K texts with the most similar are selected;
Figure BDA0001408035540000051
in the formula, sim (d)j,di) For the similarity value, T is the dimension, w ', of the text vector of the text to be classified and the text vector in the new training sample set'ikFor the value in each dimension, w, in the text vector of the new training sample setjkThe numerical value of each dimension in the text vector of the text to be classified;
step 705, for the selected K texts, sequentially calculating the weight of the category to which each text belongs, wherein the calculation formula is formula (5);
Figure BDA0001408035540000052
in the formula, W (d)j,Ci) To the text to be classified as belonging to class CiWeight value of (d), sim (d)j,di) Is the similarity value, y (d), calculated in step 704i,Ci) For the class attribute function, the class of each text is known in the new training sample set, and for the K selected texts, if the K texts belong to the class CiIf the attribute function value of the category is 1, otherwise, the attribute function value of the category is 0;
step 706, classifying the text to be classified into the category corresponding to the maximum weight value calculated in step 705.
The invention has the beneficial effects that:
1. the word vectors obtained by training a word2vec tool are adopted to represent text information, a word2vec model converts words into a low-dimensional real number vector by utilizing context information of words in the text, semantic similarity of the words is obtained through the distance between the vectors, and a method of vector splicing is replaced by a method of adding keyword and word vectors to calculate an average value on the structure of the text vector, so that the problem of high latitude of the vector is effectively solved, and meanwhile, the limitation on the selection of the number of the keywords is removed;
2. the invention provides a method for constructing category characteristics by combining an LDA (latent dirichlet allocation) model and a word2vec algorithm, and takes the probability value of a subject word as the weight of a characteristic word, the method adds the contribution degree of different words to the category and the contribution degree of the same word to different categories under the same category, because the word2vec contains the semantic relation among the words, the word vectors are added to calculate the mean value to represent a text, and the dimension of the text vector is controlled not to be too large while the similarity information among the texts is kept, so that the calculated amount is greatly reduced when the similarity calculation is carried out on the characteristic vector of the text to be classified and a class center vector;
3. in the process of classifying texts, the traditional method mostly only considers the similarity between texts, the invention provides the method for directly extracting the class characteristics, establishes the relation between texts and the classes, and carries out secondary classification by using a KNN algorithm when the primary classification is not enough to clearly divide the classes, and does not need to consider the classes which are far away from the texts to be classified at the moment, and the class samples are extracted in an equivalent manner in a new sample set after cutting, so that the influence of uneven sample distribution on the classification accuracy is eliminated.
Drawings
Fig. 1 is a flowchart illustrating a news text classification method based on LDA and word2vec algorithms according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail below with reference to specific embodiments and with reference to the attached drawings.
As shown in fig. 1, a news text classification method based on LDA and word2vec algorithms according to an embodiment of the present invention includes:
step 1, obtaining word vectors of a corpus by a word2vec tool:
and performing word segmentation on the large-scale corpus, inputting the text after word segmentation into a word2vec tool, and training to obtain a word vector of each word in the corpus.
Step 2, performing text preprocessing on the training sample set:
and performing word segmentation and removal of stop words on the texts in the training sample set.
And 3, obtaining the category core words of the training sample set through the LDA topic model:
respectively training LDA topic models on each category of a training sample set, obtaining the probability distribution of text-topics and topic-words of each category after the training sample set is trained under the LDA topic models, and taking the words with the probability value under the maximum topic larger than a threshold value alpha in each category as core words of the category according to the output result of the LDA topic models.
Step 4, a word vector a of the core word of the category is passediConstructing a class center vector c of the training sample seti
Step 401, selecting the word vector a of the core word of each category from all the word vectors of step 1i
Step 402, the probability value beta of the subject-word obtained by the LDA subject modeliAs the weight of the word to the category, the weighted word vectors of the same category are added to obtain the average value as the category center vector c of the categoryiExpressed as formula (1);
Figure BDA0001408035540000071
step 5, after the text to be classified is preprocessed, extracting text characteristic words to obtain a text vector d of the text to be classifiedj
Step 501, preprocessing a text to be classified, including word segmentation and stop word removal;
step 502, extracting text feature words by adopting a TF-IDF algorithm:
calculating text feature words extracted by TF-IDF according to a formula (2), and taking words with TF-IDF values larger than a threshold value theta as feature words w of the text to be classified;
Figure BDA0001408035540000072
in the formula, M is the occurrence frequency of the characteristic words w in the text to be classified, M is the total number of words in the text to be classified, N is the total number of texts, and N is the total number of texts containing the characteristic words w;
step 503, inputting the feature words in the text to be classified into the word2vec tool to obtain word vectors of the feature words in the text to be classified, adding the word vectors of all the feature words to obtain an average value to obtain a text vector d of the text to be classifiedj
Step 6, performing similarity calculation on the text vector of the text to be classified and the category center vector of the training sample set, sorting the similarity values in a descending order, and classifying the text to be classified according to the sorting:
601, text vector d in the text to be classifiedjClass center vector c for each classiCarrying out similarity calculation, wherein the calculation formula is shown as formula (3);
Figure BDA0001408035540000081
in the formula, sim (c)i,dj) For the similarity value, T is the dimension of the text vector of the text to be classified and the category center vector of each category, wikFor values in each dimension of the class-center vector, wjkThe numerical value of each dimension in the text vector of the text to be classified;
step 602, sorting the similarity values calculated in step 601 in descending order;
step 603, calculating the difference between the first similarity value and the second similarity value in the descending order of step 602:
if the difference value is larger than epsilon, classifying the text to be classified into a class corresponding to the first similarity value;
if the difference is less than epsilon, a secondary classification of step 7 is performed.
And 7, carrying out secondary classification on the text to be classified by adopting a KNN algorithm:
step 701, extracting the texts in the category corresponding to the category of which the difference between the front x adjacent numerical values in the descending order of the similarity values in the step 6 is smaller than epsilon in a training text set;
step 702, randomly extracting z texts in each category to form a new training sample set;
step 703, repeating step 5 for each text in the new training sample set to obtain a text vector of each text;
step 704, using KNN algorithm, to classify the text vector d of the text to be classifiedjWith the text vectors d of all the texts in the new training sample setiPerforming similarity calculation, wherein the calculation formula is formula (4), and selecting the K texts with the most similar similarity;
Figure BDA0001408035540000091
in the formula, sim (d)j,di) For the similarity value, T is the dimension, w ', of the text vector of the text to be classified and the text vector in the new training sample set'ikFor the value in each dimension, w, in the text vector of the new training sample setjkThe numerical value of each dimension in the text vector of the text to be classified;
step 705, for the selected K texts, sequentially calculating the weight of the category to which each text belongs, wherein the calculation formula is formula (5);
Figure BDA0001408035540000092
in the formula, W (d)j,Ci) To the text to be classified as belonging to class CiWeight value of (d), sim (d)j,di) Is the similarity value, y (d), calculated in step 704i,Ci) For the class attribute function, the class of each text is known in the new training sample set, and for the K selected texts, if the K texts belong to the class CiIf the attribute function value of the category is 1, otherwise, the attribute function value of the category is 0;
step 706, classifying the text to be classified into the category corresponding to the maximum weight value calculated in step 705.
The invention provides a method for directly extracting class characteristics and establishing a relation between a text and a class, namely step 6, when the classification can not be clearly divided only according to a class vector, a KNN algorithm is used for further classification, and at the moment, the class which is far away from the text to be classified does not need to be considered, namely step 7, the sample set is cut, and the calculated amount is reduced. Meanwhile, most of the traditional methods for extracting features from a training sample set use the tfidf algorithm and then construct a vector space model, the invention provides a method for constructing category features by combining an LDA model and a word2vec algorithm, and the probability value of a subject word is used as the weight of a feature word, namely step 4.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (3)

1. A news text classification method based on LDA and word2vec algorithm is characterized by comprising the following steps:
step 1, obtaining word vectors of a corpus by a word2vec tool:
performing word segmentation on a large-scale corpus, inputting the text after word segmentation into a word2vec tool, and training to obtain word vectors of all words in the corpus;
step 2, performing text preprocessing on the training sample set:
performing word segmentation on the text in the training sample set and removing stop words;
and 3, obtaining the category core words of the training sample set through the LDA topic model:
respectively training LDA topic models on each category of a training sample set, obtaining the probability distribution of text-topics and topic-words of each category after the training sample set is trained under the LDA topic models, and taking the words with the maximum topic lower probability value larger than a threshold value alpha in each category as core words of the category according to the output result of the LDA topic models;
step 4, a word vector a of the core word of the category is passediConstructing a class center vector c of the training sample setiThe method comprises the following steps:
step 401, selecting the word vector a of the core word of each category from all the word vectors of step 1i
Step 402, the probability value beta of the subject-word obtained by the LDA subject modeliAs the weight of the word to the category, the weighted word vectors of the same category are added to obtain the average value as the category center vector c of the categoryi
Figure FDA0002614631820000011
Step 5, after the text to be classified is preprocessed, extracting text characteristic words to obtain a text vector d of the text to be classifiedj
Step 6, similarity calculation is carried out on the text vector of the text to be classified and the category center vector of the training sample set, the similarity values are sorted in a descending order, the text to be classified is primarily classified according to the sorting, and step 7 is carried out when the difference value between the first two similarity values in the descending order is smaller than a threshold value epsilon;
and 7, performing secondary classification on the text to be classified by adopting a KNN algorithm, wherein the secondary classification comprises the following steps:
step 701, extracting the texts in the category corresponding to the category of which the difference between the front x adjacent numerical values in the descending order of the similarity values in the step 6 is smaller than epsilon in a training text set;
step 702, randomly extracting z texts in each category to form a new training sample set;
step 703, repeating step 5 for each text in the new training sample set to obtain a text vector of each text;
step 704, using KNN algorithm, to classify the text vector d of the text to be classifiedjWith the text vectors d of all the texts in the new training sample setiPerforming similarity calculation, wherein the calculation formula is formula (4), and selecting the K texts with the most similar similarity;
Figure FDA0002614631820000021
in the formula, sim (d)j,di) For the similarity value, T is the dimension, w ', of the text vector of the text to be classified and the text vector in the new training sample set'ikFor the value in each dimension, w, in the text vector of the new training sample setjkThe numerical value of each dimension in the text vector of the text to be classified;
step 705, for the selected K texts, sequentially calculating the weight of the category to which each text belongs, wherein the calculation formula is formula (5);
Figure FDA0002614631820000022
in the formula, W (d)j,Ci) To the text to be classified as belonging to class CiWeight value of (d), sim (d)j,di) Is the similarity value, y (d), calculated in step 704i,Ci) For the class attribute function, the class of each text is known in the new training sample set, and for the K selected texts, if the K texts belong to the class CiIf the attribute function value of the category is 1, otherwise, the attribute function value of the category is 0;
step 706, classifying the text to be classified into the category corresponding to the maximum weight value calculated in step 705.
2. The news text classification method according to claim 1, wherein step 5 specifically includes:
step 501, preprocessing a text to be classified, including word segmentation and stop word removal;
step 502, extracting text feature words by adopting a TF-IDF algorithm:
calculating text feature words extracted by TF-IDF according to a formula (2), and taking words with TF-IDF values larger than a threshold value theta as feature words w of the text to be classified;
Figure FDA0002614631820000031
in the formula, M is the occurrence frequency of the characteristic words w in the text to be classified, M is the total number of words in the text to be classified, N is the total number of texts in the training sample set, and N is the total number of texts containing the characteristic words w in the training sample set;
step 503, inputting the feature words in the text to be classified into the word2vec tool to obtain word vectors of the feature words in the text to be classified, adding the word vectors of all the feature words to obtain an average value to obtain a text vector d of the text to be classifiedj
3. The news text classification method according to claim 1, wherein step 6 specifically includes:
601, text vector d in the text to be classifiedjClass center vector c for each classiCarrying out similarity calculation, wherein the calculation formula is shown as formula (3);
Figure FDA0002614631820000032
in the formula, sim (c)i,dj) For the similarity value, T is the dimension of the text vector of the text to be classified and the category center vector of each category, wikFor values in each dimension of the class-center vector, wjkThe numerical value of each dimension in the text vector of the text to be classified;
step 602, sorting the similarity values calculated in step 601 in descending order;
step 603, calculating the difference between the first similarity value and the second similarity value in the descending order of step 602:
if the difference value is larger than epsilon, classifying the text to be classified into a class corresponding to the first similarity value;
if the difference is less than epsilon, a secondary classification of step 7 is performed.
CN201710828232.XA 2017-09-14 2017-09-14 News text classification method based on LDA and word2vec algorithm Expired - Fee Related CN107609121B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710828232.XA CN107609121B (en) 2017-09-14 2017-09-14 News text classification method based on LDA and word2vec algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710828232.XA CN107609121B (en) 2017-09-14 2017-09-14 News text classification method based on LDA and word2vec algorithm

Publications (2)

Publication Number Publication Date
CN107609121A CN107609121A (en) 2018-01-19
CN107609121B true CN107609121B (en) 2021-03-30

Family

ID=61062711

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710828232.XA Expired - Fee Related CN107609121B (en) 2017-09-14 2017-09-14 News text classification method based on LDA and word2vec algorithm

Country Status (1)

Country Link
CN (1) CN107609121B (en)

Families Citing this family (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108520030B (en) * 2018-03-27 2022-02-11 深圳中兴网信科技有限公司 Text classification method, text classification system and computer device
CN108597519B (en) * 2018-04-04 2020-12-29 百度在线网络技术(北京)有限公司 Call bill classification method, device, server and storage medium
CN108829661B (en) * 2018-05-09 2020-03-27 成都信息工程大学 News subject name extraction method based on fuzzy matching
CN108932228B (en) * 2018-06-06 2023-08-08 广东南方报业移动媒体有限公司 Live broadcast industry news and partition matching method and device, server and storage medium
CN108846097B (en) * 2018-06-15 2021-01-29 北京搜狐新媒体信息技术有限公司 User interest tag representation method, article recommendation device and equipment
CN108846120A (en) * 2018-06-27 2018-11-20 合肥工业大学 Method, system and storage medium for classifying to text set
CN108804622B (en) * 2018-08-20 2021-09-03 天津探数科技有限公司 Short text classifier construction method considering semantic background
CN109145116A (en) * 2018-09-03 2019-01-04 武汉斗鱼网络科技有限公司 A kind of file classification method, device, electronic equipment and storage medium
CN109284379B (en) * 2018-09-21 2022-01-04 福州大学 Adaptive microblog topic tracking method based on two-way quantity model
CN110969172A (en) * 2018-09-28 2020-04-07 武汉斗鱼网络科技有限公司 Text classification method and related equipment
CN110969023B (en) * 2018-09-29 2023-04-18 北京国双科技有限公司 Text similarity determination method and device
CN109446324B (en) * 2018-10-16 2020-12-15 北京字节跳动网络技术有限公司 Sample data processing method and device, storage medium and electronic equipment
CN109522408A (en) * 2018-10-30 2019-03-26 广东原昇信息科技有限公司 The classification method of information streaming material intention text
CN109684444A (en) * 2018-11-02 2019-04-26 厦门快商通信息技术有限公司 A kind of intelligent customer service method and system
CN109685109B (en) * 2018-11-26 2020-10-30 浙江工业大学 Base station label track classification method based on twin neural network
CN110046340A (en) * 2018-12-28 2019-07-23 阿里巴巴集团控股有限公司 The training method and device of textual classification model
CN109766410A (en) * 2019-01-07 2019-05-17 东华大学 A kind of newsletter archive automatic classification system based on fastText algorithm
CN109815400A (en) * 2019-01-23 2019-05-28 四川易诚智讯科技有限公司 Personage's interest extracting method based on long text
CN109947939B (en) * 2019-01-30 2022-07-05 中兴飞流信息科技有限公司 Text classification method, electronic device and computer-readable storage medium
CN111753079A (en) * 2019-03-11 2020-10-09 阿里巴巴集团控股有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN111723199A (en) * 2019-03-19 2020-09-29 北京沃东天骏信息技术有限公司 Text classification method and device and computer readable storage medium
CN110781271A (en) * 2019-09-02 2020-02-11 国网天津市电力公司电力科学研究院 Semi-supervised network representation learning model based on hierarchical attention mechanism
CN110569351A (en) * 2019-09-02 2019-12-13 北京猎云万罗科技有限公司 Network media news classification method based on restrictive user preference
CN110674239B (en) * 2019-09-27 2022-11-04 中国航空无线电电子研究所 Automatic classification method and device for geographic elements
CN110704626B (en) * 2019-09-30 2022-07-22 北京邮电大学 Short text classification method and device
CN110795564B (en) * 2019-11-01 2022-02-22 南京稷图数据科技有限公司 Text classification method lacking negative cases
CN111177373B (en) * 2019-12-12 2023-07-14 北京明略软件系统有限公司 Method and device for acquiring training data, and model training method and device
CN111459959B (en) * 2020-03-31 2023-06-30 北京百度网讯科技有限公司 Method and apparatus for updating event sets
CN111859979A (en) * 2020-06-16 2020-10-30 中国科学院自动化研究所 Ironic text collaborative recognition method, ironic text collaborative recognition device, ironic text collaborative recognition equipment and computer readable medium
CN112069058A (en) * 2020-08-11 2020-12-11 国网河北省电力有限公司保定供电分公司 Defect disposal method based on expert database and self-learning technology
CN112052333B (en) * 2020-08-20 2024-04-30 深圳市欢太科技有限公司 Text classification method and device, storage medium and electronic equipment
CN112667806A (en) * 2020-10-20 2021-04-16 上海金桥信息股份有限公司 Text classification screening method using LDA
CN112417153B (en) * 2020-11-20 2023-07-04 虎博网络技术(上海)有限公司 Text classification method, apparatus, terminal device and readable storage medium
CN112632971B (en) * 2020-12-18 2023-08-25 上海明略人工智能(集团)有限公司 Word vector training method and system for entity matching
CN112287669B (en) * 2020-12-28 2021-05-25 深圳追一科技有限公司 Text processing method and device, computer equipment and storage medium
CN113268597B (en) * 2021-05-25 2023-06-27 平安科技(深圳)有限公司 Text classification method, device, equipment and storage medium
CN113486176B (en) * 2021-07-08 2022-11-04 桂林电子科技大学 News classification method based on secondary feature amplification
CN113255340B (en) * 2021-07-09 2021-11-02 北京邮电大学 Theme extraction method and device for scientific and technological requirements and storage medium
CN113535965A (en) * 2021-09-16 2021-10-22 杭州费尔斯通科技有限公司 Method and system for large-scale classification of texts
CN113920373A (en) * 2021-10-29 2022-01-11 平安银行股份有限公司 Object classification method and device, terminal equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107122349A (en) * 2017-04-24 2017-09-01 无锡中科富农物联科技有限公司 A kind of feature word of text extracting method based on word2vec LDA models

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9575952B2 (en) * 2014-10-21 2017-02-21 At&T Intellectual Property I, L.P. Unsupervised topic modeling for short texts

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107122349A (en) * 2017-04-24 2017-09-01 无锡中科富农物联科技有限公司 A kind of feature word of text extracting method based on word2vec LDA models

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于类中心向量的文本分类模型研究与实现";郭茂;《中国优秀硕士学位论文全文数据库 信息科技辑》;20100915(第09期);第2.1、2.5、4.5、4.7节 *

Also Published As

Publication number Publication date
CN107609121A (en) 2018-01-19

Similar Documents

Publication Publication Date Title
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN110413780B (en) Text emotion analysis method and electronic equipment
CN107391772B (en) Text classification method based on naive Bayes
CN108763213A (en) Theme feature text key word extracting method
CN109165294B (en) Short text classification method based on Bayesian classification
CN106599054B (en) Method and system for classifying and pushing questions
CN109960799B (en) Short text-oriented optimization classification method
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN109739986A (en) A kind of complaint short text classification method based on Deep integrating study
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
CN105808526A (en) Commodity short text core word extracting method and device
WO2021068683A1 (en) Method and apparatus for generating regular expression, server, and computer-readable storage medium
CN104392006B (en) A kind of event query processing method and processing device
CN108804595B (en) Short text representation method based on word2vec
CN102289522A (en) Method of intelligently classifying texts
CN106570170A (en) Text classification and naming entity recognition integrated method and system based on depth cyclic neural network
CN109522544A (en) Sentence vector calculation, file classification method and system based on Chi-square Test
CN106503153B (en) A kind of computer version classification system
CN110990676A (en) Social media hotspot topic extraction method and system
CN106528768A (en) Consultation hotspot analysis method and device
US20230074771A1 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
CN111368529B (en) Mobile terminal sensitive word recognition method, device and system based on edge calculation
CN108153899B (en) Intelligent text classification method
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN110569351A (en) Network media news classification method based on restrictive user preference

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210223

Address after: No. 601, Huangpu Avenue West, Shenzhen, Guangdong 510632

Applicant after: Jinan University

Address before: 518057 room 503, block C, building 5, Shenzhen Bay ecological science and Technology Park, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Applicant before: SHENZHEN MATENG TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
CB02 Change of applicant information

Address after: 510632 No. 601, Whampoa Avenue, Guangzhou, Guangdong

Applicant after: Jinan University

Address before: No. 601, Huangpu Avenue West, Shenzhen, Guangdong 510632

Applicant before: Jinan University

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210330

Termination date: 20210914

CF01 Termination of patent right due to non-payment of annual fee