News text classification method based on LDA and word2vec algorithm
Technical Field
The invention relates to the technical field of file classification, in particular to a news text classification method based on LDA and word2vec algorithms.
Background
At present, the most widely used text representation methods are based on a bag-of-words method, a document is regarded as a word set by the bag-of-words method, the appearance of each word is independent, and information such as the sequence, grammar and semantics of the words is not considered. The method organizes characteristic items in a training text set into a vector space model, each document is represented as a vector with the same dimension as the model, and the value of each position in the vector is the weight of a word represented by the position in a training sample set. The main problems of the method are as follows:
(1) vector dimension is too high:
the dimension of the vector is the same as the number of the feature items reserved in the whole training sample set, which can reach tens of thousands or even hundreds of thousands, so that the phenomenon of dimension disaster is caused, and the text vectors occupy a large storage space;
(2) data sparseness:
one document vector only has a weight value at the position where the feature item in the document appears, and the weight values at most other positions are all 0, so that the calculation efficiency in the text classification task is reduced, and meanwhile, the storage space is wasted;
(3) semantic information of a document cannot be represented well:
the bag-of-words method assumes that words in the document are completely independent, omits the semantic relation between the words, and for two documents with similar semantics but without the same characteristic words, the text similarity calculated by the text vector represented by the bag-of-words method is 0.
The KNN algorithm is simple in principle, easy to implement, high in stability and accuracy, and one of the classic algorithms applied to text classification at present, and the algorithm has the following two main defects:
(1) when the training sample set is large, the KNN algorithm is inefficient:
the common KNN algorithm needs to calculate the similarity between the feature vector of the text to be classified and the feature vectors of all samples in a training set, select K nearest training samples, count the number of classes to which the training samples belong, and finally divide the text to be classified into the class with the largest number, wherein the calculation of the feature vector of the text to be classified and the feature vector of the text in the whole training sample set is a key factor of the low efficiency of the KNN algorithm;
(2) the weights of the attributes are the same, which affects the accuracy of the classification result:
when the samples of each class in the training sample set are unevenly distributed, for example, the capacity of a certain class of samples is large, and the capacities of other classes of samples are small, it may cause that when a text to be classified is input, the samples of the high-capacity class in the K nearest neighbor samples of the text account for most, and since the KNN algorithm only considers the "nearest" neighbor samples finally, if the number of the certain class of samples is large, the text to be classified may not be close to the samples, but may be mistakenly classified under the class, which affects the accuracy of classification.
Disclosure of Invention
In order to solve the problems, the invention aims to provide a news text classification method based on LDA and word2vec algorithms, similarity calculation is carried out on a feature vector and a class center vector of a text to be classified for primary classification, the calculated amount is greatly reduced, when the primary classification is not enough for clearly classifying, secondary classification is carried out by using a KNN algorithm, class samples are extracted in a concentrated and equivalent manner from a new sample after cutting, and the influence of uneven sample distribution on classification accuracy is eliminated.
The invention provides a news text classification method based on LDA and word2vec algorithms, which comprises the following steps:
step 1, obtaining word vectors of a corpus by a word2vec tool:
performing word segmentation on a large-scale corpus, inputting the text after word segmentation into a word2vec tool, and training to obtain word vectors of all words in the corpus;
step 2, performing text preprocessing on the training sample set:
performing word segmentation on the text in the training sample set and removing stop words;
and 3, obtaining the category core words of the training sample set through the LDA topic model:
respectively training LDA topic models on each category of a training sample set, obtaining the probability distribution of text-topics and topic-words of each category after the training sample set is trained under the LDA topic models, and taking the words with the maximum topic lower probability value larger than a threshold value alpha in each category as core words of the category according to the output result of the LDA topic models;
step 4, a word vector a of the core word of the category is passediConstructing a class center vector c of the training sample seti;
Step 5, after the text to be classified is preprocessed, extracting text characteristic words to obtain a text vector d of the text to be classifiedj;
Step 6, similarity calculation is carried out on the text vector of the text to be classified and the category center vector of the training sample set, the similarity values are sorted in a descending order, the text to be classified is primarily classified according to the sorting, and step 7 is carried out when the difference value between the first two similarity values in the descending order is smaller than a threshold value epsilon;
and 7, carrying out secondary classification on the text to be classified by adopting a KNN algorithm.
As a further improvement of the present invention, step 4 specifically includes:
step 401, selecting the word vector a of the core word of each category from all the word vectors of step 1i;
Step 402, the probability value beta of the subject-word obtained by the LDA subject modeliAs the weight of the word to the category, the weighted word vectors of the same category are added to obtain the average value as the category center vector c of the categoryiExpressed as formula (1);
as a further improvement of the present invention, step 5 specifically includes:
step 501, preprocessing a text to be classified, including word segmentation and stop word removal;
step 502, extracting text feature words by adopting a TF-IDF algorithm:
calculating text feature words extracted by TF-IDF according to a formula (2), and taking words with TF-IDF values larger than a threshold value theta as feature words w of the text to be classified;
in the formula, M is the occurrence frequency of the characteristic words w in the text to be classified, M is the total number of words in the text to be classified, N is the total number of texts in the training sample set, and N is the total number of texts containing the characteristic words w in the training sample set;
step 503, inputting the feature words in the text to be classified into the word2vec tool to obtain word vectors of the feature words in the text to be classified, and adding the word vectors of all the feature words to obtain an average valueText vector d to text to be classifiedj。
As a further improvement of the present invention, step 6 specifically includes:
601, text vector d in the text to be classifiedjClass center vector c for each classiCarrying out similarity calculation, wherein the calculation formula is shown as formula (3);
in the formula, sim (c)i,dj) For the similarity value, T is the dimension of the text vector of the text to be classified and the category center vector of each category, wikFor values in each dimension of the class-center vector, wjkThe numerical value of each dimension in the text vector of the text to be classified;
step 602, sorting the similarity values calculated in step 601 in descending order;
step 603, calculating the difference between the first similarity value and the second similarity value in the descending order of step 602:
if the difference value is larger than epsilon, classifying the text to be classified into a class corresponding to the first similarity value;
if the difference is less than epsilon, a secondary classification of step 7 is performed.
As a further improvement of the present invention, step 7 specifically includes:
step 701, extracting the texts in the category corresponding to the category of which the difference between the front x adjacent numerical values in the descending order of the similarity values in the step 6 is smaller than epsilon in a training text set;
step 702, randomly extracting z texts in each category to form a new training sample set;
step 703, repeating step 5 for each text in the new training sample set to obtain a text vector of each text;
step 704, using KNN algorithm, to classify the text vector d of the text to be classifiedjWith the text vectors d of all the texts in the new training sample setiThe calculation of the similarity is carried out,the calculation formula is formula (4), and K texts with the most similar are selected;
in the formula, sim (d)j,di) For the similarity value, T is the dimension, w ', of the text vector of the text to be classified and the text vector in the new training sample set'ikFor the value in each dimension, w, in the text vector of the new training sample setjkThe numerical value of each dimension in the text vector of the text to be classified;
step 705, for the selected K texts, sequentially calculating the weight of the category to which each text belongs, wherein the calculation formula is formula (5);
in the formula, W (d)j,Ci) To the text to be classified as belonging to class CiWeight value of (d), sim (d)j,di) Is the similarity value, y (d), calculated in step 704i,Ci) For the class attribute function, the class of each text is known in the new training sample set, and for the K selected texts, if the K texts belong to the class CiIf the attribute function value of the category is 1, otherwise, the attribute function value of the category is 0;
step 706, classifying the text to be classified into the category corresponding to the maximum weight value calculated in step 705.
The invention has the beneficial effects that:
1. the word vectors obtained by training a word2vec tool are adopted to represent text information, a word2vec model converts words into a low-dimensional real number vector by utilizing context information of words in the text, semantic similarity of the words is obtained through the distance between the vectors, and a method of vector splicing is replaced by a method of adding keyword and word vectors to calculate an average value on the structure of the text vector, so that the problem of high latitude of the vector is effectively solved, and meanwhile, the limitation on the selection of the number of the keywords is removed;
2. the invention provides a method for constructing category characteristics by combining an LDA (latent dirichlet allocation) model and a word2vec algorithm, and takes the probability value of a subject word as the weight of a characteristic word, the method adds the contribution degree of different words to the category and the contribution degree of the same word to different categories under the same category, because the word2vec contains the semantic relation among the words, the word vectors are added to calculate the mean value to represent a text, and the dimension of the text vector is controlled not to be too large while the similarity information among the texts is kept, so that the calculated amount is greatly reduced when the similarity calculation is carried out on the characteristic vector of the text to be classified and a class center vector;
3. in the process of classifying texts, the traditional method mostly only considers the similarity between texts, the invention provides the method for directly extracting the class characteristics, establishes the relation between texts and the classes, and carries out secondary classification by using a KNN algorithm when the primary classification is not enough to clearly divide the classes, and does not need to consider the classes which are far away from the texts to be classified at the moment, and the class samples are extracted in an equivalent manner in a new sample set after cutting, so that the influence of uneven sample distribution on the classification accuracy is eliminated.
Drawings
Fig. 1 is a flowchart illustrating a news text classification method based on LDA and word2vec algorithms according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail below with reference to specific embodiments and with reference to the attached drawings.
As shown in fig. 1, a news text classification method based on LDA and word2vec algorithms according to an embodiment of the present invention includes:
step 1, obtaining word vectors of a corpus by a word2vec tool:
and performing word segmentation on the large-scale corpus, inputting the text after word segmentation into a word2vec tool, and training to obtain a word vector of each word in the corpus.
Step 2, performing text preprocessing on the training sample set:
and performing word segmentation and removal of stop words on the texts in the training sample set.
And 3, obtaining the category core words of the training sample set through the LDA topic model:
respectively training LDA topic models on each category of a training sample set, obtaining the probability distribution of text-topics and topic-words of each category after the training sample set is trained under the LDA topic models, and taking the words with the probability value under the maximum topic larger than a threshold value alpha in each category as core words of the category according to the output result of the LDA topic models.
Step 4, a word vector a of the core word of the category is passediConstructing a class center vector c of the training sample seti:
Step 401, selecting the word vector a of the core word of each category from all the word vectors of step 1i;
Step 402, the probability value beta of the subject-word obtained by the LDA subject modeliAs the weight of the word to the category, the weighted word vectors of the same category are added to obtain the average value as the category center vector c of the categoryiExpressed as formula (1);
step 5, after the text to be classified is preprocessed, extracting text characteristic words to obtain a text vector d of the text to be classifiedj:
Step 501, preprocessing a text to be classified, including word segmentation and stop word removal;
step 502, extracting text feature words by adopting a TF-IDF algorithm:
calculating text feature words extracted by TF-IDF according to a formula (2), and taking words with TF-IDF values larger than a threshold value theta as feature words w of the text to be classified;
in the formula, M is the occurrence frequency of the characteristic words w in the text to be classified, M is the total number of words in the text to be classified, N is the total number of texts, and N is the total number of texts containing the characteristic words w;
step 503, inputting the feature words in the text to be classified into the word2vec tool to obtain word vectors of the feature words in the text to be classified, adding the word vectors of all the feature words to obtain an average value to obtain a text vector d of the text to be classifiedj。
Step 6, performing similarity calculation on the text vector of the text to be classified and the category center vector of the training sample set, sorting the similarity values in a descending order, and classifying the text to be classified according to the sorting:
601, text vector d in the text to be classifiedjClass center vector c for each classiCarrying out similarity calculation, wherein the calculation formula is shown as formula (3);
in the formula, sim (c)i,dj) For the similarity value, T is the dimension of the text vector of the text to be classified and the category center vector of each category, wikFor values in each dimension of the class-center vector, wjkThe numerical value of each dimension in the text vector of the text to be classified;
step 602, sorting the similarity values calculated in step 601 in descending order;
step 603, calculating the difference between the first similarity value and the second similarity value in the descending order of step 602:
if the difference value is larger than epsilon, classifying the text to be classified into a class corresponding to the first similarity value;
if the difference is less than epsilon, a secondary classification of step 7 is performed.
And 7, carrying out secondary classification on the text to be classified by adopting a KNN algorithm:
step 701, extracting the texts in the category corresponding to the category of which the difference between the front x adjacent numerical values in the descending order of the similarity values in the step 6 is smaller than epsilon in a training text set;
step 702, randomly extracting z texts in each category to form a new training sample set;
step 703, repeating step 5 for each text in the new training sample set to obtain a text vector of each text;
step 704, using KNN algorithm, to classify the text vector d of the text to be classifiedjWith the text vectors d of all the texts in the new training sample setiPerforming similarity calculation, wherein the calculation formula is formula (4), and selecting the K texts with the most similar similarity;
in the formula, sim (d)j,di) For the similarity value, T is the dimension, w ', of the text vector of the text to be classified and the text vector in the new training sample set'ikFor the value in each dimension, w, in the text vector of the new training sample setjkThe numerical value of each dimension in the text vector of the text to be classified;
step 705, for the selected K texts, sequentially calculating the weight of the category to which each text belongs, wherein the calculation formula is formula (5);
in the formula, W (d)j,Ci) To the text to be classified as belonging to class CiWeight value of (d), sim (d)j,di) Is the similarity value, y (d), calculated in step 704i,Ci) For the class attribute function, the class of each text is known in the new training sample set, and for the K selected texts, if the K texts belong to the class CiIf the attribute function value of the category is 1, otherwise, the attribute function value of the category is 0;
step 706, classifying the text to be classified into the category corresponding to the maximum weight value calculated in step 705.
The invention provides a method for directly extracting class characteristics and establishing a relation between a text and a class, namely step 6, when the classification can not be clearly divided only according to a class vector, a KNN algorithm is used for further classification, and at the moment, the class which is far away from the text to be classified does not need to be considered, namely step 7, the sample set is cut, and the calculated amount is reduced. Meanwhile, most of the traditional methods for extracting features from a training sample set use the tfidf algorithm and then construct a vector space model, the invention provides a method for constructing category features by combining an LDA model and a word2vec algorithm, and the probability value of a subject word is used as the weight of a feature word, namely step 4.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.