Newsletter archive sorting technique based on LDA and word2vec algorithms
Technical field
The present invention relates to document classification technical field, in particular to a kind of based on the new of LDA and word2vec algorithms
Hear file classification method.
Background technology
Current most popular document representation method is all based on bag of words method, and bag of words method regards document as the set of word,
The appearance of each word is separate, does not consider the information such as order, the syntax and semantics of word.The spy that it concentrates training text
Sign item is organized into vector space model, and every document representation is into the vector with the model identical dimensional, each position in vectorial
Value is the weight that the word representated by the position is concentrated in training sample.There is subject matter existing for this method:
(1) vector dimension is too high:
Dimension and the whole training sample of vector are concentrated as the characteristic item number of reservation, can reach up to ten thousand or even tens
Ten thousand, " dimension disaster " phenomenon is caused, and these text vectors can take very big memory space;
(2) Sparse:
One document vector has weighted value only in there is the document on the position of characteristic item, is weighed on remaining most of position
Weight values are 0, reduce the efficiency calculated in text categorization task, while also waste memory space;
(3) semantic information of document can not preferably be represented:
Bag of words method is assumed to be completely independent between word in document, the semantic relation between word is have ignored, for two semantemes
Document close but in the absence of same characteristic features word, the text similarity that the text vector for using bag of words method to represent is calculated is 0.
KNN algorithm principles are simple, it is easy to accomplish, there is high stability and high accuracy, be to be applied to text classification at present
One of classic algorithm, the deficiency of the algorithm mainly has at following 2 points:
(1) when training sample set is larger, KNN efficiency of algorithm is low:
Common KNN algorithms need the characteristic vector by whole samples in the characteristic vector of text to be sorted and training set to enter
Row Similarity Measure, the training sample of K arest neighbors is selected, the quantity of training sample generic is counted, finally will
Text to be sorted is divided into the maximum classification of quantity, wherein, by the characteristic vector of text to be sorted and whole training sample set
It is the low key factor of KNN efficiency of algorithm that the characteristic vector of Chinese version, which calculate,;
(2) weight of each attribute is identical, influences the accuracy rate of classification results:
When training sample concentrates sample distribution of all categories unbalanced, as certain a kind of sample size is very big, and other classes
During sample size very little, it is possible to cause when inputting text to be sorted, Large Copacity classification in K nearest samples of the text
Sample occupy the majority, because KNN algorithms finally only consider " nearest " neighbours' sample, if certain a kind of sample size is larger,
Text to be sorted may be not close to, in this kind of sample, can but be assigned under the category by mistake in itself, influence the accuracy rate of classification.
The content of the invention
To solve the above problems, it is an object of the invention to provide a kind of news text based on LDA and word2vec algorithms
This sorting technique, the characteristic vector of text to be sorted and class center vector are subjected to Similarity Measure and carry out just subseries, significantly
Amount of calculation is reduced, when subseries is not enough to clearly divide classification originally, secondary classification is carried out with KNN algorithms, after cutting
New samples collection moderate extracts classification sample, and eliminate sample distribution inequality influences to caused by classification accuracy.
The invention provides a kind of newsletter archive sorting technique based on LDA and word2vec algorithms, including:
Step 1, the term vector of corpus is obtained by word2vec instruments:
Large-scale corpus is subjected to word segmentation processing, by the text input word2vec instruments after participle, training obtains
The term vector of each word in corpus;
Step 2, Text Pretreatment is carried out to training sample set:
The text that training sample is concentrated is segmented, removes stop words;
Step 3, by LDA topic models, the Core Words of Class of training sample set is obtained:
LDA topic models are respectively trained in each classification of training sample set, training sample set is under LDA topic models
Text-theme of each classification and the probability distribution of theme-word are obtained after training, will be each according to LDA topic model output results
Maximum theme probability values are more than core word of the word of threshold alpha as the category in individual classification;
Step 4, the term vector a of Core Words of Class is passed throughi, construct the class center vector c of training sample seti;
Step 5, treat after classifying text pre-processed, extract text feature word, obtain the text of text to be sorted to
Measure dj;
Step 6, the class center vector of the text vector and training sample set for the treatment of classifying text carries out Similarity Measure,
And to Similarity value descending sort, classifying text is treated according to sequence and carries out just subseries, when the first two is seemingly spent in descending sort
When difference between value is less than threshold epsilon, step 7 is carried out;
Step 7, classifying text is treated using KNN algorithms and carries out secondary classification.
As further improvement of the invention, step 4 specifically includes:
Step 401, the term vector a of the core word of each classification is selected from all term vectors of step 1i;
Step 402, the probability values of theme-word LDA topic models obtainediThe word of weight as to(for) the category,
Term vector after each weighting under same category is added and averaged as such class center vector ci, it is expressed as formula
(1);
As further improvement of the invention, step 5 specifically includes:
Step 501, text to be sorted is pre-processed, including segments, removes stop words;
Step 502, text feature word is extracted using TF-IDF algorithms:
The text feature word that TF-IDF extracts is calculated according to formula (2), using word of the TF-IDF values more than threshold θ as treating point
The Feature Words w of class text;
In formula, m is the number that Feature Words w occurs in text to be sorted, and M is the word sum of text to be sorted, and N is training
Total textual data in sample set, n are that training sample concentrates the text sum comprising Feature Words w;
Step 503, Feature Words in text to be sorted are inputted into word2vec instruments, obtains Feature Words in text to be sorted
Term vector, the term vector of all Feature Words is added and averages to obtain the text vector d of text to be sortedj。
As further improvement of the invention, step 6 specifically includes:
Step 601, by the text vector d in text to be sortedjWith the class center vector c of each classificationiCarry out similarity
Calculate, calculation formula is formula (3);
In formula, sim (ci,dj) it is Similarity value, T is the text vector of text to be sorted and the class center of each classification
The dimension of vector, wikFor in class center vector per the numerical value on one-dimensional, wjkFor in the text vector of text to be sorted per one-dimensional
On numerical value;
Step 602, the Similarity value calculated in step 601 is subjected to descending sort;
Step 603, the difference in the descending sort of calculation procedure 602 between first Similarity value and the second Similarity value:
If the difference is more than ε, by the classification corresponding to text classification to be sorted to the first Similarity value;
If the difference is less than ε, the secondary classification of step 7 is carried out.
As further improvement of the invention, step 7 specifically includes:
Step 701, the difference of preceding x adjacent values in Similarity value descending sort in step 6 is less than in classification corresponding to ε
Text training text concentrate extract;
Step 702, z piece texts are randomly selected in each classification, form new training sample set;
Step 703, every text in new training sample set repeats step 5, obtains the text vector of every text;
Step 704, using KNN algorithms, by the text vector d of text to be sortedjWith all texts in new training sample set
Text vector diSimilarity Measure is done, calculation formula is formula (4), selects most similar K pieces text;
In formula, sim (dj,di) it is Similarity value, T is the text vector and new training sample set Chinese version of text to be sorted
The dimension of vector, w 'ikFor in the text vector of new training sample set per the numerical value on one-dimensional, wjkFor the text of text to be sorted
Per the numerical value on one-dimensional in vector;
Step 705, to the K piece texts selected, the weight of every text generic is calculated successively, calculation formula is formula
(5);
In formula, W (dj,Ci) it is that text to be sorted belongs to classification CiWeighted value, sim (dj,di) it is what is calculated in step 704
Similarity value, y (di,Ci) it is category attribute function, known each text generic in new training sample set, for the K selected
Piece text, if it belong to classification Ci, then category attribute functional value is 1, and otherwise category attribute functional value is 0;
Step 706, weighted value text classification to be sorted calculated into step 705 is the classification corresponding to maximum
In.
Beneficial effects of the present invention are:
1st, the term vector that the present invention trains to obtain using word2vec instruments represents text message, and word2vec models utilize
Word is converted into a low-dimensional real number vector by the contextual information of word in text, and phrase semantic is obtained by the distance between vector
On similarity, on the construction of text vector, using keyword term vector be added average by the way of replace vector splicing
Method, vectorial high latitude is solve thed problems, such as, meanwhile, it also been removed the limitation chosen for keyword quantity;
2nd, traditional method that feature is extracted to training sample set uses TFIDF algorithms mostly, reconstructs vector space mould
Type, the present invention propose the method construct category feature being combined using LDA models with word2vec algorithms, and by theme-word
Weight of the probable value as Feature Words, the method add different terms under same category for the percentage contribution of classification and
Same word has contained the semantic relation between word due to word2vec for different classes of percentage contribution, and the present invention is by word
Addition of vectors is averaged to represent text, while similarity information between retaining text, controls the dimension of text vector not
Can be excessive, therefore, when the characteristic vector of text to be sorted and class center vector are carried out into Similarity Measure, greatly reduce calculating
Amount;
3rd, during classifying to text, conventional method only considers the similarity between text mostly, and the present invention carries
Go out direct extraction category feature, contact is established between text and classification, when subseries is not enough to clearly divide classification originally,
Reuse KNN algorithms and carry out secondary classification, and at this point for the classification distant with text to be sorted, then it is not required to consider,
New samples collection moderate after cutting extracts classification sample, and eliminate sample distribution inequality influences to caused by classification accuracy.
Brief description of the drawings
Fig. 1 is a kind of newsletter archive sorting technique based on LDA and word2vec algorithms described in the embodiment of the present invention
Schematic flow sheet.
Embodiment
The present invention is described in further detail below by specific embodiment and with reference to accompanying drawing.
A kind of as shown in figure 1, newsletter archive classification side based on LDA and word2vec algorithms described in the embodiment of the present invention
Method, including:
Step 1, the term vector of corpus is obtained by word2vec instruments:
Large-scale corpus is subjected to word segmentation processing, by the text input word2vec instruments after participle, training obtains
The term vector of each word in corpus.
Step 2, Text Pretreatment is carried out to training sample set:
The text that training sample is concentrated is segmented, removes stop words.
Step 3, by LDA topic models, the Core Words of Class of training sample set is obtained:
LDA topic models are respectively trained in each classification of training sample set, training sample set is under LDA topic models
Text-theme of each classification and the probability distribution of theme-word are obtained after training, will be each according to LDA topic model output results
Maximum theme probability values are more than core word of the word of threshold alpha as the category in individual classification.
Step 4, the term vector a of Core Words of Class is passed throughi, construct the class center vector c of training sample seti:
Step 401, the term vector a of the core word of each classification is selected from all term vectors of step 1i;
Step 402, the probability values of theme-word LDA topic models obtainediThe word of weight as to(for) the category,
Term vector after each weighting under same category is added and averaged as such class center vector ci, it is expressed as formula
(1);
Step 5, treat after classifying text pre-processed, extract text feature word, obtain the text of text to be sorted to
Measure dj:
Step 501, text to be sorted is pre-processed, including segments, removes stop words;
Step 502, text feature word is extracted using TF-IDF algorithms:
The text feature word that TF-IDF extracts is calculated according to formula (2), using word of the TF-IDF values more than threshold θ as treating point
The Feature Words w of class text;
In formula, m is the number that Feature Words w occurs in text to be sorted, and M is the word sum of text to be sorted, and N is total
Textual data, n are the text sum comprising Feature Words w;
Step 503, Feature Words in text to be sorted are inputted into word2vec instruments, obtains Feature Words in text to be sorted
Term vector, the term vector of all Feature Words is added and averages to obtain the text vector d of text to be sortedj。
Step 6, the class center vector of the text vector and training sample set for the treatment of classifying text carries out Similarity Measure,
And to Similarity value descending sort, classified according to classifying text is treated in sequence:
Step 601, by the text vector d in text to be sortedjWith the class center vector c of each classificationiCarry out similarity
Calculate, calculation formula is formula (3);
In formula, sim (ci,dj) it is Similarity value, T is the text vector of text to be sorted and the class center of each classification
The dimension of vector, wikFor in class center vector per the numerical value on one-dimensional, wjkFor in the text vector of text to be sorted per one-dimensional
On numerical value;
Step 602, the Similarity value calculated in step 601 is subjected to descending sort;
Step 603, the difference in the descending sort of calculation procedure 602 between first Similarity value and the second Similarity value:
If the difference is more than ε, by the classification corresponding to text classification to be sorted to the first Similarity value;
If the difference is less than ε, the secondary classification of step 7 is carried out.
Step 7, classifying text is treated using KNN algorithms and carries out secondary classification:
Step 701, the difference of preceding x adjacent values in Similarity value descending sort in step 6 is less than in classification corresponding to ε
Text training text concentrate extract;
Step 702, z piece texts are randomly selected in each classification, form new training sample set;
Step 703, every text in new training sample set repeats step 5, obtains the text vector of every text;
Step 704, using KNN algorithms, by the text vector d of text to be sortedjWith all texts in new training sample set
Text vector diSimilarity Measure is done, calculation formula is formula (4), selects most similar K pieces text;
In formula, sim (dj,di) it is Similarity value, T is the text vector and new training sample set Chinese version of text to be sorted
The dimension of vector, w 'ikFor in the text vector of new training sample set per the numerical value on one-dimensional, wjkFor the text of text to be sorted
Per the numerical value on one-dimensional in vector;
Step 705, to the K piece texts selected, the weight of every text generic is calculated successively, calculation formula is formula
(5);
In formula, W (dj,Ci) it is that text to be sorted belongs to classification CiWeighted value, sim (dj,di) it is what is calculated in step 704
Similarity value, y (di,Ci) it is category attribute function, known each text generic in new training sample set, for the K selected
Piece text, if it belong to classification Ci, then category attribute functional value is 1, and otherwise category attribute functional value is 0;
Step 706, weighted value text classification to be sorted calculated into step 705 is the classification corresponding to maximum
In.
The present invention, which proposes, directly extracts category feature, establishes contact between text and classification, i.e. step 6, when only according to
When can not clearly be divided according to categorization vector, reuse KNN algorithms and further classify, and at this point for text distance to be sorted compared with
Remote classification, then it is not required to consider, i.e. step 7, realizes the cutting to sample set, reduce amount of calculation.It is meanwhile traditional to instruction
The method for practicing sample set extraction feature uses tfidf algorithms mostly, reconstructs vector space model, and the present invention proposes use
The method construct category feature that LDA models are combined with word2vec algorithms, and using the probable value of theme-word as Feature Words
Weight, i.e. step 4, the method add different terms under same category for classification percentage contribution and same word for
Different classes of percentage contribution, term vector is added and averaged to represent text, while similarity information between retaining text,
The dimension for controlling text vector will not be excessive.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area
For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies
Change, equivalent substitution, improvement etc., should be included in the scope of the protection.