CN107609121A

CN107609121A - Newsletter archive sorting technique based on LDA and word2vec algorithms

Info

Publication number: CN107609121A
Application number: CN201710828232.XA
Authority: CN
Inventors: 赵阔; 王峰; 谢珍真; 孙小雅
Original assignee: Shenzhen City Mateng Technology Co Ltd
Current assignee: Jinan University
Priority date: 2017-09-14
Filing date: 2017-09-14
Publication date: 2018-01-19
Anticipated expiration: 2037-09-14
Also published as: CN107609121B

Abstract

The invention discloses a kind of newsletter archive sorting technique based on LDA and word2vec algorithms, including：Corpus term vector is obtained by word2vec；Text participle, the removal stop words that training sample is concentrated；The Core Words of Class of training sample set is obtained by LDA models；Construct the class center vector of training sample set；Treat after classifying text pre-processed, extract text feature word, obtain the text vector of text to be sorted；The class center vector of the text vector and training sample set for the treatment of classifying text carries out Similarity Measure, treats classifying text and is classified；Classifying text, which is treated, with KNN algorithms carries out secondary classification.Beneficial effects of the present invention：The characteristic vector of text to be sorted and class center vector are subjected to Similarity Measure and carry out just subseries, greatly reduce amount of calculation, when subseries is not enough to clearly divide classification originally, secondary classification is carried out with KNN algorithms, classification sample is extracted in new samples collection moderate, eliminate sample distribution inequality influences to caused by classification accuracy.

Description

Newsletter archive sorting technique based on LDA and word2vec algorithms

Technical field

The present invention relates to document classification technical field, in particular to a kind of based on the new of LDA and word2vec algorithms Hear file classification method.

Background technology

Current most popular document representation method is all based on bag of words method, and bag of words method regards document as the set of word, The appearance of each word is separate, does not consider the information such as order, the syntax and semantics of word.The spy that it concentrates training text Sign item is organized into vector space model, and every document representation is into the vector with the model identical dimensional, each position in vectorial Value is the weight that the word representated by the position is concentrated in training sample.There is subject matter existing for this method：

(1) vector dimension is too high：

Dimension and the whole training sample of vector are concentrated as the characteristic item number of reservation, can reach up to ten thousand or even tens Ten thousand, " dimension disaster " phenomenon is caused, and these text vectors can take very big memory space；

(2) Sparse：

One document vector has weighted value only in there is the document on the position of characteristic item, is weighed on remaining most of position Weight values are 0, reduce the efficiency calculated in text categorization task, while also waste memory space；

(3) semantic information of document can not preferably be represented：

Bag of words method is assumed to be completely independent between word in document, the semantic relation between word is have ignored, for two semantemes Document close but in the absence of same characteristic features word, the text similarity that the text vector for using bag of words method to represent is calculated is 0.

KNN algorithm principles are simple, it is easy to accomplish, there is high stability and high accuracy, be to be applied to text classification at present One of classic algorithm, the deficiency of the algorithm mainly has at following 2 points：

(1) when training sample set is larger, KNN efficiency of algorithm is low：

Common KNN algorithms need the characteristic vector by whole samples in the characteristic vector of text to be sorted and training set to enter Row Similarity Measure, the training sample of K arest neighbors is selected, the quantity of training sample generic is counted, finally will Text to be sorted is divided into the maximum classification of quantity, wherein, by the characteristic vector of text to be sorted and whole training sample set It is the low key factor of KNN efficiency of algorithm that the characteristic vector of Chinese version, which calculate,；

(2) weight of each attribute is identical, influences the accuracy rate of classification results：

When training sample concentrates sample distribution of all categories unbalanced, as certain a kind of sample size is very big, and other classes During sample size very little, it is possible to cause when inputting text to be sorted, Large Copacity classification in K nearest samples of the text Sample occupy the majority, because KNN algorithms finally only consider " nearest " neighbours' sample, if certain a kind of sample size is larger, Text to be sorted may be not close to, in this kind of sample, can but be assigned under the category by mistake in itself, influence the accuracy rate of classification.

The content of the invention

To solve the above problems, it is an object of the invention to provide a kind of news text based on LDA and word2vec algorithms This sorting technique, the characteristic vector of text to be sorted and class center vector are subjected to Similarity Measure and carry out just subseries, significantly Amount of calculation is reduced, when subseries is not enough to clearly divide classification originally, secondary classification is carried out with KNN algorithms, after cutting New samples collection moderate extracts classification sample, and eliminate sample distribution inequality influences to caused by classification accuracy.

The invention provides a kind of newsletter archive sorting technique based on LDA and word2vec algorithms, including：

Step 1, the term vector of corpus is obtained by word2vec instruments：

Large-scale corpus is subjected to word segmentation processing, by the text input word2vec instruments after participle, training obtains The term vector of each word in corpus；

Step 2, Text Pretreatment is carried out to training sample set：

The text that training sample is concentrated is segmented, removes stop words；

Step 3, by LDA topic models, the Core Words of Class of training sample set is obtained：

LDA topic models are respectively trained in each classification of training sample set, training sample set is under LDA topic models Text-theme of each classification and the probability distribution of theme-word are obtained after training, will be each according to LDA topic model output results Maximum theme probability values are more than core word of the word of threshold alpha as the category in individual classification；

Step 4, the term vector a of Core Words of Class is passed through_i, construct the class center vector c of training sample set_i；

Step 5, treat after classifying text pre-processed, extract text feature word, obtain the text of text to be sorted to Measure d_j；

Step 6, the class center vector of the text vector and training sample set for the treatment of classifying text carries out Similarity Measure, And to Similarity value descending sort, classifying text is treated according to sequence and carries out just subseries, when the first two is seemingly spent in descending sort When difference between value is less than threshold epsilon, step 7 is carried out；

Step 7, classifying text is treated using KNN algorithms and carries out secondary classification.

As further improvement of the invention, step 4 specifically includes：

Step 401, the term vector a of the core word of each classification is selected from all term vectors of step 1_i；

Step 402, the probability values of theme-word LDA topic models obtained_iThe word of weight as to(for) the category, Term vector after each weighting under same category is added and averaged as such class center vector c_i, it is expressed as formula (1)；

As further improvement of the invention, step 5 specifically includes：

Step 501, text to be sorted is pre-processed, including segments, removes stop words；

Step 502, text feature word is extracted using TF-IDF algorithms：

The text feature word that TF-IDF extracts is calculated according to formula (2), using word of the TF-IDF values more than threshold θ as treating point The Feature Words w of class text；

In formula, m is the number that Feature Words w occurs in text to be sorted, and M is the word sum of text to be sorted, and N is training Total textual data in sample set, n are that training sample concentrates the text sum comprising Feature Words w；

Step 503, Feature Words in text to be sorted are inputted into word2vec instruments, obtains Feature Words in text to be sorted Term vector, the term vector of all Feature Words is added and averages to obtain the text vector d of text to be sorted_j。

As further improvement of the invention, step 6 specifically includes：

Step 601, by the text vector d in text to be sorted_jWith the class center vector c of each classification_iCarry out similarity Calculate, calculation formula is formula (3)；

In formula, sim (c_i,d_j) it is Similarity value, T is the text vector of text to be sorted and the class center of each classification The dimension of vector, w_ikFor in class center vector per the numerical value on one-dimensional, w_jkFor in the text vector of text to be sorted per one-dimensional On numerical value；

Step 602, the Similarity value calculated in step 601 is subjected to descending sort；

Step 603, the difference in the descending sort of calculation procedure 602 between first Similarity value and the second Similarity value：

If the difference is more than ε, by the classification corresponding to text classification to be sorted to the first Similarity value；

If the difference is less than ε, the secondary classification of step 7 is carried out.

As further improvement of the invention, step 7 specifically includes：

Step 701, the difference of preceding x adjacent values in Similarity value descending sort in step 6 is less than in classification corresponding to ε Text training text concentrate extract；

Step 702, z piece texts are randomly selected in each classification, form new training sample set；

Step 703, every text in new training sample set repeats step 5, obtains the text vector of every text；

Step 704, using KNN algorithms, by the text vector d of text to be sorted_jWith all texts in new training sample set Text vector d_iSimilarity Measure is done, calculation formula is formula (4), selects most similar K pieces text；

In formula, sim (d_j,d_i) it is Similarity value, T is the text vector and new training sample set Chinese version of text to be sorted The dimension of vector, w '_ikFor in the text vector of new training sample set per the numerical value on one-dimensional, w_jkFor the text of text to be sorted Per the numerical value on one-dimensional in vector；

Step 705, to the K piece texts selected, the weight of every text generic is calculated successively, calculation formula is formula (5)；

In formula, W (d_j,C_i) it is that text to be sorted belongs to classification C_iWeighted value, sim (d_j,d_i) it is what is calculated in step 704 Similarity value, y (d_i,C_i) it is category attribute function, known each text generic in new training sample set, for the K selected Piece text, if it belong to classification C_i, then category attribute functional value is 1, and otherwise category attribute functional value is 0；

Step 706, weighted value text classification to be sorted calculated into step 705 is the classification corresponding to maximum In.

Beneficial effects of the present invention are：

1st, the term vector that the present invention trains to obtain using word2vec instruments represents text message, and word2vec models utilize Word is converted into a low-dimensional real number vector by the contextual information of word in text, and phrase semantic is obtained by the distance between vector On similarity, on the construction of text vector, using keyword term vector be added average by the way of replace vector splicing Method, vectorial high latitude is solve thed problems, such as, meanwhile, it also been removed the limitation chosen for keyword quantity；

2nd, traditional method that feature is extracted to training sample set uses TFIDF algorithms mostly, reconstructs vector space mould Type, the present invention propose the method construct category feature being combined using LDA models with word2vec algorithms, and by theme-word Weight of the probable value as Feature Words, the method add different terms under same category for the percentage contribution of classification and Same word has contained the semantic relation between word due to word2vec for different classes of percentage contribution, and the present invention is by word Addition of vectors is averaged to represent text, while similarity information between retaining text, controls the dimension of text vector not Can be excessive, therefore, when the characteristic vector of text to be sorted and class center vector are carried out into Similarity Measure, greatly reduce calculating Amount；

3rd, during classifying to text, conventional method only considers the similarity between text mostly, and the present invention carries Go out direct extraction category feature, contact is established between text and classification, when subseries is not enough to clearly divide classification originally, Reuse KNN algorithms and carry out secondary classification, and at this point for the classification distant with text to be sorted, then it is not required to consider, New samples collection moderate after cutting extracts classification sample, and eliminate sample distribution inequality influences to caused by classification accuracy.

Brief description of the drawings

Fig. 1 is a kind of newsletter archive sorting technique based on LDA and word2vec algorithms described in the embodiment of the present invention Schematic flow sheet.

Embodiment

The present invention is described in further detail below by specific embodiment and with reference to accompanying drawing.

A kind of as shown in figure 1, newsletter archive classification side based on LDA and word2vec algorithms described in the embodiment of the present invention Method, including：

Step 1, the term vector of corpus is obtained by word2vec instruments：

Large-scale corpus is subjected to word segmentation processing, by the text input word2vec instruments after participle, training obtains The term vector of each word in corpus.

Step 2, Text Pretreatment is carried out to training sample set：

The text that training sample is concentrated is segmented, removes stop words.

LDA topic models are respectively trained in each classification of training sample set, training sample set is under LDA topic models Text-theme of each classification and the probability distribution of theme-word are obtained after training, will be each according to LDA topic model output results Maximum theme probability values are more than core word of the word of threshold alpha as the category in individual classification.

Step 4, the term vector a of Core Words of Class is passed through_i, construct the class center vector c of training sample set_i：

Step 5, treat after classifying text pre-processed, extract text feature word, obtain the text of text to be sorted to Measure d_j：

Step 502, text feature word is extracted using TF-IDF algorithms：

In formula, m is the number that Feature Words w occurs in text to be sorted, and M is the word sum of text to be sorted, and N is total Textual data, n are the text sum comprising Feature Words w；

Step 6, the class center vector of the text vector and training sample set for the treatment of classifying text carries out Similarity Measure, And to Similarity value descending sort, classified according to classifying text is treated in sequence：

Step 7, classifying text is treated using KNN algorithms and carries out secondary classification：

The present invention, which proposes, directly extracts category feature, establishes contact between text and classification, i.e. step 6, when only according to When can not clearly be divided according to categorization vector, reuse KNN algorithms and further classify, and at this point for text distance to be sorted compared with Remote classification, then it is not required to consider, i.e. step 7, realizes the cutting to sample set, reduce amount of calculation.It is meanwhile traditional to instruction The method for practicing sample set extraction feature uses tfidf algorithms mostly, reconstructs vector space model, and the present invention proposes use The method construct category feature that LDA models are combined with word2vec algorithms, and using the probable value of theme-word as Feature Words Weight, i.e. step 4, the method add different terms under same category for classification percentage contribution and same word for Different classes of percentage contribution, term vector is added and averaged to represent text, while similarity information between retaining text, The dimension for controlling text vector will not be excessive.

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims

A kind of 1. newsletter archive sorting technique based on LDA and word2vec algorithms, it is characterised in that including：

Step 1, the term vector of corpus is obtained by word2vec instruments：

Large-scale corpus is subjected to word segmentation processing, by the text input word2vec instruments after participle, training obtains language material The term vector of each word in storehouse；

Step 2, Text Pretreatment is carried out to training sample set：

The text that training sample is concentrated is segmented, removes stop words；

Step 3, by LDA topic models, the Core Words of Class of training sample set is obtained：

LDA topic models are respectively trained in each classification of training sample set, training sample set is trained under LDA topic models Text-theme of each classification and the probability distribution of theme-word are obtained afterwards, according to LDA topic model output results, by each class Not middle maximum theme probability values are more than core word of the word of threshold alpha as the category；

Step 4, the term vector a of Core Words of Class is passed through_i, construct the class center vector c of training sample set_i；

Step 5, treat after classifying text pre-processed, extract text feature word, obtain the text vector d of text to be sorted_j；

Step 6, the class center vector of the text vector and training sample set for the treatment of classifying text carries out Similarity Measure, and right Similarity value descending sort, according to sequence treat classifying text carry out just subseries, when in descending sort the first two like angle value it Between difference when being less than threshold epsilon, carry out step 7；

Step 7, classifying text is treated using KNN algorithms and carries out secondary classification.
2. newsletter archive sorting technique according to claim 1, it is characterised in that step 4 specifically includes：

Step 401, the term vector a of the core word of each classification is selected from all term vectors of step 1_i；

Step 402, the probability values of theme-word LDA topic models obtained_iThe word of weight as to(for) the category, will be same Term vector under one classification after each weighting, which is added, averages as such class center vector c_i, it is expressed as formula (1)；

<mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mi>n</mi> </mfrac> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msub> <mi>&beta;</mi> <mi>i</mi> </msub> <mo>&CenterDot;</mo> <msub> <mi>&alpha;</mi> <mi>i</mi> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>.</mo> </mrow>
3. newsletter archive sorting technique according to claim 1, it is characterised in that step 5 specifically includes：

Step 501, text to be sorted is pre-processed, including segments, removes stop words；

Step 502, text feature word is extracted using TF-IDF algorithms：

The text feature word of TF-IDF extractions is calculated according to formula (2), TF-IDF values are more than the word of threshold θ as text to be sorted This Feature Words w；

<mrow> <mi>T</mi> <mi>F</mi> <mo>-</mo> <mi>I</mi> <mi>D</mi> <mi>F</mi> <mo>=</mo> <mfrac> <mi>m</mi> <mi>M</mi> </mfrac> <mo>&CenterDot;</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mfrac> <mi>N</mi> <mi>n</mi> </mfrac> <mo>+</mo> <mn>0.01</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

In formula, m is the number that Feature Words w occurs in text to be sorted, and M is the word sum of text to be sorted, and N is training sample Total textual data is concentrated, n is that training sample concentrates the text sum comprising Feature Words w；

Step 503, Feature Words in text to be sorted are inputted into word2vec instruments, obtain the words of Feature Words in text to be sorted to Amount, the term vector of all Feature Words is added and averages to obtain the text vector d of text to be sorted_j。
4. newsletter archive sorting technique according to claim 1, it is characterised in that step 6 specifically includes：

Step 601, by the text vector d in text to be sorted_jWith the class center vector c of each classification_iCarry out similarity meter Calculate, calculation formula is formula (3)；

<mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mi>k</mi> </mrow> </msub> <mo>&times;</mo> <msub> <mi>w</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> </mrow> <mrow> <msqrt> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msubsup> <mi>w</mi> <mrow> <mi>i</mi> <mi>k</mi> </mrow> <mn>2</mn> </msubsup> </mrow> </msqrt> <msqrt> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msubsup> <mi>w</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> <mn>2</mn> </msubsup> </mrow> </msqrt> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

In formula, sim (c_i,d_j) it is Similarity value, T is the text vector of text to be sorted and the class center vector of each classification Dimension, w_ikFor in class center vector per the numerical value on one-dimensional, w_jkOn often one-dimensional in the text vector of text to be sorted Numerical value；

Step 602, the Similarity value calculated in step 601 is subjected to descending sort；

Step 603, the difference in the descending sort of calculation procedure 602 between first Similarity value and the second Similarity value：

If the difference is more than ε, by the classification corresponding to text classification to be sorted to the first Similarity value；

If the difference is less than ε, the secondary classification of step 7 is carried out.
5. newsletter archive sorting technique according to claim 1, it is characterised in that step 7 specifically includes：

Step 701, the difference of preceding x adjacent values in Similarity value descending sort in step 6 is less than the text in classification corresponding to ε This is concentrated in training text and extracted；

Step 702, z piece texts are randomly selected in each classification, form new training sample set；

Step 703, every text in new training sample set repeats step 5, obtains the text vector of every text；

Step 704, using KNN algorithms, by the text vector d of text to be sorted_jWith the text of all texts in new training sample set Vectorial d_iSimilarity Measure is done, calculation formula is formula (4), selects most similar K pieces text；

<mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msubsup> <mi>w</mi> <mrow> <mi>i</mi> <mi>k</mi> </mrow> <mo>&prime;</mo> </msubsup> <mo>&times;</mo> <msub> <mi>w</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> </mrow> <mrow> <msqrt> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msubsup> <mi>w</mi> <mrow> <mi>i</mi> <mi>k</mi> </mrow> <mrow> <mo>&prime;</mo> <mn>2</mn> </mrow> </msubsup> </mrow> </msqrt> <msqrt> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <msubsup> <mi>w</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> <mn>2</mn> </msubsup> </mrow> </msqrt> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>

In formula, sim (d_j,d_i) it is Similarity value, T is the text vector and new training sample set Chinese version vector of text to be sorted Dimension, w_i′_kFor in the text vector of new training sample set per the numerical value on one-dimensional, w_jkFor the text vector of text to be sorted In per the numerical value on one-dimensional；

Step 705, to the K piece texts selected, the weight of every text generic is calculated successively, calculation formula is formula (5)；

<mrow> <mi>W</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>C</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&Sigma;</mo> <mrow> <msub> <mi>d</mi> <mi>y</mi> </msub> <mo>&Element;</mo> <mi>K</mi> <mi>N</mi> <mi>N</mi> </mrow> </munder> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mi>y</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>C</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow>

In formula, W (d_j,C_i) it is that text to be sorted belongs to classification C_iWeighted value, sim (d_j,d_i) similar to be calculated in step 704 Angle value, y (d_i,C_i) it is category attribute function, known each text generic in new training sample set, for the K pieces text selected This, if it belong to classification C_i, then category attribute functional value is 1, and otherwise category attribute functional value is 0；

Step 706, weighted value text classification to be sorted calculated into step 705 is in the classification corresponding to maximum.